Are nocow files snapshot-aware

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Are nocow files snapshot-aware
@ 2014-02-04 20:52 Kai Krakow
  2014-02-05  1:22 ` Josef Bacik
  0 siblings, 1 reply; 15+ messages in thread
From: Kai Krakow @ 2014-02-04 20:52 UTC (permalink / raw)
  To: linux-btrfs

Hi!

I'm curious... The whole snapshot thing on btrfs is based on its COW design. 
But you can make individual files and directory contents nocow by applying 
the C attribute on it using chattr. This is usually recommended for database 
files and VM images. So far, so good...

But what happens to such files when they are part of a snapshot? Do they 
become duplicated during the snapshot? Do they become unshared (as a whole) 
when written to? Or when the the parent snapshot becomes deleted? Or maybe 
the nocow attribute is just ignored after a snapshot was taken?

After all they are nocow and thus would be handled in another way when 
snapshotted.

-- 
Replies to list only preferred.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Are nocow files snapshot-aware
  2014-02-04 20:52 Are nocow files snapshot-aware Kai Krakow
@ 2014-02-05  1:22 ` Josef Bacik
  2014-02-05  2:02   ` David Sterba
  0 siblings, 1 reply; 15+ messages in thread
From: Josef Bacik @ 2014-02-05  1:22 UTC (permalink / raw)
  To: Kai Krakow, linux-btrfs

On 02/04/2014 03:52 PM, Kai Krakow wrote:
> Hi!
>
> I'm curious... The whole snapshot thing on btrfs is based on its COW design.
> But you can make individual files and directory contents nocow by applying
> the C attribute on it using chattr. This is usually recommended for database
> files and VM images. So far, so good...
>
> But what happens to such files when they are part of a snapshot? Do they
> become duplicated during the snapshot? Do they become unshared (as a whole)
> when written to? Or when the the parent snapshot becomes deleted? Or maybe
> the nocow attribute is just ignored after a snapshot was taken?
>
> After all they are nocow and thus would be handled in another way when
> snapshotted.
>
When snapshotted nocow files fallback to normal cow behaviour. Thanks,

Josef

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Are nocow files snapshot-aware
  2014-02-05  1:22 ` Josef Bacik
@ 2014-02-05  2:02   ` David Sterba
  2014-02-05 18:17     ` Kai Krakow
  0 siblings, 1 reply; 15+ messages in thread
From: David Sterba @ 2014-02-05  2:02 UTC (permalink / raw)
  To: Josef Bacik; +Cc: Kai Krakow, linux-btrfs

On Tue, Feb 04, 2014 at 08:22:05PM -0500, Josef Bacik wrote:
> On 02/04/2014 03:52 PM, Kai Krakow wrote:
> >Hi!
> >
> >I'm curious... The whole snapshot thing on btrfs is based on its COW design.
> >But you can make individual files and directory contents nocow by applying
> >the C attribute on it using chattr. This is usually recommended for database
> >files and VM images. So far, so good...
> >
> >But what happens to such files when they are part of a snapshot? Do they
> >become duplicated during the snapshot? Do they become unshared (as a whole)
> >when written to? Or when the the parent snapshot becomes deleted? Or maybe
> >the nocow attribute is just ignored after a snapshot was taken?
> >
> >After all they are nocow and thus would be handled in another way when
> >snapshotted.
> >
> When snapshotted nocow files fallback to normal cow behaviour.

This may seem unclear to people not familiar with the actual
implementation, and I had to think for a second about that sentence. The
file will keep the NOCOW status, but any modified blocks will be newly
allocated on the first write (in a COW manner), then the block location
will not change anymore (unlike ordinary COW).

HTH

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Are nocow files snapshot-aware
  2014-02-05  2:02   ` David Sterba
@ 2014-02-05 18:17     ` Kai Krakow
  2014-02-06  2:38       ` Duncan
  0 siblings, 1 reply; 15+ messages in thread
From: Kai Krakow @ 2014-02-05 18:17 UTC (permalink / raw)
  To: linux-btrfs

David Sterba <dsterba@suse.cz> schrieb:

> On Tue, Feb 04, 2014 at 08:22:05PM -0500, Josef Bacik wrote:
>> On 02/04/2014 03:52 PM, Kai Krakow wrote:
>> >Hi!
>> >
>> >I'm curious... The whole snapshot thing on btrfs is based on its COW
>> >design. But you can make individual files and directory contents nocow
>> >by applying the C attribute on it using chattr. This is usually
>> >recommended for database files and VM images. So far, so good...
>> >
>> >But what happens to such files when they are part of a snapshot? Do they
>> >become duplicated during the snapshot? Do they become unshared (as a
>> >whole) when written to? Or when the the parent snapshot becomes deleted?
>> >Or maybe the nocow attribute is just ignored after a snapshot was taken?
>> >
>> >After all they are nocow and thus would be handled in another way when
>> >snapshotted.
>> >
>> When snapshotted nocow files fallback to normal cow behaviour.
> 
> This may seem unclear to people not familiar with the actual
> implementation, and I had to think for a second about that sentence. The
> file will keep the NOCOW status, but any modified blocks will be newly
> allocated on the first write (in a COW manner), then the block location
> will not change anymore (unlike ordinary COW).

Ah okay, that makes it clear. So, actually, in the snapshot the file is 
still nocow - just for the exception that blocks being written to become 
unshared and relocated. This may introduce a lot of fragmentation but it 
won't become worse when rewriting the same blocks over and over again.

> HTH

Yes, it does. ;-)

-- 
Replies to list only preferred.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Are nocow files snapshot-aware
  2014-02-05 18:17     ` Kai Krakow
@ 2014-02-06  2:38       ` Duncan
  2014-02-07  0:32         ` Kai Krakow
  0 siblings, 1 reply; 15+ messages in thread
From: Duncan @ 2014-02-06  2:38 UTC (permalink / raw)
  To: linux-btrfs

Kai Krakow posted on Wed, 05 Feb 2014 19:17:10 +0100 as excerpted:

> David Sterba <dsterba@suse.cz> schrieb:
> 
>> On Tue, Feb 04, 2014 at 08:22:05PM -0500, Josef Bacik wrote:
>>> On 02/04/2014 03:52 PM, Kai Krakow wrote:
>>> >Hi!
>>> >
>>> >I'm curious... The whole snapshot thing on btrfs is based on its COW
>>> >design. But you can make individual files and directory contents
>>> >nocow by applying the C attribute on it using chattr. This is usually
>>> >recommended for database files and VM images. So far, so good...
>>> >
>>> >But what happens to such files when they are part of a snapshot? Do
>>> >they become duplicated during the snapshot? Do they become unshared
>>> >(as a whole) when written to? Or when the the parent snapshot becomes
>>> >deleted?
>>> >Or maybe the nocow attribute is just ignored after a snapshot was
>>> >taken?
>>> >
>>> When snapshotted nocow files fallback to normal cow behaviour.
>> 
>> This may seem unclear to people not familiar with the actual
>> implementation, and I had to think for a second about that sentence.
>> The file will keep the NOCOW status, but any modified blocks will be
>> newly allocated on the first write (in a COW manner), then the block
>> location will not change anymore (unlike ordinary COW).
> 
> Ah okay, that makes it clear. So, actually, in the snapshot the file is
> still nocow - just for the exception that blocks being written to become
> unshared and relocated. This may introduce a lot of fragmentation but it
> won't become worse when rewriting the same blocks over and over again.

That also explains the report of a NOCOW VM-image still triggering the 
snapshot-aware-defrag-related pathology.  It was a _heavily_ auto-
snapshotted btrfs (thousands of snapshots, something like every 30 
seconds or more frequent, without thinning them down right away), and the 
continuing VM writes would nearly guarantee that many of those snapshots 
had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW 
at all!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Are nocow files snapshot-aware
  2014-02-06  2:38       ` Duncan
@ 2014-02-07  0:32         ` Kai Krakow
  2014-02-07  1:01           ` cwillu
  2014-02-07  7:06           ` Duncan
  0 siblings, 2 replies; 15+ messages in thread
From: Kai Krakow @ 2014-02-07  0:32 UTC (permalink / raw)
  To: linux-btrfs

Duncan <1i5t5.duncan@cox.net> schrieb:

>> Ah okay, that makes it clear. So, actually, in the snapshot the file is
>> still nocow - just for the exception that blocks being written to become
>> unshared and relocated. This may introduce a lot of fragmentation but it
>> won't become worse when rewriting the same blocks over and over again.
> 
> That also explains the report of a NOCOW VM-image still triggering the
> snapshot-aware-defrag-related pathology.  It was a _heavily_ auto-
> snapshotted btrfs (thousands of snapshots, something like every 30
> seconds or more frequent, without thinning them down right away), and the
> continuing VM writes would nearly guarantee that many of those snapshots
> had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW
> at all!

The question here is: Does it really make sense to create such snapshots of 
disk images currently online and running a system. They will probably be 
broken anyway after rollback - or at least I'd not fully trust the contents.

VM images should not be part of a subvolume of which snapshots are taken at 
a regular and short interval. The problem will go away if you follow this 
rule.

The same applies to probably any kind of file which you make nocow - e.g. 
database files. Most of those file implement their own way of transaction 
protection or COW system, e.g. look at InnoDB files. Neither they gain 
anything from using IO schedulers (because InnoDB internally does block 
sorting and prioritizing and knows better, doing otherwise even hurts 
performance), nor they gain from file system semantics like COW (because it 
does its own transactions and atomic updates and probably can do better for 
its use case). Similar applies to disk images (imagine ZFS, NTFS, ReFS, or 
btrfs images on btrfs). Snapshots can only do harm here (the only 
"protection" use case would be to have a backup, but snapshots are no 
backups), and COW will probably hurt performance a lot. The only use case is 
taking _controlled_ snapshots - and doing it all 30 seconds is by all means 
NOT controlled, it's completely undeterministic.

-- 
Replies to list only preferred.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Are nocow files snapshot-aware
  2014-02-07  0:32         ` Kai Krakow
@ 2014-02-07  1:01           ` cwillu
  2014-02-07  1:28             ` Chris Murphy
  2014-02-07  7:06           ` Duncan
  1 sibling, 1 reply; 15+ messages in thread
From: cwillu @ 2014-02-07  1:01 UTC (permalink / raw)
  To: Kai Krakow; +Cc: linux-btrfs

On Thu, Feb 6, 2014 at 6:32 PM, Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:
> Duncan <1i5t5.duncan@cox.net> schrieb:
>
>>> Ah okay, that makes it clear. So, actually, in the snapshot the file is
>>> still nocow - just for the exception that blocks being written to become
>>> unshared and relocated. This may introduce a lot of fragmentation but it
>>> won't become worse when rewriting the same blocks over and over again.
>>
>> That also explains the report of a NOCOW VM-image still triggering the
>> snapshot-aware-defrag-related pathology.  It was a _heavily_ auto-
>> snapshotted btrfs (thousands of snapshots, something like every 30
>> seconds or more frequent, without thinning them down right away), and the
>> continuing VM writes would nearly guarantee that many of those snapshots
>> had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW
>> at all!
>
> The question here is: Does it really make sense to create such snapshots of
> disk images currently online and running a system. They will probably be
> broken anyway after rollback - or at least I'd not fully trust the contents.
>
> VM images should not be part of a subvolume of which snapshots are taken at
> a regular and short interval. The problem will go away if you follow this
> rule.
>
> The same applies to probably any kind of file which you make nocow - e.g.
> database files. Most of those file implement their own way of transaction
> protection or COW system, e.g. look at InnoDB files. Neither they gain
> anything from using IO schedulers (because InnoDB internally does block
> sorting and prioritizing and knows better, doing otherwise even hurts
> performance), nor they gain from file system semantics like COW (because it
> does its own transactions and atomic updates and probably can do better for
> its use case). Similar applies to disk images (imagine ZFS, NTFS, ReFS, or
> btrfs images on btrfs). Snapshots can only do harm here (the only
> "protection" use case would be to have a backup, but snapshots are no
> backups), and COW will probably hurt performance a lot. The only use case is
> taking _controlled_ snapshots - and doing it all 30 seconds is by all means
> NOT controlled, it's completely undeterministic.

If the database/virtual machine/whatever is crash safe, then the
atomic state that a snapshot grabs will be useful.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Are nocow files snapshot-aware
  2014-02-07  1:01           ` cwillu
@ 2014-02-07  1:28             ` Chris Murphy
  2014-02-07 21:07               ` Kai Krakow
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Murphy @ 2014-02-07  1:28 UTC (permalink / raw)
  To: cwillu; +Cc: Kai Krakow, linux-btrfs


On Feb 6, 2014, at 6:01 PM, cwillu <cwillu@cwillu.com> wrote:

> On Thu, Feb 6, 2014 at 6:32 PM, Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:
>> Duncan <1i5t5.duncan@cox.net> schrieb:
>> 
>>>> Ah okay, that makes it clear. So, actually, in the snapshot the file is
>>>> still nocow - just for the exception that blocks being written to become
>>>> unshared and relocated. This may introduce a lot of fragmentation but it
>>>> won't become worse when rewriting the same blocks over and over again.
>>> 
>>> That also explains the report of a NOCOW VM-image still triggering the
>>> snapshot-aware-defrag-related pathology.  It was a _heavily_ auto-
>>> snapshotted btrfs (thousands of snapshots, something like every 30
>>> seconds or more frequent, without thinning them down right away), and the
>>> continuing VM writes would nearly guarantee that many of those snapshots
>>> had unique blocks, so the effect was nearly as bad as if it wasn't NOCOW
>>> at all!
>> 
>> The question here is: Does it really make sense to create such snapshots of
>> disk images currently online and running a system. They will probably be
>> broken anyway after rollback - or at least I'd not fully trust the contents.
>> 
>> VM images should not be part of a subvolume of which snapshots are taken at
>> a regular and short interval. The problem will go away if you follow this
>> rule.
>> 
>> The same applies to probably any kind of file which you make nocow - e.g.
>> database files. Most of those file implement their own way of transaction
>> protection or COW system, e.g. look at InnoDB files. Neither they gain
>> anything from using IO schedulers (because InnoDB internally does block
>> sorting and prioritizing and knows better, doing otherwise even hurts
>> performance), nor they gain from file system semantics like COW (because it
>> does its own transactions and atomic updates and probably can do better for
>> its use case). Similar applies to disk images (imagine ZFS, NTFS, ReFS, or
>> btrfs images on btrfs). Snapshots can only do harm here (the only
>> "protection" use case would be to have a backup, but snapshots are no
>> backups), and COW will probably hurt performance a lot. The only use case is
>> taking _controlled_ snapshots - and doing it all 30 seconds is by all means
>> NOT controlled, it's completely undeterministic.
> 
> If the database/virtual machine/whatever is crash safe, then the
> atomic state that a snapshot grabs will be useful.

How fast is this state fixed on disk from the time of the snapshot command? Loosely speaking. I'm curious if this is < 1 second; a few seconds; or possibly up to the 30 second default commit interval? And also if it's even related to the commit interval time at all?

I'm also curious what happens to files that are presently writing. e.g. I'm writing a 1GB file to subvol A and before it completes I snapshot subvol A into A.1. If I go find the file I was writing to, in A.1, what's its state? Truncated? Or or are in-progress writes permitted to complete if it's a rw snapshot? Any difference in behavior if it's an ro snapshot?


Chris Murphy


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Are nocow files snapshot-aware
  2014-02-07  0:32         ` Kai Krakow
  2014-02-07  1:01           ` cwillu
@ 2014-02-07  7:06           ` Duncan
  2014-02-07 21:58             ` Kai Krakow
  1 sibling, 1 reply; 15+ messages in thread
From: Duncan @ 2014-02-07  7:06 UTC (permalink / raw)
  To: linux-btrfs

Kai Krakow posted on Fri, 07 Feb 2014 01:32:27 +0100 as excerpted:

> Duncan <1i5t5.duncan@cox.net> schrieb:
> 
>> That also explains the report of a NOCOW VM-image still triggering the
>> snapshot-aware-defrag-related pathology.  It was a _heavily_ auto-
>> snapshotted btrfs (thousands of snapshots, something like every 30
>> seconds or more frequent, without thinning them down right away), and
>> the continuing VM writes would nearly guarantee that many of those
>> snapshots had unique blocks, so the effect was nearly as bad as if it
>> wasn't NOCOW at all!
> 
> The question here is: Does it really make sense to create such snapshots
> of disk images currently online and running a system. They will probably
> be broken anyway after rollback - or at least I'd not fully trust the
> contents.
> 
> VM images should not be part of a subvolume of which snapshots are taken
> at a regular and short interval. The problem will go away if you follow
> this rule.
> 
> The same applies to probably any kind of file which you make nocow -
> e.g. database files. The only use case is taking _controlled_ snapshots
> - and doing it all 30 seconds is by all means NOT controlled, it's
> completely undeterministic.

I'd absolutely agree -- and that wasn't my report, I'm just recalling it, 
as at the time I didn't understand the interaction between NOCOW and 
snapshots and couldn't quite understand how a NOCOW file was still 
triggering the snapshot-aware-defrag pathology, which in fact we were 
just beginning to realize based on such reports.

In fact at the time I assumed it was because the NOCOW had been added 
after the file was originally written, such that btrfs couldn't NOCOW it 
properly.  That still might have been the case, but now that I understand 
the interaction between snapshots and NOCOW, I see that such heavy 
snapshotting on an actively written VM could trigger the same issue, even 
if the NOCOW file was created properly and was indeed NOCOW when content 
was actually first written into it.

But definitely agreed.  30 second snapshotting, with a 30 second commit 
deadline, is pretty much off the deep end regardless of the content.  I'd 
even argue that 1 minute snapshotting without snapshots thinned down to 
say 5 or 10 minute snapshots after say an hour, is too extreme to be 
practical.  Even a couple days of that, and how are you going to even 
manage the thousands of snapshots or know which precise snapshot to roll 
back to if you had to?  That's why in the what-I-considered toward the 
extreme end of practical example I posted here some days ago, IIRC I had 
it do 1 minute snapshots but thin them down to 5 or 10 minutes after a 
couple hours and to half hour after a couple days, with something like 90 
day snapshots out to a decade.  Even that I considered extreme altho at 
least reasonably so, but the point was, even with something as extreme as 
1 minute snapshots at first frequency and decade of snapshots, with 
reasonable thinning it was still very manageable, something like 250 
snapshots total, well below the thousands or tens of thousands we're 
sometimes seeing in reports.  That's hardly practical no matter how you 
slice it, as how likely are you to know the exact minute to roll back to, 
even a month out, and even if you do, if you can survive a month before 
detecting it, how important is rolling back to precisely the last minute 
before the problem actually going to be?  At a month out perhaps the 
hour, but the minute?

But some of the snapshotting scripts out there, and the admins running 
them, seem to have the idea that just because it's possible it must be 
done, and they have snapshots taken every minute or more frequently, with 
no automated snapshot thinning at all.  IMO that's pathology run amok 
even if btrfs /was/ stable and mature and /could/ handle it properly.

That's regardless of the content so it's from a different angle than you 
were attacking the problem from...  But if admins aren't able to 
recognize the problem with per-minute snapshots without any thinning at 
all for days, weeks, months on end, I doubt they'll be any better at 
recognizing that VMs, databases, etc, should have a dedicated subvolume.  
Taking the long view, with a bit of luck we'll get to the point were 
database and VM setup scripts and/or documentation recommend setting NOCOW 
on the directory the VMs/DBs/etc will be in, but in practice, even that's 
pushing it, and will take some time (2-5 years) as btrfs stabilizes and 
mainstreams, taking over from ext4 as the assumed Linux default.  Other 
than that, I guess it'll be a case-by-case basis as people report 
problems here.  But with a snapshot-aware-defrag that actually scales, 
hopefully there won't be so many people reporting problems.  True, they 
might not have the best optimized system and may have some minor 
pathologies in their admin practices, but as long as they remain /minor/ 
pathologies because btrfs can deal with them better than it does now thus 
keeping them from becoming /major/ pathologies...

But be that as it may, since such extreme snapshotting /is/ possible, and 
with automation and downloadable snapper scripts somebody WILL be doing 
it, btrfs should scale to it if it is to be considered mature and 
stable.  People don't want a filesystem that's going to fall over on them 
and lose data or simply become unworkably live-locked just because they 
didn't know what they were doing when they setup the snapper script and 
set it to 1 minute snaps without any corresponding thinning after an hour 
or a day or whatever.

Anyway, the temporary snapshot-aware-defrag disable commit is now in 
mainline, committed shortly after 3.14-rc1 so it'll be in rc2, giving the 
devs some breathing room to work out a solution that scales rather better 
than what we had.  So defragging is (hopefully temporarily) not snapshot 
aware again ATM, but the pathologic snapshot-aware-defrag scaling issues 
are at least in a bounded set of kernel releases now, so the immediately 
critical problem should die down to some extent now, as the related 
commits (the patches did need some backporting rework, apparently) hit 
stable, anyway.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Are nocow files snapshot-aware
  2014-02-07  1:28             ` Chris Murphy
@ 2014-02-07 21:07               ` Kai Krakow
  2014-02-07 21:31                 ` Chris Murphy
  0 siblings, 1 reply; 15+ messages in thread
From: Kai Krakow @ 2014-02-07 21:07 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy <lists@colorremedies.com> schrieb:

>> If the database/virtual machine/whatever is crash safe, then the
>> atomic state that a snapshot grabs will be useful.
> 
> How fast is this state fixed on disk from the time of the snapshot
> command? Loosely speaking. I'm curious if this is < 1 second; a few
> seconds; or possibly up to the 30 second default commit interval? And also
> if it's even related to the commit interval time at all?

Such constructs can only be crash-safe if write-barriers are passed down 
through the cow logic of btrfs to the storage layer. That won't probably 
ever happen. Atomic and transactional updates cannot happen without write-
barriers or synchronous writes. To make it work, you need to design the 
storage-layers from the ground up to work without write-barriers, like 
having battery-backed write-caches, synchronous logical file-system layers 
etc. Otherwise, database/vm/whatever transactional/atomic writes are just 
having undefined status down at the lowest storage layer.

> I'm also curious what happens to files that are presently writing. e.g.
> I'm writing a 1GB file to subvol A and before it completes I snapshot
> subvol A into A.1. If I go find the file I was writing to, in A.1, what's
> its state? Truncated? Or or are in-progress writes permitted to complete
> if it's a rw snapshot? Any difference in behavior if it's an ro snapshot?

I wondered that many times, too. What happens to files being written to? I 
suppose, at the time of snapshotting it's taking the current state of the 
blocks as they are, ignoring pending writes. This means, the file being 
written to is probably in limbo state.

For example, xfs has an option to freeze the file system to take atomic 
snapshots. You can use that feature to take consistent snapshots of MySQL 
InnoDB files to create a hot-copy backup of it. But: You need to instruct 
MySQL first to complete its transactions and pausing before running 
xfs_freeze, then after that's done, you can resume MySQL operations. That 
clearly tells me that it is probably not safe to take snapshots of online 
databases, even if they are crash-safe (and by what I know, InnoDB is 
designed to be crash-safe).

A solution, probably far-future, could be that a btrfs snapshot would inform 
all current file-writers to complete transactions and atomic operations and 
wait until each one signals a ready state, then take the snapshot, then 
signal the processes to resume operations. For this, the btrfs driver could 
offer some sort of subscription, similar to what inotify offers. Processes 
subscribe to some sort of notification broadcasts, btrfs can wait for every 
process to report an integral file state. If I remember right, reiser4 
offered some similar feature (approaching the problem from the opposite 
side): processes were offered an interface to start and commit transactions 
within reiser4. If btrfs had such information from file-writers, it could 
take consistent snapshots of online databases/vms/whatever (given, that in 
the vm case the guest could pass this information to the host). Whatever 
approach is taken, however, it will make the time needed to create snapshots 
undeterministic, processes may not finish their transactions within a 
reasonable time...

-- 
Replies to list only preferred.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Are nocow files snapshot-aware
  2014-02-07 21:07               ` Kai Krakow
@ 2014-02-07 21:31                 ` Chris Murphy
  2014-02-07 22:26                   ` Kai Krakow
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Murphy @ 2014-02-07 21:31 UTC (permalink / raw)
  To: Btrfs BTRFS

On Feb 7, 2014, at 2:07 PM, Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:

> Chris Murphy <lists@colorremedies.com> schrieb:
> 
>>> If the database/virtual machine/whatever is crash safe, then the
>>> atomic state that a snapshot grabs will be useful.
>> 
>> How fast is this state fixed on disk from the time of the snapshot
>> command? Loosely speaking. I'm curious if this is < 1 second; a few
>> seconds; or possibly up to the 30 second default commit interval? And also
>> if it's even related to the commit interval time at all?
> 
> Such constructs can only be crash-safe if write-barriers are passed down 
> through the cow logic of btrfs to the storage layer. That won't probably 
> ever happen. Atomic and transactional updates cannot happen without write-
> barriers or synchronous writes. To make it work, you need to design the 
> storage-layers from the ground up to work without write-barriers, like 
> having battery-backed write-caches, synchronous logical file-system layers 
> etc. Otherwise, database/vm/whatever transactional/atomic writes are just 
> having undefined status down at the lowest storage layer.

This explanation makes sense. But I failed to qualify the "state fixed on disk". I'm not concerned about when bits actually arrive on disk. I'm wondering what state they describe. So assume no crash or power failure, and assume writes eventually make it onto the media without a problem. What I'm wondering is, what state of the subvolume I'm snapshotting do I end up with? Is there a delay and how long is it, or is it pretty much instant? The command completes really quickly even when the file system is actively being used, so the feedback is that the snapshot state is established very fast but I'm not sure what bearing that has in reality.

Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Are nocow files snapshot-aware
  2014-02-07  7:06           ` Duncan
@ 2014-02-07 21:58             ` Kai Krakow
  0 siblings, 0 replies; 15+ messages in thread
From: Kai Krakow @ 2014-02-07 21:58 UTC (permalink / raw)
  To: linux-btrfs

Duncan <1i5t5.duncan@cox.net> schrieb:

>> The question here is: Does it really make sense to create such snapshots
>> of disk images currently online and running a system. They will probably
>> be broken anyway after rollback - or at least I'd not fully trust the
>> contents.
>> 
>> VM images should not be part of a subvolume of which snapshots are taken
>> at a regular and short interval. The problem will go away if you follow
>> this rule.
>> 
>> The same applies to probably any kind of file which you make nocow -
>> e.g. database files. The only use case is taking _controlled_ snapshots
>> - and doing it all 30 seconds is by all means NOT controlled, it's
>> completely undeterministic.
> 
> I'd absolutely agree -- and that wasn't my report, I'm just recalling it,
> as at the time I didn't understand the interaction between NOCOW and
> snapshots and couldn't quite understand how a NOCOW file was still
> triggering the snapshot-aware-defrag pathology, which in fact we were
> just beginning to realize based on such reports.

Sorry, didn't mean to push it to you. ;-) I just wanted to give some 
pointers to rethink such practices for people stumpling upon this.

> But some of the snapshotting scripts out there, and the admins running
> them, seem to have the idea that just because it's possible it must be
> done, and they have snapshots taken every minute or more frequently, with
> no automated snapshot thinning at all.  IMO that's pathology run amok
> even if btrfs /was/ stable and mature and /could/ handle it properly.

Yeah, people should stop such "bullshit practice" (sorry), no matter if 
there's a technical problem with it. It does not give the protection they 
intended to give. It's just wrong sense for security/safety... There _may_ 
be actual use cases for doing it - but generally I'd suggest it's plain 
wrong.

> That's regardless of the content so it's from a different angle than you
> were attacking the problem from...  But if admins aren't able to
> recognize the problem with per-minute snapshots without any thinning at
> all for days, weeks, months on end, I doubt they'll be any better at
> recognizing that VMs, databases, etc, should have a dedicated subvolume.

True.

> But be that as it may, since such extreme snapshotting /is/ possible, and
> with automation and downloadable snapper scripts somebody WILL be doing
> it, btrfs should scale to it if it is to be considered mature and
> stable.  People don't want a filesystem that's going to fall over on them
> and lose data or simply become unworkably live-locked just because they
> didn't know what they were doing when they setup the snapper script and
> set it to 1 minute snaps without any corresponding thinning after an hour
> or a day or whatever.

Such, uhm, sorry, "bullshit practice" should not be a high priority on the 
fix-list for btrfs. There are other areas. It's a technical problem, yes, 
but I think there are more important ones than brute-forcing problems out of 
btrfs that are never being hit by normal usage patterns.

It is good that such "tests" are done, but I would not understand how people 
can expect they need such a "feature" - now and at once. Such tests are not 
ready to leave the development sandbox yet.

>From a normal use perspective, doing such heavy snapshotting is probably 
almost always nonsense.

I'd be more interested in how btrfs behaves in highly io loaded server 
patterns. One interesting use case for me would be to use btrfs as the 
building block of a system with container virtualization (docker, lxc), 
making a high vm density on the machine (with the io load and unpredictable 
io bahavior that internet-facing servers apply to their storage layer), 
using btrfs snapshots to instantly create new vms from vm templates living 
in subvolumes (thin provisioning), spreading btrfs across a higher number of 
disks as the average desktop user / standard server has. I think this is one 
of many very interesting use cases for btrfs and its capabilities. And this 
is how we get back to my initial question: In such a scenario I'd like to 
take ro snapshots of all machines (which probably host nocow files for 
databases), send these to a backup server at low io-priority, then remove 
the snapshots. Apparently, btrfs send/receive is still far from being stable 
and bullet-proof from what I read here, so the destination would probably be 
another btrfs or zfs, using inplace-rsync backups and snapshotting for 
backlog.

-- 
Replies to list only preferred.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Are nocow files snapshot-aware
  2014-02-07 21:31                 ` Chris Murphy
@ 2014-02-07 22:26                   ` Kai Krakow
  2014-02-08  6:34                     ` Duncan
  0 siblings, 1 reply; 15+ messages in thread
From: Kai Krakow @ 2014-02-07 22:26 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy <lists@colorremedies.com> schrieb:

> 
> On Feb 7, 2014, at 2:07 PM, Kai Krakow <hurikhan77+btrfs@gmail.com> wrote:
> 
>> Chris Murphy <lists@colorremedies.com> schrieb:
>> 
>>>> If the database/virtual machine/whatever is crash safe, then the
>>>> atomic state that a snapshot grabs will be useful.
>>> 
>>> How fast is this state fixed on disk from the time of the snapshot
>>> command? Loosely speaking. I'm curious if this is < 1 second; a few
>>> seconds; or possibly up to the 30 second default commit interval? And
>>> also if it's even related to the commit interval time at all?
>> 
>> Such constructs can only be crash-safe if write-barriers are passed down
>> through the cow logic of btrfs to the storage layer. That won't probably
>> ever happen. Atomic and transactional updates cannot happen without
>> write- barriers or synchronous writes. To make it work, you need to
>> design the storage-layers from the ground up to work without
>> write-barriers, like having battery-backed write-caches, synchronous
>> logical file-system layers etc. Otherwise, database/vm/whatever
>> transactional/atomic writes are just having undefined status down at the
>> lowest storage layer.
> 
> This explanation makes sense. But I failed to qualify the "state fixed on
> disk". I'm not concerned about when bits actually arrive on disk. I'm
> wondering what state they describe. So assume no crash or power failure,
> and assume writes eventually make it onto the media without a problem.
> What I'm wondering is, what state of the subvolume I'm snapshotting do I
> end up with? Is there a delay and how long is it, or is it pretty much
> instant? The command completes really quickly even when the file system is
> actively being used, so the feedback is that the snapshot state is
> established very fast but I'm not sure what bearing that has in reality.

I think from that perspective it is more or less the same taking a snapshot 
or cycling the power. For the state of the file consistency it means the 
same, I suppose. I got your argument about "state fixed on disk", but I 
implied from perspective of the writing process it is just the same 
situation: in the moment of the snapshot the data file is in a crashed 
state. That is like cycling the power without having a mechanism to support 
transactional guarantees.

So the question is: Do btrfs snapshots give the same guarantees on the 
filesystem level that write-barriers give on the storage level which exactly 
those processes rely upon? The cleanest solution would be if processes could 
give btrfs hints about what belongs to their transactions so in the moment 
of a snapshot the data file would be in clean state. I guess snapshots are 
atomic in that way, that pending writes will never reach the snapshots just 
taken, which is good.

But what about the ordering of writes? Maybe some younger write requests 
already made it to the disk, while older ones didn't. The file system 
usually only has to care about its own transactional integrity, not those of 
its writing processes, and that is completely unrelated to what the writing 
process expects. Or in other words: A following crash only guarantees that 
the active subvolume being written to is clean from the transactional 
perspective of the process, but the snapshot may be broken. As far as I 
know, user processes cannot tell the filesystem when to issue write-
barriers, it could only issue fsyncs (which hurts performance). Otherwise 
this discussion would be a whole different story.

Did you test how btrfs snapshots perform while running fsync with a lot of 
data to be committed? Could give a clue...

-- 
Replies to list only preferred.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Are nocow files snapshot-aware
  2014-02-07 22:26                   ` Kai Krakow
@ 2014-02-08  6:34                     ` Duncan
  2014-02-08  8:50                       ` Kai Krakow
  0 siblings, 1 reply; 15+ messages in thread
From: Duncan @ 2014-02-08  6:34 UTC (permalink / raw)
  To: linux-btrfs

Kai Krakow posted on Fri, 07 Feb 2014 23:26:34 +0100 as excerpted:

> So the question is: Do btrfs snapshots give the same guarantees on the
> filesystem level that write-barriers give on the storage level which
> exactly those processes rely upon? The cleanest solution would be if
> processes could give btrfs hints about what belongs to their
> transactions so in the moment of a snapshot the data file would be in
> clean state. I guess snapshots are atomic in that way, that pending
> writes will never reach the snapshots just taken, which is good.

Keep in mind that btrfs' metadata is COW-based also.  Like reiser4 in 
this way, in theory at least, commits are atomic -- they've ether made it 
to disk or they haven't, there's no half there.  Commits at the leaf 
level propagate up the tree, and are not finalized until the top-level 
root node is written.  AFAIK if there's dirty data to write, btrfs 
triggers a root node commit every 30 seconds.  Until that root is 
rewritten, it points to the last consistent-state written root node.  
Once it's rewritten, it points to the new one and a new set of writes are 
started, only to be finalized at the next root node write.

And I believe that final write simply updates a pointer to point at the 
latest root node.  There's also a history of root nodes, which is what 
the btrfs-find-root tool uses in combination with btrfs restore, if 
necessary, to find a valid root from the root node pointer log if the 
system crashed in the middle of that final update so the pointer ends up 
pointing at garbage.

Meanwhile, I'm a bit blurry on this but if I understand things correctly, 
between root node writes/full-filesystem-commits there's a log of 
transaction completions at the atomic individual transaction level, such 
that even transactions completed between root node writes can normally be 
replayed.  Of course this is only ~30 seconds worth of activity max, 
since the root node writes should occur every 30 seconds, but this is 
what btrfs-zero-log zeroes out, if/when needed.  You'll lose that few 
seconds of log replay since the last root node write, but if it was 
garbage data due to it being written when the system actually went down, 
dropping those few extra seconds of log can allow the filesystem to mount 
properly from the last full root node commit, where it couldn't, 
otherwise.

It's actually those metadata trees and the atomic root-node commit 
feature that btrfs snapshots depend on, and why they're normally so fast 
to create.  When a snapshot is taken, btrfs simply keeps a record of the 
current root node instead of letting it recede into history and fall off 
the end of the root node log, labeling that record with the name of the 
snapshot for humans as well as the object-ID that btrfs uses.  That root 
node is by definition a record of the filesystem in a consistent state, 
so any snapshot that's a reference to it is similarly by definition in a 
consistent state.

So normally, files in the process of being written out (created) simply 
wouldn't appear in the snapshot.  Of course preexisting files will appear 
(and fallocated files are simply the blanked-out-special-case of 
preexisting), but again, with normal COW-based files at least, will exist 
in a state either before the latest transaction started, or after it 
finished, which of course is where fsync comes in, since that's how 
userspace apps communicate file transactions to the filesystem.

And of course in addition to COW, btrfs normally does checksumming as 
well, and again, the filesystem including that checksumming will be self-
consistent when a root-node is written, or it won't be written until the 
filesystem /is/ self-consistent.  If for whatever reason there's garbage 
when btrfs attempts to read the data back, which is exactly what btrfs 
defines it as if it doesn't pass checksum, btrfs will refuse to use that 
data.  If there's a second copy somewhere (as with raid1 mode), it'll try 
to restore from that second copy.  If it can't, btrfs will return an 
error and simply won't let you access that file.

So one way or another, a snapshot is deterministic and atomic.  No 
partial transactions, at least on ordinary COW and checksummed files.

Which brings us to NOCOW files, where for btrfs NOCOW also turns off 
checksumming.  Btrfs will write these files in-place, and as a result 
there's not the transaction integrity guarantee on these files that there 
is on ordinary files.

*HOWEVER*, the situation isn't as bad as it might seem, because most 
files where NOCOW is recommended, database files, VM images, pre-
allocated torrent files, etc, are created and managed by applications 
that already have their own data integrity management/verification/repair 
methods, since they're designed to work on filesystems without the data 
integrity guarantees btrfs normally provides.

In fact, it's possible, even likely in case of a crash, that the 
application's own data integrity mechanisms can fight with those of 
btrfs, and letting btrfs scrub restore what it thinks is a good copy can 
actually interfere with the application's own data integrity and repair 
functionality because it often goes to quite some lengths to repair 
damage or simply revert to a checkpoint position if it has to, but it 
doesn't expect the filesystem to be making such changes and isn't 
prepared to deal with filesystems that do so!  There have in fact been 
several reports to the list of what appears to be exactly that happening!

So in fact it's often /better/ to turn off both COW and checksumming via 
NOCOW, if you know your application manages such things.  That way the 
filesystem doesn't try to repair the damage in case of a crash, which 
leaves the application's own functionality to handle it and repair or 
roll back as it is designed to do.

That's with crashes.  The one quirk that's left to deal with is how 
snapshots deal with NOCOW files.  As explained earlier, snapshots leave a 
NOCOW file as-is initially, but will COW it ONCE, the first time a 
snapshotted NOCOW file-block is written to in that snapshot, thus 
diverging it from the shared version.

A snapshot thus looks much like a crash in terms of NOCOW file integrity 
since the blocks of a NOCOW file are simply snapshotted in-place, and 
there's already no checksumming or file integrity verification on such 
files -- they're simply directly written in-place (with the exception of 
a single COW write when a writable snapshottted NOCOW file diverges from 
the shared snapshot version).

But as I said, the applications themselves are normally designed to 
handle and recover from crashes, and in fact, having btrfs try to manage 
it too only complicates things and can actually make it impossible for 
the app to recover what it would have otherwise recovered just fine.

So it should be with these NOCOW in-place snapshotted files, too.  If a 
NOCOW file is put back into operation from a snapshot, and the file was 
being written to at snapshot time, it'll very likely trigger exactly the 
same response from the application as a crash while writing would have 
triggered, but, the point is, such applications are normally designed to 
deal with just that, and thus, they should recover just as they would 
from a crash.  If they could recover from a crash, it shouldn't be an 
issue.  If they couldn't, well...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Are nocow files snapshot-aware
  2014-02-08  6:34                     ` Duncan
@ 2014-02-08  8:50                       ` Kai Krakow
  0 siblings, 0 replies; 15+ messages in thread
From: Kai Krakow @ 2014-02-08  8:50 UTC (permalink / raw)
  To: linux-btrfs

Duncan <1i5t5.duncan@cox.net> schrieb:

[...]

Difficult to twist your mind around that but well explained. ;-)

> A snapshot thus looks much like a crash in terms of NOCOW file integrity
> since the blocks of a NOCOW file are simply snapshotted in-place, and
> there's already no checksumming or file integrity verification on such
> files -- they're simply directly written in-place (with the exception of
> a single COW write when a writable snapshottted NOCOW file diverges from
> the shared snapshot version).
> 
> But as I said, the applications themselves are normally designed to
> handle and recover from crashes, and in fact, having btrfs try to manage
> it too only complicates things and can actually make it impossible for
> the app to recover what it would have otherwise recovered just fine.
> 
> So it should be with these NOCOW in-place snapshotted files, too.  If a
> NOCOW file is put back into operation from a snapshot, and the file was
> being written to at snapshot time, it'll very likely trigger exactly the
> same response from the application as a crash while writing would have
> triggered, but, the point is, such applications are normally designed to
> deal with just that, and thus, they should recover just as they would
> from a crash.  If they could recover from a crash, it shouldn't be an
> issue.  If they couldn't, well...

So we have common sense that taking a snapshot looks like a crash from the 
applications perspective. That means if their are facilities to instruct the 
application to suspend its operations first, you should use them - like in 
the InnoDB case:

http://dev.mysql.com/doc/refman/5.1/en/lock-tables.html:

| FLUSH TABLES WITH READ LOCK;
| SHOW MASTER STATUS;
| SYSTEM xfs_freeze -f /var/lib/mysql;
| SYSTEM YOUR_SCRIPT_TO_CREATE_SNAPSHOT.sh;
| SYSTEM xfs_freeze -u /var/lib/mysql;
| UNLOCK TABLES;
| EXIT;

Only that way you get consistent snapshots and won't trigger crash-recovery 
(which might otherwise throw away unrecoverable transactions or otherwise 
harm your data for the sake of consistency). InnoDB is more or less like a 
vm filesystem image on btrfs in this case. So the same approach should be 
taken for vm images if possible. I think VMware has facilities to prepare 
the guest for a snapshot being taken (it is triggered when you take 
snapshots with VMware itself, and btw it usually takes much longer than 
btrfs snapshots do).

Take xfs for example: Although it is crash-safe, it prefers to zero-out your 
files for security reasons during log-replay - because it is crash-safe only 
for meta-data: if meta-data has already allocated blocks but file-data has 
not yet been written, a recovered file may end up with wrong content 
otherwise, so its cleared out. This _IS_NOT_ the situation you want with vm 
images with xfs inside hosted on btrfs when taking a snapshot. You should 
trigger xfs_freeze in the guest before taking the btrfs snapshot in the 
host.

I think the same holds true for most other meta-data-only-journalling file 
systems which probably even do not zero-out files during recovery and just 
silently corrupt your files during crash-recovery.

So in case of crash or snapshot (which looks the same from the application 
perspective), btrfs' capabilities won't help you here (at least in the nocow 
case, probably in the cow case too, because the vm guest may write blocks 
out-of-order without having the possibility to pass write-barriers down to 
btrfs cow mechanism). Taking snapshots of database files or vm images 
without proper prepartion only guarantees you crash-like rollback 
situations. Taking snapshots even at short intervals only makes this worse, 
with all the extra downsides of effects this has within the btrfs.

I think this is important to understand for people planning to do automated 
snapshots of such file data. Making a file nocow only helps the situation 
during normal operation - but after a snapshot, a nocow file is essentially 
cow while carried over to the new subvolume generation during writes of 
blocks from the old generation.

-- 
Replies to list only preferred.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-02-08  9:28 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-04 20:52 Are nocow files snapshot-aware Kai Krakow
2014-02-05  1:22 ` Josef Bacik
2014-02-05  2:02   ` David Sterba
2014-02-05 18:17     ` Kai Krakow
2014-02-06  2:38       ` Duncan
2014-02-07  0:32         ` Kai Krakow
2014-02-07  1:01           ` cwillu
2014-02-07  1:28             ` Chris Murphy
2014-02-07 21:07               ` Kai Krakow
2014-02-07 21:31                 ` Chris Murphy
2014-02-07 22:26                   ` Kai Krakow
2014-02-08  6:34                     ` Duncan
2014-02-08  8:50                       ` Kai Krakow
2014-02-07  7:06           ` Duncan
2014-02-07 21:58             ` Kai Krakow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).