Fwd: [virt-devel] btrfs NOCOW for VM disk images

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Fwd: [virt-devel] btrfs NOCOW for VM disk images
       [not found] <20131122142051.GA32192@stefanha-thinkpad.redhat.com>
@ 2013-11-22 16:17 ` John Dulaney
  2013-11-22 21:26   ` Duncan
  2013-11-22 22:12   ` Chris Murphy
  0 siblings, 2 replies; 5+ messages in thread
From: John Dulaney @ 2013-11-22 16:17 UTC (permalink / raw)
  To: linux-btrfs

----- Forwarded Message -----
From: "Stefan Hajnoczi" <stefanha@redhat.com>
To: "Eric Sandeen" <sandeen@redhat.com>
Cc: virt-devel@redhat.com, "Kevin Wolf" <kwolf@redhat.com>
Sent: Friday, November 22, 2013 9:20:51 AM
Subject: [virt-devel] btrfs NOCOW for VM disk images

Hi,
In upstream QEMU we're discussing patches that set the NOCOW flag on
disk image files.  We're told that this increases btrfs performance
greatly since the file system will modify data in-place like ext4/xfs.

During testing I found that the NOCOW flag prevents file cloning from
working.  cp --reflink fails with EINVAL when the source file has the
NOCOW flag set.

It is not possible to toggle NOCOW back and forth later on since it can
only be set when no data has been allocated for the file yet.

This leaves us with the choice between performance (NOCOW) and snapshots
(default).  Both are important for VM disk images!

Questions:

 * Would it be possible to extend btrfs so that cp --reflink works on
   NOCOW files?  (Clueless idea: quiesce I/O to the NOCOW file and clone
   it, then resume I/O and COW only writes to shared blocks.)

 * Does NOCOW prevent any other functionality besides file-level cloning?

 * Does NOCOW increase risk of data loss/corruption?  (I guess yes since
   overwriting in place puts data at risk of power failure or drive
   failure.)

Thanks,
Stefan

-- 
John Dulaney, RHCE
IRC: handsome_pirate
jdulaney.wordpress.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Fwd: [virt-devel] btrfs NOCOW for VM disk images
  2013-11-22 16:17 ` Fwd: [virt-devel] btrfs NOCOW for VM disk images John Dulaney
@ 2013-11-22 21:26   ` Duncan
  2013-11-22 22:00     ` Roman Mamedov
  2013-11-22 22:12   ` Chris Murphy
  1 sibling, 1 reply; 5+ messages in thread
From: Duncan @ 2013-11-22 21:26 UTC (permalink / raw)
  To: linux-btrfs

John Dulaney posted on Fri, 22 Nov 2013 11:17:34 -0500 as excerpted:

> In upstream QEMU we're discussing patches that set the NOCOW flag on
> disk image files.  We're told that this increases btrfs performance
> greatly since the file system will modify data in-place like ext4/xfs.

Indeed.  For VM images and similar large "internally modified" files, 
NOCOW is definitely recommended, since otherwise they can very rapidly 
become extremely heavily fragmented.  This is a use-case that COW-based 
filesystems simply don't deal well with, so turning off the COW is 
definitely recommended.

> During testing I found that the NOCOW flag prevents file cloning from
> working.  cp --reflink fails with EINVAL when the source file has the
> NOCOW flag set.

That would be expected, since disabling COW means the file will be 
updated in-place, and if reflink-copying was allowed, changing the one 
view in-place would by definition change the other view of the same file, 
since it /is/ the same file data.

If you want both views of the file to change together, why not use a 
normal hardlink?  If you don't want them to change together, then you 
can't set NOCOW and reflink-copy, since by definition NOCOW makes changes 
in-place, and if reflinks were allowed, that'd change both views.

Quoting the cp manpage --reflink discussion:

>>>>

When --reflink[=always] is specified, perform a lightweight copy, where 
the data blocks are copied only when modified.  If this is not possible 
the copy fails, or if --reflink=auto is specified, fall back to a 
standard copy.

<<<<

Since you disabled COW, the data blocks cannot be copied when modified, 
so the copy fails (or with auto falls back to a normal copy).  Defined, 
documented and expected behavior.

> It is not possible to toggle NOCOW back and forth later on since it can
> only be set when no data has been allocated for the file yet.
> 
> This leaves us with the choice between performance (NOCOW) and snapshots
> (default).  Both are important for VM disk images!
> 
> Questions:
> 
>  * Would it be possible to extend btrfs so that cp --reflink works on
>    NOCOW files?  (Clueless idea: quiesce I/O to the NOCOW file and clone
>    it, then resume I/O and COW only writes to shared blocks.)

Of course it's /possible/, but doing so would pervert the definition of 
NOCOW or of reflink or both.  Either reflinks would effectively become 
hardlinks and writing to one view of the NOCOW data would change them 
all, or it would no longer be NOCOW.  Since hardlinks already exist as a 
solution and COW is the default...

>  * Does NOCOW prevent any other functionality besides file-level
>  cloning?

Being a simple btrfs user/sysadmin, I'm not sure about the file-level 
option, but certainly when given as a mount option (nodatacow)[1], both 
data checksumming and file compression are turned off as well.  Given the 
technical requirements, I'd assume the same applies to NOCOW file 
attributes as well.

It's worth noting that there have been several bugs related to this as 
well, where btrfs was doing the wrong thing with "internally changed" 
files in one case or another.  One now fixed bug was triggered most often 
with systemd's journal, where systemd was doing direct-IO and btrfs 
wasn't properly handling checksums.  (Someone else reported a file-
preallocating bittorrent client triggering that same bug, so it wasn't 
/just/ systemd triggering it, but systemd was the most widely deployed 
and thus most common trigger.)   Turning off checksums for this sort of 
"internally changed" image file thus becomes the easiest way to avoid 
such issues and NOCOW is the way this type of file usage pattern is 
conveyed to the filesystem.   Mixing compression and internal-writes is 
another problematic situation, so turning that off for NOCOW files also 
makes sense.

>  * Does NOCOW increase risk of data loss/corruption?  (I guess yes since
>    overwriting in place puts data at risk of power failure or drive
>    failure.)

Absolutely, for that file at least.  The loss of data checksumming means 
loss of that normally important data integrity check as well, tho at the 
same time it's actually safer in some ways since you don't have the 
filesystem checksums fighting and racing with the internal file updates, 
the source of the now fixed systemd journal triggered bug mentioned above.

However, NOCOW on a large and very frequently internally changed file  
arguably makes other data/metadata on the filesystem safer, since the 
very frequent changes are now contained and isolated to their own 
unchanging location on the filesystem, no constantly changing partially 
shared extent tracking metadata and data checksum records to be keeping 
updated at the same time, thus no possibility of endangering the other 
files sharing the same metadata records.

[1] Btrfs mount options.  They aren't yet documented in the mount manpage, 
and mount doesn't ship with btrfs-progs so there's no manpage 
documentation for mount options there, so the kernel's btrfs.txt file and 
the wiki are the only good places to look up btrfs-specific mount-options:

https://btrfs.wiki.kernel.org/index.php/Mount_options

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [virt-devel] btrfs NOCOW for VM disk images
  2013-11-22 21:26   ` Duncan
@ 2013-11-22 22:00     ` Roman Mamedov
  2013-11-23  1:21       ` David Sterba
  0 siblings, 1 reply; 5+ messages in thread
From: Roman Mamedov @ 2013-11-22 22:00 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 903 bytes --]

On Fri, 22 Nov 2013 21:26:16 +0000 (UTC)
Duncan <1i5t5.duncan@cox.net> wrote:

> > During testing I found that the NOCOW flag prevents file cloning from
> > working.  cp --reflink fails with EINVAL when the source file has the
> > NOCOW flag set.
> 
> That would be expected, since disabling COW means the file will be 
> updated in-place, and if reflink-copying was allowed, changing the one 
> view in-place would by definition change the other view of the same file, 
> since it /is/ the same file data.

However snapshotting a subvolume which has NOCOW files *is* allowed.
I'm told data is then COW'ed only once, and only the areas that are changed
after the snapshot has been made (or something along those lines). So since
snapshotting+NOCOW can be combined and everything works automagically as
expected, maybe reflink could be made to work as well?

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [virt-devel] btrfs NOCOW for VM disk images
  2013-11-22 22:00     ` Roman Mamedov
@ 2013-11-23  1:21       ` David Sterba
  0 siblings, 0 replies; 5+ messages in thread
From: David Sterba @ 2013-11-23  1:21 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: Duncan, linux-btrfs

On Sat, Nov 23, 2013 at 04:00:28AM +0600, Roman Mamedov wrote:
> On Fri, 22 Nov 2013 21:26:16 +0000 (UTC)
> Duncan <1i5t5.duncan@cox.net> wrote:
> 
> > > During testing I found that the NOCOW flag prevents file cloning from
> > > working.  cp --reflink fails with EINVAL when the source file has the
> > > NOCOW flag set.
> > 
> > That would be expected, since disabling COW means the file will be 
> > updated in-place, and if reflink-copying was allowed, changing the one 
> > view in-place would by definition change the other view of the same file, 
> > since it /is/ the same file data.
> 
> However snapshotting a subvolume which has NOCOW files *is* allowed.
> I'm told data is then COW'ed only once, and only the areas that are changed
> after the snapshot has been made (or something along those lines).

This is correct.

> So since snapshotting+NOCOW can be combined and everything works
> automagically as expected, maybe reflink could be made to work as
> well?

This works (to my own surprise). The clone ioctl checks if the files
have the same status regarding checksums, so reflink from nocow -> nocow
should work.

What does not work if one does

  $ cp --reflink=always nocow-file somefile
  cp: failed to clone ‘soefile’ from ‘nocow’: Invalid argument

because cp creates somefile without nocow status. But precreating
somefile with chattr +C and the calling the command above, cp does not
complain.

Rewriting one file does not modify the other though output of filefrag
after the modification does not seem to reflect that the files do not in
fact share the same blocks:

File size of somefile is 2097152 (512 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     511:   54865266..  54865777:    512:             eof
somefile: 1 extent found

File size of nocow is 2097152 (512 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..     511:   54865266..  54865777:    512:             eof
nocow: 1 extent found

files were dd'ed with zeros and differ in the first 4k:

  dd if=/dev/urandom of=somefile bs=4k count=1 conv=notrunc

So there's a bug somewhere, probably in reporting extents through fiemap of
the modified nocow file. This hides the actual position of the new block and
fragmentation.


david

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [virt-devel] btrfs NOCOW for VM disk images
  2013-11-22 16:17 ` Fwd: [virt-devel] btrfs NOCOW for VM disk images John Dulaney
  2013-11-22 21:26   ` Duncan
@ 2013-11-22 22:12   ` Chris Murphy
  1 sibling, 0 replies; 5+ messages in thread
From: Chris Murphy @ 2013-11-22 22:12 UTC (permalink / raw)
  To: John Dulaney; +Cc: linux-btrfs

On Nov 22, 2013, at 9:17 AM, John Dulaney <jdulaney@redhat.com> wrote:
> 
> In upstream QEMU we're discussing patches that set the NOCOW flag on
> disk image files.  We're told that this increases btrfs performance
> greatly since the file system will modify data in-place like ext4/xfs.

The best performing qemu/kvm results I have, using installing Fedora 20 as the benchmark method and anaconda's time stamping of the start and completion of the installation, is Btrfs on the host with preallocated Raw file with xattr +C, and Btrfs used in the guest. The test matrix is 3x3: ext4, XFS, Btrfs. So each fs was used on the host, and in the guest.

By "best performing" we're talking about maybe 20-30 seconds better over a 7-8 minute install time on spinning rust. So with respect to installing an OS (the live image uses rsync), it seems Btrfs on Btrfs is at least no worse off than other file systems.

A 20GB preallocated Raw on Btrfs with +C set has 33 extents, which doesn't ever change.

When I do this with a qcow2 file with preallocated metadata, it starts out with only 5 extents upon creation, but with each successive installation using the same qcow2 file, also with +C xattr, the extent count grows quite a bit. Although it's very unclear from the testing if this negatively impacts performance, or if the extent increase eventually flattens out.

after installation1> fedoratest.img: 1255 extents found
after installation2> fedoratest.img: 1773 extents found
after installation3> fedoratest.img: 2148 extents found
after installation4> fedoratest.img: 2245 extents found

This is a whole lot less, however, than non-preallocated Raw, without +C xattr where it rapidly ends up with tens of thousands of extents with no end in sight.

> This leaves us with the choice between performance (NOCOW) and snapshots
> (default).  Both are important for VM disk images!

Some testing needs to be done with qcow2 on Btrfs with +C long term to see if there's a meaningful performance hit as the qcow2 ages.

It may also be possible to defragment the qcow2 file once extent allocation tapers off.

And another possibility would be for qcow2 to support full preallocation, so that its initial extent count is no worse than Raw.

If you don't need host based snapshotting, another possibility is using Btrfs in the guest, and snapshotting within the guest. It depends on the use case if this is preferred or not, but I think there could be some advantages to snapshotting within the guest. In this case, using Btrfs in the guest regardless of the backing method used, gives the guest the ability to at least flag for fs/data corruption, if not repair it (if a raid 1+ data profile is employed).

Chris Murphy

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-11-23  1:21 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20131122142051.GA32192@stefanha-thinkpad.redhat.com>
2013-11-22 16:17 ` Fwd: [virt-devel] btrfs NOCOW for VM disk images John Dulaney
2013-11-22 21:26   ` Duncan
2013-11-22 22:00     ` Roman Mamedov
2013-11-23  1:21       ` David Sterba
2013-11-22 22:12   ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).