* Re: Fwd: [virt-devel] btrfs NOCOW for VM disk images
2013-11-22 16:17 ` Fwd: [virt-devel] btrfs NOCOW for VM disk images John Dulaney
@ 2013-11-22 21:26 ` Duncan
2013-11-22 22:00 ` Roman Mamedov
2013-11-22 22:12 ` Chris Murphy
1 sibling, 1 reply; 5+ messages in thread
From: Duncan @ 2013-11-22 21:26 UTC (permalink / raw)
To: linux-btrfs
John Dulaney posted on Fri, 22 Nov 2013 11:17:34 -0500 as excerpted:
> In upstream QEMU we're discussing patches that set the NOCOW flag on
> disk image files. We're told that this increases btrfs performance
> greatly since the file system will modify data in-place like ext4/xfs.
Indeed. For VM images and similar large "internally modified" files,
NOCOW is definitely recommended, since otherwise they can very rapidly
become extremely heavily fragmented. This is a use-case that COW-based
filesystems simply don't deal well with, so turning off the COW is
definitely recommended.
> During testing I found that the NOCOW flag prevents file cloning from
> working. cp --reflink fails with EINVAL when the source file has the
> NOCOW flag set.
That would be expected, since disabling COW means the file will be
updated in-place, and if reflink-copying was allowed, changing the one
view in-place would by definition change the other view of the same file,
since it /is/ the same file data.
If you want both views of the file to change together, why not use a
normal hardlink? If you don't want them to change together, then you
can't set NOCOW and reflink-copy, since by definition NOCOW makes changes
in-place, and if reflinks were allowed, that'd change both views.
Quoting the cp manpage --reflink discussion:
>>>>
When --reflink[=always] is specified, perform a lightweight copy, where
the data blocks are copied only when modified. If this is not possible
the copy fails, or if --reflink=auto is specified, fall back to a
standard copy.
<<<<
Since you disabled COW, the data blocks cannot be copied when modified,
so the copy fails (or with auto falls back to a normal copy). Defined,
documented and expected behavior.
> It is not possible to toggle NOCOW back and forth later on since it can
> only be set when no data has been allocated for the file yet.
>
> This leaves us with the choice between performance (NOCOW) and snapshots
> (default). Both are important for VM disk images!
>
> Questions:
>
> * Would it be possible to extend btrfs so that cp --reflink works on
> NOCOW files? (Clueless idea: quiesce I/O to the NOCOW file and clone
> it, then resume I/O and COW only writes to shared blocks.)
Of course it's /possible/, but doing so would pervert the definition of
NOCOW or of reflink or both. Either reflinks would effectively become
hardlinks and writing to one view of the NOCOW data would change them
all, or it would no longer be NOCOW. Since hardlinks already exist as a
solution and COW is the default...
> * Does NOCOW prevent any other functionality besides file-level
> cloning?
Being a simple btrfs user/sysadmin, I'm not sure about the file-level
option, but certainly when given as a mount option (nodatacow)[1], both
data checksumming and file compression are turned off as well. Given the
technical requirements, I'd assume the same applies to NOCOW file
attributes as well.
It's worth noting that there have been several bugs related to this as
well, where btrfs was doing the wrong thing with "internally changed"
files in one case or another. One now fixed bug was triggered most often
with systemd's journal, where systemd was doing direct-IO and btrfs
wasn't properly handling checksums. (Someone else reported a file-
preallocating bittorrent client triggering that same bug, so it wasn't
/just/ systemd triggering it, but systemd was the most widely deployed
and thus most common trigger.) Turning off checksums for this sort of
"internally changed" image file thus becomes the easiest way to avoid
such issues and NOCOW is the way this type of file usage pattern is
conveyed to the filesystem. Mixing compression and internal-writes is
another problematic situation, so turning that off for NOCOW files also
makes sense.
> * Does NOCOW increase risk of data loss/corruption? (I guess yes since
> overwriting in place puts data at risk of power failure or drive
> failure.)
Absolutely, for that file at least. The loss of data checksumming means
loss of that normally important data integrity check as well, tho at the
same time it's actually safer in some ways since you don't have the
filesystem checksums fighting and racing with the internal file updates,
the source of the now fixed systemd journal triggered bug mentioned above.
However, NOCOW on a large and very frequently internally changed file
arguably makes other data/metadata on the filesystem safer, since the
very frequent changes are now contained and isolated to their own
unchanging location on the filesystem, no constantly changing partially
shared extent tracking metadata and data checksum records to be keeping
updated at the same time, thus no possibility of endangering the other
files sharing the same metadata records.
[1] Btrfs mount options. They aren't yet documented in the mount manpage,
and mount doesn't ship with btrfs-progs so there's no manpage
documentation for mount options there, so the kernel's btrfs.txt file and
the wiki are the only good places to look up btrfs-specific mount-options:
https://btrfs.wiki.kernel.org/index.php/Mount_options
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [virt-devel] btrfs NOCOW for VM disk images
2013-11-22 16:17 ` Fwd: [virt-devel] btrfs NOCOW for VM disk images John Dulaney
2013-11-22 21:26 ` Duncan
@ 2013-11-22 22:12 ` Chris Murphy
1 sibling, 0 replies; 5+ messages in thread
From: Chris Murphy @ 2013-11-22 22:12 UTC (permalink / raw)
To: John Dulaney; +Cc: linux-btrfs
On Nov 22, 2013, at 9:17 AM, John Dulaney <jdulaney@redhat.com> wrote:
>
> In upstream QEMU we're discussing patches that set the NOCOW flag on
> disk image files. We're told that this increases btrfs performance
> greatly since the file system will modify data in-place like ext4/xfs.
The best performing qemu/kvm results I have, using installing Fedora 20 as the benchmark method and anaconda's time stamping of the start and completion of the installation, is Btrfs on the host with preallocated Raw file with xattr +C, and Btrfs used in the guest. The test matrix is 3x3: ext4, XFS, Btrfs. So each fs was used on the host, and in the guest.
By "best performing" we're talking about maybe 20-30 seconds better over a 7-8 minute install time on spinning rust. So with respect to installing an OS (the live image uses rsync), it seems Btrfs on Btrfs is at least no worse off than other file systems.
A 20GB preallocated Raw on Btrfs with +C set has 33 extents, which doesn't ever change.
When I do this with a qcow2 file with preallocated metadata, it starts out with only 5 extents upon creation, but with each successive installation using the same qcow2 file, also with +C xattr, the extent count grows quite a bit. Although it's very unclear from the testing if this negatively impacts performance, or if the extent increase eventually flattens out.
after installation1> fedoratest.img: 1255 extents found
after installation2> fedoratest.img: 1773 extents found
after installation3> fedoratest.img: 2148 extents found
after installation4> fedoratest.img: 2245 extents found
This is a whole lot less, however, than non-preallocated Raw, without +C xattr where it rapidly ends up with tens of thousands of extents with no end in sight.
> This leaves us with the choice between performance (NOCOW) and snapshots
> (default). Both are important for VM disk images!
Some testing needs to be done with qcow2 on Btrfs with +C long term to see if there's a meaningful performance hit as the qcow2 ages.
It may also be possible to defragment the qcow2 file once extent allocation tapers off.
And another possibility would be for qcow2 to support full preallocation, so that its initial extent count is no worse than Raw.
If you don't need host based snapshotting, another possibility is using Btrfs in the guest, and snapshotting within the guest. It depends on the use case if this is preferred or not, but I think there could be some advantages to snapshotting within the guest. In this case, using Btrfs in the guest regardless of the backing method used, gives the guest the ability to at least flag for fs/data corruption, if not repair it (if a raid 1+ data profile is employed).
Chris Murphy
^ permalink raw reply [flat|nested] 5+ messages in thread