From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: BTRFS as image store for KVM?
Date: Wed, 16 Sep 2015 03:57:12 +0000 (UTC) [thread overview]
Message-ID: <pan$a7c37$760f233d$6b18271c$f9ac2029@cox.net> (raw)
In-Reply-To: 55F88ECC.1040604@menke.ac
Gert Menke posted on Tue, 15 Sep 2015 23:34:04 +0200 as excerpted:
> I'm not 100% sure if this is the right place to ask[.]
It is. =:^)
> I want to build a virtualization server to replace my current home
> server. I'm thinking about a Debian system with libvirt/KVM. The system
> will have one or two SSDs and five harddisks with some kind of software
> RAID5 for storage. I'd like to have a filesystem with data checksums, so
> BTRFS seems like the right way to go. However, I read that BTRFS does
> not perform well as storage for KVM disk images.
> (See here: http://www.linux-kvm.org/page/Tuning_KVM )
>
> Is this still true?
>
> I would appreciate any comments and/or tips you might have on this
> topic.
>
> Is anyone using BTRFS as an image store? Are there any special settings
> I should be aware of to make it work well?
Looks like you're doing some solid research before you deploy. =:^)
Here's the deal. The problem is fragmentation, which is much more of an
issue on spinning rust than it typically is on ssds, since ssds have
effectively zero seek-time. If you can put the VMs on those ssds you
mentioned, not on the spinning rust, the fragmentation won't matter so
much, and you may well not have to worry about it.
Any copy-on-write filesystem (which btrfs is), is going to have serious
problems with a file-internal-rewrite write pattern (as contrasted to
append, or simply rewrite the entire thing sequentially, beginning to
end), because as various blocks are rewritten, they get written
elsewhere, worst-case one at a time, dramatically increasing
fragmentation -- hundreds of thousands of extents are not unheard-of with
files in the multi-GiB size range.[1]
The two typical problematic cases are database files and VM images (your
case).
Btrfs has two possible solutions to work around the problem. The first
one is the autodefrag mount option, which detects file fragmentation
during the write and queues up the affected file for a defragmenting
rewrite by a lower priority worker thread. This works best on the small
end, because as file size increases, so does time to actually write it
out, and at some point, depending on the size of the file and how busy
the database/VM is, writes are (trying to) come in faster than the file
can be rewritten. Typically, there's no problem under a quarter GiB,
with people beginning to notice performance issues at half to 3/4 GiB,
tho on fast disks and not too busy VMs/DBs (which may well include your
home system, depending on what you use the VMs for), you might not see
problems until size reaches 2 GiB or so. As such, autodefrag tends to be
a very good option for firefox sqlite database files, for instance, as
they tend to be small enough not to have issues. But it's not going to
work so well for multi-GiB VM images.
The second solution, or more like workaround, for larger internal-rewrite-
pattern files, generally 1 GiB plus (so many VMs), is to use the NOCOW
file attribute (set with chattr +C), which tells btrfs to rewrite the
file in-place instead of using the usual copy-on-write method. However,
you're not going to like the side effects, as btrfs turns off both
checksumming and transparent compression on nocow files, because there's
serious checksum/data-it-covers write-race issues with in-place rewrite,
and of course the rewritten data may compress better or worse than the
old version, so rewriting a compressed copy in-place is problematic as
well.
So setting nocow turns off checksumming, the biggest reason you're
considering btrfs in the first place, likely making this option
effectively unworkable for you. =:^(
Which means btrfs itself likely isn't a particularly good choice, UNLESS
(a) your VM images are small (under a GiB, ideally under a quarter-gig,
admittedly a pretty small VM), OR (b) your VMs are primarily reading, not
writing, or aren't likely to be busy enough for autodefrag to be a
problem, given the size, OR (c) you put the VM images (and thus the btrfs
containing them) on ssd, not spinning rust.
Meanwhile, quickly tying up a couple loose ends with nocow in case you do
decide to use it for this or some other use-case:
a) On btrfs, setting nocow on a file that's already larger than zero-size
doesn't work as expected (cow writes can continue to occur for some
time). Typically the easiest way to ensure that the file is nocow before
getting data, is to set nocow on its containing directory before the file
is created, so new files inherit the attribute. For existing files, set
it on the dir and copy the file in from a different filesystem (or move
it to say a tmpfs and back), so the file gets created with the nocow
attribute as it is copied in.
b) Btrfs' snapshot feature depends on COW, locking in place the existing
version of the file, forcing otherwise nocow files to be what I've seen
described as cow1 -- the first write to a file block will cow it to a new
location because the existing version is locked in place in the old
location. However, the file retains its nocow attribute, and further
writes to the same block will now rewrite the existing first-cowed
location instead of forcing further cows... until yet another snapshot
locks the new existing block in place once again. While this isn't too
much of a problem for the occasional snapshot, it does create problems
for high-frequency scheduled snapshotting, since then the otherwise nocow
files will be cowing quite a lot anyway, and as a result fragmenting, due
to the snapshotting locking existing versions in place so often.
Finally, as I said above, fragmentation doesn't affect ssds like it does
spinning rust (tho it's still not ideal, since scheduling all those
individual accesses instead of fewer accesses to larger extents does have
a cost, and with sub-erase-block-size fragments, there's wear-leveling
and write-cycle issues to consider), so you might not have to worry about
it at all if you put the btrfs and thus the VMs it contains on ssd.
---
[1] Btrfs file blocks are kernel memory page size, 4 KiB on x86, 32-bit
or 64-bit, so there's 256 blocks per MiB, 1024 MiB per GiB, so 262,144
blocks per GiB. The theoretical worst-case fragmentation, each block its
own extent, is thus 262,144 extents per GiB.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2015-09-16 3:57 UTC|newest]
Thread overview: 45+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-15 21:34 BTRFS as image store for KVM? Gert Menke
2015-09-16 3:00 ` Chris Murphy
2015-09-16 3:57 ` Duncan [this message]
2015-09-16 11:35 ` Brendan Heading
2015-09-16 12:25 ` Austin S Hemmelgarn
2015-09-16 12:41 ` Paul Jones
2015-09-17 17:56 ` Gert Menke
2015-09-17 18:35 ` Chris Murphy
2015-09-17 21:32 ` Gert Menke
2015-09-18 2:00 ` Duncan
2015-09-18 8:32 ` Gert Menke
2015-09-23 7:28 ` Russell Coker
2015-09-18 14:13 ` Austin S Hemmelgarn
2015-09-23 7:24 ` Russell Coker
2015-09-17 18:46 ` Mike Fleetwood
2015-09-17 19:43 ` Hugo Mills
2015-09-17 21:49 ` Gert Menke
2015-09-18 2:22 ` Duncan
2015-09-18 8:59 ` Gert Menke
2015-09-17 22:41 ` Sean Greenslade
2015-09-18 7:34 ` Gert Menke
2015-09-17 4:19 ` Paul Harvey
2015-09-20 1:26 ` Jim Salter
2015-09-25 12:48 ` Rich Freeman
2015-09-25 12:56 ` Jim Salter
2015-09-25 13:04 ` Austin S Hemmelgarn
[not found] ` <5605483A.7040103@jrs-s.net>
2015-09-25 13:46 ` Austin S Hemmelgarn
2015-09-25 13:52 ` Jim Salter
2015-09-25 14:02 ` Timofey Titovets
2015-09-25 14:20 ` Austin S Hemmelgarn
2015-09-29 14:12 ` Gert Menke
2015-10-02 4:21 ` Russell Coker
2015-10-02 12:07 ` Austin S Hemmelgarn
2015-10-03 8:32 ` Russell Coker
2015-10-04 2:09 ` Duncan
2015-10-04 12:03 ` Lionel Bouton
2015-10-04 12:21 ` Rich Freeman
2015-10-05 8:19 ` Duncan
2015-10-05 8:43 ` Erkki Seppala
2015-10-05 8:51 ` Roman Mamedov
2015-10-05 11:16 ` Lionel Bouton
2015-10-05 11:40 ` Rich Freeman
2015-10-05 11:54 ` Austin S Hemmelgarn
[not found] ` <RPG31r00t34oj7R01PG5Us>
2015-10-05 14:04 ` Duncan
2015-10-05 15:59 ` Austin S Hemmelgarn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$a7c37$760f233d$6b18271c$f9ac2029@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).