Re: BTRFS: Unbelievably slow with kvm/qemu

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: BTRFS: Unbelievably slow with kvm/qemu
@ 2010-08-29 19:34 Tomasz Chmielewski
  2010-08-30  0:14 ` Josef Bacik
  0 siblings, 1 reply; 8+ messages in thread
From: Tomasz Chmielewski @ 2010-08-29 19:34 UTC (permalink / raw)
  To: linux-kernel, linux-btrfs
  Cc: hch, gg.mariotti, Justin P. Mattock, mjt, josef, tytso

Christoph Hellwig wrote:

> There are a lot of variables when using qemu.
>
> The most important one are:
>
>  - the cache mode on the device.  The default is cache=writethrough,
>    which is not quite optimal.  You generally do want to use cache=none
>    which uses O_DIRECT in qemu.
>  - if the backing image is sparse or not.
>  - if you use barrier - both in the host and the guest.

I noticed that when btrfs is mounted with default options, when writing 
i.e. 10 GB on the KVM guest using qcow2 image, 20 GB are written on the 
host (as measured with "iostat -m -p").


With ext4 (or btrfs mounted with nodatacow), 10 GB write on a guest 
produces 10 GB write on the host.


-- 
Tomasz Chmielewski
http://wpkg.org


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BTRFS: Unbelievably slow with kvm/qemu
  2010-08-29 19:34 BTRFS: Unbelievably slow with kvm/qemu Tomasz Chmielewski
@ 2010-08-30  0:14 ` Josef Bacik
  2010-08-30 15:59   ` K. Richard Pixley
  0 siblings, 1 reply; 8+ messages in thread
From: Josef Bacik @ 2010-08-30  0:14 UTC (permalink / raw)
  To: Tomasz Chmielewski
  Cc: linux-kernel, linux-btrfs, hch, gg.mariotti, Justin P. Mattock,
	mjt, josef, tytso

On Sun, Aug 29, 2010 at 09:34:29PM +0200, Tomasz Chmielewski wrote:
> Christoph Hellwig wrote:
>
>> There are a lot of variables when using qemu.
>>
>> The most important one are:
>>
>>  - the cache mode on the device.  The default is cache=writethrough,
>>    which is not quite optimal.  You generally do want to use cache=none
>>    which uses O_DIRECT in qemu.
>>  - if the backing image is sparse or not.
>>  - if you use barrier - both in the host and the guest.
>
> I noticed that when btrfs is mounted with default options, when writing  
> i.e. 10 GB on the KVM guest using qcow2 image, 20 GB are written on the  
> host (as measured with "iostat -m -p").
>
>
> With ext4 (or btrfs mounted with nodatacow), 10 GB write on a guest  
> produces 10 GB write on the host.
>

Whoa 20gb?  That doesn't sound right, COW should just mean we get quite a bit of
fragmentation, not write everything twice.  What exactly is qemu doing?  Thanks,

Josef

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BTRFS: Unbelievably slow with kvm/qemu
  2010-08-30  0:14 ` Josef Bacik
@ 2010-08-30 15:59   ` K. Richard Pixley
  2010-08-31 21:46     ` Mike Fedyk
  0 siblings, 1 reply; 8+ messages in thread
From: K. Richard Pixley @ 2010-08-30 15:59 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Tomasz Chmielewski, linux-kernel, linux-btrfs, hch, gg.mariotti,
	Justin P. Mattock, mjt, tytso

  On 8/29/10 17:14 , Josef Bacik wrote:
> On Sun, Aug 29, 2010 at 09:34:29PM +0200, Tomasz Chmielewski wrote:
>> Christoph Hellwig wrote:
>>> There are a lot of variables when using qemu.
>>>
>>> The most important one are:
>>>
>>>   - the cache mode on the device.  The default is cache=writethrough,
>>>     which is not quite optimal.  You generally do want to use cache=none
>>>     which uses O_DIRECT in qemu.
>>>   - if the backing image is sparse or not.
>>>   - if you use barrier - both in the host and the guest.
>> I noticed that when btrfs is mounted with default options, when writing
>> i.e. 10 GB on the KVM guest using qcow2 image, 20 GB are written on the
>> host (as measured with "iostat -m -p").
>>
>> With ext4 (or btrfs mounted with nodatacow), 10 GB write on a guest
>> produces 10 GB write on the host
> Whoa 20gb?  That doesn't sound right, COW should just mean we get quite a bit of
> fragmentation, not write everything twice.  What exactly is qemu doing?  Thanks,
Make sure you build your file system with "mkfs.btrfs -m single -d 
single /dev/whatever".  You may well be writing duplicate copies of 
everything.

--rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BTRFS: Unbelievably slow with kvm/qemu
  2010-08-30 15:59   ` K. Richard Pixley
@ 2010-08-31 21:46     ` Mike Fedyk
  2010-08-31 22:01       ` K. Richard Pixley
       [not found]       ` <4C7D7B14.9020008@noir.com>
  0 siblings, 2 replies; 8+ messages in thread
From: Mike Fedyk @ 2010-08-31 21:46 UTC (permalink / raw)
  To: K. Richard Pixley
  Cc: Josef Bacik, Tomasz Chmielewski, linux-kernel, linux-btrfs, hch,
	gg.mariotti, Justin P. Mattock, mjt, tytso

On Mon, Aug 30, 2010 at 8:59 AM, K. Richard Pixley <rich@noir.com> wrot=
e:
> =C2=A0On 8/29/10 17:14 , Josef Bacik wrote:
>>
>> On Sun, Aug 29, 2010 at 09:34:29PM +0200, Tomasz Chmielewski wrote:
>>>
>>> Christoph Hellwig wrote:
>>>>
>>>> There are a lot of variables when using qemu.
>>>>
>>>> The most important one are:
>>>>
>>>> =C2=A0- the cache mode on the device. =C2=A0The default is cache=3D=
writethrough,
>>>> =C2=A0 =C2=A0which is not quite optimal. =C2=A0You generally do wa=
nt to use cache=3Dnone
>>>> =C2=A0 =C2=A0which uses O_DIRECT in qemu.
>>>> =C2=A0- if the backing image is sparse or not.
>>>> =C2=A0- if you use barrier - both in the host and the guest.
>>>
>>> I noticed that when btrfs is mounted with default options, when wri=
ting
>>> i.e. 10 GB on the KVM guest using qcow2 image, 20 GB are written on=
 the
>>> host (as measured with "iostat -m -p").
>>>
>>> With ext4 (or btrfs mounted with nodatacow), 10 GB write on a guest
>>> produces 10 GB write on the host
>>
>> Whoa 20gb? =C2=A0That doesn't sound right, COW should just mean we g=
et quite a
>> bit of
>> fragmentation, not write everything twice. =C2=A0What exactly is qem=
u doing?
>> =C2=A0Thanks,
>
> Make sure you build your file system with "mkfs.btrfs -m single -d si=
ngle
> /dev/whatever". =C2=A0You may well be writing duplicate copies of eve=
rything.
>
There is little reason not to use duplicate metadata.  Only small
files (less than 2kb) get stored in the tree, so there should be no
worries about images being duplicated without data duplication set at
mkfs time.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BTRFS: Unbelievably slow with kvm/qemu
  2010-08-31 21:46     ` Mike Fedyk
@ 2010-08-31 22:01       ` K. Richard Pixley
       [not found]       ` <4C7D7B14.9020008@noir.com>
  1 sibling, 0 replies; 8+ messages in thread
From: K. Richard Pixley @ 2010-08-31 22:01 UTC (permalink / raw)
  To: Mike Fedyk
  Cc: Josef Bacik, Tomasz Chmielewski, linux-kernel, linux-btrfs, hch,
	gg.mariotti, Justin P. Mattock, mjt, tytso

On 20100831 14:46, Mike Fedyk wrote:
 > There is little reason not to use duplicate metadata.  Only small
 > files (less than 2kb) get stored in the tree, so there should be no
 > worries about images being duplicated without data duplication set at
 > mkfs time.

My benchmarks show that for my kinds of data, btrfs is somewhat slower 
than ext4, (which is slightly slower than ext3 which is somewhat slower 
than ext2), when using the defaults, (ie, duplicate metadata).

It's a hair faster than ext2, (the fastest of the ext family), when 
using singleton metadata.  And ext2 isn't even crash resistant while 
btrfs has snapshots.

I'm using hardware raid for striping speed.  (Tried btrfs striping, it 
was close, but not as fast on my hardware).  I want speed, speed, speed. 
  My data is only vaguely important, (continuous builders), but speed is 
everything.

While the reason to use singleton metadata may be "little", it dominates 
my application.  If I were forced to use duplicate metadata then I'd 
still be arguing with my coworkers about whether the speed costs were 
worth it to buy snapshot functionality.  But the fact that btrfs is 
faster AND provides snapshots, (and less metadata overhead and bigger 
file systems and etc), makes for an easy sale.

Note that nilfs2 has similar performance, but somewhat different 
snapshot characteristics that aren't as useful in my current application.

--rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BTRFS: Unbelievably slow with kvm/qemu
       [not found]       ` <4C7D7B14.9020008@noir.com>
@ 2010-09-02  0:18         ` Ted Ts'o
  2010-09-02 16:36           ` K. Richard Pixley
       [not found]           ` <4C7FD2AA.8090302@noir.com>
  0 siblings, 2 replies; 8+ messages in thread
From: Ted Ts'o @ 2010-09-02  0:18 UTC (permalink / raw)
  To: K. Richard Pixley
  Cc: Mike Fedyk, Josef Bacik, Tomasz Chmielewski, linux-kernel,
	linux-btrfs, hch, gg.mariotti, Justin P. Mattock, mjt

On Tue, Aug 31, 2010 at 02:58:44PM -0700, K. Richard Pixley wrote:
>  On 20100831 14:46, Mike Fedyk wrote:
> >There is little reason not to use duplicate metadata.  Only small
> >files (less than 2kb) get stored in the tree, so there should be no
> >worries about images being duplicated without data duplication set at
> >mkfs time.
> My benchmarks show that for my kinds of data, btrfs is somewhat
> slower than ext4, (which is slightly slower than ext3 which is
> somewhat slower than ext2), when using the defaults, (ie, duplicate
> metadata).
> 
> It's a hair faster than ext2, (the fastest of the ext family), when
> using singleton metadata.  And ext2 isn't even crash resistant while
> btrfs has snapshots.

I'm really, really curious.  Can you describe your data and your
workload in detail?  You mentioned "continuous builders"; is this some
kind of tinderbox setup?

						- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BTRFS: Unbelievably slow with kvm/qemu
  2010-09-02  0:18         ` Ted Ts'o
@ 2010-09-02 16:36           ` K. Richard Pixley
       [not found]           ` <4C7FD2AA.8090302@noir.com>
  1 sibling, 0 replies; 8+ messages in thread
From: K. Richard Pixley @ 2010-09-02 16:36 UTC (permalink / raw)
  To: Ted Ts'o, Mike Fedyk, Josef Bacik, Tomasz Chmielewski,
	linux-kernel, linux-btrfs

  On 9/1/10 17:18 , Ted Ts'o wrote:
> On Tue, Aug 31, 2010 at 02:58:44PM -0700, K. Richard Pixley wrote:
>>   On 20100831 14:46, Mike Fedyk wrote:
>>> There is little reason not to use duplicate metadata.  Only small
>>> files (less than 2kb) get stored in the tree, so there should be no
>>> worries about images being duplicated without data duplication set at
>>> mkfs time.
>> My benchmarks show that for my kinds of data, btrfs is somewhat
>> slower than ext4, (which is slightly slower than ext3 which is
>> somewhat slower than ext2), when using the defaults, (ie, duplicate
>> metadata).
>>
>> It's a hair faster than ext2, (the fastest of the ext family), when
>> using singleton metadata.  And ext2 isn't even crash resistant while
>> btrfs has snapshots.
> I'm really, really curious.  Can you describe your data and your
> workload in detail?  You mentioned "continuous builders"; is this some
> kind of tinderbox setup?
I'm not familiar with tinderbox.  Continuous builders tend to be a lot 
like shell scripts - its usually easier to write a new one than to even 
bother to read someone else's.  :).

Basically, it's an automated system that started out life as a shell 
script loop around a build a few years ago.  The current rendition 
includes a number of extra features.  The basic idea here is to expose 
top-of-tree build errors as fast as possible which means that these 
machines can take some build shortcuts that would not be appropriate for 
official builds intended as release candidates.  We have a different set 
of builders which build release candidates.

When it starts, it removes as many snapshots as it needs to in order to 
make space for another build.  Initially it creates a snapshot from 
/home, checks out source, and does a full build of top of tree.  Then it 
starts over.  If it has a build and is not top of tree, it creates a 
snapshot from the last successful build, updates, and does an 
incremental build.  When it reaches top of tree, it starts taking requests.

We're using openembedded so the build is largely based on components 
with a global "BOM", (bill of materials), acting as a code based 
database of which versions of which components are in use for which 
images.  This acts as a funneling point.  Requests are a specification 
of a list of components to change, (different versions, etc).  A 
snapshot is taken from the last successful build, the BOM is changed 
locally and built incrementally.  If everything builds alright, then the 
new BOM may be committed and/or the resulting binary packages may be 
published for QA consumption.  But even in the case of failure, this 
snapshot is terminal and never marked as "successful" so never reused.

The system acts both as a continuous builder to check top of tree as 
well as an automated method for serializing changes, (which stands in 
for real, human integration).

We currently have about 20 of these servers, ranging from 2 - 24 cores, 
4 - 24G memory, etc.  A single device build takes about 22G so a 24G 
machine can do an entire build in memory.  The different machines run 
similar builds against different branches or against different targets 
and the staggering tends to create a lower response time in the case of 
top-of-tree build errors that affect all devices, (the most common type 
of error).  And most of the servers are cast offs, older servers that 
would be discarded otherwise.  Server speed tends to be an issue 
primarily for the full builds.  Once the full build has been created, 
the incrementals tend to be limited to single threading as the build 
spends most of it's time doing dependency rechecking.

The snapshot based approach is recent, as is our btrfs usage, (which is 
currently problematic, polluted file systems, kernel crashes, etc).  
Previously I was using rsync to backup a copy of a full build and rsync 
to replace it when a build failed.  The working directory was the same 
working directory and I went to some pains to make it reusable.  I've 
been looking for a snapshotting facility for a couple of years now but 
only discovered btrfs recently.  (I tried lvm based snapshots but they 
don't really have the characteristics that I want, nor do nilfs2 snapshots.)

Is that what you were looking for?

--rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BTRFS: Unbelievably slow with kvm/qemu
       [not found]           ` <4C7FD2AA.8090302@noir.com>
@ 2010-09-02 16:49             ` K. Richard Pixley
  0 siblings, 0 replies; 8+ messages in thread
From: K. Richard Pixley @ 2010-09-02 16:49 UTC (permalink / raw)
  To: Ted Ts'o, Mike Fedyk, Josef Bacik, Tomasz Chmielewski,
	linux-kernel, linux-btrfs

  On 9/2/10 09:36 , K. Richard Pixley wrote:
>  On 9/1/10 17:18 , Ted Ts'o wrote:
>> On Tue, Aug 31, 2010 at 02:58:44PM -0700, K. Richard Pixley wrote:
>>>   On 20100831 14:46, Mike Fedyk wrote:
>>>> There is little reason not to use duplicate metadata.  Only small
>>>> files (less than 2kb) get stored in the tree, so there should be no
>>>> worries about images being duplicated without data duplication set at
>>>> mkfs time.
>>> My benchmarks show that for my kinds of data, btrfs is somewhat
>>> slower than ext4, (which is slightly slower than ext3 which is
>>> somewhat slower than ext2), when using the defaults, (ie, duplicate
>>> metadata).
>>>
>>> It's a hair faster than ext2, (the fastest of the ext family), when
>>> using singleton metadata.  And ext2 isn't even crash resistant while
>>> btrfs has snapshots.
>> I'm really, really curious.  Can you describe your data and your
>> workload in detail?  You mentioned "continuous builders"; is this some
>> kind of tinderbox setup?
> I'm not familiar with tinderbox.  Continuous builders tend to be a lot 
> like shell scripts - its usually easier to write a new one than to 
> even bother to read someone else's.  :).
>
> Basically, it's an automated system that started out life as a shell 
> script loop around a build a few years ago.  The current rendition 
> includes a number of extra features.  The basic idea here is to expose 
> top-of-tree build errors as fast as possible which means that these 
> machines can take some build shortcuts that would not be appropriate 
> for official builds intended as release candidates.  We have a 
> different set of builders which build release candidates.
>
> When it starts, it removes as many snapshots as it needs to in order 
> to make space for another build.  Initially it creates a snapshot from 
> /home, checks out source, and does a full build of top of tree.  Then 
> it starts over.  If it has a build and is not top of tree, it creates 
> a snapshot from the last successful build, updates, and does an 
> incremental build.  When it reaches top of tree, it starts taking 
> requests.
>
> We're using openembedded so the build is largely based on components 
> with a global "BOM", (bill of materials), acting as a code based 
> database of which versions of which components are in use for which 
> images.  This acts as a funneling point.  Requests are a specification 
> of a list of components to change, (different versions, etc).  A 
> snapshot is taken from the last successful build, the BOM is changed 
> locally and built incrementally.  If everything builds alright, then 
> the new BOM may be committed and/or the resulting binary packages may 
> be published for QA consumption.  But even in the case of failure, 
> this snapshot is terminal and never marked as "successful" so never 
> reused.
>
> The system acts both as a continuous builder to check top of tree as 
> well as an automated method for serializing changes, (which stands in 
> for real, human integration).
>
> We currently have about 20 of these servers, ranging from 2 - 24 
> cores, 4 - 24G memory, etc.  A single device build takes about 22G so 
> a 24G machine can do an entire build in memory.  The different 
> machines run similar builds against different branches or against 
> different targets and the staggering tends to create a lower response 
> time in the case of top-of-tree build errors that affect all devices, 
> (the most common type of error).  And most of the servers are cast 
> offs, older servers that would be discarded otherwise.  Server speed 
> tends to be an issue primarily for the full builds.  Once the full 
> build has been created, the incrementals tend to be limited to single 
> threading as the build spends most of it's time doing dependency 
> rechecking.
>
> The snapshot based approach is recent, as is our btrfs usage, (which 
> is currently problematic, polluted file systems, kernel crashes, 
> etc).  Previously I was using rsync to backup a copy of a full build 
> and rsync to replace it when a build failed.  The working directory 
> was the same working directory and I went to some pains to make it 
> reusable.  I've been looking for a snapshotting facility for a couple 
> of years now but only discovered btrfs recently.  (I tried lvm based 
> snapshots but they don't really have the characteristics that I want, 
> nor do nilfs2 snapshots.)
>
> Is that what you were looking for?
I should probably mention times and targets.

A typical 2-core, 4G developer workstation can build our entire system 
for 1 device in about 6 - 8hrs.  We typically build each device on a 
separate server and the highest end servers we're using today, (8 - 24 
core, 24G memory), can build a single device in a little under an hour.  
Those are full build times.  A complete cycle of an incremental based 
builder, (doing nothing but bookkeeping and checking dependencies), can 
take anywhere from about 2 - 4 minutes.  And a typical single component 
update usually takes 4 - 6 minutes.

 From a developer's perspective, I'm churning out 8hr builds every 5 
minutes or so.  What snapshots provide primarily is the ability to 
discard a polluted/broken working directory while retaining the ability 
to reuse it's immediate predecessor.  It's also true that snapshots 
leave old working directories laying around where they could be examined 
or debugged, but generally that facility is rarely used because it's too 
much trouble to provide developers access to those machines.

The targets here are an openembedded based embedded linux system.

--rich

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2010-09-02 16:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-29 19:34 BTRFS: Unbelievably slow with kvm/qemu Tomasz Chmielewski
2010-08-30  0:14 ` Josef Bacik
2010-08-30 15:59   ` K. Richard Pixley
2010-08-31 21:46     ` Mike Fedyk
2010-08-31 22:01       ` K. Richard Pixley
     [not found]       ` <4C7D7B14.9020008@noir.com>
2010-09-02  0:18         ` Ted Ts'o
2010-09-02 16:36           ` K. Richard Pixley
     [not found]           ` <4C7FD2AA.8090302@noir.com>
2010-09-02 16:49             ` K. Richard Pixley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).