Experiences: Why BTRFS had to yield for ZFS

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Experiences: Why BTRFS had to yield for ZFS
@ 2012-09-17  8:45 Casper Bang
  2012-09-17  9:15 ` Ralf Hildebrandt
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Casper Bang @ 2012-09-17  8:45 UTC (permalink / raw)
  To: linux-btrfs

Abstract
For database testing purposes, a COW filesystem was needed in order to
facilitate snapshotting and rollback, such as to provide mirrors of
our production database at fixed intervals (every night and by
demand).

Platform
An HP Proliant 380P (2x Intel Xeon E5-2620 with 12 cores for a total
of 24 threads) with build-in Smart Array SAS/SATA (Gen8) controllers,
was combined with 10x consumer Samsung 830 512GB SSD (SATAIII, 6Gb/s).
Oracle (Unbreakable) Linux x64 2.6.39-200.29.3.el6uek.x86_64 #1 SMP
Tue Aug 28 13:03:31 EDT 2012 and Oracle database standard edition
10.2.0.4 64bit.

Setup
OS was installed on fist disk (sda) and the remaining 9 (sdb - sdj)
were pooled into some 4.4TB, for containing Oracle datafiles. An
initial backup of the 1.5TB large prod database would get restored as
a (shut down) sync instance on the test server on the COW filesystem.
A script on the test server, would then apply Oracle archive files
from the production environment to this Oracle sync database, every
10'th minute, effectively making it near up-to-date with production.
The most reliable way to do this was with a simple NFS mount (rather
than rsync or samba). The idea then was, that it would be very fast
and easy to make a new snapshot of the sync database, start it up, and
voila you'd have a new instance ready to play with. A desktop machine
with ext4 partitions proved lower boundary for applying archivelog
data at around 1200 kb/s - we expected an order of magnitude higher
performance on the server.

BTRFS experiences
We used native BTRFS from kernel; with atime off, ssd mode. BTRFS
proved to be very fast at reading for a large TRDBMS (2x speedup
compared to a SAN). However, applying archivelog on a BTRFS filesystem
proved to scale poorly, by starting out with a decent apply rate it
would eventually end down around 400-500 kb/s. BTRFS had to be
abandoned due to this, since the script would never be able to finish
applying archivelog as new ones arrived. The desktop machine with
traditional spinning drives formatted for BTRFS showed a similar
scenario, so hardware (server, controller and disks) was excluded as a
cause.

ZFS experiences
We then tried using ZFS via custom-built SPL/ZFS 0.6.0-rc10 modules
with recordsize equal to that of Oracle database (8K); compression
off, quota off, dedup off, checksum on and atime on.
ZFS proved to be on-pair with a SAN, when it comes to reading for a
large TRDBMS. Thankfully, ZFS did not degrade much in archivelog apply
performance, and proved to have a lower-boundary of 15MB/s.

Conclusion
We had hoped to be able to utilize BTRFS, due to it's license and
inclusion in the Linux mainline kernel. However, for practical
purposes, we're not able to make use of BTRFS due to its performance
when writing -especially considering this is even without mixing in
shapshotting. While ZFS doesn't give us quite the boost in read
performance we had expected from SSD's, it seems more optimized for
writting and will allow us to complete our project of getting clones
of a production database environment up and running in a snap.

Take it for what it's worth, a couple of developers experiences with
BTRFS. We are not likely to go back and change things now it works,
but we are curious as to why we see such big differences between the
two file-systems. Any comments and/or feedback appreciated.

Regards,
Jesper and Casper

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Experiences: Why BTRFS had to yield for ZFS
  2012-09-17  8:45 Experiences: Why BTRFS had to yield for ZFS Casper Bang
@ 2012-09-17  9:15 ` Ralf Hildebrandt
  2012-09-17  9:55   ` Casper Bnag
  2012-09-18  5:28 ` Anand Jain
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Ralf Hildebrandt @ 2012-09-17  9:15 UTC (permalink / raw)
  To: linux-btrfs

* Casper Bang <casper.bang@gmail.com>:

> Oracle (Unbreakable) Linux x64 2.6.39-200.29.3.el6uek.x86_64 #1 SMP

And the btrfs was that from vanilla 2.6.39 (i.e. over a year old)?

-- 
Ralf Hildebrandt                   Charite Universitätsmedizin Berlin
ralf.hildebrandt@charite.de        Campus Benjamin Franklin
http://www.charite.de              Hindenburgdamm 30, 12203 Berlin
Geschäftsbereich IT, Abt. Netzwerk fon: +49-30-450.570.155

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Experiences: Why BTRFS had to yield for ZFS
  2012-09-17  9:15 ` Ralf Hildebrandt
@ 2012-09-17  9:55   ` Casper Bnag
  2012-09-17 10:05     ` Avi Miller
  0 siblings, 1 reply; 17+ messages in thread
From: Casper Bnag @ 2012-09-17  9:55 UTC (permalink / raw)
  To: linux-btrfs

Ralf Hildebrandt <Ralf.Hildebrandt <at> charite.de> writes:

> 
> * Casper Bang <casper.bang <at> gmail.com>:
> 
> > Oracle (Unbreakable) Linux x64 2.6.39-200.29.3.el6uek.x86_64 #1 SMP
> 
> And the btrfs was that from vanilla 2.6.39 (i.e. over a year old)?
> 

We're using the latest available kernel for our Oracle Unbreakable 
Linux 6.3 from Aug 28. We have no other option, since the Oracle database
software needs to run on a certified distro. I have no idea how to check 
the version Oracle actually compiles with, only the tools package has 
easy-to-grasp version info.

In any event, I would think it unlikely that the performance differences 
we see is the result of missing performance tweeks - we're talking an order 
of magnitude here and that smells of a design difference.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Experiences: Why BTRFS had to yield for ZFS
  2012-09-17  9:55   ` Casper Bnag
@ 2012-09-17 10:05     ` Avi Miller
  2012-09-17 10:47       ` Casper Bnag
  2012-09-18 16:48       ` Andrew McGlashan
  0 siblings, 2 replies; 17+ messages in thread
From: Avi Miller @ 2012-09-17 10:05 UTC (permalink / raw)
  To: Casper Bnag; +Cc: linux-btrfs

Hi,

On 17/09/2012, at 7:55 PM, Casper Bnag <casper.bang@gmail.com> wrote:

> We're using the latest available kernel for our Oracle Unbreakable 
> Linux 6.3 from Aug 28. We have no other option, since the Oracle database
> software needs to run on a certified distro. 

Oracle Database is not certified to run on either btrfs or ZFS on Linux, so if certification is an issue, you can't use either filesystem. Out of interest, have you done a performance benchmark with ASM using ASMlib on the same platform? 

--
Oracle <http://www.oracle.com>
Avi Miller | Principal Program Manager | +61 (412) 229 687
Oracle Linux and Virtualization
417 St Kilda Road, Melbourne, Victoria 3004 Australia







^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Experiences: Why BTRFS had to yield for ZFS
  2012-09-17 10:05     ` Avi Miller
@ 2012-09-17 10:47       ` Casper Bnag
  2012-09-17 10:58         ` Avi Miller
  2012-09-18 16:48       ` Andrew McGlashan
  1 sibling, 1 reply; 17+ messages in thread
From: Casper Bnag @ 2012-09-17 10:47 UTC (permalink / raw)
  To: linux-btrfs

> Oracle Database is not certified to run on either btrfs or ZFS on Linux, so if 
certification is an issue, you can't use either filesystem. 

Right, I had missed that - only ZFS on Solaris is officially supported I 
suppose. We had to draw the line somewhere, and an Oracle OS with an Oracle 
database with an Oracle filesystem seemed like a good platform. If the BTRFS 
pieces are indeed a year old in the latest official binary kernel from last 
month, that just makes me wonder why Oracle didn't use these latest bits. 
Again, I'm inclined to think we're dealing with a design difference between 
ZFS and BTRFS rather than a missing performance optimization. You'd know that 
better than I. :)

> Out of interest, have you done a performance benchmark with ASM using ASMlib
> on the same platform? 

Sorry, no. Our experience with ASM is limited, we came to the conclusion once
that we like being able to handle the files in a plain mountable file-system.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Experiences: Why BTRFS had to yield for ZFS
  2012-09-17 10:47       ` Casper Bnag
@ 2012-09-17 10:58         ` Avi Miller
  0 siblings, 0 replies; 17+ messages in thread
From: Avi Miller @ 2012-09-17 10:58 UTC (permalink / raw)
  To: Casper Bnag; +Cc: linux-btrfs

Hi,

On 17/09/2012, at 8:47 PM, Casper Bnag <casper.bang@gmail.com> wrote:

> month, that just makes me wonder why Oracle didn't use these latest bits. 

We used the most stable release of btrfs that was available when the development of the UEK was done. Keep in mind that while it's versioned at 2.6.39, it's actually 3.0.16 under the hood. It's just that some userspace doesn't like having a kernel version that doesn't start with "2.6"

>> Out of interest, have you done a performance benchmark with ASM using ASMlib
>> on the same platform? 
> 
> Sorry, no. Our experience with ASM is limited, we came to the conclusion once
> that we like being able to handle the files in a plain mountable file-system.

Perhaps, but ASM would provide all the functionality you require, including snapshots and rollback, at the highest possible performance. Certainly a lot higher than both ZFS and btrfs. And it's fully certified and supported by Oracle.

As an alternative, why not consider using Oracle VM on the machine and creating database VMs instead? You can then use the snapshot capability of Oracle VM while still running supported and certified filesystems inside each guest.

(We should also probably take this discussion off-list, as it has drifted away from btrfs proper). Feel free to reply to me directly if you want.

--
Oracle <http://www.oracle.com>
Avi Miller | Principal Program Manager | +61 (412) 229 687
Oracle Linux and Virtualization
417 St Kilda Road, Melbourne, Victoria 3004 Australia

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Experiences: Why BTRFS had to yield for ZFS
  2012-09-17  8:45 Experiences: Why BTRFS had to yield for ZFS Casper Bang
  2012-09-17  9:15 ` Ralf Hildebrandt
@ 2012-09-18  5:28 ` Anand Jain
  2012-09-19  7:28   ` Casper Bang
  2012-09-18 23:08 ` Gregory Farnum
  2012-09-19 15:25 ` Chris Mason
  3 siblings, 1 reply; 17+ messages in thread
From: Anand Jain @ 2012-09-18  5:28 UTC (permalink / raw)
  To: Casper Bang; +Cc: linux-btrfs



> A script on the test server, would then apply Oracle archive files
> from the production environment to this Oracle sync database, every
> 10'th minute, effectively making it near up-to-date with production.

> The most reliable way to do this was with a simple NFS mount (rather
> than rsync or samba). The idea then was, that it would be very fast
> and easy to make a new snapshot of the sync database, start it up, and
> voila you'd have a new instance ready to play with. A desktop machine


  archive-log-apply script - if you could, can you share the
  script itself ? or provide more details about the script.
  (It will help to understand the work-load in question).

Thanks, Anand

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Experiences: Why BTRFS had to yield for ZFS
  2012-09-17 10:05     ` Avi Miller
  2012-09-17 10:47       ` Casper Bnag
@ 2012-09-18 16:48       ` Andrew McGlashan
  2012-09-18 21:46         ` Avi Miller
  1 sibling, 1 reply; 17+ messages in thread
From: Andrew McGlashan @ 2012-09-18 16:48 UTC (permalink / raw)
  To: Avi Miller; +Cc: Casper Bnag, linux-btrfs

Hi,

On 17/09/2012 8:05 PM, Avi Miller wrote:
> Oracle Database is not certified to run on either btrfs or ZFS on Linux, so if certification is an issue, you can't use either filesystem. Out of interest, have you done a performance benchmark with ASM using ASMlib on the same platform? 

I thought that Oracle considered BTRFS to be production ready.  It
surprises me that running an Oracle database on BTRFS is not a supported
configuration.

Cheers

-- 
Kind Regards
AndrewM

Andrew McGlashan
Broadband Solutions now including VoIP

Current Land Line No: 03 9012 2102
Mobile: 04 2574 1827 Fax: 03 9012 2178

National No: 1300 85 3804

Affinity Vision Australia Pty Ltd
http://affinityvision.com.au
http://securemywireless.com.au
http://adsl2choice.net.au

In Case of Emergency --  http://affinityvision.com.au/ice.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Experiences: Why BTRFS had to yield for ZFS
  2012-09-18 16:48       ` Andrew McGlashan
@ 2012-09-18 21:46         ` Avi Miller
  0 siblings, 0 replies; 17+ messages in thread
From: Avi Miller @ 2012-09-18 21:46 UTC (permalink / raw)
  To: Andrew McGlashan; +Cc: Casper Bnag, linux-btrfs

Hi,

On 19/09/2012, at 2:48 AM, Andrew McGlashan <andrew.mcglashan@affinityvision.com.au> wrote:

> On 17/09/2012 8:05 PM, Avi Miller wrote:
>> Oracle Database is not certified to run on either btrfs or ZFS on Linux, so if certification is an issue, you can't use either filesystem. Out of interest, have you done a performance benchmark with ASM using ASMlib on the same platform? 
> 
> I thought that Oracle considered BTRFS to be production ready.  It
> surprises me that running an Oracle database on BTRFS is not a supported
> configuration.

The Oracle Linux team considers btrfs production-ready and we support it for production purposes for customers. However, we have nothing to do with Database and their certification process, and the Database (and other) product teams have not certified it for use with their products yet. This is also why product certification lags: we have nothing to do with individual product certification processes on various operating systems/platforms. 

So, while I'm aware that the database team is planning to certify btrfs "at some point", I suspect with Oracle OpenWorld coming up in a few weeks time, they have other things on their plate right now. :)

--
Oracle <http://www.oracle.com>
Avi Miller | Principal Program Manager | +61 (412) 229 687
Oracle Linux and Virtualization
417 St Kilda Road, Melbourne, Victoria 3004 Australia

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Experiences: Why BTRFS had to yield for ZFS
  2012-09-17  8:45 Experiences: Why BTRFS had to yield for ZFS Casper Bang
  2012-09-17  9:15 ` Ralf Hildebrandt
  2012-09-18  5:28 ` Anand Jain
@ 2012-09-18 23:08 ` Gregory Farnum
  2012-09-19 15:25 ` Chris Mason
  3 siblings, 0 replies; 17+ messages in thread
From: Gregory Farnum @ 2012-09-18 23:08 UTC (permalink / raw)
  To: Casper Bang; +Cc: linux-btrfs

On Mon, Sep 17, 2012 at 1:45 AM, Casper Bang <casper.bang@gmail.com> wrote:
> Abstract
> For database testing purposes, a COW filesystem was needed in order to
> facilitate snapshotting and rollback, such as to provide mirrors of
> our production database at fixed intervals (every night and by
> demand).
>
> Platform
> An HP Proliant 380P (2x Intel Xeon E5-2620 with 12 cores for a total
> of 24 threads) with build-in Smart Array SAS/SATA (Gen8) controllers,
> was combined with 10x consumer Samsung 830 512GB SSD (SATAIII, 6Gb/s).
> Oracle (Unbreakable) Linux x64 2.6.39-200.29.3.el6uek.x86_64 #1 SMP
> Tue Aug 28 13:03:31 EDT 2012 and Oracle database standard edition
> 10.2.0.4 64bit.
>
> Setup
> OS was installed on fist disk (sda) and the remaining 9 (sdb - sdj)
> were pooled into some 4.4TB, for containing Oracle datafiles. An
> initial backup of the 1.5TB large prod database would get restored as
> a (shut down) sync instance on the test server on the COW filesystem.
> A script on the test server, would then apply Oracle archive files
> from the production environment to this Oracle sync database, every
> 10'th minute, effectively making it near up-to-date with production.
> The most reliable way to do this was with a simple NFS mount (rather
> than rsync or samba). The idea then was, that it would be very fast
> and easy to make a new snapshot of the sync database, start it up, and
> voila you'd have a new instance ready to play with. A desktop machine
> with ext4 partitions proved lower boundary for applying archivelog
> data at around 1200 kb/s - we expected an order of magnitude higher
> performance on the server.
>
> BTRFS experiences
> We used native BTRFS from kernel; with atime off, ssd mode. BTRFS
> proved to be very fast at reading for a large TRDBMS (2x speedup
> compared to a SAN). However, applying archivelog on a BTRFS filesystem
> proved to scale poorly, by starting out with a decent apply rate it
> would eventually end down around 400-500 kb/s. BTRFS had to be
> abandoned due to this, since the script would never be able to finish
> applying archivelog as new ones arrived. The desktop machine with
> traditional spinning drives formatted for BTRFS showed a similar
> scenario, so hardware (server, controller and disks) was excluded as a
> cause.

Can you talk more about this decent apply rate ending up down at
400-500kb/s? We've been seeing degrading performance in our workloads
but thought it was due to snapshot abuse. (ie, large writes start out
at say 110MB/s and get slower the longer we run it — though we've
never run it long enough to go slower than about half starting speed.)


>
> ZFS experiences
> We then tried using ZFS via custom-built SPL/ZFS 0.6.0-rc10 modules
> with recordsize equal to that of Oracle database (8K); compression
> off, quota off, dedup off, checksum on and atime on.
> ZFS proved to be on-pair with a SAN, when it comes to reading for a
> large TRDBMS. Thankfully, ZFS did not degrade much in archivelog apply
> performance, and proved to have a lower-boundary of 15MB/s.
>
> Conclusion
> We had hoped to be able to utilize BTRFS, due to it's license and
> inclusion in the Linux mainline kernel. However, for practical
> purposes, we're not able to make use of BTRFS due to its performance
> when writing -especially considering this is even without mixing in
> shapshotting. While ZFS doesn't give us quite the boost in read
> performance we had expected from SSD's, it seems more optimized for
> writting and will allow us to complete our project of getting clones
> of a production database environment up and running in a snap.
>
> Take it for what it's worth, a couple of developers experiences with
> BTRFS. We are not likely to go back and change things now it works,
> but we are curious as to why we see such big differences between the
> two file-systems. Any comments and/or feedback appreciated.
>
> Regards,
> Jesper and Casper
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Experiences: Why BTRFS had to yield for ZFS
  2012-09-18  5:28 ` Anand Jain
@ 2012-09-19  7:28   ` Casper Bang
  2012-09-19  7:36     ` Fajar A. Nugraha
  0 siblings, 1 reply; 17+ messages in thread
From: Casper Bang @ 2012-09-19  7:28 UTC (permalink / raw)
  To: linux-btrfs

> Anand Jain <Anand.Jain <at> oracle.com> writes:
>   archive-log-apply script - if you could, can you share the
>   script itself ? or provide more details about the script.
>   (It will help to understand the work-load in question).

Our setup entails a whole bunch of scripts, but the apply script looks like this 
(orion is the production environment, pandium is the shadow):
http://pastebin.com/k4T7deap

The script invokes rman passing rman_recover_database.rcs:

connect target /
run {
    crosscheck archivelog all;
    delete noprompt expired archivelog all;
    catalog start with '/backup/oracle/flash_recovery_area/FROM_PROD/archivelog' 
noprompt;
    recover database;
}

We receive a 1GB archivelog roughly every 20'th minute, depending on the 
workload of the production environment. Apply rate starts out fine with btrfs > 
ext4 > zfs, but ends out with ZFS > ext4 > btrfs. The following numbers are from 
our consumer spinning-platter disk test, but they are equally representable to 
the SSD numbers we got.

Ext4 starts out with a realtime to SCN ratio of about 3.4 and ends down around a 
factor 2.2.

ZFS starts out with a realtime to SCN ratio of about 7.5 and ends down around a 
factor 4.4.

Btrfs starts out with a realtime to SCN ratio of about 2.2 and ends down around 
a factor 0.8. This of course means we will never be able to catch up with 
production, as btrfs can't apply these as fast as they're created.

It was even worse with btrfs on our 10xSSD server, where 20 min. of realtime 
work would end up taking some 5h to get applied (factor 0.06), obviously useless 
to us.

I should point out, that during this process we also had to move some large 
backup sets around and we saw several times btrfs eating massive IO never to 
finish a simple mv command.

I'm inclined to believe we've found some weak corner, perhaps in combination 
with SSD's - but it led us to compare with ext4 and ZFS, and dismiss btrfs for 
this over ZFS as it solves our problem.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Experiences: Why BTRFS had to yield for ZFS
  2012-09-19  7:28   ` Casper Bang
@ 2012-09-19  7:36     ` Fajar A. Nugraha
  2012-09-19  8:09       ` Casper Bang
  0 siblings, 1 reply; 17+ messages in thread
From: Fajar A. Nugraha @ 2012-09-19  7:36 UTC (permalink / raw)
  To: Casper Bang; +Cc: linux-btrfs

On Wed, Sep 19, 2012 at 2:28 PM, Casper Bang <casper.bang@gmail.com> wrote:
>> Anand Jain <Anand.Jain <at> oracle.com> writes:
>>   archive-log-apply script - if you could, can you share the
>>   script itself ? or provide more details about the script.
>>   (It will help to understand the work-load in question).
>
> Our setup entails a whole bunch of scripts, but the apply script looks like this
> (orion is the production environment, pandium is the shadow):
> http://pastebin.com/k4T7deap
>
> The script invokes rman passing rman_recover_database.rcs:

IIRC there were some patches post-3.0 which relates to sync. If oracle
db uses sync writes (or call sync somewhere, which it should), it
might help to re-run the test with more recent kernel. kernel-ml
repository might help.

> Ext4 starts out with a realtime to SCN ratio of about 3.4 and ends down around a
> factor 2.2.
>
> ZFS starts out with a realtime to SCN ratio of about 7.5 and ends down around a
> factor 4.4.

So zfsonlinux is actually faster than ext4 for that purpuse? coool !

>
> Btrfs starts out with a realtime to SCN ratio of about 2.2 and ends down around
> a factor 0.8. This of course means we will never be able to catch up with
> production, as btrfs can't apply these as fast as they're created.
>
> It was even worse with btrfs on our 10xSSD server, where 20 min. of realtime
> work would end up taking some 5h to get applied (factor 0.06), obviously useless
> to us.

Just wondering, did you use "discard" option by any chance? In my
experience it makes btrfs MUCH slower.

-- 
Fajar

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Experiences: Why BTRFS had to yield for ZFS
  2012-09-19  7:36     ` Fajar A. Nugraha
@ 2012-09-19  8:09       ` Casper Bang
  0 siblings, 0 replies; 17+ messages in thread
From: Casper Bang @ 2012-09-19  8:09 UTC (permalink / raw)
  To: linux-btrfs

> IIRC there were some patches post-3.0 which relates to sync. If oracle
> db uses sync writes (or call sync somewhere, which it should), it
> might help to re-run the test with more recent kernel. kernel-ml
> repository might help.

Yeah there doesn't seem to be a shortage of patches coming into btrfs
 (just looking around the mailing-list) so that doesn't surprise me. 
Indeed, reading about race conditions, deadlocks and locks being held too 
long, does not serve to promote btrfs as particular production ready.

> > Ext4 starts out with a realtime to SCN ratio of about 3.4 and ends down 
around a
> > factor 2.2.
> >
> > ZFS starts out with a realtime to SCN ratio of about 7.5 and ends down 
around 
a
> > factor 4.4.
> 
> So zfsonlinux is actually faster than ext4 for that purpuse? coool !

Yes, rather amazingly fast - again, seems to us ZFS is optimized for write 
while btrfs is optimized for read.

> Just wondering, did you use "discard" option by any chance? In my
> experience it makes btrfs MUCH slower.

I actually don't remember when we added this (we started out without it), 
but I don't recall seeing a major difference. We should disable it however,
since the stupid fancy HP RAID controller refuses to pass on TRIM and Smart
commands anyway (and the propriatary HP SSD tools refuse to access 
non-enterprise HP SSD's.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Experiences: Why BTRFS had to yield for ZFS
  2012-09-17  8:45 Experiences: Why BTRFS had to yield for ZFS Casper Bang
                   ` (2 preceding siblings ...)
  2012-09-18 23:08 ` Gregory Farnum
@ 2012-09-19 15:25 ` Chris Mason
  2012-09-19 19:43   ` Casper Bang
  2012-10-08 14:38   ` Casper Bang
  3 siblings, 2 replies; 17+ messages in thread
From: Chris Mason @ 2012-09-19 15:25 UTC (permalink / raw)
  To: Casper Bang; +Cc: linux-btrfs@vger.kernel.org

On Mon, Sep 17, 2012 at 02:45:08AM -0600, Casper Bang wrote:
> Abstract
> For database testing purposes, a COW filesystem was needed in order to
> facilitate snapshotting and rollback, such as to provide mirrors of
> our production database at fixed intervals (every night and by
> demand).

Thanks for taking the time to write this up follow through the thread.
It's always interesting to hear situations where btrfs doesn't work
well.

There are three basic problems with the database workloads on btrfs.
First is that we have higher latencies on writes because we are feeding
everything through helper threads for crcs.  Usually the extra latencies
don't show up because we have enough work in the pipeline to keep the
drive busy.

I don't believe the UEK kernels have the recent changes to do some of
the crc work inline (without handing off) for smaller synchronous IOs.

Second, on O_SYNC writes btrfs will write both the file metadata and
data into a special tree so we can be crash safe.  For big files this
tends to spend a lot of time looking for the extents in the file that
have changed.

Josef fixed that up and it is queued for the next merge window.

The third problem is that lots of random writes tend to make lots of
metadata.  If this doesn't fit in ram, we can end up doing many reads
that slow things down.  We're working on this now as well, but recent
kernels change how we cache things and should improve the results.

-chris

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Experiences: Why BTRFS had to yield for ZFS
  2012-09-19 15:25 ` Chris Mason
@ 2012-09-19 19:43   ` Casper Bang
  2012-10-08 14:38   ` Casper Bang
  1 sibling, 0 replies; 17+ messages in thread
From: Casper Bang @ 2012-09-19 19:43 UTC (permalink / raw)
  To: linux-btrfs

Chris Mason <chris.mason <at> fusionio.com> writes:
> There are three basic problems with the database workloads on btrfs.
> First is that we have higher latencies on writes because we are feeding
> everything through helper threads for crcs.  Usually the extra latencies
> don't show up because we have enough work in the pipeline to keep the
> drive busy.
> 
> I don't believe the UEK kernels have the recent changes to do some of
> the crc work inline (without handing off) for smaller synchronous IOs.
> 
> Second, on O_SYNC writes btrfs will write both the file metadata and
> data into a special tree so we can be crash safe.  For big files this
> tends to spend a lot of time looking for the extents in the file that
> have changed.
> 
> Josef fixed that up and it is queued for the next merge window.
> 
> The third problem is that lots of random writes tend to make lots of
> metadata.  If this doesn't fit in ram, we can end up doing many reads
> that slow things down.  We're working on this now as well, but recent
> kernels change how we cache things and should improve the results.

That's good to hear - personally I'd rather use btrfs than ZFS, but it seems we 
were a tad bit early to the party with this kind of workload. Interesting nobody 
commented on block-size, I kind of expected that when writing my initial post 
(database using 8KB blocks, tweakable in ZFS but apparently not in btrfs).

/Casper


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Experiences: Why BTRFS had to yield for ZFS
  2012-09-19 15:25 ` Chris Mason
  2012-09-19 19:43   ` Casper Bang
@ 2012-10-08 14:38   ` Casper Bang
  2012-10-08 20:59     ` Avi Miller
  1 sibling, 1 reply; 17+ messages in thread
From: Casper Bang @ 2012-10-08 14:38 UTC (permalink / raw)
  To: linux-btrfs

> Thanks for taking the time to write this up follow through the thread.
> It's always interesting to hear situations where btrfs doesn't work
> well.
> 
> There are three basic problems with the database workloads on btrfs.
> First is that we have higher latencies on writes because we are feeding
> everything through helper threads for crcs.  Usually the extra latencies
> don't show up because we have enough work in the pipeline to keep the
> drive busy.
> 
> I don't believe the UEK kernels have the recent changes to do some of
> the crc work inline (without handing off) for smaller synchronous IOs.
> 
> Second, on O_SYNC writes btrfs will write both the file metadata and
> data into a special tree so we can be crash safe.  For big files this
> tends to spend a lot of time looking for the extents in the file that
> have changed.
> 
> Josef fixed that up and it is queued for the next merge window.
> 
> The third problem is that lots of random writes tend to make lots of
> metadata.  If this doesn't fit in ram, we can end up doing many reads
> that slow things down.  We're working on this now as well, but recent
> kernels change how we cache things and should improve the results.

I feel I should update my previous thread about performance issues using btrfs 
in light of recent findings. We have discovered that, in all likelihood, what we 
experienced and what was described, was not a problem with btrfs per se, but a 
result of a more general issue which btrfs was just really good at exposing 
(using threads more aggressively than zfs?!).

Various benchmarks in Java (thread-pool setup/shutdown) and C (pthreads creation 
and joining), has shown that our Xeon/E5-2620 server with the latest Oracle 
Unbreakable Linux has a very slow time serving up new threads (benchmarks 
available upon request).

Java threading benchmark on Xeon/E5-2620 @ 2.0GHz:
Oracle Unbreakable Linux: 1m49s	realtime, 3m17s sys-time
Ubuntu:                   5s realtime, 3.9s sys-time.

We are not sure how to continue investigating why the Oracle Linux/Kernel 
performs so poorly (scheduler, kernel config etc?), but it seems pretty obvious 
that this issue should be raised with Oracle rather than the btrfs developers - 
though we'll probably look into using another OS entirely. As such, apologies 
for creating the noise, btrfs was not to blame!

If you do have a suspicion or insight on the matter (perhaps work for Oracle, or 
know OUK?), of course we'd love a followup offline this list.

Kind regards,
Casper

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Experiences: Why BTRFS had to yield for ZFS
  2012-10-08 14:38   ` Casper Bang
@ 2012-10-08 20:59     ` Avi Miller
  0 siblings, 0 replies; 17+ messages in thread
From: Avi Miller @ 2012-10-08 20:59 UTC (permalink / raw)
  To: Casper Bang; +Cc: linux-btrfs

Hi,

On 09/10/2012, at 1:38 AM, Casper Bang <casper.bang@gmail.com> wrote:

> If you do have a suspicion or insight on the matter (perhaps work for Oracle, or 
> know OUK?), of course we'd love a followup offline this list.


I've sent an email to Casper to follow this up offline.

Thanks,
Avi

--
Oracle <http://www.oracle.com>
Avi Miller | Principal Program Manager | +61 (412) 229 687
Oracle Linux and Virtualization
417 St Kilda Road, Melbourne, Victoria 3004 Australia







^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2012-10-08 21:00 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-17  8:45 Experiences: Why BTRFS had to yield for ZFS Casper Bang
2012-09-17  9:15 ` Ralf Hildebrandt
2012-09-17  9:55   ` Casper Bnag
2012-09-17 10:05     ` Avi Miller
2012-09-17 10:47       ` Casper Bnag
2012-09-17 10:58         ` Avi Miller
2012-09-18 16:48       ` Andrew McGlashan
2012-09-18 21:46         ` Avi Miller
2012-09-18  5:28 ` Anand Jain
2012-09-19  7:28   ` Casper Bang
2012-09-19  7:36     ` Fajar A. Nugraha
2012-09-19  8:09       ` Casper Bang
2012-09-18 23:08 ` Gregory Farnum
2012-09-19 15:25 ` Chris Mason
2012-09-19 19:43   ` Casper Bang
2012-10-08 14:38   ` Casper Bang
2012-10-08 20:59     ` Avi Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).