ext4 vs btrfs performance on SSD array

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* ext4 vs btrfs performance on SSD array
@ 2014-08-26 23:39 Nikolai Grigoriev
  2014-08-27  7:10 ` Duncan
  2014-09-02  0:08 ` Dave Chinner
  0 siblings, 2 replies; 14+ messages in thread
From: Nikolai Grigoriev @ 2014-08-26 23:39 UTC (permalink / raw)
  To: linux-btrfs

Hi,

This is not exactly a problem - I am trying to understand why BTRFS
demonstrates significantly higher throughput in my environment.

I am observing something that I cannot explain. I am trying to come up
with a good filesystem configuration using HP P420i controller and
SSDs (Intel S3500). Out of curiosity I have tried BTRFS (still
unstable so I can't really expect to be able to use it) and noticed
that the read speed is about 150% of ext4 - while write speed is
comparable.

To be clear, I am using RAID0 with two SSDs with strip size of 256Kb.
I have 6 disks so I have created 3 logical disks, 2 SSDs each - just
for testing. And then I have formatted them with ext4, XFS and BTRFS.

When I write (something like dd if=/dev/zero of=test2 bs=512k
count=20000 conv=fdatasync,fsync) and watch the system using iostats,
I see that both BTRFS and EXT4 are writing at approximately the same
rate with similar number of write requests:

(ext4  - writing)
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00 1791.00     0.00   895.00
1023.43   141.73   78.97   0.56 100.00
(btrfs - writing)
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdc               0.00     0.00    0.00 1786.00     0.00   893.00
1024.00   137.87   77.21   0.56 100.10

When I read, I observe different picture:

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
(ext4 - reading)
sdb               0.00     0.00 4782.00    0.00   597.75     0.00
256.00     1.57    0.33   0.18  84.10
(btrfs - reading)
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdc             207.00     0.00 1794.00    0.00   886.40     0.00
1011.90    10.59    5.90   0.56 100.00
(xfs - reading)
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdd               0.00     0.00 4623.00    0.00   577.88     0.00
256.00     1.71    0.37   0.21  97.00

And this is what I see if I just try to read the block device with dd:

(reading block device)
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdb           132055.00     0.00 4259.00    0.00   532.38     0.00
256.00     1.61    0.38   0.23  99.80
sdc           131750.00     0.00 4250.00    0.00   531.25     0.00
256.00     1.58    0.37   0.24 100.00
sdc           142476.00     0.00 4596.00    0.00   574.50     0.00
256.00     1.61    0.35   0.20  92.40

All settings seem to be identical (I/O scheduler, readahead...) for
all 3 logical volumes.

So, everything else being equal, I clearly see that btrfs does much
fewer reads per second and clearly reads more bytes per second. And
that number of rrqm/s - this is the number of merged requests. I can
only see it on the device that is formatted with btrfs.

Kernel: 3.8.13-35.3.5.el6uek.x86_64 #2 SMP Fri Aug 8 21:58:11 PDT 2014
x86_64 x86_64 x86_64 GNU/Linux
Btrfs v0.20-rc1
#   btrfs fi show
Label: 'cassandra-data'  uuid: 6677e858-7a3e-4c76-861c-32977fd2fff9
    Total devices 1 FS bytes used 49.22GB
    devid    1 size 1.46TB used 60.02GB path /dev/sdc

Btrfs v0.20-rc1
# btrfs fi df /cassandra-data/disk2
Data: total=59.01GB, used=49.15GB
System: total=4.00MB, used=12.00KB
Metadata: total=1.01GB, used=72.20MB
dmesg:
btrfs: use ssd allocation scheme
btrfs: turning off barriers
btrfs: disk space caching is enabled

Puzzled... What is so special btrfs can do (and ext4 and xfs do not
do) to make such a difference?

P.S. No, compression is disabled with mount option compress=no

-- 
Nikolai Grigoriev
(514) 772-5178

-- 
Nikolai Grigoriev
(514) 772-5178

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-08-26 23:39 ext4 vs btrfs performance on SSD array Nikolai Grigoriev
@ 2014-08-27  7:10 ` Duncan
  2014-08-27 21:59   ` Nikolai Grigoriev
  2014-09-02  0:08 ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Duncan @ 2014-08-27  7:10 UTC (permalink / raw)
  To: linux-btrfs

Nikolai Grigoriev posted on Tue, 26 Aug 2014 19:39:08 -0400 as excerpted:

> Kernel: 3.8.13-35.3.5.el6uek.x86_64 #2 SMP Fri Aug 8 21:58:11 PDT 2014
> x86_64 x86_64 x86_64 GNU/Linux

> Btrfs v0.20-rc1

I've no answer for your question, but you know how old both your kernel 
and btrfs-progs versions are, for a filesystem under as heavy development 
as btrfs is, right?

The normal recommendation is to run the latest stable series kernel, 
3.16.x at this time, unless you have specific reason not to (like the 
below, or because you're specifically comparing multiple btrfs kernel-
spaces).  Userspace isn't quite as critical, but 3.14.2 is current (with 
3.16 soon to be released), and 3.12 was the first one of the new 
versioning sequence and currently the minimum recommended.  Btrfs-progs 
v0.20-rc1 is as ancient as a 3.8 kernel.

Tho there's a current known btrfs kworker thread lockup bug that 
apparently only affects those using the compress mount option.  Btrfs 
converted from using its own private worker threads to generic kworker 
threads in 3.15, so previous to that wasn't affected, while all current 
releases in the 3.15 and 3.16 series (and 3.17 thru rc2, rc3 should have 
the patch) are affected.  The patch is marked for stable so should end up 
in 3.16 stable series too, tho probably not 3.15 as AFAIK as a non-long-
term-support release it's already EOL or close to it.  (3.14 is an LTS 
but as I said the bug didn't affect it so no backported patch necessary.)

So that'd be a good reason to stay with 3.14 (which as I said is LTS) for 
the time being, but back further than that is definitely older than would 
be recommended for anything btrfs related, and both kernel 3.8 and 
userspace 0.20-rc1 are positively ancient.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-08-27  7:10 ` Duncan
@ 2014-08-27 21:59   ` Nikolai Grigoriev
  0 siblings, 0 replies; 14+ messages in thread
From: Nikolai Grigoriev @ 2014-08-27 21:59 UTC (permalink / raw)
  To: linux-btrfs

Duncan <1i5t5.duncan <at> cox.net> writes:

> 
> Nikolai Grigoriev posted on Tue, 26 Aug 2014 19:39:08 -0400 as excerpted:
> 
> > Kernel: 3.8.13-35.3.5.el6uek.x86_64 #2 SMP Fri Aug 8 21:58:11 PDT 2014
> > x86_64 x86_64 x86_64 GNU/Linux
> 
> > Btrfs v0.20-rc1
> 
> I've no answer for your question, but you know how old both your kernel 
> and btrfs-progs versions are, for a filesystem under as heavy development 
> as btrfs is, right?

Yes. As much as I'd like to run latest & greatest the company sticks to OEL
6.5 so I have to play within the limits. The primary reason why I asked the
question is because I have noticed that "btrfs does it much better :)" and
wanted to understand why. Either to understand ext4 limitations vs btrfs or
to understand the issues with my ext4 configuration.

Actually, I have found the answer later last night. I have found that btrfs
has its own readahead implementation. So I've got an idea to disable it and
see if it makes it slower. And, indeed, I can confirm that in my specific
test scenario the readahead with 4Mb (default) buffer was making lots of
difference. I think it was mostly due to RAID-0 with 2 SSDs. But even on a
single filesystem it does make a difference.

Then I have also realized that since it was due to readahead, it won't be a
big game changer for Cassandra as it does lots of random reads.

But thanks anyway for the detailed explanation of BTRFS status. I'll surely
use it as soon as I can.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-08-26 23:39 ext4 vs btrfs performance on SSD array Nikolai Grigoriev
  2014-08-27  7:10 ` Duncan
@ 2014-09-02  0:08 ` Dave Chinner
  2014-09-02  1:22   ` Christoph Hellwig
  1 sibling, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2014-09-02  0:08 UTC (permalink / raw)
  To: Nikolai Grigoriev; +Cc: linux-btrfs

On Tue, Aug 26, 2014 at 07:39:08PM -0400, Nikolai Grigoriev wrote:
> Hi,
> 
> This is not exactly a problem - I am trying to understand why BTRFS
> demonstrates significantly higher throughput in my environment.
> 
> I am observing something that I cannot explain. I am trying to come up
> with a good filesystem configuration using HP P420i controller and
> SSDs (Intel S3500). Out of curiosity I have tried BTRFS (still
> unstable so I can't really expect to be able to use it) and noticed
> that the read speed is about 150% of ext4 - while write speed is
> comparable.
...
> When I read, I observe different picture:
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> (ext4 - reading)
> sdb               0.00     0.00 4782.00    0.00   597.75     0.00 256.00     1.57    0.33   0.18  84.10
> (btrfs - reading)
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sdc             207.00     0.00 1794.00    0.00   886.40     0.00 1011.90    10.59    5.90   0.56 100.00
> (xfs - reading)
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sdd               0.00     0.00 4623.00    0.00   577.88     0.00 256.00     1.71    0.37   0.21  97.00

Pretty obvious difference: avgrq-sz. btrfs is doing 512k IOs, ext4
and XFS are doing is doing 128k IOs because that's the default block
device readahead size.  'blockdev --setra 1024 /dev/sdd' before
mounting the filesystem will probably fix it.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-02  0:08 ` Dave Chinner
@ 2014-09-02  1:22   ` Christoph Hellwig
  2014-09-02 11:31     ` Theodore Ts'o
                       ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Christoph Hellwig @ 2014-09-02  1:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Nikolai Grigoriev, linux-btrfs, linux-fsdevel, linux-raid,
	linux-mm, Jens Axboe

On Tue, Sep 02, 2014 at 10:08:22AM +1000, Dave Chinner wrote:
> Pretty obvious difference: avgrq-sz. btrfs is doing 512k IOs, ext4
> and XFS are doing is doing 128k IOs because that's the default block
> device readahead size.  'blockdev --setra 1024 /dev/sdd' before
> mounting the filesystem will probably fix it.

Btw, it's really getting time to make Linux storage fs work out the
box.  There's way to many things that are stupid by default and we
require everyone to fix up manually:

 - the ridiculously low max_sectors default
 - the very small max readahead size
 - replacing cfq with deadline (or noop)
 - the too small RAID5 stripe cache size

and probably a few I forgot about.  It's time to make things perform
well out of the box..

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-02  1:22   ` Christoph Hellwig
@ 2014-09-02 11:31     ` Theodore Ts'o
  2014-09-02 14:20       ` Jan Kara
  2014-09-02 12:55     ` Zack Coffey
  2014-09-03  0:01     ` NeilBrown
  2 siblings, 1 reply; 14+ messages in thread
From: Theodore Ts'o @ 2014-09-02 11:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Nikolai Grigoriev, linux-btrfs, linux-fsdevel,
	linux-raid, linux-mm, Jens Axboe

>  - the very small max readahead size

For things like the readahead size, that's probably something that we
should autotune, based the time it takes to read N sectors.  i.e.,
start N relatively small, such as 128k, and then bump it up based on
how long it takes to do a sequential read of N sectors until it hits a
given tunable, which is specified in milliseconds instead of kilobytes.

>  - replacing cfq with deadline (or noop)

Unfortunately, that will break ionice and a number of other things...

      	       	     		     		  - Ted

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-02  1:22   ` Christoph Hellwig
  2014-09-02 11:31     ` Theodore Ts'o
@ 2014-09-02 12:55     ` Zack Coffey
  2014-09-02 13:40       ` Austin S Hemmelgarn
  2014-09-03  0:01     ` NeilBrown
  2 siblings, 1 reply; 14+ messages in thread
From: Zack Coffey @ 2014-09-02 12:55 UTC (permalink / raw)
  Cc: linux-btrfs, linux-fsdevel, linux-raid, linux-mm

While I'm sure some of those settings were selected with good reason,
maybe there can be a few options (2 or 3) that have some basic
intelligence at creation to pick a more sane option.

Some checks to see if an option or two might be better suited for the
fs. Like the RAID5 stripe size. Leave the default as is, but maybe a
quick speed test to automatically choose from a handful of the most
common values. If they fail or nothing better is found, then apply the
default value just like it would now.


On Mon, Sep 1, 2014 at 9:22 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Sep 02, 2014 at 10:08:22AM +1000, Dave Chinner wrote:
>> Pretty obvious difference: avgrq-sz. btrfs is doing 512k IOs, ext4
>> and XFS are doing is doing 128k IOs because that's the default block
>> device readahead size.  'blockdev --setra 1024 /dev/sdd' before
>> mounting the filesystem will probably fix it.
>
> Btw, it's really getting time to make Linux storage fs work out the
> box.  There's way to many things that are stupid by default and we
> require everyone to fix up manually:
>
>  - the ridiculously low max_sectors default
>  - the very small max readahead size
>  - replacing cfq with deadline (or noop)
>  - the too small RAID5 stripe cache size
>
> and probably a few I forgot about.  It's time to make things perform
> well out of the box..
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-02 12:55     ` Zack Coffey
@ 2014-09-02 13:40       ` Austin S Hemmelgarn
  0 siblings, 0 replies; 14+ messages in thread
From: Austin S Hemmelgarn @ 2014-09-02 13:40 UTC (permalink / raw)
  To: Zack Coffey; +Cc: linux-btrfs, linux-fsdevel, linux-raid, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2424 bytes --]

I wholeheartedly agree.  Of course, getting something other than CFQ as
the default I/O scheduler is going to be a difficult task.  Enough
people upstream are convinced that we all NEED I/O priorities, when most
of what I see people doing with them is bandwidth provisioning, which
can be done much more accurately (and flexibly) using cgroups.

Ironically, there have been a lot of in-kernel defaults that I have run
into issues with recently, most of which originated in the DOS era,
where a few MB of RAM was high-end.

On 2014-09-02 08:55, Zack Coffey wrote:
> While I'm sure some of those settings were selected with good reason,
> maybe there can be a few options (2 or 3) that have some basic
> intelligence at creation to pick a more sane option.
> 
> Some checks to see if an option or two might be better suited for the
> fs. Like the RAID5 stripe size. Leave the default as is, but maybe a
> quick speed test to automatically choose from a handful of the most
> common values. If they fail or nothing better is found, then apply the
> default value just like it would now.
> 
> 
> On Mon, Sep 1, 2014 at 9:22 PM, Christoph Hellwig <hch@infradead.org> wrote:
>> On Tue, Sep 02, 2014 at 10:08:22AM +1000, Dave Chinner wrote:
>>> Pretty obvious difference: avgrq-sz. btrfs is doing 512k IOs, ext4
>>> and XFS are doing is doing 128k IOs because that's the default block
>>> device readahead size.  'blockdev --setra 1024 /dev/sdd' before
>>> mounting the filesystem will probably fix it.
>>
>> Btw, it's really getting time to make Linux storage fs work out the
>> box.  There's way to many things that are stupid by default and we
>> require everyone to fix up manually:
>>
>>  - the ridiculously low max_sectors default
>>  - the very small max readahead size
>>  - replacing cfq with deadline (or noop)
>>  - the too small RAID5 stripe cache size
>>
>> and probably a few I forgot about.  It's time to make things perform
>> well out of the box..
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-02 11:31     ` Theodore Ts'o
@ 2014-09-02 14:20       ` Jan Kara
  2014-09-02 14:55         ` Theodore Ts'o
  0 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2014-09-02 14:20 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Christoph Hellwig, Dave Chinner, Nikolai Grigoriev, linux-btrfs,
	linux-fsdevel, linux-raid, linux-mm, Jens Axboe

On Tue 02-09-14 07:31:04, Ted Tso wrote:
> >  - the very small max readahead size
> 
> For things like the readahead size, that's probably something that we
> should autotune, based the time it takes to read N sectors.  i.e.,
> start N relatively small, such as 128k, and then bump it up based on
> how long it takes to do a sequential read of N sectors until it hits a
> given tunable, which is specified in milliseconds instead of kilobytes.
  Actually the amount of readahead we do is autotuned (based on hit rate).
So I would keep the setting in sysfs as the maximum size adaptive readahead
can ever read and we can bump it up. We can possibly add another feedback
into the readahead code to tune actualy readahead size depending on device
speed but we'd have to research exactly what algorithm would work best.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-02 14:20       ` Jan Kara
@ 2014-09-02 14:55         ` Theodore Ts'o
  0 siblings, 0 replies; 14+ messages in thread
From: Theodore Ts'o @ 2014-09-02 14:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dave Chinner, Nikolai Grigoriev, linux-btrfs,
	linux-fsdevel, linux-raid, linux-mm, Jens Axboe

On Tue, Sep 02, 2014 at 04:20:24PM +0200, Jan Kara wrote:
> On Tue 02-09-14 07:31:04, Ted Tso wrote:
> > >  - the very small max readahead size
> > 
> > For things like the readahead size, that's probably something that we
> > should autotune, based the time it takes to read N sectors.  i.e.,
> > start N relatively small, such as 128k, and then bump it up based on
> > how long it takes to do a sequential read of N sectors until it hits a
> > given tunable, which is specified in milliseconds instead of kilobytes.
>   Actually the amount of readahead we do is autotuned (based on hit rate).
> So I would keep the setting in sysfs as the maximum size adaptive readahead
> can ever read and we can bump it up. We can possibly add another feedback
> into the readahead code to tune actualy readahead size depending on device
> speed but we'd have to research exactly what algorithm would work best.

I do think we will need to add a time based cap when bump up the max
adaptive readahead; otherwise what could happen is that if we are
streaming off of a slow block device, the readhaead could easily grow
to the point where it starts affecting the latency of competing read
requests to the slow block device.

I suppose we could make the argument that it's not needed, because most of
situations where we might be using slow block devices, the streaming
reader will likely have exclusive use of the device, since no one
would be crazy enough to say, try to run a live CD-ROM image when USB
sticks are so cheap.  :-)

So maybe in practice it won't matter, but I think some kind of time
based cap would probably be a good idea.

						- Ted

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-02  1:22   ` Christoph Hellwig
  2014-09-02 11:31     ` Theodore Ts'o
  2014-09-02 12:55     ` Zack Coffey
@ 2014-09-03  0:01     ` NeilBrown
  2014-09-05 16:08       ` Christoph Hellwig
  2 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2014-09-03  0:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Nikolai Grigoriev, linux-btrfs, linux-fsdevel,
	linux-raid, linux-mm, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 1763 bytes --]

On Mon, 1 Sep 2014 18:22:22 -0700 Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Sep 02, 2014 at 10:08:22AM +1000, Dave Chinner wrote:
> > Pretty obvious difference: avgrq-sz. btrfs is doing 512k IOs, ext4
> > and XFS are doing is doing 128k IOs because that's the default block
> > device readahead size.  'blockdev --setra 1024 /dev/sdd' before
> > mounting the filesystem will probably fix it.
> 
> Btw, it's really getting time to make Linux storage fs work out the
> box.  There's way to many things that are stupid by default and we
> require everyone to fix up manually:
> 
>  - the ridiculously low max_sectors default
>  - the very small max readahead size
>  - replacing cfq with deadline (or noop)
>  - the too small RAID5 stripe cache size
> 
> and probably a few I forgot about.  It's time to make things perform
> well out of the box..

Do we still need maximums at all?
There was a time when the queue limit in the block device (or bdi) was an
important part of the write throttle strategy.  Without a queue limit, all of
memory could be consumed by memory in write-back, all queued for some device.
This wasn't healthy.

But since then the write throttling has been completely re-written.  I'm not
certain (and should check) but I suspect it doesn't depend on submit_bio
blocking when the queue is full any more.

So can we just remove the limit on max_sectors and the RAID5 stripe cache
size?  I'm certainly keen to remove the later and just use a mempool if the
limit isn't needed.
I have seen reports that a very large raid5 stripe cache size can cause
a reduction in performance.  I don't know why but I suspect it is a bug that
should be found and fixed.

Do we need max_sectors ??

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-03  0:01     ` NeilBrown
@ 2014-09-05 16:08       ` Christoph Hellwig
  2014-09-05 16:40         ` Jeff Moyer
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2014-09-05 16:08 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christoph Hellwig, Dave Chinner, Nikolai Grigoriev, linux-btrfs,
	linux-fsdevel, linux-raid, linux-mm, Jens Axboe

On Wed, Sep 03, 2014 at 10:01:58AM +1000, NeilBrown wrote:
> Do we still need maximums at all?

I don't think we do.  At least on any system I work with I have to
increase them to get good performance without any adverse effect on
throttling.

> So can we just remove the limit on max_sectors and the RAID5 stripe cache
> size?  I'm certainly keen to remove the later and just use a mempool if the
> limit isn't needed.
> I have seen reports that a very large raid5 stripe cache size can cause
> a reduction in performance.  I don't know why but I suspect it is a bug that
> should be found and fixed.
> 
> Do we need max_sectors ??

I'll send a patch to remove it and watch for the fireworks..

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-05 16:08       ` Christoph Hellwig
@ 2014-09-05 16:40         ` Jeff Moyer
  2014-09-05 16:50           ` Jens Axboe
  0 siblings, 1 reply; 14+ messages in thread
From: Jeff Moyer @ 2014-09-05 16:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: NeilBrown, Dave Chinner, Nikolai Grigoriev, linux-btrfs,
	linux-fsdevel, linux-raid, linux-mm, Jens Axboe

Christoph Hellwig <hch@infradead.org> writes:

> On Wed, Sep 03, 2014 at 10:01:58AM +1000, NeilBrown wrote:
>> Do we still need maximums at all?
>
> I don't think we do.  At least on any system I work with I have to
> increase them to get good performance without any adverse effect on
> throttling.
>
>> So can we just remove the limit on max_sectors and the RAID5 stripe cache
>> size?  I'm certainly keen to remove the later and just use a mempool if the
>> limit isn't needed.
>> I have seen reports that a very large raid5 stripe cache size can cause
>> a reduction in performance.  I don't know why but I suspect it is a bug that
>> should be found and fixed.
>> 
>> Do we need max_sectors ??

I'm assuming we're talking about max_sectors_kb in
/sys/block/sdX/queue/.

> I'll send a patch to remove it and watch for the fireworks..

:) I've seen SSDs that actually degrade in performance if I/O sizes
exceed their internal page size (using artificial benchmarks; I never
confirmed that with actual workloads).  Bumping the default might not be
bad, but getting rid of the tunable would be a step backwards, in my
opinion.

Are you going to bump up BIO_MAX_PAGES while you're at it?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-05 16:40         ` Jeff Moyer
@ 2014-09-05 16:50           ` Jens Axboe
  0 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2014-09-05 16:50 UTC (permalink / raw)
  To: Jeff Moyer, Christoph Hellwig
  Cc: NeilBrown, Dave Chinner, Nikolai Grigoriev, linux-btrfs,
	linux-fsdevel, linux-raid, linux-mm

On 09/05/2014 10:40 AM, Jeff Moyer wrote:
> Christoph Hellwig <hch@infradead.org> writes:
> 
>> On Wed, Sep 03, 2014 at 10:01:58AM +1000, NeilBrown wrote:
>>> Do we still need maximums at all?
>>
>> I don't think we do.  At least on any system I work with I have to
>> increase them to get good performance without any adverse effect on
>> throttling.
>>
>>> So can we just remove the limit on max_sectors and the RAID5 stripe cache
>>> size?  I'm certainly keen to remove the later and just use a mempool if the
>>> limit isn't needed.
>>> I have seen reports that a very large raid5 stripe cache size can cause
>>> a reduction in performance.  I don't know why but I suspect it is a bug that
>>> should be found and fixed.
>>>
>>> Do we need max_sectors ??
> 
> I'm assuming we're talking about max_sectors_kb in
> /sys/block/sdX/queue/.
> 
>> I'll send a patch to remove it and watch for the fireworks..
> 
> :) I've seen SSDs that actually degrade in performance if I/O sizes
> exceed their internal page size (using artificial benchmarks; I never
> confirmed that with actual workloads).  Bumping the default might not be
> bad, but getting rid of the tunable would be a step backwards, in my
> opinion.
> 
> Are you going to bump up BIO_MAX_PAGES while you're at it?

The reason it's 256 right (or since forever, actually) is that this is
one single 4kb page. If you go higher, that would require a higher order
allocation. Not impossible, but it's definitely a potential issue. It's
a lot saner to string bios at that point, with separate 0 order allocs.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2014-09-05 16:50 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-26 23:39 ext4 vs btrfs performance on SSD array Nikolai Grigoriev
2014-08-27  7:10 ` Duncan
2014-08-27 21:59   ` Nikolai Grigoriev
2014-09-02  0:08 ` Dave Chinner
2014-09-02  1:22   ` Christoph Hellwig
2014-09-02 11:31     ` Theodore Ts'o
2014-09-02 14:20       ` Jan Kara
2014-09-02 14:55         ` Theodore Ts'o
2014-09-02 12:55     ` Zack Coffey
2014-09-02 13:40       ` Austin S Hemmelgarn
2014-09-03  0:01     ` NeilBrown
2014-09-05 16:08       ` Christoph Hellwig
2014-09-05 16:40         ` Jeff Moyer
2014-09-05 16:50           ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).