Re: ext4 vs btrfs performance on SSD array

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: ext4 vs btrfs performance on SSD array
       [not found] ` <20140902000822.GA20473@dastard>
@ 2014-09-02  1:22   ` Christoph Hellwig
  2014-09-02 10:39     ` Zack Coffey
                       ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Christoph Hellwig @ 2014-09-02  1:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Nikolai Grigoriev, linux-btrfs, linux-fsdevel, linux-raid,
	linux-mm, Jens Axboe

On Tue, Sep 02, 2014 at 10:08:22AM +1000, Dave Chinner wrote:
> Pretty obvious difference: avgrq-sz. btrfs is doing 512k IOs, ext4
> and XFS are doing is doing 128k IOs because that's the default block
> device readahead size.  'blockdev --setra 1024 /dev/sdd' before
> mounting the filesystem will probably fix it.

Btw, it's really getting time to make Linux storage fs work out the
box.  There's way to many things that are stupid by default and we
require everyone to fix up manually:

 - the ridiculously low max_sectors default
 - the very small max readahead size
 - replacing cfq with deadline (or noop)
 - the too small RAID5 stripe cache size

and probably a few I forgot about.  It's time to make things perform
well out of the box..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-02  1:22   ` ext4 vs btrfs performance on SSD array Christoph Hellwig
@ 2014-09-02 10:39     ` Zack Coffey
  2014-09-02 11:31     ` Theodore Ts'o
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Zack Coffey @ 2014-09-02 10:39 UTC (permalink / raw)
  Cc: linux-btrfs, linux-mm, linux-raid, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 1556 bytes --]

While I'm sure some of those settings were selected with good reason, maybe
there can be a few options (2 or 3) that have some basic intelligence at
creation to pick a more sane option.

Some checks to see if an option or two might be better suited for the fs.
Like the RAID5 stripe size. Leave the default as is, but maybe a quick
speed test to automatically choose from a handful of the most common
values. If they fail or nothing better is found, then apply the default
value just like it would now.
On Sep 1, 2014 9:23 PM, "Christoph Hellwig" <hch@infradead.org> wrote:

> On Tue, Sep 02, 2014 at 10:08:22AM +1000, Dave Chinner wrote:
> > Pretty obvious difference: avgrq-sz. btrfs is doing 512k IOs, ext4
> > and XFS are doing is doing 128k IOs because that's the default block
> > device readahead size.  'blockdev --setra 1024 /dev/sdd' before
> > mounting the filesystem will probably fix it.
>
> Btw, it's really getting time to make Linux storage fs work out the
> box.  There's way to many things that are stupid by default and we
> require everyone to fix up manually:
>
>  - the ridiculously low max_sectors default
>  - the very small max readahead size
>  - replacing cfq with deadline (or noop)
>  - the too small RAID5 stripe cache size
>
> and probably a few I forgot about.  It's time to make things perform
> well out of the box..
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

[-- Attachment #2: Type: text/html, Size: 2061 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-02  1:22   ` ext4 vs btrfs performance on SSD array Christoph Hellwig
  2014-09-02 10:39     ` Zack Coffey
@ 2014-09-02 11:31     ` Theodore Ts'o
  2014-09-02 14:20       ` Jan Kara
  2014-09-02 12:55     ` Zack Coffey
  2014-09-03  0:01     ` NeilBrown
  3 siblings, 1 reply; 11+ messages in thread
From: Theodore Ts'o @ 2014-09-02 11:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Nikolai Grigoriev, linux-btrfs, linux-fsdevel,
	linux-raid, linux-mm, Jens Axboe

>  - the very small max readahead size

For things like the readahead size, that's probably something that we
should autotune, based the time it takes to read N sectors.  i.e.,
start N relatively small, such as 128k, and then bump it up based on
how long it takes to do a sequential read of N sectors until it hits a
given tunable, which is specified in milliseconds instead of kilobytes.

>  - replacing cfq with deadline (or noop)

Unfortunately, that will break ionice and a number of other things...

      	       	     		     		  - Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-02 11:31     ` Theodore Ts'o
@ 2014-09-02 14:20       ` Jan Kara
  2014-09-02 14:55         ` Theodore Ts'o
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2014-09-02 14:20 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Christoph Hellwig, Dave Chinner, Nikolai Grigoriev, linux-btrfs,
	linux-fsdevel, linux-raid, linux-mm, Jens Axboe

On Tue 02-09-14 07:31:04, Ted Tso wrote:
> >  - the very small max readahead size
> 
> For things like the readahead size, that's probably something that we
> should autotune, based the time it takes to read N sectors.  i.e.,
> start N relatively small, such as 128k, and then bump it up based on
> how long it takes to do a sequential read of N sectors until it hits a
> given tunable, which is specified in milliseconds instead of kilobytes.
  Actually the amount of readahead we do is autotuned (based on hit rate).
So I would keep the setting in sysfs as the maximum size adaptive readahead
can ever read and we can bump it up. We can possibly add another feedback
into the readahead code to tune actualy readahead size depending on device
speed but we'd have to research exactly what algorithm would work best.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-02 14:20       ` Jan Kara
@ 2014-09-02 14:55         ` Theodore Ts'o
  0 siblings, 0 replies; 11+ messages in thread
From: Theodore Ts'o @ 2014-09-02 14:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dave Chinner, Nikolai Grigoriev, linux-btrfs,
	linux-fsdevel, linux-raid, linux-mm, Jens Axboe

On Tue, Sep 02, 2014 at 04:20:24PM +0200, Jan Kara wrote:
> On Tue 02-09-14 07:31:04, Ted Tso wrote:
> > >  - the very small max readahead size
> > 
> > For things like the readahead size, that's probably something that we
> > should autotune, based the time it takes to read N sectors.  i.e.,
> > start N relatively small, such as 128k, and then bump it up based on
> > how long it takes to do a sequential read of N sectors until it hits a
> > given tunable, which is specified in milliseconds instead of kilobytes.
>   Actually the amount of readahead we do is autotuned (based on hit rate).
> So I would keep the setting in sysfs as the maximum size adaptive readahead
> can ever read and we can bump it up. We can possibly add another feedback
> into the readahead code to tune actualy readahead size depending on device
> speed but we'd have to research exactly what algorithm would work best.

I do think we will need to add a time based cap when bump up the max
adaptive readahead; otherwise what could happen is that if we are
streaming off of a slow block device, the readhaead could easily grow
to the point where it starts affecting the latency of competing read
requests to the slow block device.

I suppose we could make the argument that it's not needed, because most of
situations where we might be using slow block devices, the streaming
reader will likely have exclusive use of the device, since no one
would be crazy enough to say, try to run a live CD-ROM image when USB
sticks are so cheap.  :-)

So maybe in practice it won't matter, but I think some kind of time
based cap would probably be a good idea.

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-02  1:22   ` ext4 vs btrfs performance on SSD array Christoph Hellwig
  2014-09-02 10:39     ` Zack Coffey
  2014-09-02 11:31     ` Theodore Ts'o
@ 2014-09-02 12:55     ` Zack Coffey
  2014-09-02 13:40       ` Austin S Hemmelgarn
  2014-09-03  0:01     ` NeilBrown
  3 siblings, 1 reply; 11+ messages in thread
From: Zack Coffey @ 2014-09-02 12:55 UTC (permalink / raw)
  Cc: linux-btrfs, linux-fsdevel, linux-raid, linux-mm

While I'm sure some of those settings were selected with good reason,
maybe there can be a few options (2 or 3) that have some basic
intelligence at creation to pick a more sane option.

Some checks to see if an option or two might be better suited for the
fs. Like the RAID5 stripe size. Leave the default as is, but maybe a
quick speed test to automatically choose from a handful of the most
common values. If they fail or nothing better is found, then apply the
default value just like it would now.


On Mon, Sep 1, 2014 at 9:22 PM, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Sep 02, 2014 at 10:08:22AM +1000, Dave Chinner wrote:
>> Pretty obvious difference: avgrq-sz. btrfs is doing 512k IOs, ext4
>> and XFS are doing is doing 128k IOs because that's the default block
>> device readahead size.  'blockdev --setra 1024 /dev/sdd' before
>> mounting the filesystem will probably fix it.
>
> Btw, it's really getting time to make Linux storage fs work out the
> box.  There's way to many things that are stupid by default and we
> require everyone to fix up manually:
>
>  - the ridiculously low max_sectors default
>  - the very small max readahead size
>  - replacing cfq with deadline (or noop)
>  - the too small RAID5 stripe cache size
>
> and probably a few I forgot about.  It's time to make things perform
> well out of the box..
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-02 12:55     ` Zack Coffey
@ 2014-09-02 13:40       ` Austin S Hemmelgarn
  0 siblings, 0 replies; 11+ messages in thread
From: Austin S Hemmelgarn @ 2014-09-02 13:40 UTC (permalink / raw)
  To: Zack Coffey; +Cc: linux-btrfs, linux-fsdevel, linux-raid, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2424 bytes --]

I wholeheartedly agree.  Of course, getting something other than CFQ as
the default I/O scheduler is going to be a difficult task.  Enough
people upstream are convinced that we all NEED I/O priorities, when most
of what I see people doing with them is bandwidth provisioning, which
can be done much more accurately (and flexibly) using cgroups.

Ironically, there have been a lot of in-kernel defaults that I have run
into issues with recently, most of which originated in the DOS era,
where a few MB of RAM was high-end.

On 2014-09-02 08:55, Zack Coffey wrote:
> While I'm sure some of those settings were selected with good reason,
> maybe there can be a few options (2 or 3) that have some basic
> intelligence at creation to pick a more sane option.
> 
> Some checks to see if an option or two might be better suited for the
> fs. Like the RAID5 stripe size. Leave the default as is, but maybe a
> quick speed test to automatically choose from a handful of the most
> common values. If they fail or nothing better is found, then apply the
> default value just like it would now.
> 
> 
> On Mon, Sep 1, 2014 at 9:22 PM, Christoph Hellwig <hch@infradead.org> wrote:
>> On Tue, Sep 02, 2014 at 10:08:22AM +1000, Dave Chinner wrote:
>>> Pretty obvious difference: avgrq-sz. btrfs is doing 512k IOs, ext4
>>> and XFS are doing is doing 128k IOs because that's the default block
>>> device readahead size.  'blockdev --setra 1024 /dev/sdd' before
>>> mounting the filesystem will probably fix it.
>>
>> Btw, it's really getting time to make Linux storage fs work out the
>> box.  There's way to many things that are stupid by default and we
>> require everyone to fix up manually:
>>
>>  - the ridiculously low max_sectors default
>>  - the very small max readahead size
>>  - replacing cfq with deadline (or noop)
>>  - the too small RAID5 stripe cache size
>>
>> and probably a few I forgot about.  It's time to make things perform
>> well out of the box..
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-02  1:22   ` ext4 vs btrfs performance on SSD array Christoph Hellwig
                       ` (2 preceding siblings ...)
  2014-09-02 12:55     ` Zack Coffey
@ 2014-09-03  0:01     ` NeilBrown
  2014-09-05 16:08       ` Christoph Hellwig
  3 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2014-09-03  0:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Chinner, Nikolai Grigoriev, linux-btrfs, linux-fsdevel,
	linux-raid, linux-mm, Jens Axboe

[-- Attachment #1: Type: text/plain, Size: 1763 bytes --]

On Mon, 1 Sep 2014 18:22:22 -0700 Christoph Hellwig <hch@infradead.org> wrote:

> On Tue, Sep 02, 2014 at 10:08:22AM +1000, Dave Chinner wrote:
> > Pretty obvious difference: avgrq-sz. btrfs is doing 512k IOs, ext4
> > and XFS are doing is doing 128k IOs because that's the default block
> > device readahead size.  'blockdev --setra 1024 /dev/sdd' before
> > mounting the filesystem will probably fix it.
> 
> Btw, it's really getting time to make Linux storage fs work out the
> box.  There's way to many things that are stupid by default and we
> require everyone to fix up manually:
> 
>  - the ridiculously low max_sectors default
>  - the very small max readahead size
>  - replacing cfq with deadline (or noop)
>  - the too small RAID5 stripe cache size
> 
> and probably a few I forgot about.  It's time to make things perform
> well out of the box..

Do we still need maximums at all?
There was a time when the queue limit in the block device (or bdi) was an
important part of the write throttle strategy.  Without a queue limit, all of
memory could be consumed by memory in write-back, all queued for some device.
This wasn't healthy.

But since then the write throttling has been completely re-written.  I'm not
certain (and should check) but I suspect it doesn't depend on submit_bio
blocking when the queue is full any more.

So can we just remove the limit on max_sectors and the RAID5 stripe cache
size?  I'm certainly keen to remove the later and just use a mempool if the
limit isn't needed.
I have seen reports that a very large raid5 stripe cache size can cause
a reduction in performance.  I don't know why but I suspect it is a bug that
should be found and fixed.

Do we need max_sectors ??

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-03  0:01     ` NeilBrown
@ 2014-09-05 16:08       ` Christoph Hellwig
  2014-09-05 16:40         ` Jeff Moyer
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2014-09-05 16:08 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christoph Hellwig, Dave Chinner, Nikolai Grigoriev, linux-btrfs,
	linux-fsdevel, linux-raid, linux-mm, Jens Axboe

On Wed, Sep 03, 2014 at 10:01:58AM +1000, NeilBrown wrote:
> Do we still need maximums at all?

I don't think we do.  At least on any system I work with I have to
increase them to get good performance without any adverse effect on
throttling.

> So can we just remove the limit on max_sectors and the RAID5 stripe cache
> size?  I'm certainly keen to remove the later and just use a mempool if the
> limit isn't needed.
> I have seen reports that a very large raid5 stripe cache size can cause
> a reduction in performance.  I don't know why but I suspect it is a bug that
> should be found and fixed.
> 
> Do we need max_sectors ??

I'll send a patch to remove it and watch for the fireworks..

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-05 16:08       ` Christoph Hellwig
@ 2014-09-05 16:40         ` Jeff Moyer
  2014-09-05 16:50           ` Jens Axboe
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff Moyer @ 2014-09-05 16:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: NeilBrown, Dave Chinner, Nikolai Grigoriev, linux-btrfs,
	linux-fsdevel, linux-raid, linux-mm, Jens Axboe

Christoph Hellwig <hch@infradead.org> writes:

> On Wed, Sep 03, 2014 at 10:01:58AM +1000, NeilBrown wrote:
>> Do we still need maximums at all?
>
> I don't think we do.  At least on any system I work with I have to
> increase them to get good performance without any adverse effect on
> throttling.
>
>> So can we just remove the limit on max_sectors and the RAID5 stripe cache
>> size?  I'm certainly keen to remove the later and just use a mempool if the
>> limit isn't needed.
>> I have seen reports that a very large raid5 stripe cache size can cause
>> a reduction in performance.  I don't know why but I suspect it is a bug that
>> should be found and fixed.
>> 
>> Do we need max_sectors ??

I'm assuming we're talking about max_sectors_kb in
/sys/block/sdX/queue/.

> I'll send a patch to remove it and watch for the fireworks..

:) I've seen SSDs that actually degrade in performance if I/O sizes
exceed their internal page size (using artificial benchmarks; I never
confirmed that with actual workloads).  Bumping the default might not be
bad, but getting rid of the tunable would be a step backwards, in my
opinion.

Are you going to bump up BIO_MAX_PAGES while you're at it?

Cheers,
Jeff

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ext4 vs btrfs performance on SSD array
  2014-09-05 16:40         ` Jeff Moyer
@ 2014-09-05 16:50           ` Jens Axboe
  0 siblings, 0 replies; 11+ messages in thread
From: Jens Axboe @ 2014-09-05 16:50 UTC (permalink / raw)
  To: Jeff Moyer, Christoph Hellwig
  Cc: NeilBrown, Dave Chinner, Nikolai Grigoriev, linux-btrfs,
	linux-fsdevel, linux-raid, linux-mm

On 09/05/2014 10:40 AM, Jeff Moyer wrote:
> Christoph Hellwig <hch@infradead.org> writes:
> 
>> On Wed, Sep 03, 2014 at 10:01:58AM +1000, NeilBrown wrote:
>>> Do we still need maximums at all?
>>
>> I don't think we do.  At least on any system I work with I have to
>> increase them to get good performance without any adverse effect on
>> throttling.
>>
>>> So can we just remove the limit on max_sectors and the RAID5 stripe cache
>>> size?  I'm certainly keen to remove the later and just use a mempool if the
>>> limit isn't needed.
>>> I have seen reports that a very large raid5 stripe cache size can cause
>>> a reduction in performance.  I don't know why but I suspect it is a bug that
>>> should be found and fixed.
>>>
>>> Do we need max_sectors ??
> 
> I'm assuming we're talking about max_sectors_kb in
> /sys/block/sdX/queue/.
> 
>> I'll send a patch to remove it and watch for the fireworks..
> 
> :) I've seen SSDs that actually degrade in performance if I/O sizes
> exceed their internal page size (using artificial benchmarks; I never
> confirmed that with actual workloads).  Bumping the default might not be
> bad, but getting rid of the tunable would be a step backwards, in my
> opinion.
> 
> Are you going to bump up BIO_MAX_PAGES while you're at it?

The reason it's 256 right (or since forever, actually) is that this is
one single 4kb page. If you go higher, that would require a higher order
allocation. Not impossible, but it's definitely a potential issue. It's
a lot saner to string bios at that point, with separate 0 order allocs.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2014-09-05 16:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAEp=YLgzsLbmEfGB5YKVcHP4CQ-_z1yxnZ0tpo7gjKZ2e1ma5g@mail.gmail.com>
     [not found] ` <20140902000822.GA20473@dastard>
2014-09-02  1:22   ` ext4 vs btrfs performance on SSD array Christoph Hellwig
2014-09-02 10:39     ` Zack Coffey
2014-09-02 11:31     ` Theodore Ts'o
2014-09-02 14:20       ` Jan Kara
2014-09-02 14:55         ` Theodore Ts'o
2014-09-02 12:55     ` Zack Coffey
2014-09-02 13:40       ` Austin S Hemmelgarn
2014-09-03  0:01     ` NeilBrown
2014-09-05 16:08       ` Christoph Hellwig
2014-09-05 16:40         ` Jeff Moyer
2014-09-05 16:50           ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).