3.2 and 3.1 filesystem scalability measurements

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* 3.2 and 3.1 filesystem scalability measurements
@ 2012-01-30  4:09 Eric Whitney
  2012-01-30 15:13 ` aziro.linux.adm
  2012-01-30 15:36 ` Cédric Villemain
  0 siblings, 2 replies; 9+ messages in thread
From: Eric Whitney @ 2012-01-30  4:09 UTC (permalink / raw)
  To: Ext4 Developers List, linux-fsdevel

I've posted the results of some 3.2 and 3.1 ext4 scalability 
measurements and comparisons on a 48 core x86-64 server at:

http://free.linux.hp.com/~enw/ext4/3.2

This includes throughput and CPU efficiency graphs for five simple 
workloads, the raw data for same, plus lockstats on ext4 filesystems 
with and without journals.  The data have been useful in improving ext4 
scalability as a function of core and thread count in the past.

For reference, ext3, xfs, and btrfs data are also included.

The most notable improvement in 3.2 is a big scalability gain for 
journaled ext4 when running the large_file_creates workload.  This 
bisects cleanly to Wu Fengguang's IO-less balance_dirty_pages() patch 
which was included in the 3.2 merge window.

(Please note that the test system's hardware and firmware configuration 
has changed since my last posting, so this data set cannot be directly 
compared with my older sets.)

Thanks,
Eric

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 3.2 and 3.1 filesystem scalability measurements
  2012-01-30  4:09 3.2 and 3.1 filesystem scalability measurements Eric Whitney
@ 2012-01-30 15:13 ` aziro.linux.adm
  2012-01-30 20:30   ` Andreas Dilger
  2012-01-30 15:36 ` Cédric Villemain
  1 sibling, 1 reply; 9+ messages in thread
From: aziro.linux.adm @ 2012-01-30 15:13 UTC (permalink / raw)
  To: Eric Whitney; +Cc: Ext4 Developers List, linux-fsdevel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hello List,

Is it possible to be said - XFS shows the best average results over the
test.

Regards,
George

On 1/30/2012 06:09, Eric Whitney wrote:
> I've posted the results of some 3.2 and 3.1 ext4 scalability
> measurements and comparisons on a 48 core x86-64 server at:
> 
> http://free.linux.hp.com/~enw/ext4/3.2
> 
> This includes throughput and CPU efficiency graphs for five simple
> workloads, the raw data for same, plus lockstats on ext4 filesystems
> with and without journals.  The data have been useful in improving ext4
> scalability as a function of core and thread count in the past.
> 
> For reference, ext3, xfs, and btrfs data are also included.
> 
> The most notable improvement in 3.2 is a big scalability gain for
> journaled ext4 when running the large_file_creates workload.  This
> bisects cleanly to Wu Fengguang's IO-less balance_dirty_pages() patch
> which was included in the 3.2 merge window.
> 
> (Please note that the test system's hardware and firmware configuration
> has changed since my last posting, so this data set cannot be directly
> compared with my older sets.)
> 
> Thanks,
> Eric
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJPJrOPAAoJEOgP5JrlcKEvw1oH/j9r/UPoPFQaQIDY3ZYHCQVI
Px0gvyrdsdzFsANaC7o0zoTdz7tSTVIdifqZPJFV8w9FfFqg+O6kwQyIa2ovvCbA
xgeIYZoqGBJ18REW6cXnRyqsZA+5RThnVxhZ06AOJuuD2/WREaWhQwLQMS8iL1j5
22lwRWMQjsVQ2QmyGsjOp1LiHvyl3PLA4zoFZDOpdnKOqIENFhXjX/uAGWvWo/Zt
CYGVCncQx29oK5SLog5mX3HV9Nz/xMhBxPJs9sd3TY9FkkSnFS9K1x37oXDGdnq3
L/9iD/Nub+eMGNQuFZ0N4TlGY91BAntq4W38XX/tXQylagnqC5YmkkqlwpwztXE=
=Y2DB
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 3.2 and 3.1 filesystem scalability measurements
  2012-01-30 15:13 ` aziro.linux.adm
@ 2012-01-30 20:30   ` Andreas Dilger
  2012-01-31  0:14     ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Andreas Dilger @ 2012-01-30 20:30 UTC (permalink / raw)
  To: aziro.linux.adm; +Cc: Eric Whitney, Ext4 Developers List, linux-fsdevel

On 2012-01-30, at 8:13 AM, aziro.linux.adm wrote:
> Is it possible to be said - XFS shows the best average results over the
> test.

Actually, I'm pleasantly surprised that ext4 does so much better than XFS
in the large file creates workload for 48 and 192 threads.  I would have
thought that this is XFS's bread-and-butter workload that justifies its
added code complexity (many threads writing to a multi-disk RAID array),
but XFS is about 25% slower in that case.  Conversely, XFS is about 25%
faster in the large file reads in the 192 thread case, but only 15% faster
in the 48 thread case.  Other tests show much less significant differences,
so in summary I'd say it is about even for these benchmarks.


It is also interesting to see the ext4-nojournal performance as a baseline
to show what performance is achievable on the hardware by any filesystem,
but I don't think it is necessarily a fair comparison with the other test
configurations, since this mode is not usable for most real systems.  It
gives both ext4-journal and XFS a target for improvement, by reducing the
overhead of metadata consistency.

> On 1/30/2012 06:09, Eric Whitney wrote:
>> I've posted the results of some 3.2 and 3.1 ext4 scalability
>> measurements and comparisons on a 48 core x86-64 server at:
>> 
>> http://free.linux.hp.com/~enw/ext4/3.2
>> 
>> This includes throughput and CPU efficiency graphs for five simple
>> workloads, the raw data for same, plus lockstats on ext4 filesystems
>> with and without journals.  The data have been useful in improving ext4
>> scalability as a function of core and thread count in the past.
>> 
>> For reference, ext3, xfs, and btrfs data are also included.
>> 
>> The most notable improvement in 3.2 is a big scalability gain for
>> journaled ext4 when running the large_file_creates workload.  This
>> bisects cleanly to Wu Fengguang's IO-less balance_dirty_pages() patch
>> which was included in the 3.2 merge window.
>> 
>> (Please note that the test system's hardware and firmware configuration
>> has changed since my last posting, so this data set cannot be directly
>> compared with my older sets.)
>> 
>> Thanks,
>> Eric
>> 
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.17 (MingW32)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iQEcBAEBAgAGBQJPJrOPAAoJEOgP5JrlcKEvw1oH/j9r/UPoPFQaQIDY3ZYHCQVI
> Px0gvyrdsdzFsANaC7o0zoTdz7tSTVIdifqZPJFV8w9FfFqg+O6kwQyIa2ovvCbA
> xgeIYZoqGBJ18REW6cXnRyqsZA+5RThnVxhZ06AOJuuD2/WREaWhQwLQMS8iL1j5
> 22lwRWMQjsVQ2QmyGsjOp1LiHvyl3PLA4zoFZDOpdnKOqIENFhXjX/uAGWvWo/Zt
> CYGVCncQx29oK5SLog5mX3HV9Nz/xMhBxPJs9sd3TY9FkkSnFS9K1x37oXDGdnq3
> L/9iD/Nub+eMGNQuFZ0N4TlGY91BAntq4W38XX/tXQylagnqC5YmkkqlwpwztXE=
> =Y2DB
> -----END PGP SIGNATURE-----
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 3.2 and 3.1 filesystem scalability measurements
  2012-01-30 20:30   ` Andreas Dilger
@ 2012-01-31  0:14     ` Dave Chinner
  2012-01-31 10:53       ` Jan Kara
       [not found]       ` <20120131112726.GC3867@localhost>
  0 siblings, 2 replies; 9+ messages in thread
From: Dave Chinner @ 2012-01-31  0:14 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: aziro.linux.adm, Eric Whitney, Ext4 Developers List,
	linux-fsdevel

On Mon, Jan 30, 2012 at 01:30:09PM -0700, Andreas Dilger wrote:
> On 2012-01-30, at 8:13 AM, aziro.linux.adm wrote:
> > Is it possible to be said - XFS shows the best average results over the
> > test.
> 
> Actually, I'm pleasantly surprised that ext4 does so much better than XFS
> in the large file creates workload for 48 and 192 threads.  I would have
> thought that this is XFS's bread-and-butter workload that justifies its
> added code complexity (many threads writing to a multi-disk RAID array),
> but XFS is about 25% slower in that case.  Conversely, XFS is about 25%
> faster in the large file reads in the 192 thread case, but only 15% faster
> in the 48 thread case.  Other tests show much less significant differences,
> so in summary I'd say it is about even for these benchmarks.

It appears to me from running the test locally that XFS is driving
deeper block device queues, and has a lot more writeback pages and
dirty inodes outstanding at any given point in time. That indicates
the storage array is the limiting factor to me, not the XFS code.

Typical BDI writeback state for ext4 is this:

BdiWriteback:            73344 kB
BdiReclaimable:         568960 kB
BdiDirtyThresh:         764400 kB
DirtyThresh:            764400 kB
BackgroundThresh:       382200 kB
BdiDirtied:          295613696 kB
BdiWritten:          294971648 kB
BdiWriteBandwidth:      690008 kBps
b_dirty:                    27
b_io:                       21
b_more_io:                   0
bdi_list:                    1
state:                      34

And for XFS:

BdiWriteback:           104960 kB
BdiReclaimable:         592384 kB
BdiDirtyThresh:         768876 kB
DirtyThresh:            768876 kB
BackgroundThresh:       384436 kB
BdiDirtied:          396727424 kB
BdiWritten:          396029568 kB
BdiWriteBandwidth:      668168 kBps
b_dirty:                    43
b_io:                       53
b_more_io:                   0
bdi_list:                    1
state:                      34

So XFS is has substantially more pages under writeback at any given
point in time, has more inodes dirty, but has slower throughput.  I
ran some traces on the writeback code and confirmed that the number
of writeback pages is different - ext4 is at 16-20,000, XFS is at
25-30,000 for the entire traces.

I also found this oddity on both XFS and ext4:

    flush-253:32-3400  [001] 1936151.384563: writeback_start:      bdi 253:32: sb_dev 0:0 nr_pages=-898403 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
    flush-253:32-3400  [005] 1936151.455845: writeback_start:      bdi 253:32: sb_dev 0:0 nr_pages=-911663 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
    flush-253:32-3400  [006] 1936151.596298: writeback_start:      bdi 253:32: sb_dev 0:0 nr_pages=-931332 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
    flush-253:32-3400  [006] 1936151.719074: writeback_start:      bdi 253:32: sb_dev 0:0 nr_pages=-951001 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background

That's indicating the work->nr_pages is starting extremely negative,
which should not be the case. The highest I saw was around -2m.
Something is not working right there, as writeback is supposed to
terminate if work->nr_pages < 0....

As it is, writeback is being done in chunks of roughly 6400-7000 pages
per inode, which is relatively large chunks and probably all the
dirty pages on the inode because wbc->nr_to_write == 24576 is being
passed to .writepage. ext4 is slightly higher than XFS, which is no
surprise if there are less dirty inodes in memory than for XFS.

So why is there a difference in performance? Well, ext4 is simply
interleaving allocations based on the next file that is written
back. i.e:

    +------------+-------------+-------------+--- ...
    | A {0,24M}  | B {0, 24M}  | C {0, 24M}  | D ....
    +------------+-------------+-------------+--- ...

And as it moves along, we end up with:

    ... +-------------+--------------+--------------+--- ...
    ... | A {24M,24M} | B {24M, 24M} | C {24M, 24M} | D ....
    ... +-------------+--------------+--------------+--- ...

The result is ext4 is avergaing 41 extents per 1GB file, but writes
are effectively sequential. That's good for bandwidth, not so good
for keeping fragmentation under control.

XFS is behaving differently. It is using speculative preallocation
to form larger than per-writeback instance extents. It results in
some interleaving of extents, but files tend to look like this:

datafile1:
 EXT: FILE-OFFSET         BLOCK-RANGE        AG AG-OFFSET             TOTAL FLAGS
   0: [0..65535]:         546520..612055      0 (546520..612055)      65536 00000
   1: [65536..131071]:    1906392..1971927    0 (1906392..1971927)    65536 00000
   2: [131072..262143]:   5445336..5576407    0 (5445336..5576407)   131072 00000
   3: [262144..524287]:   14948056..15210199  0 (14948056..15210199) 262144 00000
   4: [524288..1048575]:  34084568..34608855  0 (34084568..34608855) 524288 00000
   5: [1048576..1877407]: 68163288..68992119  0 (68163288..68992119) 828832 00000

(32MB, 32MB, 64MB, 128MB, 256MB, 420MB sized extents at sample time)

and the average number of extents per file is 6.3. Hence there is
more seeking during XFS writes because it is not allocating space
according to the exact writeback pattern that is being driven by the
VFS.

On my test setup, the difference in throughput was negliable with
ffsb reporting 683MB/s for ext4 and 672MB/s for XFS at 48 threads.
However, I tested on a machine with only 4GB of RAM, which means
that writeback is being done in much smaller chunks per file than
Eric's results. That means that XFS will be doing much larger
speculative preallocation per file before writeback begins, so will
be allocating much larger extents from the start.

This will separate the per-file writeback regions extents further
than my test, increasing seek distances and so should show more of a
seek cost on larger RAM machines given the same storage.  Therefore,
on a machine with 256GB RAM, the differential between sequential
allocation per writeback call (i.e. interleaving across inodes) as
ext4 does and the minimal fragmentation approach XFS takes will be
more significant.  We can see that from Eric's results, too.

However, given a large enough storage subsystem, this seek penalty
is effectively non-existent so is a fair tradeoff for a filesystem
that is expected to be used on machines with hundreds of drives
behind the filesystem. The seek penalty is also non-existent on
SSDs, so the lower allocation and metadata overhead of creating
larger extents is a win there as well...

Of course, the obvious measurable difference as a result of these
writeback patterans is when it comes to reading back the files. XFs
will have all 6-7 extents in-line in the inode, so require no
additional IO to read the extent list. The XFS files are more
contiguous than ext4, so sequential reads will seek less. Hence the
concurrent read loads perform better than ext4, as also seen in
Eric's tests.

> It is also interesting to see the ext4-nojournal performance as a baseline
> to show what performance is achievable on the hardware by any filesystem,
> but I don't think it is necessarily a fair comparison with the other test
> configurations, since this mode is not usable for most real systems.  It
> gives both ext4-journal and XFS a target for improvement, by reducing the
> overhead of metadata consistency.

Maximum write bandwidth is not necessarily the goal we want to
acheive. Good write bandwidth, definitely, but experience has shown
that prevention of writeback starvation and excessive fragmentation
helps to ensure we can maintain that level of performance over the
life of the filesystem. That's just as important (if not more
important) than maximising ultimate write speed for most production
deployments....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 3.2 and 3.1 filesystem scalability measurements
  2012-01-31  0:14     ` Dave Chinner
@ 2012-01-31 10:53       ` Jan Kara
  2012-01-31 12:55         ` Wu Fengguang
  2012-01-31 20:27         ` Dave Chinner
       [not found]       ` <20120131112726.GC3867@localhost>
  1 sibling, 2 replies; 9+ messages in thread
From: Jan Kara @ 2012-01-31 10:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andreas Dilger, aziro.linux.adm, Eric Whitney,
	Ext4 Developers List, linux-fsdevel

On Tue 31-01-12 11:14:15, Dave Chinner wrote:
> I also found this oddity on both XFS and ext4:
> 
>     flush-253:32-3400  [001] 1936151.384563: writeback_start:      bdi 253:32: sb_dev 0:0 nr_pages=-898403 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
>     flush-253:32-3400  [005] 1936151.455845: writeback_start:      bdi 253:32: sb_dev 0:0 nr_pages=-911663 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
>     flush-253:32-3400  [006] 1936151.596298: writeback_start:      bdi 253:32: sb_dev 0:0 nr_pages=-931332 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
>     flush-253:32-3400  [006] 1936151.719074: writeback_start:      bdi 253:32: sb_dev 0:0 nr_pages=-951001 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> 
> That's indicating the work->nr_pages is starting extremely negative,
> which should not be the case. The highest I saw was around -2m.
> Something is not working right there, as writeback is supposed to
> terminate if work->nr_pages < 0....
  Ugh, what kernel is this? The tracepoint is just a couple of lines after
                if (work->nr_pages <= 0)
                        break;
  so I really don't see how that could happen.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 3.2 and 3.1 filesystem scalability measurements
  2012-01-31 10:53       ` Jan Kara
@ 2012-01-31 12:55         ` Wu Fengguang
  2012-01-31 20:27         ` Dave Chinner
  1 sibling, 0 replies; 9+ messages in thread
From: Wu Fengguang @ 2012-01-31 12:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Andreas Dilger, aziro.linux.adm, Eric Whitney,
	Ext4 Developers List, linux-fsdevel

On Tue, Jan 31, 2012 at 11:53:53AM +0100, Jan Kara wrote:
> On Tue 31-01-12 11:14:15, Dave Chinner wrote:
> > I also found this oddity on both XFS and ext4:
> > 
> >     flush-253:32-3400  [001] 1936151.384563: writeback_start:      bdi 253:32: sb_dev 0:0 nr_pages=-898403 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> >     flush-253:32-3400  [005] 1936151.455845: writeback_start:      bdi 253:32: sb_dev 0:0 nr_pages=-911663 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> >     flush-253:32-3400  [006] 1936151.596298: writeback_start:      bdi 253:32: sb_dev 0:0 nr_pages=-931332 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> >     flush-253:32-3400  [006] 1936151.719074: writeback_start:      bdi 253:32: sb_dev 0:0 nr_pages=-951001 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > 
> > That's indicating the work->nr_pages is starting extremely negative,
> > which should not be the case. The highest I saw was around -2m.
> > Something is not working right there, as writeback is supposed to
> > terminate if work->nr_pages < 0....
>   Ugh, what kernel is this? The tracepoint is just a couple of lines after
>                 if (work->nr_pages <= 0)
>                         break;
>   so I really don't see how that could happen.

It should be a recent kernel judging from the "reason=background"
field.

I cannot find such "writeback_start.*nr_pages=-" pattern at all in my
huge pile of saved tracing logs. Since the background work is only
started by wb_check_background_flush() with .nr_pages = LONG_MAX,
I can only find such patterns:

      flush-0:27-5216  [005] ....   472.119012: writeback_start: bdi 0:27: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
      flush-0:27-5216  [005] ....   472.119076: writeback_start: bdi 0:27: sb_dev 0:0 nr_pages=9223372036854775803 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background

       flush-9:0-5176  [025] ....   475.578426: writeback_start: bdi 9:0: sb_dev 0:0 nr_pages=361160 sync_mode=0 kupdate=1 range_cyclic=1 background=0 reason=periodic
       flush-9:0-5176  [025] ....   475.710138: writeback_start: bdi 9:0: sb_dev 0:0 nr_pages=328392 sync_mode=0 kupdate=1 range_cyclic=1 background=0 reason=periodic

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 3.2 and 3.1 filesystem scalability measurements
  2012-01-31 10:53       ` Jan Kara
  2012-01-31 12:55         ` Wu Fengguang
@ 2012-01-31 20:27         ` Dave Chinner
  1 sibling, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2012-01-31 20:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andreas Dilger, aziro.linux.adm, Eric Whitney,
	Ext4 Developers List, linux-fsdevel

On Tue, Jan 31, 2012 at 11:53:53AM +0100, Jan Kara wrote:
> On Tue 31-01-12 11:14:15, Dave Chinner wrote:
> > I also found this oddity on both XFS and ext4:
> > 
> >     flush-253:32-3400  [001] 1936151.384563: writeback_start:      bdi 253:32: sb_dev 0:0 nr_pages=-898403 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> >     flush-253:32-3400  [005] 1936151.455845: writeback_start:      bdi 253:32: sb_dev 0:0 nr_pages=-911663 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> >     flush-253:32-3400  [006] 1936151.596298: writeback_start:      bdi 253:32: sb_dev 0:0 nr_pages=-931332 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> >     flush-253:32-3400  [006] 1936151.719074: writeback_start:      bdi 253:32: sb_dev 0:0 nr_pages=-951001 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background
> > 
> > That's indicating the work->nr_pages is starting extremely negative,
> > which should not be the case. The highest I saw was around -2m.
> > Something is not working right there, as writeback is supposed to
> > terminate if work->nr_pages < 0....
>   Ugh, what kernel is this? The tracepoint is just a couple of lines after
>                 if (work->nr_pages <= 0)
>                         break;
>   so I really don't see how that could happen.

Neither can I - it is the latest Linus kernel (3.3-rc1+).

I even upgraded trace-cmd to the latest in case that was the
problem, but it didn't make any difference. I haven't worked it out
yet.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <20120131112726.GC3867@localhost>]

* Re: 3.2 and 3.1 filesystem scalability measurements
       [not found]       ` <20120131112726.GC3867@localhost>
@ 2012-01-31 20:40         ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2012-01-31 20:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andreas Dilger, aziro.linux.adm, Eric Whitney,
	Ext4 Developers List, linux-fsdevel, Jan Kara

On Tue, Jan 31, 2012 at 07:27:26PM +0800, Wu Fengguang wrote:
> On Tue, Jan 31, 2012 at 11:14:15AM +1100, Dave Chinner wrote:
> > On Mon, Jan 30, 2012 at 01:30:09PM -0700, Andreas Dilger wrote:
> > > On 2012-01-30, at 8:13 AM, aziro.linux.adm wrote:
> > > > Is it possible to be said - XFS shows the best average results over the
> > > > test.
> > > 
> > > Actually, I'm pleasantly surprised that ext4 does so much better than XFS
> > > in the large file creates workload for 48 and 192 threads.  I would have
> > > thought that this is XFS's bread-and-butter workload that justifies its
> > > added code complexity (many threads writing to a multi-disk RAID array),
> > > but XFS is about 25% slower in that case.  Conversely, XFS is about 25%
> > > faster in the large file reads in the 192 thread case, but only 15% faster
> > > in the 48 thread case.  Other tests show much less significant differences,
> > > so in summary I'd say it is about even for these benchmarks.
> > 
> > It appears to me from running the test locally that XFS is driving
> > deeper block device queues, and has a lot more writeback pages and
> > dirty inodes outstanding at any given point in time. That indicates
> > the storage array is the limiting factor to me, not the XFS code.
> > 
> > Typical BDI writeback state for ext4 is this:
> > 
> > BdiWriteback:            73344 kB
> > BdiReclaimable:         568960 kB
> > BdiDirtyThresh:         764400 kB
> > DirtyThresh:            764400 kB
> > BackgroundThresh:       382200 kB
> > BdiDirtied:          295613696 kB
> > BdiWritten:          294971648 kB
> > BdiWriteBandwidth:      690008 kBps
> > b_dirty:                    27
> > b_io:                       21
> > b_more_io:                   0
> > bdi_list:                    1
> > state:                      34
> > 
> > And for XFS:
> > 
> > BdiWriteback:           104960 kB
> > BdiReclaimable:         592384 kB
> > BdiDirtyThresh:         768876 kB
> > DirtyThresh:            768876 kB
> > BackgroundThresh:       384436 kB
> > BdiDirtied:          396727424 kB
> > BdiWritten:          396029568 kB
> > BdiWriteBandwidth:      668168 kBps
> > b_dirty:                    43
> > b_io:                       53
> > b_more_io:                   0
> > bdi_list:                    1
> > state:                      34
> > 
> > So XFS is has substantially more pages under writeback at any given
> > point in time, has more inodes dirty, but has slower throughput.  I
> > ran some traces on the writeback code and confirmed that the number
> > of writeback pages is different - ext4 is at 16-20,000, XFS is at
> > 25-30,000 for the entire traces.
> 
> Attached are two nr_writeback (the green line) graphs for test cases
> 
>         xfs-1dd-1-3.2.0-rc3
>         ext4-1dd-1-3.2.0-rc3

The above numbers came from a 48-thread IO workload, not a single
thread, so I'm not really sure how much these will reflect the
behaviour of workload in question.

> Where I notice that the lower nr_writeback segments of XFS equal to
> the highest points of ext4, which should be decided by the block queue
> size.
> 
> XFS seems to clear PG_writeback long after (up to 0.5s) IO completion
> in big batches. This is one of the reason why XFS has higher nr_writeback
> on average.

Hmmm, that implies a possible workqueue starvation to me. IO
completions are processed in a workqueue, but that shouldn't be held
off for that long....

> The other two graphs show the writeback chunk size: ext4 is consistently
> 128MB while XFS is mostly 32MB. So it is somehow unfair comparison:
> ext4 has code to force 128MB in its write_cache_pages(), while XFS
> uses the smaller generic size "0.5s worth of data" computed in
> writeback_chunk_size().

Right - I noticed that too, but didn't really think it mattered all
that much because the actual amount of pages being written per inode
in the traces from my workload were effectively identical. i.e. the
windup was irrelevant because wbc->nr_to_write being sent by the VFS
writeback was nowhere near being exhausted.

BTW, the ext4 comment about why it does this seems a bit out of date,
too. Especially the bit about "XFS does this" but the only time XFS
did this was as a temporary work for around other writeback issues
(IIRC between .32 and .35) that have long since been fixed.

Cheers,

Dave.





-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: 3.2 and 3.1 filesystem scalability measurements
  2012-01-30  4:09 3.2 and 3.1 filesystem scalability measurements Eric Whitney
  2012-01-30 15:13 ` aziro.linux.adm
@ 2012-01-30 15:36 ` Cédric Villemain
  1 sibling, 0 replies; 9+ messages in thread
From: Cédric Villemain @ 2012-01-30 15:36 UTC (permalink / raw)
  To: Eric Whitney; +Cc: Ext4 Developers List, linux-fsdevel

Le 30 janvier 2012 05:09, Eric Whitney <eric.whitney@hp.com> a écrit :
> I've posted the results of some 3.2 and 3.1 ext4 scalability measurements
> and comparisons on a 48 core x86-64 server at:
>
> http://free.linux.hp.com/~enw/ext4/3.2
>
> This includes throughput and CPU efficiency graphs for five simple
> workloads, the raw data for same, plus lockstats on ext4 filesystems with
> and without journals.  The data have been useful in improving ext4
> scalability as a function of core and thread count in the past.
>
> For reference, ext3, xfs, and btrfs data are also included.

interesting to have all of them, yes.

>
> The most notable improvement in 3.2 is a big scalability gain for journaled
> ext4 when running the large_file_creates workload.  This bisects cleanly to
> Wu Fengguang's IO-less balance_dirty_pages() patch which was included in the
> 3.2 merge window.
>
> (Please note that the test system's hardware and firmware configuration has
> changed since my last posting, so this data set cannot be directly compared
> with my older sets.)
>
> Thanks,
> Eric
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-01-31 20:40 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-30  4:09 3.2 and 3.1 filesystem scalability measurements Eric Whitney
2012-01-30 15:13 ` aziro.linux.adm
2012-01-30 20:30   ` Andreas Dilger
2012-01-31  0:14     ` Dave Chinner
2012-01-31 10:53       ` Jan Kara
2012-01-31 12:55         ` Wu Fengguang
2012-01-31 20:27         ` Dave Chinner
     [not found]       ` <20120131112726.GC3867@localhost>
2012-01-31 20:40         ` Dave Chinner
2012-01-30 15:36 ` Cédric Villemain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).