* 3.2 and 3.1 filesystem scalability measurements @ 2012-01-30 4:09 Eric Whitney 2012-01-30 15:13 ` aziro.linux.adm 2012-01-30 15:36 ` Cédric Villemain 0 siblings, 2 replies; 9+ messages in thread From: Eric Whitney @ 2012-01-30 4:09 UTC (permalink / raw) To: Ext4 Developers List, linux-fsdevel I've posted the results of some 3.2 and 3.1 ext4 scalability measurements and comparisons on a 48 core x86-64 server at: http://free.linux.hp.com/~enw/ext4/3.2 This includes throughput and CPU efficiency graphs for five simple workloads, the raw data for same, plus lockstats on ext4 filesystems with and without journals. The data have been useful in improving ext4 scalability as a function of core and thread count in the past. For reference, ext3, xfs, and btrfs data are also included. The most notable improvement in 3.2 is a big scalability gain for journaled ext4 when running the large_file_creates workload. This bisects cleanly to Wu Fengguang's IO-less balance_dirty_pages() patch which was included in the 3.2 merge window. (Please note that the test system's hardware and firmware configuration has changed since my last posting, so this data set cannot be directly compared with my older sets.) Thanks, Eric ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 3.2 and 3.1 filesystem scalability measurements 2012-01-30 4:09 3.2 and 3.1 filesystem scalability measurements Eric Whitney @ 2012-01-30 15:13 ` aziro.linux.adm 2012-01-30 20:30 ` Andreas Dilger 2012-01-30 15:36 ` Cédric Villemain 1 sibling, 1 reply; 9+ messages in thread From: aziro.linux.adm @ 2012-01-30 15:13 UTC (permalink / raw) To: Eric Whitney; +Cc: Ext4 Developers List, linux-fsdevel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello List, Is it possible to be said - XFS shows the best average results over the test. Regards, George On 1/30/2012 06:09, Eric Whitney wrote: > I've posted the results of some 3.2 and 3.1 ext4 scalability > measurements and comparisons on a 48 core x86-64 server at: > > http://free.linux.hp.com/~enw/ext4/3.2 > > This includes throughput and CPU efficiency graphs for five simple > workloads, the raw data for same, plus lockstats on ext4 filesystems > with and without journals. The data have been useful in improving ext4 > scalability as a function of core and thread count in the past. > > For reference, ext3, xfs, and btrfs data are also included. > > The most notable improvement in 3.2 is a big scalability gain for > journaled ext4 when running the large_file_creates workload. This > bisects cleanly to Wu Fengguang's IO-less balance_dirty_pages() patch > which was included in the 3.2 merge window. > > (Please note that the test system's hardware and firmware configuration > has changed since my last posting, so this data set cannot be directly > compared with my older sets.) > > Thanks, > Eric > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJPJrOPAAoJEOgP5JrlcKEvw1oH/j9r/UPoPFQaQIDY3ZYHCQVI Px0gvyrdsdzFsANaC7o0zoTdz7tSTVIdifqZPJFV8w9FfFqg+O6kwQyIa2ovvCbA xgeIYZoqGBJ18REW6cXnRyqsZA+5RThnVxhZ06AOJuuD2/WREaWhQwLQMS8iL1j5 22lwRWMQjsVQ2QmyGsjOp1LiHvyl3PLA4zoFZDOpdnKOqIENFhXjX/uAGWvWo/Zt CYGVCncQx29oK5SLog5mX3HV9Nz/xMhBxPJs9sd3TY9FkkSnFS9K1x37oXDGdnq3 L/9iD/Nub+eMGNQuFZ0N4TlGY91BAntq4W38XX/tXQylagnqC5YmkkqlwpwztXE= =Y2DB -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 3.2 and 3.1 filesystem scalability measurements 2012-01-30 15:13 ` aziro.linux.adm @ 2012-01-30 20:30 ` Andreas Dilger 2012-01-31 0:14 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Andreas Dilger @ 2012-01-30 20:30 UTC (permalink / raw) To: aziro.linux.adm; +Cc: Eric Whitney, Ext4 Developers List, linux-fsdevel On 2012-01-30, at 8:13 AM, aziro.linux.adm wrote: > Is it possible to be said - XFS shows the best average results over the > test. Actually, I'm pleasantly surprised that ext4 does so much better than XFS in the large file creates workload for 48 and 192 threads. I would have thought that this is XFS's bread-and-butter workload that justifies its added code complexity (many threads writing to a multi-disk RAID array), but XFS is about 25% slower in that case. Conversely, XFS is about 25% faster in the large file reads in the 192 thread case, but only 15% faster in the 48 thread case. Other tests show much less significant differences, so in summary I'd say it is about even for these benchmarks. It is also interesting to see the ext4-nojournal performance as a baseline to show what performance is achievable on the hardware by any filesystem, but I don't think it is necessarily a fair comparison with the other test configurations, since this mode is not usable for most real systems. It gives both ext4-journal and XFS a target for improvement, by reducing the overhead of metadata consistency. > On 1/30/2012 06:09, Eric Whitney wrote: >> I've posted the results of some 3.2 and 3.1 ext4 scalability >> measurements and comparisons on a 48 core x86-64 server at: >> >> http://free.linux.hp.com/~enw/ext4/3.2 >> >> This includes throughput and CPU efficiency graphs for five simple >> workloads, the raw data for same, plus lockstats on ext4 filesystems >> with and without journals. The data have been useful in improving ext4 >> scalability as a function of core and thread count in the past. >> >> For reference, ext3, xfs, and btrfs data are also included. >> >> The most notable improvement in 3.2 is a big scalability gain for >> journaled ext4 when running the large_file_creates workload. This >> bisects cleanly to Wu Fengguang's IO-less balance_dirty_pages() patch >> which was included in the 3.2 merge window. >> >> (Please note that the test system's hardware and firmware configuration >> has changed since my last posting, so this data set cannot be directly >> compared with my older sets.) >> >> Thanks, >> Eric >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v2.0.17 (MingW32) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iQEcBAEBAgAGBQJPJrOPAAoJEOgP5JrlcKEvw1oH/j9r/UPoPFQaQIDY3ZYHCQVI > Px0gvyrdsdzFsANaC7o0zoTdz7tSTVIdifqZPJFV8w9FfFqg+O6kwQyIa2ovvCbA > xgeIYZoqGBJ18REW6cXnRyqsZA+5RThnVxhZ06AOJuuD2/WREaWhQwLQMS8iL1j5 > 22lwRWMQjsVQ2QmyGsjOp1LiHvyl3PLA4zoFZDOpdnKOqIENFhXjX/uAGWvWo/Zt > CYGVCncQx29oK5SLog5mX3HV9Nz/xMhBxPJs9sd3TY9FkkSnFS9K1x37oXDGdnq3 > L/9iD/Nub+eMGNQuFZ0N4TlGY91BAntq4W38XX/tXQylagnqC5YmkkqlwpwztXE= > =Y2DB > -----END PGP SIGNATURE----- > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Cheers, Andreas ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 3.2 and 3.1 filesystem scalability measurements 2012-01-30 20:30 ` Andreas Dilger @ 2012-01-31 0:14 ` Dave Chinner 2012-01-31 10:53 ` Jan Kara [not found] ` <20120131112726.GC3867@localhost> 0 siblings, 2 replies; 9+ messages in thread From: Dave Chinner @ 2012-01-31 0:14 UTC (permalink / raw) To: Andreas Dilger Cc: aziro.linux.adm, Eric Whitney, Ext4 Developers List, linux-fsdevel On Mon, Jan 30, 2012 at 01:30:09PM -0700, Andreas Dilger wrote: > On 2012-01-30, at 8:13 AM, aziro.linux.adm wrote: > > Is it possible to be said - XFS shows the best average results over the > > test. > > Actually, I'm pleasantly surprised that ext4 does so much better than XFS > in the large file creates workload for 48 and 192 threads. I would have > thought that this is XFS's bread-and-butter workload that justifies its > added code complexity (many threads writing to a multi-disk RAID array), > but XFS is about 25% slower in that case. Conversely, XFS is about 25% > faster in the large file reads in the 192 thread case, but only 15% faster > in the 48 thread case. Other tests show much less significant differences, > so in summary I'd say it is about even for these benchmarks. It appears to me from running the test locally that XFS is driving deeper block device queues, and has a lot more writeback pages and dirty inodes outstanding at any given point in time. That indicates the storage array is the limiting factor to me, not the XFS code. Typical BDI writeback state for ext4 is this: BdiWriteback: 73344 kB BdiReclaimable: 568960 kB BdiDirtyThresh: 764400 kB DirtyThresh: 764400 kB BackgroundThresh: 382200 kB BdiDirtied: 295613696 kB BdiWritten: 294971648 kB BdiWriteBandwidth: 690008 kBps b_dirty: 27 b_io: 21 b_more_io: 0 bdi_list: 1 state: 34 And for XFS: BdiWriteback: 104960 kB BdiReclaimable: 592384 kB BdiDirtyThresh: 768876 kB DirtyThresh: 768876 kB BackgroundThresh: 384436 kB BdiDirtied: 396727424 kB BdiWritten: 396029568 kB BdiWriteBandwidth: 668168 kBps b_dirty: 43 b_io: 53 b_more_io: 0 bdi_list: 1 state: 34 So XFS is has substantially more pages under writeback at any given point in time, has more inodes dirty, but has slower throughput. I ran some traces on the writeback code and confirmed that the number of writeback pages is different - ext4 is at 16-20,000, XFS is at 25-30,000 for the entire traces. I also found this oddity on both XFS and ext4: flush-253:32-3400 [001] 1936151.384563: writeback_start: bdi 253:32: sb_dev 0:0 nr_pages=-898403 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background flush-253:32-3400 [005] 1936151.455845: writeback_start: bdi 253:32: sb_dev 0:0 nr_pages=-911663 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background flush-253:32-3400 [006] 1936151.596298: writeback_start: bdi 253:32: sb_dev 0:0 nr_pages=-931332 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background flush-253:32-3400 [006] 1936151.719074: writeback_start: bdi 253:32: sb_dev 0:0 nr_pages=-951001 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background That's indicating the work->nr_pages is starting extremely negative, which should not be the case. The highest I saw was around -2m. Something is not working right there, as writeback is supposed to terminate if work->nr_pages < 0.... As it is, writeback is being done in chunks of roughly 6400-7000 pages per inode, which is relatively large chunks and probably all the dirty pages on the inode because wbc->nr_to_write == 24576 is being passed to .writepage. ext4 is slightly higher than XFS, which is no surprise if there are less dirty inodes in memory than for XFS. So why is there a difference in performance? Well, ext4 is simply interleaving allocations based on the next file that is written back. i.e: +------------+-------------+-------------+--- ... | A {0,24M} | B {0, 24M} | C {0, 24M} | D .... +------------+-------------+-------------+--- ... And as it moves along, we end up with: ... +-------------+--------------+--------------+--- ... ... | A {24M,24M} | B {24M, 24M} | C {24M, 24M} | D .... ... +-------------+--------------+--------------+--- ... The result is ext4 is avergaing 41 extents per 1GB file, but writes are effectively sequential. That's good for bandwidth, not so good for keeping fragmentation under control. XFS is behaving differently. It is using speculative preallocation to form larger than per-writeback instance extents. It results in some interleaving of extents, but files tend to look like this: datafile1: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..65535]: 546520..612055 0 (546520..612055) 65536 00000 1: [65536..131071]: 1906392..1971927 0 (1906392..1971927) 65536 00000 2: [131072..262143]: 5445336..5576407 0 (5445336..5576407) 131072 00000 3: [262144..524287]: 14948056..15210199 0 (14948056..15210199) 262144 00000 4: [524288..1048575]: 34084568..34608855 0 (34084568..34608855) 524288 00000 5: [1048576..1877407]: 68163288..68992119 0 (68163288..68992119) 828832 00000 (32MB, 32MB, 64MB, 128MB, 256MB, 420MB sized extents at sample time) and the average number of extents per file is 6.3. Hence there is more seeking during XFS writes because it is not allocating space according to the exact writeback pattern that is being driven by the VFS. On my test setup, the difference in throughput was negliable with ffsb reporting 683MB/s for ext4 and 672MB/s for XFS at 48 threads. However, I tested on a machine with only 4GB of RAM, which means that writeback is being done in much smaller chunks per file than Eric's results. That means that XFS will be doing much larger speculative preallocation per file before writeback begins, so will be allocating much larger extents from the start. This will separate the per-file writeback regions extents further than my test, increasing seek distances and so should show more of a seek cost on larger RAM machines given the same storage. Therefore, on a machine with 256GB RAM, the differential between sequential allocation per writeback call (i.e. interleaving across inodes) as ext4 does and the minimal fragmentation approach XFS takes will be more significant. We can see that from Eric's results, too. However, given a large enough storage subsystem, this seek penalty is effectively non-existent so is a fair tradeoff for a filesystem that is expected to be used on machines with hundreds of drives behind the filesystem. The seek penalty is also non-existent on SSDs, so the lower allocation and metadata overhead of creating larger extents is a win there as well... Of course, the obvious measurable difference as a result of these writeback patterans is when it comes to reading back the files. XFs will have all 6-7 extents in-line in the inode, so require no additional IO to read the extent list. The XFS files are more contiguous than ext4, so sequential reads will seek less. Hence the concurrent read loads perform better than ext4, as also seen in Eric's tests. > It is also interesting to see the ext4-nojournal performance as a baseline > to show what performance is achievable on the hardware by any filesystem, > but I don't think it is necessarily a fair comparison with the other test > configurations, since this mode is not usable for most real systems. It > gives both ext4-journal and XFS a target for improvement, by reducing the > overhead of metadata consistency. Maximum write bandwidth is not necessarily the goal we want to acheive. Good write bandwidth, definitely, but experience has shown that prevention of writeback starvation and excessive fragmentation helps to ensure we can maintain that level of performance over the life of the filesystem. That's just as important (if not more important) than maximising ultimate write speed for most production deployments.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 3.2 and 3.1 filesystem scalability measurements 2012-01-31 0:14 ` Dave Chinner @ 2012-01-31 10:53 ` Jan Kara 2012-01-31 12:55 ` Wu Fengguang 2012-01-31 20:27 ` Dave Chinner [not found] ` <20120131112726.GC3867@localhost> 1 sibling, 2 replies; 9+ messages in thread From: Jan Kara @ 2012-01-31 10:53 UTC (permalink / raw) To: Dave Chinner Cc: Andreas Dilger, aziro.linux.adm, Eric Whitney, Ext4 Developers List, linux-fsdevel On Tue 31-01-12 11:14:15, Dave Chinner wrote: > I also found this oddity on both XFS and ext4: > > flush-253:32-3400 [001] 1936151.384563: writeback_start: bdi 253:32: sb_dev 0:0 nr_pages=-898403 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > flush-253:32-3400 [005] 1936151.455845: writeback_start: bdi 253:32: sb_dev 0:0 nr_pages=-911663 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > flush-253:32-3400 [006] 1936151.596298: writeback_start: bdi 253:32: sb_dev 0:0 nr_pages=-931332 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > flush-253:32-3400 [006] 1936151.719074: writeback_start: bdi 253:32: sb_dev 0:0 nr_pages=-951001 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > > That's indicating the work->nr_pages is starting extremely negative, > which should not be the case. The highest I saw was around -2m. > Something is not working right there, as writeback is supposed to > terminate if work->nr_pages < 0.... Ugh, what kernel is this? The tracepoint is just a couple of lines after if (work->nr_pages <= 0) break; so I really don't see how that could happen. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 3.2 and 3.1 filesystem scalability measurements 2012-01-31 10:53 ` Jan Kara @ 2012-01-31 12:55 ` Wu Fengguang 2012-01-31 20:27 ` Dave Chinner 1 sibling, 0 replies; 9+ messages in thread From: Wu Fengguang @ 2012-01-31 12:55 UTC (permalink / raw) To: Jan Kara Cc: Dave Chinner, Andreas Dilger, aziro.linux.adm, Eric Whitney, Ext4 Developers List, linux-fsdevel On Tue, Jan 31, 2012 at 11:53:53AM +0100, Jan Kara wrote: > On Tue 31-01-12 11:14:15, Dave Chinner wrote: > > I also found this oddity on both XFS and ext4: > > > > flush-253:32-3400 [001] 1936151.384563: writeback_start: bdi 253:32: sb_dev 0:0 nr_pages=-898403 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > > flush-253:32-3400 [005] 1936151.455845: writeback_start: bdi 253:32: sb_dev 0:0 nr_pages=-911663 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > > flush-253:32-3400 [006] 1936151.596298: writeback_start: bdi 253:32: sb_dev 0:0 nr_pages=-931332 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > > flush-253:32-3400 [006] 1936151.719074: writeback_start: bdi 253:32: sb_dev 0:0 nr_pages=-951001 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > > > > That's indicating the work->nr_pages is starting extremely negative, > > which should not be the case. The highest I saw was around -2m. > > Something is not working right there, as writeback is supposed to > > terminate if work->nr_pages < 0.... > Ugh, what kernel is this? The tracepoint is just a couple of lines after > if (work->nr_pages <= 0) > break; > so I really don't see how that could happen. It should be a recent kernel judging from the "reason=background" field. I cannot find such "writeback_start.*nr_pages=-" pattern at all in my huge pile of saved tracing logs. Since the background work is only started by wb_check_background_flush() with .nr_pages = LONG_MAX, I can only find such patterns: flush-0:27-5216 [005] .... 472.119012: writeback_start: bdi 0:27: sb_dev 0:0 nr_pages=9223372036854775807 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background flush-0:27-5216 [005] .... 472.119076: writeback_start: bdi 0:27: sb_dev 0:0 nr_pages=9223372036854775803 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background flush-9:0-5176 [025] .... 475.578426: writeback_start: bdi 9:0: sb_dev 0:0 nr_pages=361160 sync_mode=0 kupdate=1 range_cyclic=1 background=0 reason=periodic flush-9:0-5176 [025] .... 475.710138: writeback_start: bdi 9:0: sb_dev 0:0 nr_pages=328392 sync_mode=0 kupdate=1 range_cyclic=1 background=0 reason=periodic Thanks, Fengguang ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 3.2 and 3.1 filesystem scalability measurements 2012-01-31 10:53 ` Jan Kara 2012-01-31 12:55 ` Wu Fengguang @ 2012-01-31 20:27 ` Dave Chinner 1 sibling, 0 replies; 9+ messages in thread From: Dave Chinner @ 2012-01-31 20:27 UTC (permalink / raw) To: Jan Kara Cc: Andreas Dilger, aziro.linux.adm, Eric Whitney, Ext4 Developers List, linux-fsdevel On Tue, Jan 31, 2012 at 11:53:53AM +0100, Jan Kara wrote: > On Tue 31-01-12 11:14:15, Dave Chinner wrote: > > I also found this oddity on both XFS and ext4: > > > > flush-253:32-3400 [001] 1936151.384563: writeback_start: bdi 253:32: sb_dev 0:0 nr_pages=-898403 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > > flush-253:32-3400 [005] 1936151.455845: writeback_start: bdi 253:32: sb_dev 0:0 nr_pages=-911663 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > > flush-253:32-3400 [006] 1936151.596298: writeback_start: bdi 253:32: sb_dev 0:0 nr_pages=-931332 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > > flush-253:32-3400 [006] 1936151.719074: writeback_start: bdi 253:32: sb_dev 0:0 nr_pages=-951001 sync_mode=0 kupdate=0 range_cyclic=1 background=1 reason=background > > > > That's indicating the work->nr_pages is starting extremely negative, > > which should not be the case. The highest I saw was around -2m. > > Something is not working right there, as writeback is supposed to > > terminate if work->nr_pages < 0.... > Ugh, what kernel is this? The tracepoint is just a couple of lines after > if (work->nr_pages <= 0) > break; > so I really don't see how that could happen. Neither can I - it is the latest Linus kernel (3.3-rc1+). I even upgraded trace-cmd to the latest in case that was the problem, but it didn't make any difference. I haven't worked it out yet. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
[parent not found: <20120131112726.GC3867@localhost>]
* Re: 3.2 and 3.1 filesystem scalability measurements [not found] ` <20120131112726.GC3867@localhost> @ 2012-01-31 20:40 ` Dave Chinner 0 siblings, 0 replies; 9+ messages in thread From: Dave Chinner @ 2012-01-31 20:40 UTC (permalink / raw) To: Wu Fengguang Cc: Andreas Dilger, aziro.linux.adm, Eric Whitney, Ext4 Developers List, linux-fsdevel, Jan Kara On Tue, Jan 31, 2012 at 07:27:26PM +0800, Wu Fengguang wrote: > On Tue, Jan 31, 2012 at 11:14:15AM +1100, Dave Chinner wrote: > > On Mon, Jan 30, 2012 at 01:30:09PM -0700, Andreas Dilger wrote: > > > On 2012-01-30, at 8:13 AM, aziro.linux.adm wrote: > > > > Is it possible to be said - XFS shows the best average results over the > > > > test. > > > > > > Actually, I'm pleasantly surprised that ext4 does so much better than XFS > > > in the large file creates workload for 48 and 192 threads. I would have > > > thought that this is XFS's bread-and-butter workload that justifies its > > > added code complexity (many threads writing to a multi-disk RAID array), > > > but XFS is about 25% slower in that case. Conversely, XFS is about 25% > > > faster in the large file reads in the 192 thread case, but only 15% faster > > > in the 48 thread case. Other tests show much less significant differences, > > > so in summary I'd say it is about even for these benchmarks. > > > > It appears to me from running the test locally that XFS is driving > > deeper block device queues, and has a lot more writeback pages and > > dirty inodes outstanding at any given point in time. That indicates > > the storage array is the limiting factor to me, not the XFS code. > > > > Typical BDI writeback state for ext4 is this: > > > > BdiWriteback: 73344 kB > > BdiReclaimable: 568960 kB > > BdiDirtyThresh: 764400 kB > > DirtyThresh: 764400 kB > > BackgroundThresh: 382200 kB > > BdiDirtied: 295613696 kB > > BdiWritten: 294971648 kB > > BdiWriteBandwidth: 690008 kBps > > b_dirty: 27 > > b_io: 21 > > b_more_io: 0 > > bdi_list: 1 > > state: 34 > > > > And for XFS: > > > > BdiWriteback: 104960 kB > > BdiReclaimable: 592384 kB > > BdiDirtyThresh: 768876 kB > > DirtyThresh: 768876 kB > > BackgroundThresh: 384436 kB > > BdiDirtied: 396727424 kB > > BdiWritten: 396029568 kB > > BdiWriteBandwidth: 668168 kBps > > b_dirty: 43 > > b_io: 53 > > b_more_io: 0 > > bdi_list: 1 > > state: 34 > > > > So XFS is has substantially more pages under writeback at any given > > point in time, has more inodes dirty, but has slower throughput. I > > ran some traces on the writeback code and confirmed that the number > > of writeback pages is different - ext4 is at 16-20,000, XFS is at > > 25-30,000 for the entire traces. > > Attached are two nr_writeback (the green line) graphs for test cases > > xfs-1dd-1-3.2.0-rc3 > ext4-1dd-1-3.2.0-rc3 The above numbers came from a 48-thread IO workload, not a single thread, so I'm not really sure how much these will reflect the behaviour of workload in question. > Where I notice that the lower nr_writeback segments of XFS equal to > the highest points of ext4, which should be decided by the block queue > size. > > XFS seems to clear PG_writeback long after (up to 0.5s) IO completion > in big batches. This is one of the reason why XFS has higher nr_writeback > on average. Hmmm, that implies a possible workqueue starvation to me. IO completions are processed in a workqueue, but that shouldn't be held off for that long.... > The other two graphs show the writeback chunk size: ext4 is consistently > 128MB while XFS is mostly 32MB. So it is somehow unfair comparison: > ext4 has code to force 128MB in its write_cache_pages(), while XFS > uses the smaller generic size "0.5s worth of data" computed in > writeback_chunk_size(). Right - I noticed that too, but didn't really think it mattered all that much because the actual amount of pages being written per inode in the traces from my workload were effectively identical. i.e. the windup was irrelevant because wbc->nr_to_write being sent by the VFS writeback was nowhere near being exhausted. BTW, the ext4 comment about why it does this seems a bit out of date, too. Especially the bit about "XFS does this" but the only time XFS did this was as a temporary work for around other writeback issues (IIRC between .32 and .35) that have long since been fixed. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: 3.2 and 3.1 filesystem scalability measurements 2012-01-30 4:09 3.2 and 3.1 filesystem scalability measurements Eric Whitney 2012-01-30 15:13 ` aziro.linux.adm @ 2012-01-30 15:36 ` Cédric Villemain 1 sibling, 0 replies; 9+ messages in thread From: Cédric Villemain @ 2012-01-30 15:36 UTC (permalink / raw) To: Eric Whitney; +Cc: Ext4 Developers List, linux-fsdevel Le 30 janvier 2012 05:09, Eric Whitney <eric.whitney@hp.com> a écrit : > I've posted the results of some 3.2 and 3.1 ext4 scalability measurements > and comparisons on a 48 core x86-64 server at: > > http://free.linux.hp.com/~enw/ext4/3.2 > > This includes throughput and CPU efficiency graphs for five simple > workloads, the raw data for same, plus lockstats on ext4 filesystems with > and without journals. The data have been useful in improving ext4 > scalability as a function of core and thread count in the past. > > For reference, ext3, xfs, and btrfs data are also included. interesting to have all of them, yes. > > The most notable improvement in 3.2 is a big scalability gain for journaled > ext4 when running the large_file_creates workload. This bisects cleanly to > Wu Fengguang's IO-less balance_dirty_pages() patch which was included in the > 3.2 merge window. > > (Please note that the test system's hardware and firmware configuration has > changed since my last posting, so this data set cannot be directly compared > with my older sets.) > > Thanks, > Eric > > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Cédric Villemain +33 (0)6 20 30 22 52 http://2ndQuadrant.fr/ PostgreSQL: Support 24x7 - Développement, Expertise et Formation -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2012-01-31 20:40 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-01-30 4:09 3.2 and 3.1 filesystem scalability measurements Eric Whitney 2012-01-30 15:13 ` aziro.linux.adm 2012-01-30 20:30 ` Andreas Dilger 2012-01-31 0:14 ` Dave Chinner 2012-01-31 10:53 ` Jan Kara 2012-01-31 12:55 ` Wu Fengguang 2012-01-31 20:27 ` Dave Chinner [not found] ` <20120131112726.GC3867@localhost> 2012-01-31 20:40 ` Dave Chinner 2012-01-30 15:36 ` Cédric Villemain
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).