Re: MD write performance issue - found Catalyst patches

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: MD write performance issue - found Catalyst patches
@ 2009-10-18 10:00 mark delfman
  2009-10-18 22:39 ` NeilBrown
  2009-10-29  6:41 ` Neil Brown
  0 siblings, 2 replies; 18+ messages in thread
From: mark delfman @ 2009-10-18 10:00 UTC (permalink / raw)
  To: Mattias Hellström, Linux RAID Mailing List, NeilBrown

[-- Attachment #1: Type: text/plain, Size: 2020 bytes --]

We have tracked the performance drop to the attached two commits in
2.6.28.6.    The performance never fully recovers in later kernels so
I presuming that the change in the write cache is still affecting MD
today.

The problem for us is that although we have slowly tracked it down, we
have no understanding of linux at this level and simply wouldn’t know
where go from this point.

Considering this seems to only effect MD and not hardware based RAID
(in our tests) I thought that this would be an appropriate place to
post these patches and findings.

There are 2 patches which impact MD performance via a filesystem:

a) commit 66c85494570396661479ba51e17964b2c82b6f39 - write-back: fix
nr_to_write counter
b) commit fa76ac6cbeb58256cf7de97a75d5d7f838a80b32 - Fix page
writeback thinko, causing Berkeley DB slowdown


1) no patches applied into 2.6.28.5 kernel: write speed is 1.1 GB/s via xfs
2) both patches are applied into 2.6.28.5 kernel: xfs drops to circa:
680 MB/s (like in kernel 2.6.28.6 and later)
3) put only one patch: 66c85494570396661479ba51e17964b2c82b6f39
(write-back: fix nr_to_write counter) - performance goes down to circa
780 MB/s
4) put only one patch: fa76ac6cbeb58256cf7de97a75d5d7f838a80b32 (Fix
page writeback thinko) - the performance is good: 1.1 GB/s (on XFS)

change log for 28.6
ftp://ftp.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.28.6


Hopefully this helps to resolve this....


Mark

2009/10/17 Mattias Hellström <hellstrom.mattias@gmail.com>:
> (If I were you) I would further test the revisions between the
> following and then look at the changelog for the culprit. Looks like
> versions after this are just trying to regain the missing speed.
>
> Linux linux-tlfp 2.6.27.14-vanilla #1 SMP Fri Oct 16 00:56:25 BST 2009
> x86_64 x86_64 x86_64 GNU/Linux
>
> RAW: 1.1
> XFS 1.1
>
> Linux linux-tlfp 2.6.27.20-vanilla #1 SMP Thu Oct 15 23:59:32 BST 2009
> x86_64 x86_64 x86_64 GNU/Linux
>
> RAw 1.1 GB/s
> XFS: 487 MB/s
>

[-- Attachment #2: fa76ac6cbeb58256cf7de97a75d5d7f838a80b32.patch --]
[-- Type: application/octet-stream, Size: 2047 bytes --]

commit fa76ac6cbeb58256cf7de97a75d5d7f838a80b32
Author: Nick Piggin <npiggin@suse.de>
Date:   Thu Feb 12 04:34:23 2009 +0100

    Fix page writeback thinko, causing Berkeley DB slowdown
    
    commit 3a4c6800f31ea8395628af5e7e490270ee5d0585 upstream.
    
    A bug was introduced into write_cache_pages cyclic writeout by commit
    31a12666d8f0c22235297e1c1575f82061480029 ("mm: write_cache_pages cyclic
    fix").  The intention (and comments) is that we should cycle back and
    look for more dirty pages at the beginning of the file if there is no
    more work to be done.
    
    But the !done condition was dropped from the test.  This means that any
    time the page writeout loop breaks (eg.  due to nr_to_write == 0), we
    will set index to 0, then goto again.  This will set done_index to
    index, then find done is set, so will proceed to the end of the
    function.  When updating mapping->writeback_index for cyclic writeout,
    we now use done_index == 0, so we're always cycling back to 0.
    
    This seemed to be causing random mmap writes (slapadd and iozone) to
    start writing more pages from the LRU and writeout would slowdown, and
    caused bugzilla entry
    
    	http://bugzilla.kernel.org/show_bug.cgi?id=12604
    
    about Berkeley DB slowing down dramatically.
    
    With this patch, iozone random write performance is increased nearly
    5x on my system (iozone -B -r 4k -s 64k -s 512m -s 1200m on ext2).
    
    Signed-off-by: Nick Piggin <npiggin@suse.de>
    Reported-and-tested-by: Jan Kara <jack@suse.cz>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 08d2b96..11400ed 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -997,7 +997,7 @@ continue_unlock:
 		pagevec_release(&pvec);
 		cond_resched();
 	}
-	if (!cycled) {
+	if (!cycled && !done) {
 		/*
 		 * range_cyclic:
 		 * We hit the last page and there is more work to be done: wrap

[-- Attachment #3: 66c85494570396661479ba51e17964b2c82b6f39.patch --]
[-- Type: application/octet-stream, Size: 2347 bytes --]

commit 66c85494570396661479ba51e17964b2c82b6f39
Author: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Date:   Mon Feb 2 18:33:49 2009 +0200

    write-back: fix nr_to_write counter
    
    commit dcf6a79dda5cc2a2bec183e50d829030c0972aaa upstream.
    
    Commit 05fe478dd04e02fa230c305ab9b5616669821dd3 introduced some
    @wbc->nr_to_write breakage.
    
    It made the following changes:
     1. Decrement wbc->nr_to_write instead of nr_to_write
     2. Decrement wbc->nr_to_write _only_ if wbc->sync_mode == WB_SYNC_NONE
     3. If synced nr_to_write pages, stop only if if wbc->sync_mode ==
        WB_SYNC_NONE, otherwise keep going.
    
    However, according to the commit message, the intention was to only make
    change 3.  Change 1 is a bug.  Change 2 does not seem to be necessary,
    and it breaks UBIFS expectations, so if needed, it should be done
    separately later.  And change 2 does not seem to be documented in the
    commit message.
    
    This patch does the following:
     1. Undo changes 1 and 2
     2. Add a comment explaining change 3 (it very useful to have comments
        in _code_, not only in the commit).
    
    Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
    Acked-by: Nick Piggin <npiggin@suse.de>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 11400ed..0c4100e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -981,13 +981,22 @@ continue_unlock:
 				}
  			}
 
-			if (wbc->sync_mode == WB_SYNC_NONE) {
-				wbc->nr_to_write--;
-				if (wbc->nr_to_write <= 0) {
-					done = 1;
-					break;
-				}
+			if (nr_to_write > 0)
+				nr_to_write--;
+			else if (wbc->sync_mode == WB_SYNC_NONE) {
+				/*
+				 * We stop writing back only if we are not
+				 * doing integrity sync. In case of integrity
+				 * sync we have to keep going because someone
+				 * may be concurrently dirtying pages, and we
+				 * might have synced a lot of newly appeared
+				 * dirty pages, but have not synced all of the
+				 * old dirty pages.
+				 */
+				done = 1;
+				break;
 			}
+
 			if (wbc->nonblocking && bdi_write_congested(bdi)) {
 				wbc->encountered_congestion = 1;
 				done = 1;

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: MD write performance issue - found Catalyst patches
  2009-10-18 10:00 MD write performance issue - found Catalyst patches mark delfman
@ 2009-10-18 22:39 ` NeilBrown
  2009-10-29  6:41 ` Neil Brown
  1 sibling, 0 replies; 18+ messages in thread
From: NeilBrown @ 2009-10-18 22:39 UTC (permalink / raw)
  To: mark delfman; +Cc: Mattias Hellström, Linux RAID Mailing List

On Sun, October 18, 2009 9:00 pm, mark delfman wrote:
> We have tracked the performance drop to the attached two commits in
> 2.6.28.6.    The performance never fully recovers in later kernels so
> I presuming that the change in the write cache is still affecting MD
> today.
>
> The problem for us is that although we have slowly tracked it down, we
> have no understanding of linux at this level and simply wouldn’t know
> where go from this point.
>
> Considering this seems to only effect MD and not hardware based RAID
> (in our tests) I thought that this would be an appropriate place to
> post these patches and findings.
>
> There are 2 patches which impact MD performance via a filesystem:
>
> a) commit 66c85494570396661479ba51e17964b2c82b6f39 - write-back: fix
> nr_to_write counter
> b) commit fa76ac6cbeb58256cf7de97a75d5d7f838a80b32 - Fix page
> writeback thinko, causing Berkeley DB slowdown
>
>
> 1) no patches applied into 2.6.28.5 kernel: write speed is 1.1 GB/s via
> xfs
> 2) both patches are applied into 2.6.28.5 kernel: xfs drops to circa:
> 680 MB/s (like in kernel 2.6.28.6 and later)
> 3) put only one patch: 66c85494570396661479ba51e17964b2c82b6f39
> (write-back: fix nr_to_write counter) - performance goes down to circa
> 780 MB/s
> 4) put only one patch: fa76ac6cbeb58256cf7de97a75d5d7f838a80b32 (Fix
> page writeback thinko) - the performance is good: 1.1 GB/s (on XFS)
>
> change log for 28.6
> ftp://ftp.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.28.6
>
>
> Hopefully this helps to resolve this....

Hopefully it will...
Thanks for tracking this down.  It is certainly easier to work out what
is happening when you have a small patch that makes the difference.

I'll see what I (or others) can discover.

Thanks,
NeilBrown

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MD write performance issue - found Catalyst patches
  2009-10-18 10:00 MD write performance issue - found Catalyst patches mark delfman
  2009-10-18 22:39 ` NeilBrown
@ 2009-10-29  6:41 ` Neil Brown
  2009-10-29  6:48   ` Thomas Fjellstrom
  2009-10-29  8:08   ` Asdo
  1 sibling, 2 replies; 18+ messages in thread
From: Neil Brown @ 2009-10-29  6:41 UTC (permalink / raw)
  To: mark delfman; +Cc: Mattias Hellström, Linux RAID Mailing List, npiggin

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=unknown, Size: 2293 bytes --]

On Sunday October 18, markdelfman@googlemail.com wrote:
> We have tracked the performance drop to the attached two commits in
> 2.6.28.6.    The performance never fully recovers in later kernels so
> I presuming that the change in the write cache is still affecting MD
> today.
> 
> The problem for us is that although we have slowly tracked it down, we
> have no understanding of linux at this level and simply wouldn’t know
> where go from this point.
> 
> Considering this seems to only effect MD and not hardware based RAID
> (in our tests) I thought that this would be an appropriate place to
> post these patches and findings.
> 
> There are 2 patches which impact MD performance via a filesystem:
> 
> a) commit 66c85494570396661479ba51e17964b2c82b6f39 - write-back: fix
> nr_to_write counter
> b) commit fa76ac6cbeb58256cf7de97a75d5d7f838a80b32 - Fix page
> writeback thinko, causing Berkeley DB slowdown
> 

I've had a look at this and asked around and I'm afraid there doesn't
seem to be an easy answer.

The most likely difference between 'before' and 'after' those patches
is that more pages are being written per call to generic_writepages in
the 'before' case.  This would generally improve throughput,
particularly with RAID5 which would get more full stripes.

However that is largely a guess as the bugs which were fixed by the
patch could interact in interesting ways with XFS (which decrements
->nr_to_write itself) and it isn't immediately clear to me that more
pages would be written... 

In any case, the 'after' code is clearly correct, so if throughput can
really be increased, the change should be somewhere else.

What might be useful would be to instrument write_cache_pages to count
how many pages were written each time it calls.  You could either
print this number out every time or, if that creates too much noise,
print out an average ever 512 calls or similar.

Seeing how this differs with and without the patches in question could
help understand what is going one and provide hints for how to fix it.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MD write performance issue - found Catalyst patches
  2009-10-29  6:41 ` Neil Brown
@ 2009-10-29  6:48   ` Thomas Fjellstrom
  2009-10-29  7:32     ` Thomas Fjellstrom
  2009-10-29  8:08   ` Asdo
  1 sibling, 1 reply; 18+ messages in thread
From: Thomas Fjellstrom @ 2009-10-29  6:48 UTC (permalink / raw)
  To: Neil Brown
  Cc: mark delfman, Mattias Hellström, Linux RAID Mailing List,
	npiggin

On Thu October 29 2009, Neil Brown wrote:
> On Sunday October 18, markdelfman@googlemail.com wrote:
> > We have tracked the performance drop to the attached two commits in
> > 2.6.28.6.    The performance never fully recovers in later kernels so
> > I presuming that the change in the write cache is still affecting MD
> > today.
> >
> > The problem for us is that although we have slowly tracked it down, we
> > have no understanding of linux at this level and simply wouldn’t know
> > where go from this point.
> >
> > Considering this seems to only effect MD and not hardware based RAID
> > (in our tests) I thought that this would be an appropriate place to
> > post these patches and findings.
> >
> > There are 2 patches which impact MD performance via a filesystem:
> >
> > a) commit 66c85494570396661479ba51e17964b2c82b6f39 - write-back: fix
> > nr_to_write counter
> > b) commit fa76ac6cbeb58256cf7de97a75d5d7f838a80b32 - Fix page
> > writeback thinko, causing Berkeley DB slowdown
> 
> I've had a look at this and asked around and I'm afraid there doesn't
> seem to be an easy answer.
> 
> The most likely difference between 'before' and 'after' those patches
> is that more pages are being written per call to generic_writepages in
> the 'before' case.  This would generally improve throughput,
> particularly with RAID5 which would get more full stripes.
> 
> However that is largely a guess as the bugs which were fixed by the
> patch could interact in interesting ways with XFS (which decrements
> ->nr_to_write itself) and it isn't immediately clear to me that more
> pages would be written...
> 
> In any case, the 'after' code is clearly correct, so if throughput can
> really be increased, the change should be somewhere else.
> 
> What might be useful would be to instrument write_cache_pages to count
> how many pages were written each time it calls.  You could either
> print this number out every time or, if that creates too much noise,
> print out an average ever 512 calls or similar.
> 
> Seeing how this differs with and without the patches in question could
> help understand what is going one and provide hints for how to fix it.
>

I don't suppose this causes "bursty" writeout like I've been seeing lately? 
For some reason writes go full speed for a short while and then just stop 
for a short time, which averages out to 2-4x slower than what the array 
should be capable of.

> NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Thomas Fjellstrom
tfjellstrom@shaw.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MD write performance issue - found Catalyst patches
  2009-10-29  6:48   ` Thomas Fjellstrom
@ 2009-10-29  7:32     ` Thomas Fjellstrom
  0 siblings, 0 replies; 18+ messages in thread
From: Thomas Fjellstrom @ 2009-10-29  7:32 UTC (permalink / raw)
  To: Neil Brown
  Cc: mark delfman, Mattias Hellström, Linux RAID Mailing List,
	npiggin

On Thu October 29 2009, Thomas Fjellstrom wrote:
> On Thu October 29 2009, Neil Brown wrote:
> > On Sunday October 18, markdelfman@googlemail.com wrote:
> > > We have tracked the performance drop to the attached two commits in
> > > 2.6.28.6.    The performance never fully recovers in later kernels so
> > > I presuming that the change in the write cache is still affecting MD
> > > today.
> > >
> > > The problem for us is that although we have slowly tracked it down,
> > > we have no understanding of linux at this level and simply wouldn’t
> > > know where go from this point.
> > >
> > > Considering this seems to only effect MD and not hardware based RAID
> > > (in our tests) I thought that this would be an appropriate place to
> > > post these patches and findings.
> > >
> > > There are 2 patches which impact MD performance via a filesystem:
> > >
> > > a) commit 66c85494570396661479ba51e17964b2c82b6f39 - write-back: fix
> > > nr_to_write counter
> > > b) commit fa76ac6cbeb58256cf7de97a75d5d7f838a80b32 - Fix page
> > > writeback thinko, causing Berkeley DB slowdown
> >
> > I've had a look at this and asked around and I'm afraid there doesn't
> > seem to be an easy answer.
> >
> > The most likely difference between 'before' and 'after' those patches
> > is that more pages are being written per call to generic_writepages in
> > the 'before' case.  This would generally improve throughput,
> > particularly with RAID5 which would get more full stripes.
> >
> > However that is largely a guess as the bugs which were fixed by the
> > patch could interact in interesting ways with XFS (which decrements
> > ->nr_to_write itself) and it isn't immediately clear to me that more
> > pages would be written...
> >
> > In any case, the 'after' code is clearly correct, so if throughput can
> > really be increased, the change should be somewhere else.
> >
> > What might be useful would be to instrument write_cache_pages to count
> > how many pages were written each time it calls.  You could either
> > print this number out every time or, if that creates too much noise,
> > print out an average ever 512 calls or similar.
> >
> > Seeing how this differs with and without the patches in question could
> > help understand what is going one and provide hints for how to fix it.
> 
> I don't suppose this causes "bursty" writeout like I've been seeing
>  lately? For some reason writes go full speed for a short while and then
>  just stop for a short time, which averages out to 2-4x slower than what
>  the array should be capable of.

At the very least, 2.6.26 doesn't have this issue. Speeds are lower than I 
was expecting (350MB/s write, 450MB/s read), but no where near as bad as 
later kernels. and there is no "bursty" behaviour. speeds are fairly 
constant throughout testing.

> > NeilBrown
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid"
> > in the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Thomas Fjellstrom
tfjellstrom@shaw.ca
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MD write performance issue - found Catalyst patches
  2009-10-29  6:41 ` Neil Brown
  2009-10-29  6:48   ` Thomas Fjellstrom
@ 2009-10-29  8:08   ` Asdo
  2009-10-31 10:51     ` mark delfman
  1 sibling, 1 reply; 18+ messages in thread
From: Asdo @ 2009-10-29  8:08 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Neil Brown wrote:
> I've had a look at this and asked around and I'm afraid there doesn't
> seem to be an easy answer.
>
> The most likely difference between 'before' and 'after' those patches
> is that more pages are being written per call to generic_writepages in
> the 'before' case.  This would generally improve throughput,
> particularly with RAID5 which would get more full stripes.
>
> However that is largely a guess as the bugs which were fixed by the
> patch could interact in interesting ways with XFS (which decrements
> ->nr_to_write itself) and it isn't immediately clear to me that more
> pages would be written... 
>
> In any case, the 'after' code is clearly correct, so if throughput can
> really be increased, the change should be somewhere else.
>   
Thank you Neil for looking into this

How can "writing less pages" be more correct than "writing more pages"?
I can see the first as an optimization to the second, however if this 
reduces throughput then the optimization doesn't work...
Isn't it possible to "fix" it so to write more pages and still be 
semantically correct?


Thomas Fjellstrom wrote:
> I don't suppose this causes "bursty" writeout like I've been seeing lately? 
> For some reason writes go full speed for a short while and then just stop 
> for a short time, which averages out to 2-4x slower than what the array 
> should be capable of.
>   
I have definitely seen this bursty behaviour on 2.6.31.

It would be interesting to know what are the CPUs doing or waiting for 
in the pause times. But I am not a kernel expert :-( how could one check 
this?

Thank you

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MD write performance issue - found Catalyst patches
  2009-10-29  8:08   ` Asdo
@ 2009-10-31 10:51     ` mark delfman
  2009-11-03  4:58       ` Neil Brown
  0 siblings, 1 reply; 18+ messages in thread
From: mark delfman @ 2009-10-31 10:51 UTC (permalink / raw)
  To: Asdo; +Cc: Neil Brown, linux-raid

Thank you Neil... if the commits improve overall the stability of
Linux then they are obviously important and hopefully there is another
way to achieve the same results... as you say if we can see that there
is an opportunity for significant performance gain (i think 600MBs
extra is significant), it’s maybe worth some thought.

I would very much like to contribute to this and indeed ongoing
developments, but the only hurdle i have is that i am storage focused,
which means I work solely in storage environments (which is good) but
also means i know very little about linux and dramatically less about
programming (i have never even touched C for example).   I come from
hardware based storage into the world of linux, so lag behind greatly
in many ways.

I do have current access to equipment in which we can gain the
performance needed to see the effect of the commits (up to 600MBsec
difference) but i have the lack of ability to implement the ideas
which you have suggested.

I am hopeful that you or another member of this group could offer some
advice / patch to implement the print options you suggested... if so i
would happily allocated resource and time to do what i can to help
with this.

I appreciate that this group is generally aimed at those with linux
experience, but hopefully i can still add some value whether simply
with test equipment, comparisons or real life introduction for
feedback etc.

The print options you suggested... are these a simple introduction?
Could someone maybe offer a abc of how to add this?

On Thu, Oct 29, 2009 at 9:08 AM, Asdo <asdo@shiftmail.org> wrote:
> Neil Brown wrote:
>>
>> I've had a look at this and asked around and I'm afraid there doesn't
>> seem to be an easy answer.
>>
>> The most likely difference between 'before' and 'after' those patches
>> is that more pages are being written per call to generic_writepages in
>> the 'before' case.  This would generally improve throughput,
>> particularly with RAID5 which would get more full stripes.
>>
>> However that is largely a guess as the bugs which were fixed by the
>> patch could interact in interesting ways with XFS (which decrements
>> ->nr_to_write itself) and it isn't immediately clear to me that more
>> pages would be written...
>> In any case, the 'after' code is clearly correct, so if throughput can
>> really be increased, the change should be somewhere else.
>>
>
> Thank you Neil for looking into this
>
> How can "writing less pages" be more correct than "writing more pages"?
> I can see the first as an optimization to the second, however if this
> reduces throughput then the optimization doesn't work...
> Isn't it possible to "fix" it so to write more pages and still be
> semantically correct?
>
>
> Thomas Fjellstrom wrote:
>>
>> I don't suppose this causes "bursty" writeout like I've been seeing
>> lately? For some reason writes go full speed for a short while and then just
>> stop for a short time, which averages out to 2-4x slower than what the array
>> should be capable of.
>>
>
> I have definitely seen this bursty behaviour on 2.6.31.
>
> It would be interesting to know what are the CPUs doing or waiting for in
> the pause times. But I am not a kernel expert :-( how could one check this?
>
> Thank you
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MD write performance issue - found Catalyst patches
  2009-10-31 10:51     ` mark delfman
@ 2009-11-03  4:58       ` Neil Brown
  2009-11-03 12:11         ` mark delfman
  0 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2009-11-03  4:58 UTC (permalink / raw)
  To: mark delfman; +Cc: Asdo, linux-raid

On Saturday October 31, markdelfman@googlemail.com wrote:
> 
> I am hopeful that you or another member of this group could offer some
> advice / patch to implement the print options you suggested... if so i
> would happily allocated resource and time to do what i can to help
> with this.


I've spent a little while exploring this.
It appears to very definitely be an XFS problem, interacting in
interesting ways with the VM.

I built a 4-drive raid6 and did some simple testing on 2.6.28.5 and
2.6.28.6 using each of xfs and ext2.

ext2 gives write throughput of 65MB/sec on .5 and 66MB/sec on .6
xfs gives 86MB/sec on .5 and only 51MB/sec on .6


When write_cache_pages is called it calls 'writepage' some number of
times.  On ext2, writepage will write at most one page.
On xfs writepage will sometimes write multiple pages.

I created a patch as below that prints (in a fairly cryptic way)
the number of 'writepage' calls and the number of pages that XFS
actually wrote.

For ext2, the number of writepage calls is at most 1536 and averages
around 140

For xfs with .5, there is usually only one call to writepage and it
writes around 800 pages.
For .6 there are about 200 calls to writepages but the achieve
an average of about 700 pages together.

So as you can see, there is very different behaviour.

I notice a more recent patch in XFS in mainline which looks like a
dirty hack to try to address this problem.

I suggest you try that patch and/or take this to the XFS developers.

NeilBrown



diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 08d2b96..aa4bccc 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -875,6 +875,8 @@ int write_cache_pages(struct address_space *mapping,
 	int cycled;
 	int range_whole = 0;
 	long nr_to_write = wbc->nr_to_write;
+	long hidden_writes = 0;
+	long clear_writes = 0;
 
 	if (wbc->nonblocking && bdi_write_congested(bdi)) {
 		wbc->encountered_congestion = 1;
@@ -961,7 +963,11 @@ continue_unlock:
 			if (!clear_page_dirty_for_io(page))
 				goto continue_unlock;
 
+			{ int orig_nr_to_write = wbc->nr_to_write;
 			ret = (*writepage)(page, wbc, data);
+			hidden_writes += orig_nr_to_write - wbc->nr_to_write;
+			clear_writes ++;
+			}
 			if (unlikely(ret)) {
 				if (ret == AOP_WRITEPAGE_ACTIVATE) {
 					unlock_page(page);
@@ -1008,12 +1014,37 @@ continue_unlock:
 		end = writeback_index - 1;
 		goto retry;
 	}
+
 	if (!wbc->no_nrwrite_index_update) {
 		if (wbc->range_cyclic || (range_whole && nr_to_write > 0))
 			mapping->writeback_index = done_index;
 		wbc->nr_to_write = nr_to_write;
 	}
 
+	{ static int sum, cnt, max;
+	static unsigned long previous;
+	static int sum2, max2;
+	
+	sum += clear_writes;
+	cnt += 1;
+
+	if (max < clear_writes) max = clear_writes;
+
+	sum2 += hidden_writes;
+	if (max2 < hidden_writes) max2 = hidden_writes;
+
+	if (cnt > 100 && time_after(jiffies, previous + 10*HZ)) {
+		printk("write_page_cache: sum=%d cnt=%d max=%d mean=%d sum2=%d max2=%d mean2=%d\n",
+		       sum, cnt, max, sum/cnt,
+		       sum2, max2, sum2/cnt);
+		sum = 0;
+		cnt = 0;
+		max = 0;
+		max2 = 0;
+		sum2 = 0;
+		previous = jiffies;
+	}
+	}
 	return ret;
 }
 EXPORT_SYMBOL(write_cache_pages);


------------------------------------------------------
From c8a4051c3731b6db224482218cfd535ab9393ff8 Mon Sep 17 00:00:00 2001
From: Eric Sandeen <sandeen@sandeen.net>
Date: Fri, 31 Jul 2009 00:02:17 -0500
Subject: [PATCH] xfs: bump up nr_to_write in xfs_vm_writepage

VM calculation for nr_to_write seems off.  Bump it way
up, this gets simple streaming writes zippy again.
To be reviewed again after Jens' writeback changes.

Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Cc: Chris Mason <chris.mason@oracle.com>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
---
 fs/xfs/linux-2.6/xfs_aops.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 7ec89fc..aecf251 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -1268,6 +1268,14 @@ xfs_vm_writepage(
 	if (!page_has_buffers(page))
 		create_empty_buffers(page, 1 << inode->i_blkbits, 0);
 
+
+	/*
+	 *  VM calculation for nr_to_write seems off.  Bump it way
+	 *  up, this gets simple streaming writes zippy again.
+	 *  To be reviewed again after Jens' writeback changes.
+	 */
+	wbc->nr_to_write *= 4;
+
 	/*
 	 * Convert delayed allocate, unwritten or unmapped space
 	 * to real space and flush out to disk.
-- 
1.6.4.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: MD write performance issue - found Catalyst patches
  2009-11-03  4:58       ` Neil Brown
@ 2009-11-03 12:11         ` mark delfman
  2009-11-04 17:15           ` mark delfman
  0 siblings, 1 reply; 18+ messages in thread
From: mark delfman @ 2009-11-03 12:11 UTC (permalink / raw)
  To: Neil Brown; +Cc: Asdo, linux-raid

Thanks Neil,

I seem to recall that I tried this on EXT3 and saw the same results as
XFS, but with your code and suggestions I think it is well worth me
trying some more tests and reporting back....


Mark

On Tue, Nov 3, 2009 at 4:58 AM, Neil Brown <neilb@suse.de> wrote:
> On Saturday October 31, markdelfman@googlemail.com wrote:
>>
>> I am hopeful that you or another member of this group could offer some
>> advice / patch to implement the print options you suggested... if so i
>> would happily allocated resource and time to do what i can to help
>> with this.
>
>
> I've spent a little while exploring this.
> It appears to very definitely be an XFS problem, interacting in
> interesting ways with the VM.
>
> I built a 4-drive raid6 and did some simple testing on 2.6.28.5 and
> 2.6.28.6 using each of xfs and ext2.
>
> ext2 gives write throughput of 65MB/sec on .5 and 66MB/sec on .6
> xfs gives 86MB/sec on .5 and only 51MB/sec on .6
>
>
> When write_cache_pages is called it calls 'writepage' some number of
> times.  On ext2, writepage will write at most one page.
> On xfs writepage will sometimes write multiple pages.
>
> I created a patch as below that prints (in a fairly cryptic way)
> the number of 'writepage' calls and the number of pages that XFS
> actually wrote.
>
> For ext2, the number of writepage calls is at most 1536 and averages
> around 140
>
> For xfs with .5, there is usually only one call to writepage and it
> writes around 800 pages.
> For .6 there are about 200 calls to writepages but the achieve
> an average of about 700 pages together.
>
> So as you can see, there is very different behaviour.
>
> I notice a more recent patch in XFS in mainline which looks like a
> dirty hack to try to address this problem.
>
> I suggest you try that patch and/or take this to the XFS developers.
>
> NeilBrown
>
>
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 08d2b96..aa4bccc 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -875,6 +875,8 @@ int write_cache_pages(struct address_space *mapping,
>        int cycled;
>        int range_whole = 0;
>        long nr_to_write = wbc->nr_to_write;
> +       long hidden_writes = 0;
> +       long clear_writes = 0;
>
>        if (wbc->nonblocking && bdi_write_congested(bdi)) {
>                wbc->encountered_congestion = 1;
> @@ -961,7 +963,11 @@ continue_unlock:
>                        if (!clear_page_dirty_for_io(page))
>                                goto continue_unlock;
>
> +                       { int orig_nr_to_write = wbc->nr_to_write;
>                        ret = (*writepage)(page, wbc, data);
> +                       hidden_writes += orig_nr_to_write - wbc->nr_to_write;
> +                       clear_writes ++;
> +                       }
>                        if (unlikely(ret)) {
>                                if (ret == AOP_WRITEPAGE_ACTIVATE) {
>                                        unlock_page(page);
> @@ -1008,12 +1014,37 @@ continue_unlock:
>                end = writeback_index - 1;
>                goto retry;
>        }
> +
>        if (!wbc->no_nrwrite_index_update) {
>                if (wbc->range_cyclic || (range_whole && nr_to_write > 0))
>                        mapping->writeback_index = done_index;
>                wbc->nr_to_write = nr_to_write;
>        }
>
> +       { static int sum, cnt, max;
> +       static unsigned long previous;
> +       static int sum2, max2;
> +
> +       sum += clear_writes;
> +       cnt += 1;
> +
> +       if (max < clear_writes) max = clear_writes;
> +
> +       sum2 += hidden_writes;
> +       if (max2 < hidden_writes) max2 = hidden_writes;
> +
> +       if (cnt > 100 && time_after(jiffies, previous + 10*HZ)) {
> +               printk("write_page_cache: sum=%d cnt=%d max=%d mean=%d sum2=%d max2=%d mean2=%d\n",
> +                      sum, cnt, max, sum/cnt,
> +                      sum2, max2, sum2/cnt);
> +               sum = 0;
> +               cnt = 0;
> +               max = 0;
> +               max2 = 0;
> +               sum2 = 0;
> +               previous = jiffies;
> +       }
> +       }
>        return ret;
>  }
>  EXPORT_SYMBOL(write_cache_pages);
>
>
> ------------------------------------------------------
> From c8a4051c3731b6db224482218cfd535ab9393ff8 Mon Sep 17 00:00:00 2001
> From: Eric Sandeen <sandeen@sandeen.net>
> Date: Fri, 31 Jul 2009 00:02:17 -0500
> Subject: [PATCH] xfs: bump up nr_to_write in xfs_vm_writepage
>
> VM calculation for nr_to_write seems off.  Bump it way
> up, this gets simple streaming writes zippy again.
> To be reviewed again after Jens' writeback changes.
>
> Signed-off-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
> Cc: Chris Mason <chris.mason@oracle.com>
> Reviewed-by: Felix Blyakher <felixb@sgi.com>
> Signed-off-by: Felix Blyakher <felixb@sgi.com>
> ---
>  fs/xfs/linux-2.6/xfs_aops.c |    8 ++++++++
>  1 files changed, 8 insertions(+), 0 deletions(-)
>
> diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
> index 7ec89fc..aecf251 100644
> --- a/fs/xfs/linux-2.6/xfs_aops.c
> +++ b/fs/xfs/linux-2.6/xfs_aops.c
> @@ -1268,6 +1268,14 @@ xfs_vm_writepage(
>        if (!page_has_buffers(page))
>                create_empty_buffers(page, 1 << inode->i_blkbits, 0);
>
> +
> +       /*
> +        *  VM calculation for nr_to_write seems off.  Bump it way
> +        *  up, this gets simple streaming writes zippy again.
> +        *  To be reviewed again after Jens' writeback changes.
> +        */
> +       wbc->nr_to_write *= 4;
> +
>        /*
>         * Convert delayed allocate, unwritten or unmapped space
>         * to real space and flush out to disk.
> --
> 1.6.4.3
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MD write performance issue - found Catalyst patches
  2009-11-03 12:11         ` mark delfman
@ 2009-11-04 17:15           ` mark delfman
  2009-11-04 17:25             ` Asdo
  2009-11-04 19:05             ` Steve Cousins
  0 siblings, 2 replies; 18+ messages in thread
From: mark delfman @ 2009-11-04 17:15 UTC (permalink / raw)
  To: Neil Brown; +Cc: Asdo, linux-raid

[-- Attachment #1: Type: text/plain, Size: 6258 bytes --]

Some FS comparisons attached in pdf

not sure what to make of them as yet, but worth posting


On Tue, Nov 3, 2009 at 12:11 PM, mark delfman
<markdelfman@googlemail.com> wrote:
> Thanks Neil,
>
> I seem to recall that I tried this on EXT3 and saw the same results as
> XFS, but with your code and suggestions I think it is well worth me
> trying some more tests and reporting back....
>
>
> Mark
>
> On Tue, Nov 3, 2009 at 4:58 AM, Neil Brown <neilb@suse.de> wrote:
>> On Saturday October 31, markdelfman@googlemail.com wrote:
>>>
>>> I am hopeful that you or another member of this group could offer some
>>> advice / patch to implement the print options you suggested... if so i
>>> would happily allocated resource and time to do what i can to help
>>> with this.
>>
>>
>> I've spent a little while exploring this.
>> It appears to very definitely be an XFS problem, interacting in
>> interesting ways with the VM.
>>
>> I built a 4-drive raid6 and did some simple testing on 2.6.28.5 and
>> 2.6.28.6 using each of xfs and ext2.
>>
>> ext2 gives write throughput of 65MB/sec on .5 and 66MB/sec on .6
>> xfs gives 86MB/sec on .5 and only 51MB/sec on .6
>>
>>
>> When write_cache_pages is called it calls 'writepage' some number of
>> times.  On ext2, writepage will write at most one page.
>> On xfs writepage will sometimes write multiple pages.
>>
>> I created a patch as below that prints (in a fairly cryptic way)
>> the number of 'writepage' calls and the number of pages that XFS
>> actually wrote.
>>
>> For ext2, the number of writepage calls is at most 1536 and averages
>> around 140
>>
>> For xfs with .5, there is usually only one call to writepage and it
>> writes around 800 pages.
>> For .6 there are about 200 calls to writepages but the achieve
>> an average of about 700 pages together.
>>
>> So as you can see, there is very different behaviour.
>>
>> I notice a more recent patch in XFS in mainline which looks like a
>> dirty hack to try to address this problem.
>>
>> I suggest you try that patch and/or take this to the XFS developers.
>>
>> NeilBrown
>>
>>
>>
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 08d2b96..aa4bccc 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -875,6 +875,8 @@ int write_cache_pages(struct address_space *mapping,
>>        int cycled;
>>        int range_whole = 0;
>>        long nr_to_write = wbc->nr_to_write;
>> +       long hidden_writes = 0;
>> +       long clear_writes = 0;
>>
>>        if (wbc->nonblocking && bdi_write_congested(bdi)) {
>>                wbc->encountered_congestion = 1;
>> @@ -961,7 +963,11 @@ continue_unlock:
>>                        if (!clear_page_dirty_for_io(page))
>>                                goto continue_unlock;
>>
>> +                       { int orig_nr_to_write = wbc->nr_to_write;
>>                        ret = (*writepage)(page, wbc, data);
>> +                       hidden_writes += orig_nr_to_write - wbc->nr_to_write;
>> +                       clear_writes ++;
>> +                       }
>>                        if (unlikely(ret)) {
>>                                if (ret == AOP_WRITEPAGE_ACTIVATE) {
>>                                        unlock_page(page);
>> @@ -1008,12 +1014,37 @@ continue_unlock:
>>                end = writeback_index - 1;
>>                goto retry;
>>        }
>> +
>>        if (!wbc->no_nrwrite_index_update) {
>>                if (wbc->range_cyclic || (range_whole && nr_to_write > 0))
>>                        mapping->writeback_index = done_index;
>>                wbc->nr_to_write = nr_to_write;
>>        }
>>
>> +       { static int sum, cnt, max;
>> +       static unsigned long previous;
>> +       static int sum2, max2;
>> +
>> +       sum += clear_writes;
>> +       cnt += 1;
>> +
>> +       if (max < clear_writes) max = clear_writes;
>> +
>> +       sum2 += hidden_writes;
>> +       if (max2 < hidden_writes) max2 = hidden_writes;
>> +
>> +       if (cnt > 100 && time_after(jiffies, previous + 10*HZ)) {
>> +               printk("write_page_cache: sum=%d cnt=%d max=%d mean=%d sum2=%d max2=%d mean2=%d\n",
>> +                      sum, cnt, max, sum/cnt,
>> +                      sum2, max2, sum2/cnt);
>> +               sum = 0;
>> +               cnt = 0;
>> +               max = 0;
>> +               max2 = 0;
>> +               sum2 = 0;
>> +               previous = jiffies;
>> +       }
>> +       }
>>        return ret;
>>  }
>>  EXPORT_SYMBOL(write_cache_pages);
>>
>>
>> ------------------------------------------------------
>> From c8a4051c3731b6db224482218cfd535ab9393ff8 Mon Sep 17 00:00:00 2001
>> From: Eric Sandeen <sandeen@sandeen.net>
>> Date: Fri, 31 Jul 2009 00:02:17 -0500
>> Subject: [PATCH] xfs: bump up nr_to_write in xfs_vm_writepage
>>
>> VM calculation for nr_to_write seems off.  Bump it way
>> up, this gets simple streaming writes zippy again.
>> To be reviewed again after Jens' writeback changes.
>>
>> Signed-off-by: Christoph Hellwig <hch@infradead.org>
>> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
>> Cc: Chris Mason <chris.mason@oracle.com>
>> Reviewed-by: Felix Blyakher <felixb@sgi.com>
>> Signed-off-by: Felix Blyakher <felixb@sgi.com>
>> ---
>>  fs/xfs/linux-2.6/xfs_aops.c |    8 ++++++++
>>  1 files changed, 8 insertions(+), 0 deletions(-)
>>
>> diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
>> index 7ec89fc..aecf251 100644
>> --- a/fs/xfs/linux-2.6/xfs_aops.c
>> +++ b/fs/xfs/linux-2.6/xfs_aops.c
>> @@ -1268,6 +1268,14 @@ xfs_vm_writepage(
>>        if (!page_has_buffers(page))
>>                create_empty_buffers(page, 1 << inode->i_blkbits, 0);
>>
>> +
>> +       /*
>> +        *  VM calculation for nr_to_write seems off.  Bump it way
>> +        *  up, this gets simple streaming writes zippy again.
>> +        *  To be reviewed again after Jens' writeback changes.
>> +        */
>> +       wbc->nr_to_write *= 4;
>> +
>>        /*
>>         * Convert delayed allocate, unwritten or unmapped space
>>         * to real space and flush out to disk.
>> --
>> 1.6.4.3
>>
>>
>

[-- Attachment #2: FS test.pdf --]
[-- Type: application/pdf, Size: 53707 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MD write performance issue - found Catalyst patches
  2009-11-04 17:15           ` mark delfman
@ 2009-11-04 17:25             ` Asdo
       [not found]               ` <66781b10911050904m407d14d6t7d3bec12578d6500@mail.gmail.com>
  2009-11-04 19:05             ` Steve Cousins
  1 sibling, 1 reply; 18+ messages in thread
From: Asdo @ 2009-11-04 17:25 UTC (permalink / raw)
  To: mark delfman; +Cc: Neil Brown, linux-raid

Hey great job Neil and Mark
Mark, your benchmarks seems to confirm Neil's analysis: ext2 and ext3 
are not slowed down from 2.6.28.5 and 2.6.28.6
Mark why don't you try to apply the patch below here by Eric Sandeen 
found by Neil to the 2.6.28.6 to see if the xfs write performance comes 
back?
Thank you for your efforts
Asdo

mark delfman wrote:
> Some FS comparisons attached in pdf
>
> not sure what to make of them as yet, but worth posting
>
>
> On Tue, Nov 3, 2009 at 12:11 PM, mark delfman
> <markdelfman@googlemail.com> wrote:
>   
>> Thanks Neil,
>>
>> I seem to recall that I tried this on EXT3 and saw the same results as
>> XFS, but with your code and suggestions I think it is well worth me
>> trying some more tests and reporting back....
>>
>>
>> Mark
>>
>> On Tue, Nov 3, 2009 at 4:58 AM, Neil Brown <neilb@suse.de> wrote:
>>     
>>> On Saturday October 31, markdelfman@googlemail.com wrote:
>>>       
>>>> I am hopeful that you or another member of this group could offer some
>>>> advice / patch to implement the print options you suggested... if so i
>>>> would happily allocated resource and time to do what i can to help
>>>> with this.
>>>>         
>>> I've spent a little while exploring this.
>>> It appears to very definitely be an XFS problem, interacting in
>>> interesting ways with the VM.
>>>
>>> I built a 4-drive raid6 and did some simple testing on 2.6.28.5 and
>>> 2.6.28.6 using each of xfs and ext2.
>>>
>>> ext2 gives write throughput of 65MB/sec on .5 and 66MB/sec on .6
>>> xfs gives 86MB/sec on .5 and only 51MB/sec on .6
>>>
>>>
>>> When write_cache_pages is called it calls 'writepage' some number of
>>> times.  On ext2, writepage will write at most one page.
>>> On xfs writepage will sometimes write multiple pages.
>>>
>>> I created a patch as below that prints (in a fairly cryptic way)
>>> the number of 'writepage' calls and the number of pages that XFS
>>> actually wrote.
>>>
>>> For ext2, the number of writepage calls is at most 1536 and averages
>>> around 140
>>>
>>> For xfs with .5, there is usually only one call to writepage and it
>>> writes around 800 pages.
>>> For .6 there are about 200 calls to writepages but the achieve
>>> an average of about 700 pages together.
>>>
>>> So as you can see, there is very different behaviour.
>>>
>>> I notice a more recent patch in XFS in mainline which looks like a
>>> dirty hack to try to address this problem.
>>>
>>> I suggest you try that patch and/or take this to the XFS developers.
>>>
>>> NeilBrown
>>>
>>>
>>>
>>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>>> index 08d2b96..aa4bccc 100644
>>> --- a/mm/page-writeback.c
>>> +++ b/mm/page-writeback.c
>>> @@ -875,6 +875,8 @@ int write_cache_pages(struct address_space *mapping,
>>>        int cycled;
>>>        int range_whole = 0;
>>>        long nr_to_write = wbc->nr_to_write;
>>> +       long hidden_writes = 0;
>>> +       long clear_writes = 0;
>>>
>>>        if (wbc->nonblocking && bdi_write_congested(bdi)) {
>>>                wbc->encountered_congestion = 1;
>>> @@ -961,7 +963,11 @@ continue_unlock:
>>>                        if (!clear_page_dirty_for_io(page))
>>>                                goto continue_unlock;
>>>
>>> +                       { int orig_nr_to_write = wbc->nr_to_write;
>>>                        ret = (*writepage)(page, wbc, data);
>>> +                       hidden_writes += orig_nr_to_write - wbc->nr_to_write;
>>> +                       clear_writes ++;
>>> +                       }
>>>                        if (unlikely(ret)) {
>>>                                if (ret == AOP_WRITEPAGE_ACTIVATE) {
>>>                                        unlock_page(page);
>>> @@ -1008,12 +1014,37 @@ continue_unlock:
>>>                end = writeback_index - 1;
>>>                goto retry;
>>>        }
>>> +
>>>        if (!wbc->no_nrwrite_index_update) {
>>>                if (wbc->range_cyclic || (range_whole && nr_to_write > 0))
>>>                        mapping->writeback_index = done_index;
>>>                wbc->nr_to_write = nr_to_write;
>>>        }
>>>
>>> +       { static int sum, cnt, max;
>>> +       static unsigned long previous;
>>> +       static int sum2, max2;
>>> +
>>> +       sum += clear_writes;
>>> +       cnt += 1;
>>> +
>>> +       if (max < clear_writes) max = clear_writes;
>>> +
>>> +       sum2 += hidden_writes;
>>> +       if (max2 < hidden_writes) max2 = hidden_writes;
>>> +
>>> +       if (cnt > 100 && time_after(jiffies, previous + 10*HZ)) {
>>> +               printk("write_page_cache: sum=%d cnt=%d max=%d mean=%d sum2=%d max2=%d mean2=%d\n",
>>> +                      sum, cnt, max, sum/cnt,
>>> +                      sum2, max2, sum2/cnt);
>>> +               sum = 0;
>>> +               cnt = 0;
>>> +               max = 0;
>>> +               max2 = 0;
>>> +               sum2 = 0;
>>> +               previous = jiffies;
>>> +       }
>>> +       }
>>>        return ret;
>>>  }
>>>  EXPORT_SYMBOL(write_cache_pages);
>>>
>>>
>>> ------------------------------------------------------
>>> From c8a4051c3731b6db224482218cfd535ab9393ff8 Mon Sep 17 00:00:00 2001
>>> From: Eric Sandeen <sandeen@sandeen.net>
>>> Date: Fri, 31 Jul 2009 00:02:17 -0500
>>> Subject: [PATCH] xfs: bump up nr_to_write in xfs_vm_writepage
>>>
>>> VM calculation for nr_to_write seems off.  Bump it way
>>> up, this gets simple streaming writes zippy again.
>>> To be reviewed again after Jens' writeback changes.
>>>
>>> Signed-off-by: Christoph Hellwig <hch@infradead.org>
>>> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
>>> Cc: Chris Mason <chris.mason@oracle.com>
>>> Reviewed-by: Felix Blyakher <felixb@sgi.com>
>>> Signed-off-by: Felix Blyakher <felixb@sgi.com>
>>> ---
>>>  fs/xfs/linux-2.6/xfs_aops.c |    8 ++++++++
>>>  1 files changed, 8 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
>>> index 7ec89fc..aecf251 100644
>>> --- a/fs/xfs/linux-2.6/xfs_aops.c
>>> +++ b/fs/xfs/linux-2.6/xfs_aops.c
>>> @@ -1268,6 +1268,14 @@ xfs_vm_writepage(
>>>        if (!page_has_buffers(page))
>>>                create_empty_buffers(page, 1 << inode->i_blkbits, 0);
>>>
>>> +
>>> +       /*
>>> +        *  VM calculation for nr_to_write seems off.  Bump it way
>>> +        *  up, this gets simple streaming writes zippy again.
>>> +        *  To be reviewed again after Jens' writeback changes.
>>> +        */
>>> +       wbc->nr_to_write *= 4;
>>> +
>>>        /*
>>>         * Convert delayed allocate, unwritten or unmapped space
>>>         * to real space and flush out to disk.
>>> --
>>> 1.6.4.3
>>>
>>>
>>>       


^ permalink raw reply	[flat|nested] 18+ messages in thread

[parent not found: <66781b10911050904m407d14d6t7d3bec12578d6500@mail.gmail.com>]

* Re: MD write performance issue - found Catalyst patches
       [not found]               ` <66781b10911050904m407d14d6t7d3bec12578d6500@mail.gmail.com>
@ 2009-11-05 19:09                 ` Asdo
  2009-11-06  4:52                   ` Neil Brown
  2009-11-06 15:51                   ` mark delfman
  0 siblings, 2 replies; 18+ messages in thread
From: Asdo @ 2009-11-05 19:09 UTC (permalink / raw)
  To: mark delfman; +Cc: Neil Brown, linux-raid

Great!
So the dirty hack pumped at x16 does really work! (while we wait for 
Jens, as written in the patch: "To be reviewed again after Jens' 
writeback changes.") Thanks for having tried up to x32.
Still Raid-6 xfs write is not yet up to the old speed... maybe the old 
code was better at filling RAID stripes exactly, who knows.
Mark, yep, personally I would be very interested in seeing how does 
2.6.31 perform on your hardware so I can e.g. see exactly how much my 
3ware 9650 controllers suck... (so also pls try vanilla 3.6.31 which I 
think has an integrated x4 hack, do not just try with x16 please)
We might also be interested in 2.6.32 performances if you have time, 
also because 2.6.32 includes the fixes for the CPU lockups in big arrays 
during resyncs which was reported on this list, and this is a good 
incentive for upgrading (Neil, btw, is there any chance those lockups 
fixes get backported to mainstream 2.6.31.x?).
Thank you!
Asdo


mark delfman wrote:
> Hi Gents,
>
> Attached is the result of some testing with the XFS patch... as we can
> see it does make a reasonable difference!  Changing the value from
> 4,16,32 shows 16 is a good level...
>
> Is this a 'safe' patch at 16?
>
> I think that maybe there is still some performance to be gained,
> especially in the R6 configs which is where most would be interested i
> suspect.. but its a great start!
>
>
> I think that i should jump up to maybe .31 and see how this reacts.....
>
> Neil, i applied your writepage patch and have outputs if these are of
> interest...
>
> Thank you for the help with the pacthing and linux!!!!
>
>
> mark
>
>
>
> On Wed, Nov 4, 2009 at 5:25 PM, Asdo <asdo@shiftmail.org> wrote:
>   
>> Hey great job Neil and Mark
>> Mark, your benchmarks seems to confirm Neil's analysis: ext2 and ext3 are
>> not slowed down from 2.6.28.5 and 2.6.28.6
>> Mark why don't you try to apply the patch below here by Eric Sandeen found
>> by Neil to the 2.6.28.6 to see if the xfs write performance comes back?
>> Thank you for your efforts
>> Asdo
>>
>> mark delfman wrote:
>>     
>>> Some FS comparisons attached in pdf
>>>
>>> not sure what to make of them as yet, but worth posting
>>>
>>>
>>> On Tue, Nov 3, 2009 at 12:11 PM, mark delfman
>>> <markdelfman@googlemail.com> wrote:
>>>
>>>       
>>>> Thanks Neil,
>>>>
>>>> I seem to recall that I tried this on EXT3 and saw the same results as
>>>> XFS, but with your code and suggestions I think it is well worth me
>>>> trying some more tests and reporting back....
>>>>
>>>>
>>>> Mark
>>>>
>>>> On Tue, Nov 3, 2009 at 4:58 AM, Neil Brown <neilb@suse.de> wrote:
>>>>
>>>>         
>>>>> On Saturday October 31, markdelfman@googlemail.com wrote:
>>>>>
>>>>>           
>>>>>> I am hopeful that you or another member of this group could offer some
>>>>>> advice / patch to implement the print options you suggested... if so i
>>>>>> would happily allocated resource and time to do what i can to help
>>>>>> with this.
>>>>>>
>>>>>>             
>>>>> I've spent a little while exploring this.
>>>>> It appears to very definitely be an XFS problem, interacting in
>>>>> interesting ways with the VM.
>>>>>
>>>>> I built a 4-drive raid6 and did some simple testing on 2.6.28.5 and
>>>>> 2.6.28.6 using each of xfs and ext2.
>>>>>
>>>>> ext2 gives write throughput of 65MB/sec on .5 and 66MB/sec on .6
>>>>> xfs gives 86MB/sec on .5 and only 51MB/sec on .6
>>>>>
>>>>>
>>>>> When write_cache_pages is called it calls 'writepage' some number of
>>>>> times.  On ext2, writepage will write at most one page.
>>>>> On xfs writepage will sometimes write multiple pages.
>>>>>
>>>>> I created a patch as below that prints (in a fairly cryptic way)
>>>>> the number of 'writepage' calls and the number of pages that XFS
>>>>> actually wrote.
>>>>>
>>>>> For ext2, the number of writepage calls is at most 1536 and averages
>>>>> around 140
>>>>>
>>>>> For xfs with .5, there is usually only one call to writepage and it
>>>>> writes around 800 pages.
>>>>> For .6 there are about 200 calls to writepages but the achieve
>>>>> an average of about 700 pages together.
>>>>>
>>>>> So as you can see, there is very different behaviour.
>>>>>
>>>>> I notice a more recent patch in XFS in mainline which looks like a
>>>>> dirty hack to try to address this problem.
>>>>>
>>>>> I suggest you try that patch and/or take this to the XFS developers.
>>>>>
>>>>> NeilBrown
>>>>>
>>>>>
>>>>>
>>>>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>>>>> index 08d2b96..aa4bccc 100644
>>>>> --- a/mm/page-writeback.c
>>>>> +++ b/mm/page-writeback.c
>>>>> @@ -875,6 +875,8 @@ int write_cache_pages(struct address_space *mapping,
>>>>>       int cycled;
>>>>>       int range_whole = 0;
>>>>>       long nr_to_write = wbc->nr_to_write;
>>>>> +       long hidden_writes = 0;
>>>>> +       long clear_writes = 0;
>>>>>
>>>>>       if (wbc->nonblocking && bdi_write_congested(bdi)) {
>>>>>               wbc->encountered_congestion = 1;
>>>>> @@ -961,7 +963,11 @@ continue_unlock:
>>>>>                       if (!clear_page_dirty_for_io(page))
>>>>>                               goto continue_unlock;
>>>>>
>>>>> +                       { int orig_nr_to_write = wbc->nr_to_write;
>>>>>                       ret = (*writepage)(page, wbc, data);
>>>>> +                       hidden_writes += orig_nr_to_write -
>>>>> wbc->nr_to_write;
>>>>> +                       clear_writes ++;
>>>>> +                       }
>>>>>                       if (unlikely(ret)) {
>>>>>                               if (ret == AOP_WRITEPAGE_ACTIVATE) {
>>>>>                                       unlock_page(page);
>>>>> @@ -1008,12 +1014,37 @@ continue_unlock:
>>>>>               end = writeback_index - 1;
>>>>>               goto retry;
>>>>>       }
>>>>> +
>>>>>       if (!wbc->no_nrwrite_index_update) {
>>>>>               if (wbc->range_cyclic || (range_whole && nr_to_write > 0))
>>>>>                       mapping->writeback_index = done_index;
>>>>>               wbc->nr_to_write = nr_to_write;
>>>>>       }
>>>>>
>>>>> +       { static int sum, cnt, max;
>>>>> +       static unsigned long previous;
>>>>> +       static int sum2, max2;
>>>>> +
>>>>> +       sum += clear_writes;
>>>>> +       cnt += 1;
>>>>> +
>>>>> +       if (max < clear_writes) max = clear_writes;
>>>>> +
>>>>> +       sum2 += hidden_writes;
>>>>> +       if (max2 < hidden_writes) max2 = hidden_writes;
>>>>> +
>>>>> +       if (cnt > 100 && time_after(jiffies, previous + 10*HZ)) {
>>>>> +               printk("write_page_cache: sum=%d cnt=%d max=%d mean=%d
>>>>> sum2=%d max2=%d mean2=%d\n",
>>>>> +                      sum, cnt, max, sum/cnt,
>>>>> +                      sum2, max2, sum2/cnt);
>>>>> +               sum = 0;
>>>>> +               cnt = 0;
>>>>> +               max = 0;
>>>>> +               max2 = 0;
>>>>> +               sum2 = 0;
>>>>> +               previous = jiffies;
>>>>> +       }
>>>>> +       }
>>>>>       return ret;
>>>>>  }
>>>>>  EXPORT_SYMBOL(write_cache_pages);
>>>>>
>>>>>
>>>>> ------------------------------------------------------
>>>>> From c8a4051c3731b6db224482218cfd535ab9393ff8 Mon Sep 17 00:00:00 2001
>>>>> From: Eric Sandeen <sandeen@sandeen.net>
>>>>> Date: Fri, 31 Jul 2009 00:02:17 -0500
>>>>> Subject: [PATCH] xfs: bump up nr_to_write in xfs_vm_writepage
>>>>>
>>>>> VM calculation for nr_to_write seems off.  Bump it way
>>>>> up, this gets simple streaming writes zippy again.
>>>>> To be reviewed again after Jens' writeback changes.
>>>>>
>>>>> Signed-off-by: Christoph Hellwig <hch@infradead.org>
>>>>> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
>>>>> Cc: Chris Mason <chris.mason@oracle.com>
>>>>> Reviewed-by: Felix Blyakher <felixb@sgi.com>
>>>>> Signed-off-by: Felix Blyakher <felixb@sgi.com>
>>>>> ---
>>>>>  fs/xfs/linux-2.6/xfs_aops.c |    8 ++++++++
>>>>>  1 files changed, 8 insertions(+), 0 deletions(-)
>>>>>
>>>>> diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
>>>>> index 7ec89fc..aecf251 100644
>>>>> --- a/fs/xfs/linux-2.6/xfs_aops.c
>>>>> +++ b/fs/xfs/linux-2.6/xfs_aops.c
>>>>> @@ -1268,6 +1268,14 @@ xfs_vm_writepage(
>>>>>       if (!page_has_buffers(page))
>>>>>               create_empty_buffers(page, 1 << inode->i_blkbits, 0);
>>>>>
>>>>> +
>>>>> +       /*
>>>>> +        *  VM calculation for nr_to_write seems off.  Bump it way
>>>>> +        *  up, this gets simple streaming writes zippy again.
>>>>> +        *  To be reviewed again after Jens' writeback changes.
>>>>> +        */
>>>>> +       wbc->nr_to_write *= 4;
>>>>> +
>>>>>       /*
>>>>>        * Convert delayed allocate, unwritten or unmapped space
>>>>>        * to real space and flush out to disk.
>>>>> --
>>>>> 1.6.4.3
>>>>>
>>>>>
>>>>>
>>>>>           
>>     


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MD write performance issue - found Catalyst patches
  2009-11-05 19:09                 ` Asdo
@ 2009-11-06  4:52                   ` Neil Brown
  2009-11-06 10:28                     ` Asdo
  2009-11-06 15:51                   ` mark delfman
  1 sibling, 1 reply; 18+ messages in thread
From: Neil Brown @ 2009-11-06  4:52 UTC (permalink / raw)
  To: Asdo; +Cc: mark delfman, linux-raid

On Thursday November 5, asdo@shiftmail.org wrote:
> incentive for upgrading (Neil, btw, is there any chance those lockups 
> fixes get backported to mainstream 2.6.31.x?).

That would be up to the XFS developers.  I suggest you consider asking
them.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MD write performance issue - found Catalyst patches
  2009-11-06  4:52                   ` Neil Brown
@ 2009-11-06 10:28                     ` Asdo
  2009-11-06 10:51                       ` Neil F Brown
  0 siblings, 1 reply; 18+ messages in thread
From: Asdo @ 2009-11-06 10:28 UTC (permalink / raw)
  To: Neil Brown; +Cc: mark delfman, linux-raid

Neil Brown wrote:
> On Thursday November 5, asdo@shiftmail.org wrote:
>   
>> incentive for upgrading (Neil, btw, is there any chance those lockups 
>> fixes get backported to mainstream 2.6.31.x?).    
>
> That would be up to the XFS developers.  I suggest you consider asking
> them.
>   
Hi Neil, no sorry I meant the patches for md raid lockups like this one:
http://neil.brown.name/git?p=md;a=commitdiff;h=1d9d52416c0445019ccc1f0fddb9a227456eb61b
and those for raid 5,6 for which i don't know the link...
Hm actually I don't see them applied to even mainstream 2.6.32 yet :-(
http://git.kernel.org/?p=linux/kernel/git/djbw/md.git;a=blob_plain;f=drivers/md/raid1.c;hb=2fdc246aaf9a7fa088451ad2a72e9119b5f7f029
am I correct?
The bug can be serious imho depending on the hardware: when I saw it on 
my hardware all disk accesses were completely starved forever and it was 
even impossible to log-in until the resync finished. It can actually be 
worked around by reducing the maximum resync speed, but this is only if 
the user knows the trick...
Thank you

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MD write performance issue - found Catalyst patches
  2009-11-06 10:28                     ` Asdo
@ 2009-11-06 10:51                       ` Neil F Brown
  0 siblings, 0 replies; 18+ messages in thread
From: Neil F Brown @ 2009-11-06 10:51 UTC (permalink / raw)
  To: Asdo; +Cc: mark delfman, linux-raid

On Fri, November 6, 2009 9:28 pm, Asdo wrote:
> Neil Brown wrote:
>> On Thursday November 5, asdo@shiftmail.org wrote:
>>
>>> incentive for upgrading (Neil, btw, is there any chance those lockups
>>> fixes get backported to mainstream 2.6.31.x?).
>>
>> That would be up to the XFS developers.  I suggest you consider asking
>> them.
>>
> Hi Neil, no sorry I meant the patches for md raid lockups like this one:
> http://neil.brown.name/git?p=md;a=commitdiff;h=1d9d52416c0445019ccc1f0fddb9a227456eb61b
> and those for raid 5,6 for which i don't know the link...
> Hm actually I don't see them applied to even mainstream 2.6.32 yet :-(
> http://git.kernel.org/?p=linux/kernel/git/djbw/md.git;a=blob_plain;f=drivers/md/raid1.c;hb=2fdc246aaf9a7fa088451ad2a72e9119b5f7f029
> am I correct?
> The bug can be serious imho depending on the hardware: when I saw it on
> my hardware all disk accesses were completely starved forever and it was
> even impossible to log-in until the resync finished. It can actually be
> worked around by reducing the maximum resync speed, but this is only if
> the user knows the trick...
> Thank you

Those patches are in 2.6.32-rc:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=1d9d52416c0445019ccc1f0fddb9a227456eb61b

however I haven't submitted them for -stable.  Maybe I should...

Thanks.
NeilBrown


>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MD write performance issue - found Catalyst patches
  2009-11-05 19:09                 ` Asdo
  2009-11-06  4:52                   ` Neil Brown
@ 2009-11-06 15:51                   ` mark delfman
  1 sibling, 0 replies; 18+ messages in thread
From: mark delfman @ 2009-11-06 15:51 UTC (permalink / raw)
  To: Asdo; +Cc: Neil Brown, linux-raid

[-- Attachment #1: Type: text/plain, Size: 9944 bytes --]

Attached are later kernel results..... Not an awful lot of difference
(apart from the native due to the fact 28.6 doesnt have the pacth
included).... 32.rc6 is certainly upto 10% faster on R6

Note we running around 10 test on each, this is a low number for
averages and the result move around 100MB plus.. but in this case we
did not need to be over accurate... they show maybe 20% reduction
writing to FS as opposed to direct to MD.

Whilst the reads on XFS are now 20% 'faster' on XFS than to the raw
device (reaching 2GBs)... .3X seems better at read caching on XFS. I
have only graphed the writes...

Mark



On Thu, Nov 5, 2009 at 7:09 PM, Asdo <asdo@shiftmail.org> wrote:
> Great!
> So the dirty hack pumped at x16 does really work! (while we wait for Jens,
> as written in the patch: "To be reviewed again after Jens' writeback
> changes.") Thanks for having tried up to x32.
> Still Raid-6 xfs write is not yet up to the old speed... maybe the old code
> was better at filling RAID stripes exactly, who knows.
> Mark, yep, personally I would be very interested in seeing how does 2.6.31
> perform on your hardware so I can e.g. see exactly how much my 3ware 9650
> controllers suck... (so also pls try vanilla 3.6.31 which I think has an
> integrated x4 hack, do not just try with x16 please)
> We might also be interested in 2.6.32 performances if you have time, also
> because 2.6.32 includes the fixes for the CPU lockups in big arrays during
> resyncs which was reported on this list, and this is a good incentive for
> upgrading (Neil, btw, is there any chance those lockups fixes get backported
> to mainstream 2.6.31.x?).
> Thank you!
> Asdo
>
>
> mark delfman wrote:
>>
>> Hi Gents,
>>
>> Attached is the result of some testing with the XFS patch... as we can
>> see it does make a reasonable difference!  Changing the value from
>> 4,16,32 shows 16 is a good level...
>>
>> Is this a 'safe' patch at 16?
>>
>> I think that maybe there is still some performance to be gained,
>> especially in the R6 configs which is where most would be interested i
>> suspect.. but its a great start!
>>
>>
>> I think that i should jump up to maybe .31 and see how this reacts.....
>>
>> Neil, i applied your writepage patch and have outputs if these are of
>> interest...
>>
>> Thank you for the help with the pacthing and linux!!!!
>>
>>
>> mark
>>
>>
>>
>> On Wed, Nov 4, 2009 at 5:25 PM, Asdo <asdo@shiftmail.org> wrote:
>>
>>>
>>> Hey great job Neil and Mark
>>> Mark, your benchmarks seems to confirm Neil's analysis: ext2 and ext3 are
>>> not slowed down from 2.6.28.5 and 2.6.28.6
>>> Mark why don't you try to apply the patch below here by Eric Sandeen
>>> found
>>> by Neil to the 2.6.28.6 to see if the xfs write performance comes back?
>>> Thank you for your efforts
>>> Asdo
>>>
>>> mark delfman wrote:
>>>
>>>>
>>>> Some FS comparisons attached in pdf
>>>>
>>>> not sure what to make of them as yet, but worth posting
>>>>
>>>>
>>>> On Tue, Nov 3, 2009 at 12:11 PM, mark delfman
>>>> <markdelfman@googlemail.com> wrote:
>>>>
>>>>
>>>>>
>>>>> Thanks Neil,
>>>>>
>>>>> I seem to recall that I tried this on EXT3 and saw the same results as
>>>>> XFS, but with your code and suggestions I think it is well worth me
>>>>> trying some more tests and reporting back....
>>>>>
>>>>>
>>>>> Mark
>>>>>
>>>>> On Tue, Nov 3, 2009 at 4:58 AM, Neil Brown <neilb@suse.de> wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> On Saturday October 31, markdelfman@googlemail.com wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I am hopeful that you or another member of this group could offer
>>>>>>> some
>>>>>>> advice / patch to implement the print options you suggested... if so
>>>>>>> i
>>>>>>> would happily allocated resource and time to do what i can to help
>>>>>>> with this.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I've spent a little while exploring this.
>>>>>> It appears to very definitely be an XFS problem, interacting in
>>>>>> interesting ways with the VM.
>>>>>>
>>>>>> I built a 4-drive raid6 and did some simple testing on 2.6.28.5 and
>>>>>> 2.6.28.6 using each of xfs and ext2.
>>>>>>
>>>>>> ext2 gives write throughput of 65MB/sec on .5 and 66MB/sec on .6
>>>>>> xfs gives 86MB/sec on .5 and only 51MB/sec on .6
>>>>>>
>>>>>>
>>>>>> When write_cache_pages is called it calls 'writepage' some number of
>>>>>> times.  On ext2, writepage will write at most one page.
>>>>>> On xfs writepage will sometimes write multiple pages.
>>>>>>
>>>>>> I created a patch as below that prints (in a fairly cryptic way)
>>>>>> the number of 'writepage' calls and the number of pages that XFS
>>>>>> actually wrote.
>>>>>>
>>>>>> For ext2, the number of writepage calls is at most 1536 and averages
>>>>>> around 140
>>>>>>
>>>>>> For xfs with .5, there is usually only one call to writepage and it
>>>>>> writes around 800 pages.
>>>>>> For .6 there are about 200 calls to writepages but the achieve
>>>>>> an average of about 700 pages together.
>>>>>>
>>>>>> So as you can see, there is very different behaviour.
>>>>>>
>>>>>> I notice a more recent patch in XFS in mainline which looks like a
>>>>>> dirty hack to try to address this problem.
>>>>>>
>>>>>> I suggest you try that patch and/or take this to the XFS developers.
>>>>>>
>>>>>> NeilBrown
>>>>>>
>>>>>>
>>>>>>
>>>>>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>>>>>> index 08d2b96..aa4bccc 100644
>>>>>> --- a/mm/page-writeback.c
>>>>>> +++ b/mm/page-writeback.c
>>>>>> @@ -875,6 +875,8 @@ int write_cache_pages(struct address_space
>>>>>> *mapping,
>>>>>>      int cycled;
>>>>>>      int range_whole = 0;
>>>>>>      long nr_to_write = wbc->nr_to_write;
>>>>>> +       long hidden_writes = 0;
>>>>>> +       long clear_writes = 0;
>>>>>>
>>>>>>      if (wbc->nonblocking && bdi_write_congested(bdi)) {
>>>>>>              wbc->encountered_congestion = 1;
>>>>>> @@ -961,7 +963,11 @@ continue_unlock:
>>>>>>                      if (!clear_page_dirty_for_io(page))
>>>>>>                              goto continue_unlock;
>>>>>>
>>>>>> +                       { int orig_nr_to_write = wbc->nr_to_write;
>>>>>>                      ret = (*writepage)(page, wbc, data);
>>>>>> +                       hidden_writes += orig_nr_to_write -
>>>>>> wbc->nr_to_write;
>>>>>> +                       clear_writes ++;
>>>>>> +                       }
>>>>>>                      if (unlikely(ret)) {
>>>>>>                              if (ret == AOP_WRITEPAGE_ACTIVATE) {
>>>>>>                                      unlock_page(page);
>>>>>> @@ -1008,12 +1014,37 @@ continue_unlock:
>>>>>>              end = writeback_index - 1;
>>>>>>              goto retry;
>>>>>>      }
>>>>>> +
>>>>>>      if (!wbc->no_nrwrite_index_update) {
>>>>>>              if (wbc->range_cyclic || (range_whole && nr_to_write >
>>>>>> 0))
>>>>>>                      mapping->writeback_index = done_index;
>>>>>>              wbc->nr_to_write = nr_to_write;
>>>>>>      }
>>>>>>
>>>>>> +       { static int sum, cnt, max;
>>>>>> +       static unsigned long previous;
>>>>>> +       static int sum2, max2;
>>>>>> +
>>>>>> +       sum += clear_writes;
>>>>>> +       cnt += 1;
>>>>>> +
>>>>>> +       if (max < clear_writes) max = clear_writes;
>>>>>> +
>>>>>> +       sum2 += hidden_writes;
>>>>>> +       if (max2 < hidden_writes) max2 = hidden_writes;
>>>>>> +
>>>>>> +       if (cnt > 100 && time_after(jiffies, previous + 10*HZ)) {
>>>>>> +               printk("write_page_cache: sum=%d cnt=%d max=%d mean=%d
>>>>>> sum2=%d max2=%d mean2=%d\n",
>>>>>> +                      sum, cnt, max, sum/cnt,
>>>>>> +                      sum2, max2, sum2/cnt);
>>>>>> +               sum = 0;
>>>>>> +               cnt = 0;
>>>>>> +               max = 0;
>>>>>> +               max2 = 0;
>>>>>> +               sum2 = 0;
>>>>>> +               previous = jiffies;
>>>>>> +       }
>>>>>> +       }
>>>>>>      return ret;
>>>>>>  }
>>>>>>  EXPORT_SYMBOL(write_cache_pages);
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------
>>>>>> From c8a4051c3731b6db224482218cfd535ab9393ff8 Mon Sep 17 00:00:00 2001
>>>>>> From: Eric Sandeen <sandeen@sandeen.net>
>>>>>> Date: Fri, 31 Jul 2009 00:02:17 -0500
>>>>>> Subject: [PATCH] xfs: bump up nr_to_write in xfs_vm_writepage
>>>>>>
>>>>>> VM calculation for nr_to_write seems off.  Bump it way
>>>>>> up, this gets simple streaming writes zippy again.
>>>>>> To be reviewed again after Jens' writeback changes.
>>>>>>
>>>>>> Signed-off-by: Christoph Hellwig <hch@infradead.org>
>>>>>> Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
>>>>>> Cc: Chris Mason <chris.mason@oracle.com>
>>>>>> Reviewed-by: Felix Blyakher <felixb@sgi.com>
>>>>>> Signed-off-by: Felix Blyakher <felixb@sgi.com>
>>>>>> ---
>>>>>>  fs/xfs/linux-2.6/xfs_aops.c |    8 ++++++++
>>>>>>  1 files changed, 8 insertions(+), 0 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
>>>>>> index 7ec89fc..aecf251 100644
>>>>>> --- a/fs/xfs/linux-2.6/xfs_aops.c
>>>>>> +++ b/fs/xfs/linux-2.6/xfs_aops.c
>>>>>> @@ -1268,6 +1268,14 @@ xfs_vm_writepage(
>>>>>>      if (!page_has_buffers(page))
>>>>>>              create_empty_buffers(page, 1 << inode->i_blkbits, 0);
>>>>>>
>>>>>> +
>>>>>> +       /*
>>>>>> +        *  VM calculation for nr_to_write seems off.  Bump it way
>>>>>> +        *  up, this gets simple streaming writes zippy again.
>>>>>> +        *  To be reviewed again after Jens' writeback changes.
>>>>>> +        */
>>>>>> +       wbc->nr_to_write *= 4;
>>>>>> +
>>>>>>      /*
>>>>>>       * Convert delayed allocate, unwritten or unmapped space
>>>>>>       * to real space and flush out to disk.
>>>>>> --
>>>>>> 1.6.4.3
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>
>>>
>
>

[-- Attachment #2: XFSvMD_2.pdf --]
[-- Type: application/pdf, Size: 34619 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MD write performance issue - found Catalyst patches
  2009-11-04 17:15           ` mark delfman
  2009-11-04 17:25             ` Asdo
@ 2009-11-04 19:05             ` Steve Cousins
  2009-11-04 22:08               ` mark delfman
  1 sibling, 1 reply; 18+ messages in thread
From: Steve Cousins @ 2009-11-04 19:05 UTC (permalink / raw)
  To: mark delfman; +Cc: linux-raid

mark delfman wrote:
> Some FS comparisons attached in pdf
>
> not sure what to make of them as yet, but worth posting
>   

I'm not sure either. Two things jump out.

    1. Why is raw RAID0 read performance slower than write performance
    2. Why is read performance with some file systems at or above raw 
read performance?

For number one, does this indicate that write caching is actually On on 
the drives?

Are all tests truly apples-apples comparisons or were there other 
factors in there that aren't listed in the charts?

I guess these issues might not have a lot to do with your main question 
but you might want to double-check the tests and numbers.

Steve

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: MD write performance issue - found Catalyst patches
  2009-11-04 19:05             ` Steve Cousins
@ 2009-11-04 22:08               ` mark delfman
  0 siblings, 0 replies; 18+ messages in thread
From: mark delfman @ 2009-11-04 22:08 UTC (permalink / raw)
  To: Steve Cousins; +Cc: linux-raid

Yes write cache is on the drives and the comparisons are all with the
same hardware... apples for apples and no pears ;)

Write is often faster (in my mind) simply because you can use a lot of
write cache (in system)... whilst reads reads you are limited to what
the drives can pull off.

I also guess that the FS's - mainly ext2 it seems is more effecient at
implementing a read cache than a raw device, hence the slight
performance increase.... but i just guessing to be honest




On Wed, Nov 4, 2009 at 7:05 PM, Steve Cousins <steve.cousins@maine.edu> wrote:
> mark delfman wrote:
>>
>> Some FS comparisons attached in pdf
>>
>> not sure what to make of them as yet, but worth posting
>>
>
> I'm not sure either. Two things jump out.
>
>   1. Why is raw RAID0 read performance slower than write performance
>   2. Why is read performance with some file systems at or above raw read
> performance?
>
> For number one, does this indicate that write caching is actually On on the
> drives?
>
> Are all tests truly apples-apples comparisons or were there other factors in
> there that aren't listed in the charts?
>
> I guess these issues might not have a lot to do with your main question but
> you might want to double-check the tests and numbers.
>
> Steve
>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2009-11-06 15:51 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-18 10:00 MD write performance issue - found Catalyst patches mark delfman
2009-10-18 22:39 ` NeilBrown
2009-10-29  6:41 ` Neil Brown
2009-10-29  6:48   ` Thomas Fjellstrom
2009-10-29  7:32     ` Thomas Fjellstrom
2009-10-29  8:08   ` Asdo
2009-10-31 10:51     ` mark delfman
2009-11-03  4:58       ` Neil Brown
2009-11-03 12:11         ` mark delfman
2009-11-04 17:15           ` mark delfman
2009-11-04 17:25             ` Asdo
     [not found]               ` <66781b10911050904m407d14d6t7d3bec12578d6500@mail.gmail.com>
2009-11-05 19:09                 ` Asdo
2009-11-06  4:52                   ` Neil Brown
2009-11-06 10:28                     ` Asdo
2009-11-06 10:51                       ` Neil F Brown
2009-11-06 15:51                   ` mark delfman
2009-11-04 19:05             ` Steve Cousins
2009-11-04 22:08               ` mark delfman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).