Re: regression: 100% io-wait with 2.6.24-rcX

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* Re: regression: 100% io-wait with 2.6.24-rcX
       [not found]       ` <E1JE1Uz-0002w5-6z@localhost.localdomain>
@ 2008-01-13 11:59         ` Fengguang Wu
  2008-01-13 11:59         ` Fengguang Wu
  1 sibling, 0 replies; 41+ messages in thread
From: Fengguang Wu @ 2008-01-13 11:59 UTC (permalink / raw)
  To: Joerg Platte
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org

On Sun, Jan 13, 2008 at 10:49:31AM +0100, Joerg Platte wrote:
> register_jprobe(ext2_writepage) = 0
> register_jprobe(requeue_io) = 0
> register_kprobe(submit_bio) = 0
> requeue_io:
> inode 114019(sda7/.kde) count 2,2 size 0 pages 1
> 0	2	0	U____
> requeue_io:
> inode 114025(sda7/cache-ibm) count 2,1 size 0 pages 1
> 0	2	0	U____
> requeue_io:
> inode 114029(sda7/socket-ibm) count 2,3 size 0 pages 1
> 0	2	0	U____
> requeue_io:
> inode 114017(sda7/0266584877) count 3,6 size 0 pages 1
> 0	2	0	U____

It helps. Thank you, Joerg!

The .kde/cache-ibm/socket-ibm/0266584877 above are directories.
It's weird that dirs would have their own mappings in ext2. In
particular this bug is triggered because the dir mapping page has
PAGECACHE_TAG_DIRTY set and PG_dirty cleared, staying in an
inconsistent state.

Fengguang

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
       [not found]       ` <E1JE1Uz-0002w5-6z@localhost.localdomain>
  2008-01-13 11:59         ` Fengguang Wu
@ 2008-01-13 11:59         ` Fengguang Wu
  1 sibling, 0 replies; 41+ messages in thread
From: Fengguang Wu @ 2008-01-13 11:59 UTC (permalink / raw)
  To: Joerg Platte
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org

On Sun, Jan 13, 2008 at 10:49:31AM +0100, Joerg Platte wrote:
> register_jprobe(ext2_writepage) = 0
> register_jprobe(requeue_io) = 0
> register_kprobe(submit_bio) = 0
> requeue_io:
> inode 114019(sda7/.kde) count 2,2 size 0 pages 1
> 0	2	0	U____
> requeue_io:
> inode 114025(sda7/cache-ibm) count 2,1 size 0 pages 1
> 0	2	0	U____
> requeue_io:
> inode 114029(sda7/socket-ibm) count 2,3 size 0 pages 1
> 0	2	0	U____
> requeue_io:
> inode 114017(sda7/0266584877) count 3,6 size 0 pages 1
> 0	2	0	U____

It helps. Thank you, Joerg!

The .kde/cache-ibm/socket-ibm/0266584877 above are directories.
It's weird that dirs would have their own mappings in ext2. In
particular this bug is triggered because the dir mapping page has
PAGECACHE_TAG_DIRTY set and PG_dirty cleared, staying in an
inconsistent state.

Fengguang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
       [not found]         ` <E1JEGPH-0001uw-Df@localhost.localdomain>
@ 2008-01-14  3:54           ` Fengguang Wu
  2008-01-14  3:54           ` Fengguang Wu
  1 sibling, 0 replies; 41+ messages in thread
From: Fengguang Wu @ 2008-01-14  3:54 UTC (permalink / raw)
  To: Joerg Platte
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org

On Sun, Jan 13, 2008 at 07:59:33PM +0800, Fengguang Wu wrote:
> On Sun, Jan 13, 2008 at 10:49:31AM +0100, Joerg Platte wrote:
> > register_jprobe(ext2_writepage) = 0
> > register_jprobe(requeue_io) = 0
> > register_kprobe(submit_bio) = 0
> > requeue_io:
> > inode 114019(sda7/.kde) count 2,2 size 0 pages 1
> > 0	2	0	U____
> > requeue_io:
> > inode 114025(sda7/cache-ibm) count 2,1 size 0 pages 1
> > 0	2	0	U____
> > requeue_io:
> > inode 114029(sda7/socket-ibm) count 2,3 size 0 pages 1
> > 0	2	0	U____
> > requeue_io:
> > inode 114017(sda7/0266584877) count 3,6 size 0 pages 1
> > 0	2	0	U____
> 
> It helps. Thank you, Joerg!
> 
> The .kde/cache-ibm/socket-ibm/0266584877 above are directories.
> It's weird that dirs would have their own mappings in ext2. In

Oh, ext2 dirs have their own mapping pages. Not the same with ext3.

> particular this bug is triggered because the dir mapping page has
> PAGECACHE_TAG_DIRTY set and PG_dirty cleared, staying in an
> inconsistent state.

Just found that a deleted dir will enter that inconsistent state when
someone still have reference to it...

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
       [not found]         ` <E1JEGPH-0001uw-Df@localhost.localdomain>
  2008-01-14  3:54           ` Fengguang Wu
@ 2008-01-14  3:54           ` Fengguang Wu
  1 sibling, 0 replies; 41+ messages in thread
From: Fengguang Wu @ 2008-01-14  3:54 UTC (permalink / raw)
  To: Joerg Platte
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org

On Sun, Jan 13, 2008 at 07:59:33PM +0800, Fengguang Wu wrote:
> On Sun, Jan 13, 2008 at 10:49:31AM +0100, Joerg Platte wrote:
> > register_jprobe(ext2_writepage) = 0
> > register_jprobe(requeue_io) = 0
> > register_kprobe(submit_bio) = 0
> > requeue_io:
> > inode 114019(sda7/.kde) count 2,2 size 0 pages 1
> > 0	2	0	U____
> > requeue_io:
> > inode 114025(sda7/cache-ibm) count 2,1 size 0 pages 1
> > 0	2	0	U____
> > requeue_io:
> > inode 114029(sda7/socket-ibm) count 2,3 size 0 pages 1
> > 0	2	0	U____
> > requeue_io:
> > inode 114017(sda7/0266584877) count 3,6 size 0 pages 1
> > 0	2	0	U____
> 
> It helps. Thank you, Joerg!
> 
> The .kde/cache-ibm/socket-ibm/0266584877 above are directories.
> It's weird that dirs would have their own mappings in ext2. In

Oh, ext2 dirs have their own mapping pages. Not the same with ext3.

> particular this bug is triggered because the dir mapping page has
> PAGECACHE_TAG_DIRTY set and PG_dirty cleared, staying in an
> inconsistent state.

Just found that a deleted dir will enter that inconsistent state when
someone still have reference to it...


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
       [not found]           ` <E1JEM2I-00010S-5U@localhost.localdomain>
  2008-01-14  9:55             ` Fengguang Wu
@ 2008-01-14  9:55             ` Fengguang Wu
  2008-01-14 11:30               ` Joerg Platte
  1 sibling, 1 reply; 41+ messages in thread
From: Fengguang Wu @ 2008-01-14  9:55 UTC (permalink / raw)
  To: Joerg Platte
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org

On Mon, Jan 14, 2008 at 11:54:39AM +0800, Fengguang Wu wrote:
> > particular this bug is triggered because the dir mapping page has
> > PAGECACHE_TAG_DIRTY set and PG_dirty cleared, staying in an
> > inconsistent state.
> 
> Just found that a deleted dir will enter that inconsistent state when
> someone still have reference to it...

Joerg, this patch fixed the bug for me :-)

Fengguang
---

clear PAGECACHE_TAG_DIRTY for truncated page in block_write_full_page()

The `truncated' page in block_write_full_page() may stick for a long time.
E.g. ext2_rmdir() will set i_size to 0. The dir may still be referenced by
someone, and have dirty pages in it.

So clear PAGECACHE_TAG_DIRTY to prevent pdflush from retrying and iowaiting on
it.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
fs/buffer.c |    2 ++
 1 files changed, 2 insertions(+)

Index: linux/fs/buffer.c
===================================================================
--- linux.orig/fs/buffer.c
+++ linux/fs/buffer.c
@@ -2820,7 +2820,9 @@ int block_write_full_page(struct page *p
 		 * freeable here, so the page does not leak.
 		 */
 		do_invalidatepage(page, 0);
+		set_page_writeback(page);
 		unlock_page(page);
+		end_page_writeback(page);
 		return 0; /* don't care */
 	}
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
       [not found]           ` <E1JEM2I-00010S-5U@localhost.localdomain>
@ 2008-01-14  9:55             ` Fengguang Wu
  2008-01-14  9:55             ` Fengguang Wu
  1 sibling, 0 replies; 41+ messages in thread
From: Fengguang Wu @ 2008-01-14  9:55 UTC (permalink / raw)
  To: Joerg Platte
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org

On Mon, Jan 14, 2008 at 11:54:39AM +0800, Fengguang Wu wrote:
> > particular this bug is triggered because the dir mapping page has
> > PAGECACHE_TAG_DIRTY set and PG_dirty cleared, staying in an
> > inconsistent state.
> 
> Just found that a deleted dir will enter that inconsistent state when
> someone still have reference to it...

Joerg, this patch fixed the bug for me :-)

Fengguang
---

clear PAGECACHE_TAG_DIRTY for truncated page in block_write_full_page()

The `truncated' page in block_write_full_page() may stick for a long time.
E.g. ext2_rmdir() will set i_size to 0. The dir may still be referenced by
someone, and have dirty pages in it.

So clear PAGECACHE_TAG_DIRTY to prevent pdflush from retrying and iowaiting on
it.

Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
---
fs/buffer.c |    2 ++
 1 files changed, 2 insertions(+)

Index: linux/fs/buffer.c
===================================================================
--- linux.orig/fs/buffer.c
+++ linux/fs/buffer.c
@@ -2820,7 +2820,9 @@ int block_write_full_page(struct page *p
 		 * freeable here, so the page does not leak.
 		 */
 		do_invalidatepage(page, 0);
+		set_page_writeback(page);
 		unlock_page(page);
+		end_page_writeback(page);
 		return 0; /* don't care */
 	}
 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
  2008-01-14  9:55             ` Fengguang Wu
@ 2008-01-14 11:30               ` Joerg Platte
  2008-01-14 11:41                 ` Peter Zijlstra
  0 siblings, 1 reply; 41+ messages in thread
From: Joerg Platte @ 2008-01-14 11:30 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org

Am Montag, 14. Januar 2008 schrieb Fengguang Wu:

> Joerg, this patch fixed the bug for me :-)

Fengguang, congratulations, I can confirm that your patch fixed the bug! With 
previous kernels the bug showed up after each reboot. Now, when booting the 
patched kernel everything is fine and there is no longer any suspicious 
iowait!

Do you have an idea why this problem appeared in 2.6.24? Did somebody change 
the ext2 code or is it related to the changes in the scheduler?

regards,
Jörg

-- 
PGP Key: send mail with subject 'SEND PGP-KEY' PGP Key-ID: FD 4E 21 1D
PGP Fingerprint: 388A872AFC5649D3 BCEC65778BE0C605

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
  2008-01-14 11:30               ` Joerg Platte
@ 2008-01-14 11:41                 ` Peter Zijlstra
       [not found]                   ` <E1JEOmD-0001Ap-U7@localhost.localdomain>
  0 siblings, 1 reply; 41+ messages in thread
From: Peter Zijlstra @ 2008-01-14 11:41 UTC (permalink / raw)
  To: jplatte; +Cc: Fengguang Wu, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org


On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> 
> > Joerg, this patch fixed the bug for me :-)
> 
> Fengguang, congratulations, I can confirm that your patch fixed the bug! With 
> previous kernels the bug showed up after each reboot. Now, when booting the 
> patched kernel everything is fine and there is no longer any suspicious 
> iowait!
> 
> Do you have an idea why this problem appeared in 2.6.24? Did somebody change 
> the ext2 code or is it related to the changes in the scheduler?

It was Fengguang who changed the inode writeback code, and I guess the
new and improved code was less able do deal with these funny corner
cases. But he has been very good in tracking them down and solving them,
kudos to him for that work!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
       [not found]                   ` <E1JEOmD-0001Ap-U7@localhost.localdomain>
@ 2008-01-14 12:50                     ` Fengguang Wu
  2008-01-15 21:13                       ` Mike Snitzer
  2008-01-14 12:50                     ` Fengguang Wu
  2008-01-15 21:42                     ` Ingo Molnar
  2 siblings, 1 reply; 41+ messages in thread
From: Fengguang Wu @ 2008-01-14 12:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: jplatte, Ingo Molnar, linux-kernel, linux-ext4@vger.kernel.org,
	Linus Torvalds, Andrew Morton

On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> 
> On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > 
> > > Joerg, this patch fixed the bug for me :-)
> > 
> > Fengguang, congratulations, I can confirm that your patch fixed the bug! With 
> > previous kernels the bug showed up after each reboot. Now, when booting the 
> > patched kernel everything is fine and there is no longer any suspicious 
> > iowait!
> > 
> > Do you have an idea why this problem appeared in 2.6.24? Did somebody change 
> > the ext2 code or is it related to the changes in the scheduler?
> 
> It was Fengguang who changed the inode writeback code, and I guess the
> new and improved code was less able do deal with these funny corner
> cases. But he has been very good in tracking them down and solving them,
> kudos to him for that work!

Thank you.

In particular the bug is triggered by the patch named:
        "writeback: introduce writeback_control.more_io to indicate more io"
That patch means to speed up writeback, but unfortunately its
aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.

Linus, given the number of bugs it triggered, I'd recommend revert
this patch(git commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b). Let's
push it back to -mm tree for more testings?

Regards,
Fengguang

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
       [not found]                   ` <E1JEOmD-0001Ap-U7@localhost.localdomain>
  2008-01-14 12:50                     ` Fengguang Wu
@ 2008-01-14 12:50                     ` Fengguang Wu
  2008-01-15 21:42                     ` Ingo Molnar
  2 siblings, 0 replies; 41+ messages in thread
From: Fengguang Wu @ 2008-01-14 12:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: jplatte, Ingo Molnar, linux-kernel, linux-ext4@vger.kernel.org,
	Linus Torvalds, Andrew Morton

On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> 
> On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > 
> > > Joerg, this patch fixed the bug for me :-)
> > 
> > Fengguang, congratulations, I can confirm that your patch fixed the bug! With 
> > previous kernels the bug showed up after each reboot. Now, when booting the 
> > patched kernel everything is fine and there is no longer any suspicious 
> > iowait!
> > 
> > Do you have an idea why this problem appeared in 2.6.24? Did somebody change 
> > the ext2 code or is it related to the changes in the scheduler?
> 
> It was Fengguang who changed the inode writeback code, and I guess the
> new and improved code was less able do deal with these funny corner
> cases. But he has been very good in tracking them down and solving them,
> kudos to him for that work!

Thank you.

In particular the bug is triggered by the patch named:
        "writeback: introduce writeback_control.more_io to indicate more io"
That patch means to speed up writeback, but unfortunately its
aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.

Linus, given the number of bugs it triggered, I'd recommend revert
this patch(git commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b). Let's
push it back to -mm tree for more testings?

Regards,
Fengguang


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
  2008-01-14 12:50                     ` Fengguang Wu
@ 2008-01-15 21:13                       ` Mike Snitzer
       [not found]                         ` <E1JF0m1-000101-OK@localhost.localdomain>
  0 siblings, 1 reply; 41+ messages in thread
From: Mike Snitzer @ 2008-01-15 21:13 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org, Linus Torvalds, Andrew Morton

On Jan 14, 2008 7:50 AM, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> >
> > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > >
> > > > Joerg, this patch fixed the bug for me :-)
> > >
> > > Fengguang, congratulations, I can confirm that your patch fixed the bug! With
> > > previous kernels the bug showed up after each reboot. Now, when booting the
> > > patched kernel everything is fine and there is no longer any suspicious
> > > iowait!
> > >
> > > Do you have an idea why this problem appeared in 2.6.24? Did somebody change
> > > the ext2 code or is it related to the changes in the scheduler?
> >
> > It was Fengguang who changed the inode writeback code, and I guess the
> > new and improved code was less able do deal with these funny corner
> > cases. But he has been very good in tracking them down and solving them,
> > kudos to him for that work!
>
> Thank you.
>
> In particular the bug is triggered by the patch named:
>         "writeback: introduce writeback_control.more_io to indicate more io"
> That patch means to speed up writeback, but unfortunately its
> aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.
>
> Linus, given the number of bugs it triggered, I'd recommend revert
> this patch(git commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b). Let's
> push it back to -mm tree for more testings?

Fengguang,

I'd like to better understand where your writeback work stands
relative to 2.6.24-rcX and -mm.  To be clear, your changes in
2.6.24-rc7 have been benchmarked to provide a ~33% sequential write
performance improvement with ext3 (as compared to 2.6.22, CFS could be
helping, etc but...).  Very impressive!

Given this improvement it is unfortunate to see your request to revert
2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b but it is understandable if
you're not confident in it for 2.6.24.

That said, you recently posted an -mm patchset that first reverts
2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b and then goes on to address
the "slow writes for concurrent large and small file writes" bug:
http://lkml.org/lkml/2008/1/15/132

For those interested in using your writeback improvements in
production sooner rather than later (primarily with ext3); what
recommendations do you have?  Just heavily test our own 2.6.24 + your
evolving "close, but not ready for merge" -mm writeback patchset?

regards,
Mike

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
       [not found]                   ` <E1JEOmD-0001Ap-U7@localhost.localdomain>
  2008-01-14 12:50                     ` Fengguang Wu
  2008-01-14 12:50                     ` Fengguang Wu
@ 2008-01-15 21:42                     ` Ingo Molnar
       [not found]                       ` <E1JF0bJ-0000zU-FG@localhost.localdomain>
  2 siblings, 1 reply; 41+ messages in thread
From: Ingo Molnar @ 2008-01-15 21:42 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Peter Zijlstra, jplatte, linux-kernel, linux-ext4@vger.kernel.org,
	Linus Torvalds, Andrew Morton


* Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:

> On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> > 
> > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > > 
> > > > Joerg, this patch fixed the bug for me :-)
> > > 
> > > Fengguang, congratulations, I can confirm that your patch fixed the bug! With 
> > > previous kernels the bug showed up after each reboot. Now, when booting the 
> > > patched kernel everything is fine and there is no longer any suspicious 
> > > iowait!
> > > 
> > > Do you have an idea why this problem appeared in 2.6.24? Did somebody change 
> > > the ext2 code or is it related to the changes in the scheduler?
> > 
> > It was Fengguang who changed the inode writeback code, and I guess the
> > new and improved code was less able do deal with these funny corner
> > cases. But he has been very good in tracking them down and solving them,
> > kudos to him for that work!
> 
> Thank you.
> 
> In particular the bug is triggered by the patch named:
>         "writeback: introduce writeback_control.more_io to indicate more io"
> That patch means to speed up writeback, but unfortunately its
> aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.
> 
> Linus, given the number of bugs it triggered, I'd recommend revert 
> this patch(git commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b). Let's 
> push it back to -mm tree for more testings?

i dont think a revert at this stage is a good idea and i'm not sure 
pushing it back into -mm would really expose more of these bugs. And 
these are real bugs in filesystems - bugs which we want to see fixed 
anyway. You are also tracking down those bugs very fast.

[ perhaps, if it's possible technically (and if it is clean enough), you
  might want to offer a runtime debug tunable that can be used to switch
  off the new aspects of your code. That would speed up testing, in case
  anyone suspects the new writeback code. ]

	Ingo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
       [not found]                       ` <E1JF0bJ-0000zU-FG@localhost.localdomain>
@ 2008-01-16  5:14                         ` Fengguang Wu
  2008-01-16  5:14                         ` Fengguang Wu
  1 sibling, 0 replies; 41+ messages in thread
From: Fengguang Wu @ 2008-01-16  5:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, jplatte, linux-kernel, linux-ext4@vger.kernel.org,
	Linus Torvalds, Andrew Morton, Mike Snitzer

On Tue, Jan 15, 2008 at 10:42:13PM +0100, Ingo Molnar wrote:
> 
> * Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> 
> > On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> > > 
> > > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > > > 
> > > > > Joerg, this patch fixed the bug for me :-)
> > > > 
> > > > Fengguang, congratulations, I can confirm that your patch fixed the bug! With 
> > > > previous kernels the bug showed up after each reboot. Now, when booting the 
> > > > patched kernel everything is fine and there is no longer any suspicious 
> > > > iowait!
> > > > 
> > > > Do you have an idea why this problem appeared in 2.6.24? Did somebody change 
> > > > the ext2 code or is it related to the changes in the scheduler?
> > > 
> > > It was Fengguang who changed the inode writeback code, and I guess the
> > > new and improved code was less able do deal with these funny corner
> > > cases. But he has been very good in tracking them down and solving them,
> > > kudos to him for that work!
> > 
> > Thank you.
> > 
> > In particular the bug is triggered by the patch named:
> >         "writeback: introduce writeback_control.more_io to indicate more io"
> > That patch means to speed up writeback, but unfortunately its
> > aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.
> > 
> > Linus, given the number of bugs it triggered, I'd recommend revert 
> > this patch(git commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b). Let's 
> > push it back to -mm tree for more testings?
> 
> i dont think a revert at this stage is a good idea and i'm not sure 
> pushing it back into -mm would really expose more of these bugs. And 
> these are real bugs in filesystems - bugs which we want to see fixed 
> anyway. You are also tracking down those bugs very fast.
> 
> [ perhaps, if it's possible technically (and if it is clean enough), you
>   might want to offer a runtime debug tunable that can be used to switch
>   off the new aspects of your code. That would speed up testing, in case
>   anyone suspects the new writeback code. ]

The patch is too aggressive in itself. We'd better not risk on it.
The iowait is only unpleasant not destructive. But it will hurt if
many users complaints. Comment says that "nfs_writepages() sometimes
bales out without doing anything."

However I have an improved and more safe patch now. It won't iowait
when nfs_writepages() bale out without increasing pages_skipped, or
even when some buggy filesystem forget to clear PAGECACHE_TAG_DIRTY.
(The magic lies in the first chunk below.)

Mike, you can use this one on 2.6.24.


---
 fs/fs-writeback.c         |   17 +++++++++++++++--
 include/linux/writeback.h |    1 +
 mm/page-writeback.c       |    9 ++++++---
 3 files changed, 22 insertions(+), 5 deletions(-)

--- linux.orig/fs/fs-writeback.c
+++ linux/fs/fs-writeback.c
@@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode,
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-				requeue_io(inode);
+				if (wbc->nr_to_write <= 0)
+					/*
+					 * slice used up: queue for next turn
+					 */
+					requeue_io(inode);
+				else
+					/*
+					 * somehow blocked: retry later
+					 */
+					redirty_tail(inode);
 			} else {
 				/*
 				 * Otherwise fully redirty the inode so that
@@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
-		if (wbc->nr_to_write <= 0)
+		if (wbc->nr_to_write <= 0) {
+			wbc->more_io = 1;
 			break;
+		}
+		if (!list_empty(&sb->s_more_io))
+			wbc->more_io = 1;
 	}
 	return;		/* Leave any unwritten inodes on s_io */
 }
--- linux.orig/include/linux/writeback.h
+++ linux/include/linux/writeback.h
@@ -62,6 +62,7 @@ struct writeback_control {
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned for_writepages:1;	/* This is a writepages() call */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
+	unsigned more_io:1;		/* more io to be dispatched */
 };
 
 /*
--- linux.orig/mm/page-writeback.c
+++ linux/mm/page-writeback.c
@@ -558,6 +558,7 @@ static void background_writeout(unsigned
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)
 			break;
+		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
@@ -565,8 +566,9 @@ static void background_writeout(unsigned
 		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
 			/* Wrote less than expected */
-			congestion_wait(WRITE, HZ/10);
-			if (!wbc.encountered_congestion)
+			if (wbc.encountered_congestion || wbc.more_io)
+				congestion_wait(WRITE, HZ/10);
+			else
 				break;
 		}
 	}
@@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg
 			global_page_state(NR_UNSTABLE_NFS) +
 			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
 	while (nr_to_write > 0) {
+		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		writeback_inodes(&wbc);
 		if (wbc.nr_to_write > 0) {
-			if (wbc.encountered_congestion)
+			if (wbc.encountered_congestion || wbc.more_io)
 				congestion_wait(WRITE, HZ/10);
 			else
 				break;	/* All the old data is written */

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
       [not found]                       ` <E1JF0bJ-0000zU-FG@localhost.localdomain>
  2008-01-16  5:14                         ` Fengguang Wu
@ 2008-01-16  5:14                         ` Fengguang Wu
  1 sibling, 0 replies; 41+ messages in thread
From: Fengguang Wu @ 2008-01-16  5:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, jplatte, linux-kernel, linux-ext4@vger.kernel.org,
	Linus Torvalds, Andrew Morton, Mike Snitzer

On Tue, Jan 15, 2008 at 10:42:13PM +0100, Ingo Molnar wrote:
> 
> * Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> 
> > On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> > > 
> > > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > > > 
> > > > > Joerg, this patch fixed the bug for me :-)
> > > > 
> > > > Fengguang, congratulations, I can confirm that your patch fixed the bug! With 
> > > > previous kernels the bug showed up after each reboot. Now, when booting the 
> > > > patched kernel everything is fine and there is no longer any suspicious 
> > > > iowait!
> > > > 
> > > > Do you have an idea why this problem appeared in 2.6.24? Did somebody change 
> > > > the ext2 code or is it related to the changes in the scheduler?
> > > 
> > > It was Fengguang who changed the inode writeback code, and I guess the
> > > new and improved code was less able do deal with these funny corner
> > > cases. But he has been very good in tracking them down and solving them,
> > > kudos to him for that work!
> > 
> > Thank you.
> > 
> > In particular the bug is triggered by the patch named:
> >         "writeback: introduce writeback_control.more_io to indicate more io"
> > That patch means to speed up writeback, but unfortunately its
> > aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.
> > 
> > Linus, given the number of bugs it triggered, I'd recommend revert 
> > this patch(git commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b). Let's 
> > push it back to -mm tree for more testings?
> 
> i dont think a revert at this stage is a good idea and i'm not sure 
> pushing it back into -mm would really expose more of these bugs. And 
> these are real bugs in filesystems - bugs which we want to see fixed 
> anyway. You are also tracking down those bugs very fast.
> 
> [ perhaps, if it's possible technically (and if it is clean enough), you
>   might want to offer a runtime debug tunable that can be used to switch
>   off the new aspects of your code. That would speed up testing, in case
>   anyone suspects the new writeback code. ]

The patch is too aggressive in itself. We'd better not risk on it.
The iowait is only unpleasant not destructive. But it will hurt if
many users complaints. Comment says that "nfs_writepages() sometimes
bales out without doing anything."

However I have an improved and more safe patch now. It won't iowait
when nfs_writepages() bale out without increasing pages_skipped, or
even when some buggy filesystem forget to clear PAGECACHE_TAG_DIRTY.
(The magic lies in the first chunk below.)

Mike, you can use this one on 2.6.24.


---
 fs/fs-writeback.c         |   17 +++++++++++++++--
 include/linux/writeback.h |    1 +
 mm/page-writeback.c       |    9 ++++++---
 3 files changed, 22 insertions(+), 5 deletions(-)

--- linux.orig/fs/fs-writeback.c
+++ linux/fs/fs-writeback.c
@@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode,
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-				requeue_io(inode);
+				if (wbc->nr_to_write <= 0)
+					/*
+					 * slice used up: queue for next turn
+					 */
+					requeue_io(inode);
+				else
+					/*
+					 * somehow blocked: retry later
+					 */
+					redirty_tail(inode);
 			} else {
 				/*
 				 * Otherwise fully redirty the inode so that
@@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
-		if (wbc->nr_to_write <= 0)
+		if (wbc->nr_to_write <= 0) {
+			wbc->more_io = 1;
 			break;
+		}
+		if (!list_empty(&sb->s_more_io))
+			wbc->more_io = 1;
 	}
 	return;		/* Leave any unwritten inodes on s_io */
 }
--- linux.orig/include/linux/writeback.h
+++ linux/include/linux/writeback.h
@@ -62,6 +62,7 @@ struct writeback_control {
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned for_writepages:1;	/* This is a writepages() call */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
+	unsigned more_io:1;		/* more io to be dispatched */
 };
 
 /*
--- linux.orig/mm/page-writeback.c
+++ linux/mm/page-writeback.c
@@ -558,6 +558,7 @@ static void background_writeout(unsigned
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)
 			break;
+		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
@@ -565,8 +566,9 @@ static void background_writeout(unsigned
 		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
 			/* Wrote less than expected */
-			congestion_wait(WRITE, HZ/10);
-			if (!wbc.encountered_congestion)
+			if (wbc.encountered_congestion || wbc.more_io)
+				congestion_wait(WRITE, HZ/10);
+			else
 				break;
 		}
 	}
@@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg
 			global_page_state(NR_UNSTABLE_NFS) +
 			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
 	while (nr_to_write > 0) {
+		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		writeback_inodes(&wbc);
 		if (wbc.nr_to_write > 0) {
-			if (wbc.encountered_congestion)
+			if (wbc.encountered_congestion || wbc.more_io)
 				congestion_wait(WRITE, HZ/10);
 			else
 				break;	/* All the old data is written */


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
       [not found]                         ` <E1JF0m1-000101-OK@localhost.localdomain>
@ 2008-01-16  5:25                           ` Fengguang Wu
  2008-01-16  5:25                           ` Fengguang Wu
  1 sibling, 0 replies; 41+ messages in thread
From: Fengguang Wu @ 2008-01-16  5:25 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org, Linus Torvalds, Andrew Morton

On Tue, Jan 15, 2008 at 04:13:22PM -0500, Mike Snitzer wrote:
> On Jan 14, 2008 7:50 AM, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> > On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> > >
> > > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > > >
> > > > > Joerg, this patch fixed the bug for me :-)
> > > >
> > > > Fengguang, congratulations, I can confirm that your patch fixed the bug! With
> > > > previous kernels the bug showed up after each reboot. Now, when booting the
> > > > patched kernel everything is fine and there is no longer any suspicious
> > > > iowait!
> > > >
> > > > Do you have an idea why this problem appeared in 2.6.24? Did somebody change
> > > > the ext2 code or is it related to the changes in the scheduler?
> > >
> > > It was Fengguang who changed the inode writeback code, and I guess the
> > > new and improved code was less able do deal with these funny corner
> > > cases. But he has been very good in tracking them down and solving them,
> > > kudos to him for that work!
> >
> > Thank you.
> >
> > In particular the bug is triggered by the patch named:
> >         "writeback: introduce writeback_control.more_io to indicate more io"
> > That patch means to speed up writeback, but unfortunately its
> > aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.
> >
> > Linus, given the number of bugs it triggered, I'd recommend revert
> > this patch(git commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b). Let's
> > push it back to -mm tree for more testings?
> 
> Fengguang,
> 
> I'd like to better understand where your writeback work stands
> relative to 2.6.24-rcX and -mm.  To be clear, your changes in
> 2.6.24-rc7 have been benchmarked to provide a ~33% sequential write
> performance improvement with ext3 (as compared to 2.6.22, CFS could be
> helping, etc but...).  Very impressive!

Wow, glad to hear that.

> Given this improvement it is unfortunate to see your request to revert
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b but it is understandable if
> you're not confident in it for 2.6.24.
> 
> That said, you recently posted an -mm patchset that first reverts
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b and then goes on to address
> the "slow writes for concurrent large and small file writes" bug:
> http://lkml.org/lkml/2008/1/15/132
> 
> For those interested in using your writeback improvements in
> production sooner rather than later (primarily with ext3); what
> recommendations do you have?  Just heavily test our own 2.6.24 + your
> evolving "close, but not ready for merge" -mm writeback patchset?

It's not ready mainly because it is fresh made and need more
feedbacks. It's doing OK on my desktop :-)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
       [not found]                         ` <E1JF0m1-000101-OK@localhost.localdomain>
  2008-01-16  5:25                           ` Fengguang Wu
@ 2008-01-16  5:25                           ` Fengguang Wu
  1 sibling, 0 replies; 41+ messages in thread
From: Fengguang Wu @ 2008-01-16  5:25 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org, Linus Torvalds, Andrew Morton

On Tue, Jan 15, 2008 at 04:13:22PM -0500, Mike Snitzer wrote:
> On Jan 14, 2008 7:50 AM, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> > On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> > >
> > > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > > >
> > > > > Joerg, this patch fixed the bug for me :-)
> > > >
> > > > Fengguang, congratulations, I can confirm that your patch fixed the bug! With
> > > > previous kernels the bug showed up after each reboot. Now, when booting the
> > > > patched kernel everything is fine and there is no longer any suspicious
> > > > iowait!
> > > >
> > > > Do you have an idea why this problem appeared in 2.6.24? Did somebody change
> > > > the ext2 code or is it related to the changes in the scheduler?
> > >
> > > It was Fengguang who changed the inode writeback code, and I guess the
> > > new and improved code was less able do deal with these funny corner
> > > cases. But he has been very good in tracking them down and solving them,
> > > kudos to him for that work!
> >
> > Thank you.
> >
> > In particular the bug is triggered by the patch named:
> >         "writeback: introduce writeback_control.more_io to indicate more io"
> > That patch means to speed up writeback, but unfortunately its
> > aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.
> >
> > Linus, given the number of bugs it triggered, I'd recommend revert
> > this patch(git commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b). Let's
> > push it back to -mm tree for more testings?
> 
> Fengguang,
> 
> I'd like to better understand where your writeback work stands
> relative to 2.6.24-rcX and -mm.  To be clear, your changes in
> 2.6.24-rc7 have been benchmarked to provide a ~33% sequential write
> performance improvement with ext3 (as compared to 2.6.22, CFS could be
> helping, etc but...).  Very impressive!

Wow, glad to hear that.

> Given this improvement it is unfortunate to see your request to revert
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b but it is understandable if
> you're not confident in it for 2.6.24.
> 
> That said, you recently posted an -mm patchset that first reverts
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b and then goes on to address
> the "slow writes for concurrent large and small file writes" bug:
> http://lkml.org/lkml/2008/1/15/132
> 
> For those interested in using your writeback improvements in
> production sooner rather than later (primarily with ext3); what
> recommendations do you have?  Just heavily test our own 2.6.24 + your
> evolving "close, but not ready for merge" -mm writeback patchset?

It's not ready mainly because it is fresh made and need more
feedbacks. It's doing OK on my desktop :-)


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-16  9:26 Martin Knoblauch
       [not found] ` <E1JF6w8-0000vs-HM@localhost.localdomain>
  0 siblings, 1 reply; 41+ messages in thread
From: Martin Knoblauch @ 2008-01-16  9:26 UTC (permalink / raw)
  To: Mike Snitzer, Fengguang Wu
  Cc: Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org, Linus Torvalds

----- Original Message ----
> From: Mike Snitzer <snitzer@gmail.com>
> To: Fengguang Wu <wfg@mail.ustc.edu.cn>
> Cc: Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; Linus Torvalds <torvalds@linux-foundation.org>; Andrew Morton <akpm@li>
> Sent: Tuesday, January 15, 2008 10:13:22 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Jan 14, 2008 7:50 AM, Fengguang Wu  wrote:
> > On Mon, Jan 14, 2008 at 12:41:26PM +0100, Peter Zijlstra wrote:
> > >
> > > On Mon, 2008-01-14 at 12:30 +0100, Joerg Platte wrote:
> > > > Am Montag, 14. Januar 2008 schrieb Fengguang Wu:
> > > >
> > > > > Joerg, this patch fixed the bug for me :-)
> > > >
> > > > Fengguang, congratulations, I can confirm that your patch
> fixed
> 
 the bug! With
> > > > previous kernels the bug showed up after each reboot. Now,
> when
> 
 booting the
> > > > patched kernel everything is fine and there is no longer
> any
> 
 suspicious
> > > > iowait!
> > > >
> > > > Do you have an idea why this problem appeared in 2.6.24?
> Did
> 
 somebody change
> > > > the ext2 code or is it related to the changes in the scheduler?
> > >
> > > It was Fengguang who changed the inode writeback code, and I
> guess
> 
 the
> > > new and improved code was less able do deal with these funny corner
> > > cases. But he has been very good in tracking them down and
> solving
> 
 them,
> > > kudos to him for that work!
> >
> > Thank you.
> >
> > In particular the bug is triggered by the patch named:
> >         "writeback: introduce writeback_control.more_io to
> indicate
> 
 more io"
> > That patch means to speed up writeback, but unfortunately its
> > aggressiveness has disclosed bugs in reiserfs, jfs and now ext2.
> >
> > Linus, given the number of bugs it triggered, I'd recommend revert
> > this patch(git commit
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b).
> 
 Let's
> > push it back to -mm tree for more testings?
> 
> Fengguang,
> 
> I'd like to better understand where your writeback work stands
> relative to 2.6.24-rcX and -mm.  To be clear, your changes in
> 2.6.24-rc7 have been benchmarked to provide a ~33% sequential write
> performance improvement with ext3 (as compared to 2.6.22, CFS could be
> helping, etc but...).  Very impressive!
> 
> Given this improvement it is unfortunate to see your request to revert
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b but it is understandable if
> you're not confident in it for 2.6.24.
> 
> That said, you recently posted an -mm patchset that first reverts
> 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b and then goes on to address
> the "slow writes for concurrent large and small file writes" bug:
> http://lkml.org/lkml/2008/1/15/132
> 
> For those interested in using your writeback improvements in
> production sooner rather than later (primarily with ext3); what
> recommendations do you have?  Just heavily test our own 2.6.24 + your
> evolving "close, but not ready for merge" -mm writeback patchset?
> 
Hi Fengguang, Mike,

 I can add myself to Mikes question. It would be good to know a "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has been showing quite nice improvement of the overall writeback situation and it would be sad to see this [partially] gone in 2.6.24-final. Linus apparently already has reverted  "...2250b". I will definitely repeat my tests with -rc8. and report.

 Cheers
Martin





^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
       [not found] ` <E1JF6w8-0000vs-HM@localhost.localdomain>
@ 2008-01-16 12:00   ` Fengguang Wu
  2008-01-16 12:00   ` Fengguang Wu
  1 sibling, 0 replies; 41+ messages in thread
From: Fengguang Wu @ 2008-01-16 12:00 UTC (permalink / raw)
  To: Martin Knoblauch
  Cc: Mike Snitzer, Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org, Linus Torvalds

On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > For those interested in using your writeback improvements in
> > production sooner rather than later (primarily with ext3); what
> > recommendations do you have?  Just heavily test our own 2.6.24 + your
> > evolving "close, but not ready for merge" -mm writeback patchset?
> > 
> Hi Fengguang, Mike,
> 
>  I can add myself to Mikes question. It would be good to know a "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has been showing quite nice improvement of the overall writeback situation and it would be sad to see this [partially] gone in 2.6.24-final. Linus apparently already has reverted  "...2250b". I will definitely repeat my tests with -rc8. and report.

Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
Maybe we can push it to 2.6.24 after your testing.

Fengguang
---
 fs/fs-writeback.c         |   17 +++++++++++++++--
 include/linux/writeback.h |    1 +
 mm/page-writeback.c       |    9 ++++++---
 3 files changed, 22 insertions(+), 5 deletions(-)

--- linux.orig/fs/fs-writeback.c
+++ linux/fs/fs-writeback.c
@@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode,
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-				requeue_io(inode);
+				if (wbc->nr_to_write <= 0)
+					/*
+					 * slice used up: queue for next turn
+					 */
+					requeue_io(inode);
+				else
+					/*
+					 * somehow blocked: retry later
+					 */
+					redirty_tail(inode);
 			} else {
 				/*
 				 * Otherwise fully redirty the inode so that
@@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
-		if (wbc->nr_to_write <= 0)
+		if (wbc->nr_to_write <= 0) {
+			wbc->more_io = 1;
 			break;
+		}
+		if (!list_empty(&sb->s_more_io))
+			wbc->more_io = 1;
 	}
 	return;		/* Leave any unwritten inodes on s_io */
 }
--- linux.orig/include/linux/writeback.h
+++ linux/include/linux/writeback.h
@@ -62,6 +62,7 @@ struct writeback_control {
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned for_writepages:1;	/* This is a writepages() call */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
+	unsigned more_io:1;		/* more io to be dispatched */
 };
 
 /*
--- linux.orig/mm/page-writeback.c
+++ linux/mm/page-writeback.c
@@ -558,6 +558,7 @@ static void background_writeout(unsigned
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)
 			break;
+		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
@@ -565,8 +566,9 @@ static void background_writeout(unsigned
 		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
 			/* Wrote less than expected */
-			congestion_wait(WRITE, HZ/10);
-			if (!wbc.encountered_congestion)
+			if (wbc.encountered_congestion || wbc.more_io)
+				congestion_wait(WRITE, HZ/10);
+			else
 				break;
 		}
 	}
@@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg
 			global_page_state(NR_UNSTABLE_NFS) +
 			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
 	while (nr_to_write > 0) {
+		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		writeback_inodes(&wbc);
 		if (wbc.nr_to_write > 0) {
-			if (wbc.encountered_congestion)
+			if (wbc.encountered_congestion || wbc.more_io)
 				congestion_wait(WRITE, HZ/10);
 			else
 				break;	/* All the old data is written */

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
       [not found] ` <E1JF6w8-0000vs-HM@localhost.localdomain>
  2008-01-16 12:00   ` Fengguang Wu
@ 2008-01-16 12:00   ` Fengguang Wu
  1 sibling, 0 replies; 41+ messages in thread
From: Fengguang Wu @ 2008-01-16 12:00 UTC (permalink / raw)
  To: Martin Knoblauch
  Cc: Mike Snitzer, Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org, Linus Torvalds

On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > For those interested in using your writeback improvements in
> > production sooner rather than later (primarily with ext3); what
> > recommendations do you have?  Just heavily test our own 2.6.24 + your
> > evolving "close, but not ready for merge" -mm writeback patchset?
> > 
> Hi Fengguang, Mike,
> 
>  I can add myself to Mikes question. It would be good to know a "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has been showing quite nice improvement of the overall writeback situation and it would be sad to see this [partially] gone in 2.6.24-final. Linus apparently already has reverted  "...2250b". I will definitely repeat my tests with -rc8. and report.

Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
Maybe we can push it to 2.6.24 after your testing.

Fengguang
---
 fs/fs-writeback.c         |   17 +++++++++++++++--
 include/linux/writeback.h |    1 +
 mm/page-writeback.c       |    9 ++++++---
 3 files changed, 22 insertions(+), 5 deletions(-)

--- linux.orig/fs/fs-writeback.c
+++ linux/fs/fs-writeback.c
@@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode,
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-				requeue_io(inode);
+				if (wbc->nr_to_write <= 0)
+					/*
+					 * slice used up: queue for next turn
+					 */
+					requeue_io(inode);
+				else
+					/*
+					 * somehow blocked: retry later
+					 */
+					redirty_tail(inode);
 			} else {
 				/*
 				 * Otherwise fully redirty the inode so that
@@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
-		if (wbc->nr_to_write <= 0)
+		if (wbc->nr_to_write <= 0) {
+			wbc->more_io = 1;
 			break;
+		}
+		if (!list_empty(&sb->s_more_io))
+			wbc->more_io = 1;
 	}
 	return;		/* Leave any unwritten inodes on s_io */
 }
--- linux.orig/include/linux/writeback.h
+++ linux/include/linux/writeback.h
@@ -62,6 +62,7 @@ struct writeback_control {
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned for_writepages:1;	/* This is a writepages() call */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
+	unsigned more_io:1;		/* more io to be dispatched */
 };
 
 /*
--- linux.orig/mm/page-writeback.c
+++ linux/mm/page-writeback.c
@@ -558,6 +558,7 @@ static void background_writeout(unsigned
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)
 			break;
+		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
@@ -565,8 +566,9 @@ static void background_writeout(unsigned
 		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
 			/* Wrote less than expected */
-			congestion_wait(WRITE, HZ/10);
-			if (!wbc.encountered_congestion)
+			if (wbc.encountered_congestion || wbc.more_io)
+				congestion_wait(WRITE, HZ/10);
+			else
 				break;
 		}
 	}
@@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg
 			global_page_state(NR_UNSTABLE_NFS) +
 			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
 	while (nr_to_write > 0) {
+		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		writeback_inodes(&wbc);
 		if (wbc.nr_to_write > 0) {
-			if (wbc.encountered_congestion)
+			if (wbc.encountered_congestion || wbc.more_io)
 				congestion_wait(WRITE, HZ/10);
 			else
 				break;	/* All the old data is written */


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-16 14:15 Martin Knoblauch
  2008-01-16 16:27 ` Mike Snitzer
  0 siblings, 1 reply; 41+ messages in thread
From: Martin Knoblauch @ 2008-01-16 14:15 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Mike Snitzer, Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org, Linus Torvalds

----- Original Message ----
> From: Fengguang Wu <wfg@mail.ustc.edu.cn>
> To: Martin Knoblauch <knobi@knobisoft.de>
> Cc: Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Wednesday, January 16, 2008 1:00:04 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > For those interested in using your writeback improvements in
> > > production sooner rather than later (primarily with ext3); what
> > > recommendations do you have?  Just heavily test our own 2.6.24
> +
> 
 your
> > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > 
> > Hi Fengguang, Mike,
> > 
> >  I can add myself to Mikes question. It would be good to know
> a
> 
 "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> been
> 
 showing quite nice improvement of the overall writeback situation and
> it
> 
 would be sad to see this [partially] gone in 2.6.24-final.
> Linus
> 
 apparently already has reverted  "...2250b". I will definitely repeat my
> tests
> 
 with -rc8. and report.
> 
> Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> Maybe we can push it to 2.6.24 after your testing.
> 

 Will do tomorrow or friday. Actually a patch against -rc8 would be nicer for me, as I have not looked at -rc7 due to holidays and some of the reported problems with it.

Cheers
Martin

> Fengguang
> ---
>  fs/fs-writeback.c         |   17 +++++++++++++++--
>  include/linux/writeback.h |    1 +
>  mm/page-writeback.c       |    9 ++++++---
>  3 files changed, 22 insertions(+), 5 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c
> +++ linux/fs/fs-writeback.c
> @@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode,
>                   * soon as the queue becomes uncongested.
>                   */
>                  inode->i_state |= I_DIRTY_PAGES;
> -                requeue_io(inode);
> +                if (wbc->nr_to_write <= 0)
> +                    /*
> +                     * slice used up: queue for next turn
> +                     */
> +                    requeue_io(inode);
> +                else
> +                    /*
> +                     * somehow blocked: retry later
> +                     */
> +                    redirty_tail(inode);
>              } else {
>                  /*
>                   * Otherwise fully redirty the inode so that
> @@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s
>          iput(inode);
>          cond_resched();
>          spin_lock(&inode_lock);
> -        if (wbc->nr_to_write <= 0)
> +        if (wbc->nr_to_write <= 0) {
> +            wbc->more_io = 1;
>              break;
> +        }
> +        if (!list_empty(&sb->s_more_io))
> +            wbc->more_io = 1;
>      }
>      return;        /* Leave any unwritten inodes on s_io */
>  }
> --- linux.orig/include/linux/writeback.h
> +++ linux/include/linux/writeback.h
> @@ -62,6 +62,7 @@ struct writeback_control {
>      unsigned for_reclaim:1;        /* Invoked from the page
> allocator
> 
 */
>      unsigned for_writepages:1;    /* This is a writepages() call */
>      unsigned range_cyclic:1;    /* range_start is cyclic */
> +    unsigned more_io:1;        /* more io to be dispatched */
>  };
>  
>  /*
> --- linux.orig/mm/page-writeback.c
> +++ linux/mm/page-writeback.c
> @@ -558,6 +558,7 @@ static void background_writeout(unsigned
>              global_page_state(NR_UNSTABLE_NFS) < background_thresh
>                  && min_pages <= 0)
>              break;
> +        wbc.more_io = 0;
>          wbc.encountered_congestion = 0;
>          wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>          wbc.pages_skipped = 0;
> @@ -565,8 +566,9 @@ static void background_writeout(unsigned
>          min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
>          if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
>              /* Wrote less than expected */
> -            congestion_wait(WRITE, HZ/10);
> -            if (!wbc.encountered_congestion)
> +            if (wbc.encountered_congestion || wbc.more_io)
> +                congestion_wait(WRITE, HZ/10);
> +            else
>                  break;
>          }
>      }
> @@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg
>              global_page_state(NR_UNSTABLE_NFS) +
>              (inodes_stat.nr_inodes - inodes_stat.nr_unused);
>      while (nr_to_write > 0) {
> +        wbc.more_io = 0;
>          wbc.encountered_congestion = 0;
>          wbc.nr_to_write = MAX_WRITEBACK_PAGES;
>          writeback_inodes(&wbc);
>          if (wbc.nr_to_write > 0) {
> -            if (wbc.encountered_congestion)
> +            if (wbc.encountered_congestion || wbc.more_io)
>                  congestion_wait(WRITE, HZ/10);
>              else
>                  break;    /* All the old data is written */
> 
> 
> 



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
  2008-01-16 14:15 Martin Knoblauch
@ 2008-01-16 16:27 ` Mike Snitzer
  0 siblings, 0 replies; 41+ messages in thread
From: Mike Snitzer @ 2008-01-16 16:27 UTC (permalink / raw)
  To: Martin Knoblauch
  Cc: Fengguang Wu, Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org, Linus Torvalds

On Jan 16, 2008 9:15 AM, Martin Knoblauch <spamtrap@knobisoft.de> wrote:
> ----- Original Message ----
> > From: Fengguang Wu <wfg@mail.ustc.edu.cn>
> > To: Martin Knoblauch <knobi@knobisoft.de>
> > Cc: Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; Linus Torvalds <torvalds@linux-foundation.org>
> > Sent: Wednesday, January 16, 2008 1:00:04 PM
> > Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> >
>
> > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > > For those interested in using your writeback improvements in
> > > > production sooner rather than later (primarily with ext3); what
> > > > recommendations do you have?  Just heavily test our own 2.6.24
> > +
> >
>  your
> > > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > >
> > > Hi Fengguang, Mike,
> > >
> > >  I can add myself to Mikes question. It would be good to know
> > a
> >
>  "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> > been
> >
>  showing quite nice improvement of the overall writeback situation and
> > it
> >
>  would be sad to see this [partially] gone in 2.6.24-final.
> > Linus
> >
>  apparently already has reverted  "...2250b". I will definitely repeat my
> > tests
> >
>  with -rc8. and report.
> >
> > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > Maybe we can push it to 2.6.24 after your testing.
> >
>
>  Will do tomorrow or friday. Actually a patch against -rc8 would be nicer for me, as I have not looked at -rc7 due to holidays and some of the reported problems with it.

Fengguang's latest writeback patch applies cleanly, builds, boots on 2.6.24-rc8.

I'll be able to share ext3 performance results (relative to 2.6.24-rc7) shortly.

Mike
>
>
> > Fengguang
> > ---
> >  fs/fs-writeback.c         |   17 +++++++++++++++--
> >  include/linux/writeback.h |    1 +
> >  mm/page-writeback.c       |    9 ++++++---
> >  3 files changed, 22 insertions(+), 5 deletions(-)
> >
> > --- linux.orig/fs/fs-writeback.c
> > +++ linux/fs/fs-writeback.c
> > @@ -284,7 +284,16 @@ __sync_single_inode(struct inode *inode,
> >                   * soon as the queue becomes uncongested.
> >                   */
> >                  inode->i_state |= I_DIRTY_PAGES;
> > -                requeue_io(inode);
> > +                if (wbc->nr_to_write <= 0)
> > +                    /*
> > +                     * slice used up: queue for next turn
> > +                     */
> > +                    requeue_io(inode);
> > +                else
> > +                    /*
> > +                     * somehow blocked: retry later
> > +                     */
> > +                    redirty_tail(inode);
> >              } else {
> >                  /*
> >                   * Otherwise fully redirty the inode so that
> > @@ -479,8 +488,12 @@ sync_sb_inodes(struct super_block *sb, s
> >          iput(inode);
> >          cond_resched();
> >          spin_lock(&inode_lock);
> > -        if (wbc->nr_to_write <= 0)
> > +        if (wbc->nr_to_write <= 0) {
> > +            wbc->more_io = 1;
> >              break;
> > +        }
> > +        if (!list_empty(&sb->s_more_io))
> > +            wbc->more_io = 1;
> >      }
> >      return;        /* Leave any unwritten inodes on s_io */
> >  }
> > --- linux.orig/include/linux/writeback.h
> > +++ linux/include/linux/writeback.h
> > @@ -62,6 +62,7 @@ struct writeback_control {
> >      unsigned for_reclaim:1;        /* Invoked from the page
> > allocator
> >
>  */
> >      unsigned for_writepages:1;    /* This is a writepages() call */
> >      unsigned range_cyclic:1;    /* range_start is cyclic */
> > +    unsigned more_io:1;        /* more io to be dispatched */
> >  };
> >
> >  /*
> > --- linux.orig/mm/page-writeback.c
> > +++ linux/mm/page-writeback.c
> > @@ -558,6 +558,7 @@ static void background_writeout(unsigned
> >              global_page_state(NR_UNSTABLE_NFS) < background_thresh
> >                  && min_pages <= 0)
> >              break;
> > +        wbc.more_io = 0;
> >          wbc.encountered_congestion = 0;
> >          wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> >          wbc.pages_skipped = 0;
> > @@ -565,8 +566,9 @@ static void background_writeout(unsigned
> >          min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> >          if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> >              /* Wrote less than expected */
> > -            congestion_wait(WRITE, HZ/10);
> > -            if (!wbc.encountered_congestion)
> > +            if (wbc.encountered_congestion || wbc.more_io)
> > +                congestion_wait(WRITE, HZ/10);
> > +            else
> >                  break;
> >          }
> >      }
> > @@ -631,11 +633,12 @@ static void wb_kupdate(unsigned long arg
> >              global_page_state(NR_UNSTABLE_NFS) +
> >              (inodes_stat.nr_inodes - inodes_stat.nr_unused);
> >      while (nr_to_write > 0) {
> > +        wbc.more_io = 0;
> >          wbc.encountered_congestion = 0;
> >          wbc.nr_to_write = MAX_WRITEBACK_PAGES;
> >          writeback_inodes(&wbc);
> >          if (wbc.nr_to_write > 0) {
> > -            if (wbc.encountered_congestion)
> > +            if (wbc.encountered_congestion || wbc.more_io)
> >                  congestion_wait(WRITE, HZ/10);
> >              else
> >                  break;    /* All the old data is written */
> >
> >
> >
>
>
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-17 13:52 Martin Knoblauch
  2008-01-17 16:11 ` Mike Snitzer
  0 siblings, 1 reply; 41+ messages in thread
From: Martin Knoblauch @ 2008-01-17 13:52 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Mike Snitzer, Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org, Linus Torvalds

----- Original Message ----
> From: Fengguang Wu <wfg@mail.ustc.edu.cn>
> To: Martin Knoblauch <knobi@knobisoft.de>
> Cc: Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Wednesday, January 16, 2008 1:00:04 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > For those interested in using your writeback improvements in
> > > production sooner rather than later (primarily with ext3); what
> > > recommendations do you have?  Just heavily test our own 2.6.24
> +
> 
 your
> > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > 
> > Hi Fengguang, Mike,
> > 
> >  I can add myself to Mikes question. It would be good to know
> a
> 
 "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> been
> 
 showing quite nice improvement of the overall writeback situation and
> it
> 
 would be sad to see this [partially] gone in 2.6.24-final.
> Linus
> 
 apparently already has reverted  "...2250b". I will definitely repeat my
> tests
> 
 with -rc8. and report.
> 
> Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> Maybe we can push it to 2.6.24 after your testing.
> 
Hi Fengguang,

 something really bad has happened between -rc3 and -rc6. Embarrassingly I did not catch that earlier :-(

 Compared to the numbers I posted in http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only test that is still good is mix3, which I attribute to the per-BDI stuff.

 At the moment I am frantically trying to find when things went down. I did run -rc8 and rc8+yourpatch. No difference to what I see with -rc6. Sorry that I cannot provide any input to your patch.

Depressed
Martin

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
  2008-01-17 13:52 Martin Knoblauch
@ 2008-01-17 16:11 ` Mike Snitzer
  0 siblings, 0 replies; 41+ messages in thread
From: Mike Snitzer @ 2008-01-17 16:11 UTC (permalink / raw)
  To: Martin Knoblauch
  Cc: Fengguang Wu, Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org, Linus Torvalds

On Jan 17, 2008 8:52 AM, Martin Knoblauch <spamtrap@knobisoft.de> wrote:
>
> ----- Original Message ----
> > From: Fengguang Wu <wfg@mail.ustc.edu.cn>
> > To: Martin Knoblauch <knobi@knobisoft.de>
> > Cc: Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; Linus Torvalds <torvalds@linux-foundation.org>
> > Sent: Wednesday, January 16, 2008 1:00:04 PM
> > Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> >
> > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > > For those interested in using your writeback improvements in
> > > > production sooner rather than later (primarily with ext3); what
> > > > recommendations do you have?  Just heavily test our own 2.6.24
> > +
> >
>  your
> > > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > >
> > > Hi Fengguang, Mike,
> > >
> > >  I can add myself to Mikes question. It would be good to know
> > a
> >
>  "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> > been
> >
>  showing quite nice improvement of the overall writeback situation and
> > it
> >
>  would be sad to see this [partially] gone in 2.6.24-final.
> > Linus
> >
>  apparently already has reverted  "...2250b". I will definitely repeat my
> > tests
> >
>  with -rc8. and report.
> >
> > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > Maybe we can push it to 2.6.24 after your testing.
> >
> Hi Fengguang,
>
>  something really bad has happened between -rc3 and -rc6. Embarrassingly I did not catch that earlier :-(
>
>  Compared to the numbers I posted in http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only test that is still good is mix3, which I attribute to the per-BDI stuff.
>
>  At the moment I am frantically trying to find when things went down. I did run -rc8 and rc8+yourpatch. No difference to what I see with -rc6. Sorry that I cannot provide any input to your patch.
>
> Depressed
> Martin

Martin,

I've backported Peter's perbdi patchset to 2.6.22.x.  I can share it
with anyone who might be interested.

As expected, it has yielded 2.6.24-rcX level scaling.  Given the test
result matrix you previously posted, 2.6.22.x+perbdi might give you
what you're looking for (sans improved writeback that 2.6.24 was
thought to be providing).  That is, much improved scaling with better
O_DIRECT and network throughput.  Just a thought...

Unfortunately, my priorities (and computing resources) have shifted
and I won't be able to thoroughly test Fengguang's new writeback patch
on 2.6.24-rc8... whereby missing out on providing
justification/testing to others on _some_ improved writeback being
included in 2.6.24 final.

Not to mention the window for writeback improvement is all but closed
considering the 2.6.24-rc8 announcement's 2.6.24 final release
timetable.

regards,
Mike

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-17 17:44 Martin Knoblauch
  2008-01-17 20:23 ` Mel Gorman
  0 siblings, 1 reply; 41+ messages in thread
From: Martin Knoblauch @ 2008-01-17 17:44 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Mike Snitzer, Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org, Linus Torvalds

----- Original Message ----
> From: Martin Knoblauch <spamtrap@knobisoft.de>
> To: Fengguang Wu <wfg@mail.ustc.edu.cn>
> Cc: Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Thursday, January 17, 2008 2:52:58 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> ----- Original Message ----
> > From: Fengguang Wu 
> > To: Martin Knoblauch 
> > Cc: Mike Snitzer ; Peter
> Zijlstra
> 
 ; jplatte@naasa.net; Ingo Molnar
> ;
> 
 linux-kernel@vger.kernel.org;
> "linux-ext4@vger.kernel.org"
> 
 ; Linus
> Torvalds
> 
 
> > Sent: Wednesday, January 16, 2008 1:00:04 PM
> > Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> > 
> > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > > For those interested in using your writeback improvements in
> > > > production sooner rather than later (primarily with ext3); what
> > > > recommendations do you have?  Just heavily test our own 2.6.24
> > +
> > 
>  your
> > > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > > 
> > > Hi Fengguang, Mike,
> > > 
> > >  I can add myself to Mikes question. It would be good to know
> > a
> > 
>  "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> > been
> > 
>  showing quite nice improvement of the overall writeback situation and
> > it
> > 
>  would be sad to see this [partially] gone in 2.6.24-final.
> > Linus
> > 
>  apparently already has reverted  "...2250b". I will definitely
> repeat
> 
 my
> > tests
> > 
>  with -rc8. and report.
> > 
> > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > Maybe we can push it to 2.6.24 after your testing.
> > 
> Hi Fengguang,
> 
>  something really bad has happened between -rc3 and
> -rc6.
> 
 Embarrassingly I did not catch that earlier :-(
> 
>  Compared to the numbers I posted
> in
> 
 http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec
> (slight
> 
 plus), while dd2/dd3 suck the same way as in pre 2.6.24. The only
> test
> 
 that is still good is mix3, which I attribute to the per-BDI stuff.
> 
>  At the moment I am frantically trying to find when things went down.
> I
> 
 did run -rc8 and rc8+yourpatch. No difference to what I see with
> -rc6.
> 
 Sorry that I cannot provide any input to your patch.
> 

 OK, the change happened between rc5 and rc6. Just following a gut feeling, I reverted

#commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
#Author: Mel Gorman <mel@csn.ul.ie>
#Date:   Mon Dec 17 16:20:05 2007 -0800
#
#    mm: fix page allocation for larger I/O segments
#    
#    In some cases the IO subsystem is able to merge requests if the pages are
#    adjacent in physical memory.  This was achieved in the allocator by having
#    expand() return pages in physically contiguous order in situations were a
#    large buddy was split.  However, list-based anti-fragmentation changed the
#    order pages were returned in to avoid searching in buffered_rmqueue() for a
#    page of the appropriate migrate type.
#    
#    This patch restores behaviour of rmqueue_bulk() preserving the physical
#    order of pages returned by the allocator without incurring increased search
#    costs for anti-fragmentation.
#    
#    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
#    Cc: James Bottomley <James.Bottomley@steeleye.com>
#    Cc: Jens Axboe <jens.axboe@oracle.com>
#    Cc: Mark Lord <mlord@pobox.com
#    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
#    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
diff -urN linux-2.6.24-rc5/mm/page_alloc.c linux-2.6.24-rc6/mm/page_alloc.c
--- linux-2.6.24-rc5/mm/page_alloc.c    2007-12-21 04:14:11.305633890 +0000
+++ linux-2.6.24-rc6/mm/page_alloc.c    2007-12-21 04:14:17.746985697 +0000
@@ -847,8 +847,19 @@
                struct page *page = __rmqueue(zone, order, migratetype);
                if (unlikely(page == NULL))
                        break;
+
+               /*
+                * Split buddy pages returned by expand() are received here
+                * in physical page order. The page is added to the callers and
+                * list and the list head then moves forward. From the callers
+                * perspective, the linked list is ordered by page number in
+                * some conditions. This is useful for IO devices that can
+                * merge IO requests if the physical pages are ordered
+                * properly.
+                */
                list_add(&page->lru, list);
                set_page_private(page, migratetype);
+               list = &page->lru;
        }
        spin_unlock(&zone->lock);
        return i;



This has brought back the good results I observed and reported.
I do not know what to make out of this. At least on the systems I care
about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory, SmartaArray6i
controller with 4x72GB SCSI disks as RAID5 (battery protected writeback
cache enabled) and gigabit networking (tg3)) this optimisation is a dissaster.

 On the other hand, it is not a regression against 2.6.22/23. Those had
bad IO scaling to. It would just be a shame to loose an apparently great
performance win.
is

Cheers
Martin

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-17 17:51 Martin Knoblauch
  0 siblings, 0 replies; 41+ messages in thread
From: Martin Knoblauch @ 2008-01-17 17:51 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Fengguang Wu, Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org, Linus Torvalds

----- Original Message ----
> From: Mike Snitzer <snitzer@gmail.com>
> To: Martin Knoblauch <spamtrap@knobisoft.de>
> Cc: Fengguang Wu <wfg@mail.ustc.edu.cn>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; Linus Torvalds <torvalds@linux-foundation.org>
> Sent: Thursday, January 17, 2008 5:11:50 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> 
> I've backported Peter's perbdi patchset to 2.6.22.x.  I can share it
> with anyone who might be interested.
> 
> As expected, it has yielded 2.6.24-rcX level scaling.  Given the test
> result matrix you previously posted, 2.6.22.x+perbdi might give you
> what you're looking for (sans improved writeback that 2.6.24 was
> thought to be providing).  That is, much improved scaling with better
> O_DIRECT and network throughput.  Just a thought...
> 
> Unfortunately, my priorities (and computing resources) have shifted
> and I won't be able to thoroughly test Fengguang's new writeback patch
> on 2.6.24-rc8... whereby missing out on providing
> justification/testing to others on _some_ improved writeback being
> included in 2.6.24 final.
> 
> Not to mention the window for writeback improvement is all but closed
> considering the 2.6.24-rc8 announcement's 2.6.24 final release
> timetable.
> 
Mike,

 thanks for the offer, but the improved throughput is my #1 priority nowadays.
And while the better scaling for different targets is nothing to frown upon, the
much better scaling when writing to the same target would have been the big
winner for me.

 Anyway, I located the "offending" commit. Lets see what the experts say.


Cheers
Martin

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
  2008-01-17 17:44 regression: 100% io-wait with 2.6.24-rcX Martin Knoblauch
@ 2008-01-17 20:23 ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2008-01-17 20:23 UTC (permalink / raw)
  To: Martin Knoblauch
  Cc: Fengguang Wu, Mike Snitzer, Peter Zijlstra, jplatte, Ingo Molnar,
	linux-kernel, linux-ext4@vger.kernel.org, Linus Torvalds,
	James.Bottomley

On (17/01/08 09:44), Martin Knoblauch didst pronounce:
> > > > > > > On Wed, Jan 16, 2008 at 01:26:41AM -0800, Martin Knoblauch wrote:
> > > > > > For those interested in using your writeback improvements in
> > > > > > production sooner rather than later (primarily with ext3); what
> > > > > > recommendations do you have?  Just heavily test our own 2.6.24
> > > > > > evolving "close, but not ready for merge" -mm writeback patchset?
> > > > > > 
> > > > > 
> > > > >  I can add myself to Mikes question. It would be good to know a
> > > > 
> > > > "roadmap" for the writeback changes. Testing 2.6.24-rcX so far has
> > > > been showing quite nice improvement of the overall writeback situation and
> > > > it would be sad to see this [partially] gone in 2.6.24-final.
> > > > Linus apparently already has reverted  "...2250b". I will definitely
> > > > repeat my tests  with -rc8. and report.
> > > > 
> > > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > > Maybe we can push it to 2.6.24 after your testing.
> > > 
> > Hi Fengguang,
> > 
> > something really bad has happened between -rc3 and -rc6.
> > Embarrassingly I did not catch that earlier :-(
> > Compared to the numbers I posted in
> > http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec
> > (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24.
> > The only test that is still good is mix3, which I attribute to
> > the per-BDI stuff.

I suspect that the IO hardware you have is very sensitive to the color of the
physical page. I wonder, do you boot the system cleanly and then run these
tests? If so, it would be interesting to know what happens if you stress
the system first (many kernel compiles for example, basically anything that
would use a lot of memory in different ways for some time) to randomise the
free lists a bit and then run your test. You'd need to run the test
three times for 2.6.23, 2.6.24-rc8 and 2.6.24-rc8 with the patch you
identified reverted.

> > At the moment I am frantically trying to find when things went down. I
> > did run -rc8 and rc8+yourpatch. No difference to what I see with -rc6.
> > 
> > Sorry that I cannot provide any input to your patch.
> 
>  OK, the change happened between rc5 and rc6. Just following a gut feeling, I reverted
> 
> #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
> #Author: Mel Gorman <mel@csn.ul.ie>
> #Date:   Mon Dec 17 16:20:05 2007 -0800
> #
> #    mm: fix page allocation for larger I/O segments
> #    
> #    In some cases the IO subsystem is able to merge requests if the pages are
> #    adjacent in physical memory.  This was achieved in the allocator by having
> #    expand() return pages in physically contiguous order in situations were a
> #    large buddy was split.  However, list-based anti-fragmentation changed the
> #    order pages were returned in to avoid searching in buffered_rmqueue() for a
> #    page of the appropriate migrate type.
> #    
> #    This patch restores behaviour of rmqueue_bulk() preserving the physical
> #    order of pages returned by the allocator without incurring increased search
> #    costs for anti-fragmentation.
> #    
> #    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> #    Cc: James Bottomley <James.Bottomley@steeleye.com>
> #    Cc: Jens Axboe <jens.axboe@oracle.com>
> #    Cc: Mark Lord <mlord@pobox.com
> #    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> #    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> diff -urN linux-2.6.24-rc5/mm/page_alloc.c linux-2.6.24-rc6/mm/page_alloc.c
> --- linux-2.6.24-rc5/mm/page_alloc.c    2007-12-21 04:14:11.305633890 +0000
> +++ linux-2.6.24-rc6/mm/page_alloc.c    2007-12-21 04:14:17.746985697 +0000
> @@ -847,8 +847,19 @@
>                 struct page *page = __rmqueue(zone, order, migratetype);
>                 if (unlikely(page == NULL))
>                         break;
> +
> +               /*
> +                * Split buddy pages returned by expand() are received here
> +                * in physical page order. The page is added to the callers and
> +                * list and the list head then moves forward. From the callers
> +                * perspective, the linked list is ordered by page number in
> +                * some conditions. This is useful for IO devices that can
> +                * merge IO requests if the physical pages are ordered
> +                * properly.
> +                */
>                 list_add(&page->lru, list);
>                 set_page_private(page, migratetype);
> +               list = &page->lru;
>         }
>         spin_unlock(&zone->lock);
>         return i;
> 
> This has brought back the good results I observed and reported.
> I do not know what to make out of this. At least on the systems I care
> about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory, SmartaArray6i
> controller with 4x72GB SCSI disks as RAID5 (battery protected writeback
> cache enabled) and gigabit networking (tg3)) this optimisation is a dissaster.
> 

That patch was not an optimisation, it was a regression fix against 2.6.23
and I don't believe reverting it is an option. Other IO hardware benefits
from having the allocator supply pages in PFN order. Your controller would
seem to suffer when presented with the same situation but I don't know why
that is. I've added James to the cc in case he has seen this sort of
situation before.

> On the other hand, it is not a regression against 2.6.22/23. Those had
> bad IO scaling to. It would just be a shame to loose an apparently great
> performance win.
> is

Could you try running your tests again when the system has been stressed
with some other workload first?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-17 21:50 Martin Knoblauch
  2008-01-17 22:12 ` Mel Gorman
  0 siblings, 1 reply; 41+ messages in thread
From: Martin Knoblauch @ 2008-01-17 21:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Fengguang Wu, Mike Snitzer, Peter Zijlstra, jplatte, Ingo Molnar,
	linux-kernel, linux-ext4@vger.kernel.org, Linus Torvalds,
	James.Bottomley

----- Original Message ----
> From: Mel Gorman <mel@csn.ul.ie>
> To: Martin Knoblauch <spamtrap@knobisoft.de>
> Cc: Fengguang Wu <wfg@mail.ustc.edu.cn>; Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; Linus Torvalds <torvalds@linux-foundation.org>; James.Bottomley@steeleye.com
> Sent: Thursday, January 17, 2008 9:23:57 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On (17/01/08 09:44), Martin Knoblauch didst pronounce:
> > > > > > > > On Wed, Jan 16, 2008 at 01:26:41AM -0800,
> Martin
> 
 Knoblauch wrote:
> > > > > > > For those interested in using your writeback
> improvements
> 
 in
> > > > > > > production sooner rather than later (primarily with
> ext3);
> 
 what
> > > > > > > recommendations do you have?  Just heavily test our
> own
> 
 2.6.24
> > > > > > > evolving "close, but not ready for merge" -mm
> writeback
> 
 patchset?
> > > > > > > 
> > > > > > 
> > > > > >  I can add myself to Mikes question. It would be good to
> know
> 
 a
> > > > > 
> > > > > "roadmap" for the writeback changes. Testing 2.6.24-rcX so
> far
> 
 has
> > > > > been showing quite nice improvement of the overall
> writeback
> 
 situation and
> > > > > it would be sad to see this [partially] gone in 2.6.24-final.
> > > > > Linus apparently already has reverted  "...2250b". I
> will
> 
 definitely
> > > > > repeat my tests  with -rc8. and report.
> > > > > 
> > > > Thank you, Martin. Can you help test this patch on 2.6.24-rc7?
> > > > Maybe we can push it to 2.6.24 after your testing.
> > > > 
> > > Hi Fengguang,
> > > 
> > > something really bad has happened between -rc3 and -rc6.
> > > Embarrassingly I did not catch that earlier :-(
> > > Compared to the numbers I posted in
> > > http://lkml.org/lkml/2007/10/26/208 , dd1 is now at 60 MB/sec
> > > (slight plus), while dd2/dd3 suck the same way as in pre 2.6.24.
> > > The only test that is still good is mix3, which I attribute to
> > > the per-BDI stuff.
> 
> I suspect that the IO hardware you have is very sensitive to the
> color of the physical page. I wonder, do you boot the system cleanly
> and then run these tests? If so, it would be interesting to know what
> happens if you stress the system first (many kernel compiles for example,
> basically anything that would use a lot of memory in different ways for some
> time) to randomise the free lists a bit and then run your test. You'd need to run
> the test three times for 2.6.23, 2.6.24-rc8 and 2.6.24-rc8 with the patch you
> identified reverted.
>

 The effect  is  defintely  depending on  the  IO  hardware. I performed the same tests
on a different box with an AACRAID controller and there things look different. Basically
the "offending" commit helps seingle stream performance on that box, while dual/triple
stream are not affected. So I suspect that the CCISS is just not behaving well.

 And yes, the tests are usually done on a freshly booted box. Of course, I repeat them
a few times. On the CCISS box the numbers are very constant. On the AACRAID box
they vary quite a bit.

 I can certainly stress the box before doing the tests. Please define "many" for the kernel
compiles :-)

> > 
> >  OK, the change happened between rc5 and rc6. Just following a
> > gut feeling, I reverted
> > 
> > #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
> > #Author: Mel Gorman 
> > #Date:   Mon Dec 17 16:20:05 2007 -0800
> > #

> > 
> > This has brought back the good results I observed and reported.
> > I do not know what to make out of this. At least on the systems
> > I care about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory,
> > SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery
> > protected writeback cache enabled) and gigabit networking (tg3)) this
> > optimisation is a dissaster.
> > 
> 
> That patch was not an optimisation, it was a regression fix
> against 2.6.23 and I don't believe reverting it is an option. Other IO
> hardware benefits from having the allocator supply pages in PFN order.

 I think this late in the 2.6.24 game we just should leave things as they are. But
we should try to find a way to make CCISS faster, as it apparently can be faster.

> Your controller would seem to suffer when presented with the same situation
> but I don't know why that is. I've added James to the cc in case he has seen this
> sort of situation before.
> 
> > On the other hand, it is not a regression against 2.6.22/23. Those
> 
 > had bad IO scaling to. It would just be a shame to loose an apparently
> 
 > great performance win.
> 
> Could you try running your tests again when the system has been
> stressed with some other workload first?
> 

 Will do.

Cheers
Martin

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
  2008-01-17 21:50 Martin Knoblauch
@ 2008-01-17 22:12 ` Mel Gorman
  0 siblings, 0 replies; 41+ messages in thread
From: Mel Gorman @ 2008-01-17 22:12 UTC (permalink / raw)
  To: Martin Knoblauch
  Cc: Fengguang Wu, Mike Snitzer, Peter Zijlstra, jplatte, Ingo Molnar,
	linux-kernel, linux-ext4@vger.kernel.org, Linus Torvalds,
	James.Bottomley

On (17/01/08 13:50), Martin Knoblauch didst pronounce:
> > <mail manglement snipped>
> 
> The effect  is  defintely  depending on  the  IO  hardware. I performed the same tests
> on a different box with an AACRAID controller and there things look different.

I take it different also means it does not show this odd performance
behaviour and is similar whether the patch is applied or not?

> Basically
> the "offending" commit helps seingle stream performance on that box, while dual/triple
> stream are not affected. So I suspect that the CCISS is just not behaving well.
> 
> And yes, the tests are usually done on a freshly booted box. Of course, I repeat them
> a few times. On the CCISS box the numbers are very constant. On the AACRAID box
> they vary quite a bit.
> 
>  I can certainly stress the box before doing the tests. Please define "many" for the kernel
> compiles :-)
> 

With 8GiB of RAM, try making 24 copies of the kernel and compiling them
all simultaneously. Running that for for 20-30 minutes should be enough to
randomise the freelists affecting what color of page is used for the dd test.

> > >  OK, the change happened between rc5 and rc6. Just following a
> > > gut feeling, I reverted
> > > 
> > > #commit 81eabcbe0b991ddef5216f30ae91c4b226d54b6d
> > > #Author: Mel Gorman 
> > > #Date:   Mon Dec 17 16:20:05 2007 -0800
> > > #
> 
> > > 
> > > This has brought back the good results I observed and reported.
> > > I do not know what to make out of this. At least on the systems
> > > I care about (HP/DL380g4, dual CPUs, HT-enabled, 8 GB Memory,
> > > SmartaArray6i controller with 4x72GB SCSI disks as RAID5 (battery
> > > protected writeback cache enabled) and gigabit networking (tg3)) this
> > > optimisation is a dissaster.
> > > 
> > 
> > That patch was not an optimisation, it was a regression fix
> > against 2.6.23 and I don't believe reverting it is an option. Other IO
> > hardware benefits from having the allocator supply pages in PFN order.
> 
>  I think this late in the 2.6.24 game we just should leave things as they are. But
> we should try to find a way to make CCISS faster, as it apparently can be faster.
> 
> > Your controller would seem to suffer when presented with the same situation
> > but I don't know why that is. I've added James to the cc in case he has seen this
> > sort of situation before.
> > 
> > > On the other hand, it is not a regression against 2.6.22/23. Those
> > 
>  > had bad IO scaling to. It would just be a shame to loose an apparently
> > 
>  > great performance win.
> > 
> > Could you try running your tests again when the system has been
> > stressed with some other workload first?
> > 
> 
>  Will do.
> 

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-18  8:19 Martin Knoblauch
  2008-01-18 16:01 ` Mel Gorman
  0 siblings, 1 reply; 41+ messages in thread
From: Martin Knoblauch @ 2008-01-18  8:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Fengguang Wu, Mike Snitzer, Peter Zijlstra, jplatte, Ingo Molnar,
	linux-kernel, linux-ext4@vger.kernel.org, Linus Torvalds,
	James.Bottomley

----- Original Message ----
> From: Mel Gorman <mel@csn.ul.ie>
> To: Martin Knoblauch <spamtrap@knobisoft.de>
> Cc: Fengguang Wu <wfg@mail.ustc.edu.cn>; Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; Linus Torvalds <torvalds@linux-foundation.org>; James.Bottomley@steeleye.com
> Sent: Thursday, January 17, 2008 11:12:21 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On (17/01/08 13:50), Martin Knoblauch didst pronounce:
> > > 
> > 
> > The effect  is  defintely  depending on  the  IO  hardware.
> > 
 performed the same tests
> > on a different box with an AACRAID controller and there things
> > look different.
> 
> I take it different also means it does not show this odd performance
> behaviour and is similar whether the patch is applied or not?
>

Here are the numbers (MB/s) from the AACRAID box, after a fresh boot:

Test           2.6.19.2   2.6.24-rc6  2.6.24-rc6-81eabcbe0b991ddef5216f30ae91c4b226d54b6d
dd1             325       350             290
dd1-dir       180       160             160
dd2             2x90     2x113         2x110
dd2-dir       2x120   2x92            2x93
dd3            3x54      3x70           3x70
dd3-dir      3x83      3x64           3x64
mix3          55,2x30  400,2x25   310,2x25

 What we are seing here is that:

a) DIRECT IO takes a much bigger hit (2.6.19 vs. 2.6.24) on this IO system compared to the CCISS box
b) Reverting your patch hurts single stream
c) dual/triple stream are not affected by your patch and are improved over 2.6.19
d) the mix3 performance is improved compared to 2.6.19.
d1) reverting your patch hurts the local-disk part of mix3
e) the AACRAID setup is definitely faster than the CCISS.

 So, on this box your patch is definitely needed to get the pre-2.6.24 performance
when writing a single big file.

 Actually things on the CCISS box might be even more complicated. I forgot the fact
that on that box we have ext2/LVM/DM/Hardware, while on the AACRAID box we have
ext2/Hardware. Do you think that the LVM/MD are sensitive to the page order/coloring?

 Anyway: does your patch only address this performance issue, or are there also
data integrity concerns without it? I may consider reverting the patch for my
production environment. It really helps two thirds of my boxes big time, while it does
not hurt the other third that much :-)

> > 
> >  I can certainly stress the box before doing the tests. Please
> > define "many" for the kernel compiles :-)
> > 
> 
> With 8GiB of RAM, try making 24 copies of the kernel and compiling them
> all simultaneously. Running that for for 20-30 minutes should be enough
> 
 to randomise the freelists affecting what color of page is used for the
> dd  test.
> 

 ouch :-) OK, I will try that.

Martin

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
  2008-01-18  8:19 Martin Knoblauch
@ 2008-01-18 16:01 ` Mel Gorman
  2008-01-18 17:46   ` Linus Torvalds
  0 siblings, 1 reply; 41+ messages in thread
From: Mel Gorman @ 2008-01-18 16:01 UTC (permalink / raw)
  To: Martin Knoblauch
  Cc: Fengguang Wu, Mike Snitzer, Peter Zijlstra, jplatte, Ingo Molnar,
	linux-kernel, linux-ext4@vger.kernel.org, Linus Torvalds,
	James.Bottomley

On (18/01/08 00:19), Martin Knoblauch didst pronounce:
> > > The effect  is  defintely  depending on  the  IO  hardware.
> > > performed the same tests
> > > on a different box with an AACRAID controller and there things
> > > look different.
> > 
> > I take it different also means it does not show this odd performance
> > behaviour and is similar whether the patch is applied or not?
> >
> 
> Here are the numbers (MB/s) from the AACRAID box, after a fresh boot:
> 
> Test            2.6.19.2  2.6.24-rc6  2.6.24-rc6-81eabcbe0b991ddef5216f30ae91c4b226d54b6d
> dd1             325       350         290
> dd1-dir         180       160         160
> dd2           2x 90     2x113       2x110
> dd2-dir       2x120     2x 92       2x 93
> dd3           3x 54     3x 70       3x 70
> dd3-dir       3x 83     3x 64       3x 64
> mix3       55,2x 30 400,2x 25   310,2x 25
> 
>  What we are seing here is that:
> 
> a) DIRECT IO takes a much bigger hit (2.6.19 vs. 2.6.24) on this IO system compared to the CCISS box
> b) Reverting your patch hurts single stream

Right, and this is consistent with other complaints about the PFN of the
page mattering to some hardware.

> c) dual/triple stream are not affected by your patch and are improved over 2.6.19

I am not very surprised. The callers to the page allocator are probably
making no special effort to get a batch of pages in PFN-order. They are just
assuming that subsequent calls give contiguous pages. With two or more
threads involved, there will not be a correlation between physical pages
and what is on disk any more.

> d) the mix3 performance is improved compared to 2.6.19.
> d1) reverting your patch hurts the local-disk part of mix3
> e) the AACRAID setup is definitely faster than the CCISS.
> 
>  So, on this box your patch is definitely needed to get the pre-2.6.24 performance
> when writing a single big file.
> 
>  Actually things on the CCISS box might be even more complicated. I forgot the fact
> that on that box we have ext2/LVM/DM/Hardware, while on the AACRAID box we have
> ext2/Hardware. Do you think that the LVM/MD are sensitive to the page order/coloring?
> 

I don't have enough experience with LVM setups to make an intelligent
guess.

>  Anyway: does your patch only address this performance issue, or are there also
> data integrity concerns without it?

Performance issue only. There are no data integrity concerns with that
patch.

> I may consider reverting the patch for my
> production environment. It really helps two thirds of my boxes big time, while it does
> not hurt the other third that much :-)
> 

That is certainly an option.

> > > 
> > >  I can certainly stress the box before doing the tests. Please
> > > define "many" for the kernel compiles :-)
> > > 
> > 
> > With 8GiB of RAM, try making 24 copies of the kernel and compiling them
> > all simultaneously. Running that for for 20-30 minutes should be enough
> > 
>  to randomise the freelists affecting what color of page is used for the
> > dd  test.
> > 
> 
>  ouch :-) OK, I will try that.
> 

Thanks.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
  2008-01-18 16:01 ` Mel Gorman
@ 2008-01-18 17:46   ` Linus Torvalds
  2008-01-18 19:01     ` Martin Knoblauch
  2008-01-18 20:00     ` Mike Snitzer
  0 siblings, 2 replies; 41+ messages in thread
From: Linus Torvalds @ 2008-01-18 17:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Martin Knoblauch, Fengguang Wu, Mike Snitzer, Peter Zijlstra,
	jplatte, Ingo Molnar, linux-kernel, linux-ext4@vger.kernel.org,
	James.Bottomley

On Fri, 18 Jan 2008, Mel Gorman wrote:
> 
> Right, and this is consistent with other complaints about the PFN of the
> page mattering to some hardware.

I don't think it's actually the PFN per se.

I think it's simply that some controllers (quite probably affected by both 
driver and hardware limits) have some subtle interactions with the size of 
the IO commands.

For example, let's say that you have a controller that has some limit X on 
the size of IO in flight (whether due to hardware or driver issues doesn't 
really matter) in addition to a limit on the size of the scatter-gather 
size. They all tend to have limits, and they differ.

Now, the PFN doesn't matter per se, but the allocation pattern definitely 
matters for whether the IO's are physically contiguous, and thus matters 
for the size of the scatter-gather thing.

Now, generally the rule-of-thumb is that you want big commands, so 
physical merging is good for you, but I could well imagine that the IO 
limits interact, and end up hurting each other. Let's say that a better 
allocation order allows for bigger contiguous physical areas, and thus 
fewer scatter-gather entries.

What does that result in? The obvious answer is

  "Better performance obviously, because the controller needs to do fewer 
   scatter-gather lookups, and the requests are bigger, because there are 
   fewer IO's that hit scatter-gather limits!"

Agreed?

Except maybe the *real* answer for some controllers end up being

  "Worse performance, because individual commands grow because they don't 
   hit the per-command limits, but now we hit the global size-in-flight 
   limits and have many fewer of these good commands in flight. And while 
   the commands are larger, it means that there are fewer outstanding 
   commands, which can mean that the disk cannot scheduling things 
   as well, or makes high latency of command generation by the controller 
   much more visible because there aren't enough concurrent requests 
   queued up to hide it"

Is this the reason? I have no idea. But somebody who knows the AACRAID 
hardware and driver limits might think about interactions like that. 
Sometimes you actually might want to have smaller individual commands if 
there is some other limit that means that it can be more advantageous to 
have many small requests over a few big onees.

RAID might well make it worse. Maybe small requests work better because 
they are simpler to schedule because they only hit one disk (eg if you 
have simple striping)! So that's another reason why one *large* request 
may actually be slower than two requests half the size, even if it's 
against the "normal rule".

And it may be that that AACRAID box takes a big hit on DIO exactly because 
DIO has been optimized almost purely for making one command as big as 
possible.

Just a theory.

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
  2008-01-18 17:46   ` Linus Torvalds
@ 2008-01-18 19:01     ` Martin Knoblauch
  2008-01-18 19:23       ` Linus Torvalds
  2008-01-22 14:39       ` Alasdair G Kergon
  2008-01-18 20:00     ` Mike Snitzer
  1 sibling, 2 replies; 41+ messages in thread
From: Martin Knoblauch @ 2008-01-18 19:01 UTC (permalink / raw)
  To: Linus Torvalds, Mel Gorman
  Cc: Martin Knoblauch, Fengguang Wu, Mike Snitzer, Peter Zijlstra,
	jplatte, Ingo Molnar, linux-kernel, linux-ext4@vger.kernel.org,
	James.Bottomley


--- Linus Torvalds <torvalds@linux-foundation.org> wrote:

> 
> 
> On Fri, 18 Jan 2008, Mel Gorman wrote:
> > 
> > Right, and this is consistent with other complaints about the PFN
> > of the page mattering to some hardware.
> 
> I don't think it's actually the PFN per se.
> 
> I think it's simply that some controllers (quite probably affected by
> both  driver and hardware limits) have some subtle interactions with
> the size of  the IO commands.
> 
> For example, let's say that you have a controller that has some limit
> X on  the size of IO in flight (whether due to hardware or driver
> issues doesn't  really matter) in addition to a limit on the size
> of the scatter-gather  size. They all tend to have limits, and
> they differ.
> 
> Now, the PFN doesn't matter per se, but the allocation pattern
> definitely  matters for whether the IO's are physically
> contiguous, and thus matters  for the size of the scatter-gather
> thing.
> 
> Now, generally the rule-of-thumb is that you want big commands, so 
> physical merging is good for you, but I could well imagine that the
> IO  limits interact, and end up hurting each other. Let's say that a
> better  allocation order allows for bigger contiguous physical areas,
> and thus  fewer scatter-gather entries.
> 
> What does that result in? The obvious answer is
> 
>   "Better performance obviously, because the controller needs to do
> fewer scatter-gather lookups, and the requests are bigger, because
> there are     fewer IO's that hit scatter-gather limits!"
> 
> Agreed?
> 
> Except maybe the *real* answer for some controllers end up being
> 
>   "Worse performance, because individual commands grow because they
> don't  hit the per-command limits, but now we hit the global
> size-in-flight limits and have many fewer of these good commands in
> flight. And while the commands are larger, it means that there
> are fewer outstanding commands, which can mean that the disk
> cannot scheduling things as well, or makes high latency of command
> generation by the controller much more visible because there aren't
> enough concurrent requests queued up to hide it"
> 
> Is this the reason? I have no idea. But somebody who knows the
> AACRAID hardware and driver limits might think about interactions
> like that. Sometimes you actually might want to have smaller 
> individual commands if there is some other limit that means that
> it can be more advantageous to have many small requests over a
> few big onees.
> 
> RAID might well make it worse. Maybe small requests work better
> because they are simpler to schedule because they only hit one
> disk (eg if you have simple striping)! So that's another reason
> why one *large* request may actually be slower than two requests
> half the size, even if it's against the "normal rule".
> 
> And it may be that that AACRAID box takes a big hit on DIO
> exactly because DIO has been optimized almost purely for making
> one command as big as possible.
> 
> Just a theory.
> 
> 		Linus

 just to make one thing clear - I am not so much concerned about the
performance of AACRAID. It is OK with or without Mel's patch. It is
better with Mel's patch. The regression in DIO compared to 2.6.19.2 is
completely independent of Mel's stuff.

 What interests me much more is the behaviour of the CCISS+LVM based
system. Here I see a huge benefit of reverting Mel's patch.

 I dirtied the system after reboot as Mel suggested (24 parallel kernel
build) and repeated the tests. The dirtying did not make any
difference. Here are the results:

Test      -rc8    -rc8-without-Mels-Patch
dd1       57      94
dd1-dir   87      86
dd2       2x8.5   2x45
dd2-dir   2x43    2x43
dd3       3x7     3x30
dd3-dir   3x28.5  3x28.5
mix3      59,2x25 98,2x24

 The big IO size with Mel's patch really has a devastating effect on
the parallel write. Nowhere near the value one would expect, while the
numbers are perfect without Mel's patch as in rc1-rc5. To bad I did not
see this earlier. Maybe we could have found a solution for .24.

 At least, rc1-rc5 have shown that the CCISS system can do well. Now
the question is which part of the system does not cope well with the
larger IO sizes? Is it the CCISS controller, LVM or both. I am open to
suggestions on how to debug that. 

Cheers
Martin

------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
  2008-01-18 19:01     ` Martin Knoblauch
@ 2008-01-18 19:23       ` Linus Torvalds
  2008-01-22 14:39       ` Alasdair G Kergon
  1 sibling, 0 replies; 41+ messages in thread
From: Linus Torvalds @ 2008-01-18 19:23 UTC (permalink / raw)
  To: Martin Knoblauch
  Cc: Mel Gorman, Fengguang Wu, Mike Snitzer, Peter Zijlstra, jplatte,
	Ingo Molnar, linux-kernel, linux-ext4@vger.kernel.org,
	James.Bottomley



On Fri, 18 Jan 2008, Martin Knoblauch wrote:
>
>  just to make one thing clear - I am not so much concerned about the
> performance of AACRAID. It is OK with or without Mel's patch. It is
> better with Mel's patch. The regression in DIO compared to 2.6.19.2 is
> completely independent of Mel's stuff.
> 
>  What interests me much more is the behaviour of the CCISS+LVM based
> system. Here I see a huge benefit of reverting Mel's patch.

Ok, I just got your usage cases confused.

The argument stays the same: some controllers/drivers may have subtle 
behavioural differences that come from the IO limits themselves.

So it wasn't AACRAID, it was CCISS+LVM. The argument is the same: it may 
well be that the *bigger* IO sizes are actually what hurts, even if the 
conventional wisdom is traditionally that bigger submissions are better.

>  At least, rc1-rc5 have shown that the CCISS system can do well. Now
> the question is which part of the system does not cope well with the
> larger IO sizes? Is it the CCISS controller, LVM or both. I am open to
> suggestions on how to debug that. 

I think you need to ask the MD/DM people for suggestions.. Who aren't cc'd 
here.

		Linus

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
  2008-01-18 17:46   ` Linus Torvalds
  2008-01-18 19:01     ` Martin Knoblauch
@ 2008-01-18 20:00     ` Mike Snitzer
  2008-01-18 22:47       ` Mike Snitzer
  1 sibling, 1 reply; 41+ messages in thread
From: Mike Snitzer @ 2008-01-18 20:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Martin Knoblauch, Fengguang Wu, Peter Zijlstra,
	jplatte, Ingo Molnar, linux-kernel, linux-ext4@vger.kernel.org,
	James.Bottomley

On Jan 18, 2008 12:46 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>
> On Fri, 18 Jan 2008, Mel Gorman wrote:
> >
> > Right, and this is consistent with other complaints about the PFN of the
> > page mattering to some hardware.
>
> I don't think it's actually the PFN per se.
>
> I think it's simply that some controllers (quite probably affected by both
> driver and hardware limits) have some subtle interactions with the size of
> the IO commands.
>
> For example, let's say that you have a controller that has some limit X on
> the size of IO in flight (whether due to hardware or driver issues doesn't
> really matter) in addition to a limit on the size of the scatter-gather
> size. They all tend to have limits, and they differ.
>
> Now, the PFN doesn't matter per se, but the allocation pattern definitely
> matters for whether the IO's are physically contiguous, and thus matters
> for the size of the scatter-gather thing.
>
> Now, generally the rule-of-thumb is that you want big commands, so
> physical merging is good for you, but I could well imagine that the IO
> limits interact, and end up hurting each other. Let's say that a better
> allocation order allows for bigger contiguous physical areas, and thus
> fewer scatter-gather entries.
>
> What does that result in? The obvious answer is
>
>   "Better performance obviously, because the controller needs to do fewer
>    scatter-gather lookups, and the requests are bigger, because there are
>    fewer IO's that hit scatter-gather limits!"
>
> Agreed?
>
> Except maybe the *real* answer for some controllers end up being
>
>   "Worse performance, because individual commands grow because they don't
>    hit the per-command limits, but now we hit the global size-in-flight
>    limits and have many fewer of these good commands in flight. And while
>    the commands are larger, it means that there are fewer outstanding
>    commands, which can mean that the disk cannot scheduling things
>    as well, or makes high latency of command generation by the controller
>    much more visible because there aren't enough concurrent requests
>    queued up to hide it"
>
> Is this the reason? I have no idea. But somebody who knows the AACRAID
> hardware and driver limits might think about interactions like that.
> Sometimes you actually might want to have smaller individual commands if
> there is some other limit that means that it can be more advantageous to
> have many small requests over a few big onees.
>
> RAID might well make it worse. Maybe small requests work better because
> they are simpler to schedule because they only hit one disk (eg if you
> have simple striping)! So that's another reason why one *large* request
> may actually be slower than two requests half the size, even if it's
> against the "normal rule".
>
> And it may be that that AACRAID box takes a big hit on DIO exactly because
> DIO has been optimized almost purely for making one command as big as
> possible.
>
> Just a theory.

Oddly enough, I'm seeing the opposite here with 2.6.22.16 w/ AACRAID
configured with 5 LUNS (each 2disk HW RAID0, 1024k stripesz).  That
is, with dd the avgrqsiz (from iostat) shows DIO to be ~130k whereas
non-DIO is a mere ~13k! (NOTE: with aacraid, max_hw_sectors_kb=192)

DIO cmdline:  dd if=/dev/zero of=/dev/sdX bs=8192k count=1k
non-DIO cmdline: dd if=/dev/zero of=/dev/sdX bs=8192k count=1k

DIO is ~80MB/s on all 5 LUNs for a total of ~400MB/s
non-DIO is only ~12MB on all 5 LUNs for a mere ~70MB/s aggregate
(deadline w/ nr_requests=32)

Calls into question the theory of small requests being beneficial for
AACRAID.  Martin, what are you seeing for the avg request size when
you're conducting your AACRAID tests?

I can fire up 2.6.24-rc8 in short order to see if things are vastly
improved (as Martin seems to indicate that he is happy with AACRAID on
2.6.24-rc8).  Although even Martin's AACRAID numbers from 2.6.19.2 are
still quite good (relative to mine).  Martin can you share any tuning
you may have done to get AACRAID to where it is for you right now?

regards,
Mike

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
  2008-01-18 20:00     ` Mike Snitzer
@ 2008-01-18 22:47       ` Mike Snitzer
  0 siblings, 0 replies; 41+ messages in thread
From: Mike Snitzer @ 2008-01-18 22:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Martin Knoblauch, Fengguang Wu, Peter Zijlstra,
	jplatte, Ingo Molnar, linux-kernel, linux-ext4@vger.kernel.org,
	James.Bottomley

On Jan 18, 2008 3:00 PM, Mike Snitzer <snitzer@gmail.com> wrote:
>
> On Jan 18, 2008 12:46 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> >
> > On Fri, 18 Jan 2008, Mel Gorman wrote:
> > >
> > > Right, and this is consistent with other complaints about the PFN of the
> > > page mattering to some hardware.
> >
> > I don't think it's actually the PFN per se.
> >
> > I think it's simply that some controllers (quite probably affected by both
> > driver and hardware limits) have some subtle interactions with the size of
> > the IO commands.
> >
> > For example, let's say that you have a controller that has some limit X on
> > the size of IO in flight (whether due to hardware or driver issues doesn't
> > really matter) in addition to a limit on the size of the scatter-gather
> > size. They all tend to have limits, and they differ.
> >
> > Now, the PFN doesn't matter per se, but the allocation pattern definitely
> > matters for whether the IO's are physically contiguous, and thus matters
> > for the size of the scatter-gather thing.
> >
> > Now, generally the rule-of-thumb is that you want big commands, so
> > physical merging is good for you, but I could well imagine that the IO
> > limits interact, and end up hurting each other. Let's say that a better
> > allocation order allows for bigger contiguous physical areas, and thus
> > fewer scatter-gather entries.
> >
> > What does that result in? The obvious answer is
> >
> >   "Better performance obviously, because the controller needs to do fewer
> >    scatter-gather lookups, and the requests are bigger, because there are
> >    fewer IO's that hit scatter-gather limits!"
> >
> > Agreed?
> >
> > Except maybe the *real* answer for some controllers end up being
> >
> >   "Worse performance, because individual commands grow because they don't
> >    hit the per-command limits, but now we hit the global size-in-flight
> >    limits and have many fewer of these good commands in flight. And while
> >    the commands are larger, it means that there are fewer outstanding
> >    commands, which can mean that the disk cannot scheduling things
> >    as well, or makes high latency of command generation by the controller
> >    much more visible because there aren't enough concurrent requests
> >    queued up to hide it"
> >
> > Is this the reason? I have no idea. But somebody who knows the AACRAID
> > hardware and driver limits might think about interactions like that.
> > Sometimes you actually might want to have smaller individual commands if
> > there is some other limit that means that it can be more advantageous to
> > have many small requests over a few big onees.
> >
> > RAID might well make it worse. Maybe small requests work better because
> > they are simpler to schedule because they only hit one disk (eg if you
> > have simple striping)! So that's another reason why one *large* request
> > may actually be slower than two requests half the size, even if it's
> > against the "normal rule".
> >
> > And it may be that that AACRAID box takes a big hit on DIO exactly because
> > DIO has been optimized almost purely for making one command as big as
> > possible.
> >
> > Just a theory.
>
> Oddly enough, I'm seeing the opposite here with 2.6.22.16 w/ AACRAID
> configured with 5 LUNS (each 2disk HW RAID0, 1024k stripesz).  That
> is, with dd the avgrqsiz (from iostat) shows DIO to be ~130k whereas
> non-DIO is a mere ~13k! (NOTE: with aacraid, max_hw_sectors_kb=192)
...
> I can fire up 2.6.24-rc8 in short order to see if things are vastly
> improved (as Martin seems to indicate that he is happy with AACRAID on
> 2.6.24-rc8).  Although even Martin's AACRAID numbers from 2.6.19.2 are
> still quite good (relative to mine).  Martin can you share any tuning
> you may have done to get AACRAID to where it is for you right now?

I can confirm 2.6.24-rc8 behaves like Martin has posted for the
AACRAID.  Slower DIO with smaller avgreqsiz.  Much faster buffered IO
(for my config anyway) with a much larger avgreqsiz (180K).

I have no idea why 2.6.22.16's request size on non-DIO is _so_ small...

Mike

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-19 10:24 Martin Knoblauch
  0 siblings, 0 replies; 41+ messages in thread
From: Martin Knoblauch @ 2008-01-19 10:24 UTC (permalink / raw)
  To: Mike Snitzer, Linus Torvalds
  Cc: Mel Gorman, Fengguang Wu, Peter Zijlstra, jplatte, Ingo Molnar,
	linux-kernel, linux-ext4@vger.kernel.org, James.Bottomley

----- Original Message ----
> From: Mike Snitzer <snitzer@gmail.com>
> To: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Mel Gorman <mel@csn.ul.ie>; Martin Knoblauch <spamtrap@knobisoft.de>; Fengguang Wu <wfg@mail.ustc.edu.cn>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; James.Bottomley@steeleye.com
> Sent: Friday, January 18, 2008 11:47:02 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> > I can fire up 2.6.24-rc8 in short order to see if things are vastly
> > improved (as Martin seems to indicate that he is happy with
> > AACRAID on 2.6.24-rc8).  Although even Martin's AACRAID
> > numbers from 2.6.19.2
> 
 > are still quite good (relative to mine).  Martin can you share any tuning
> > you may have done to get AACRAID to where it is for you right now?
Mike,

 I have always been happy with the AACRAID box compared to the CCISS system. Even with the "regression" in 2.6.24-rc1..rc5 it was more than acceptable to me. For me the differences between 2.6.19  and 2.6.24-rc8 on the AACRAID setup are:

- 11% (single stream) to 25% (dual/triple stream) regression in DIO. Something I do not care much about. I just measure it for reference.
+ the very nice behaviour when writing to different targets (mix3), which I attribute to Peter's per-dbi stuff.

 And until -rc6 I was extremely pleased with the cool speedup I saw on my CCISS boxes. This would have been the next "production" kernel for me. But lets discuss this under a seperate topic. It has nothing to do with the original wait-io issue.

 Oh, before I forget. There has been no tuning for the AACRAID. The system is an IBM x3650 with built in AACRAID and battery backed write cache. The disks are 6x142GB/15krpm in a RAID5 setup. I see one big difference between your an my tests. I do 1MB writes to simulate the behaviour of the real applications, while yours seem to be much smaller.

Cheers
Martin

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
  2008-01-18 19:01     ` Martin Knoblauch
  2008-01-18 19:23       ` Linus Torvalds
@ 2008-01-22 14:39       ` Alasdair G Kergon
  1 sibling, 0 replies; 41+ messages in thread
From: Alasdair G Kergon @ 2008-01-22 14:39 UTC (permalink / raw)
  To: Martin Knoblauch
  Cc: Linus Torvalds, Mel Gorman, Fengguang Wu, Mike Snitzer,
	Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org, James.Bottomley, Jens Axboe,
	Milan Broz, Neil Brown

On Fri, Jan 18, 2008 at 11:01:11AM -0800, Martin Knoblauch wrote:
>  At least, rc1-rc5 have shown that the CCISS system can do well. Now
> the question is which part of the system does not cope well with the
> larger IO sizes? Is it the CCISS controller, LVM or both. I am open to
> suggestions on how to debug that. 

What is your LVM device configuration?
  E.g. 'dmsetup table' and 'dmsetup info -c' output.
Some configurations lead to large IOs getting split up on the way through
device-mapper.

See if these patches make any difference:
  http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/

    dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch
    dm-introduce-merge_bvec_fn.patch
    dm-linear-add-merge.patch
    dm-table-remove-merge_bvec-sector-restriction.patch
 
Alasdair
-- 
agk@redhat.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-22 15:25 Martin Knoblauch
  2008-01-22 23:40 ` Alasdair G Kergon
  0 siblings, 1 reply; 41+ messages in thread
From: Martin Knoblauch @ 2008-01-22 15:25 UTC (permalink / raw)
  To: Alasdair G Kergon
  Cc: Linus Torvalds, Mel Gorman, Fengguang Wu, Mike Snitzer,
	Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org, James.Bottomley, Jens Axboe,
	Milan Broz, Neil Brown


------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de

----- Original Message ----
> From: Alasdair G Kergon <agk@redhat.com>
> To: Martin Knoblauch <spamtrap@knobisoft.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>; Mel Gorman <mel@csn.ul.ie>; Fengguang Wu <wfg@mail.ustc.edu.cn>; Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; James.Bottomley@steeleye.com; Jens Axboe <jens.axboe@oracle.com>; Milan Broz <mbroz@redhat.com>; Neil Brown <neilb@suse.de>
> Sent: Tuesday, January 22, 2008 3:39:33 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Fri, Jan 18, 2008 at 11:01:11AM -0800, Martin Knoblauch wrote:
> >  At least, rc1-rc5 have shown that the CCISS system can do well. Now
> > the question is which part of the system does not cope well with the
> > larger IO sizes? Is it the CCISS controller, LVM or both. I am
> open
> 
 to
> > suggestions on how to debug that. 
> 
> What is your LVM device configuration?
>   E.g. 'dmsetup table' and 'dmsetup info -c' output.
> Some configurations lead to large IOs getting split up on the
> way
> 
 through
> device-mapper.
>
Hi Alastair,

 here is the output, the filesystem in question is on LogVol02:

  [root@lpsdm52 ~]# dmsetup table
VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248
VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528
VolGroup00-LogVol00: 0 67108864 linear 104:2 384
[root@lpsdm52 ~]# dmsetup info -c
Name             Maj Min Stat Open Targ Event  UUID
VolGroup00-LogVol02 253   1 L--w    1    1      0 LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4OgmOZ4OzOgGQIdF3qDx6fJmlZukXXLIy39R
VolGroup00-LogVol01 253   2 L--w    1    1      0 LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4Ogmfn2CcAd2Fh7i48twe8PZc2XK5bSOe1Fq
VolGroup00-LogVol00 253   0 L--w    1    1      0 LVM-IV4PeE8cdxA3piC1qk79GY9PE9OC4OgmfYjxQKFP3zw2fGsezJN7ypSrfmP7oSvE

> See if these patches make any difference:
> 
  http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/
> 
>     dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch
>     dm-introduce-merge_bvec_fn.patch
>     dm-linear-add-merge.patch
>     dm-table-remove-merge_bvec-sector-restriction.patch
>  

 thanks for the suggestion. Are they supposed to apply to mainline?

Cheers
Martin

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-22 18:51 Martin Knoblauch
  0 siblings, 0 replies; 41+ messages in thread
From: Martin Knoblauch @ 2008-01-22 18:51 UTC (permalink / raw)
  To: Alasdair G Kergon
  Cc: Linus Torvalds, Mel Gorman, Fengguang Wu, Mike Snitzer,
	Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org, James.Bottomley, Jens Axboe,
	Milan Broz, Neil Brown

----- Original Message ----
> From: Alasdair G Kergon <agk@redhat.com>
> To: Martin Knoblauch <spamtrap@knobisoft.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>; Mel Gorman <mel@csn.ul.ie>; Fengguang Wu <wfg@mail.ustc.edu.cn>; Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; James.Bottomley@steeleye.com; Jens Axboe <jens.axboe@oracle.com>; Milan Broz <mbroz@redhat.com>; Neil Brown <neilb@suse.de>
> Sent: Tuesday, January 22, 2008 3:39:33 PM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> 
> See if these patches make any difference:
> 
  http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/
> 
>     dm-md-merge_bvec_fn-with-separate-bdev-and-sector.patch
>     dm-introduce-merge_bvec_fn.patch
>     dm-linear-add-merge.patch
>     dm-table-remove-merge_bvec-sector-restriction.patch
>  


 nope. Exactely the same poor results. To rule out LVM/DM I really have to see what happens if I setup a system with filesystems directly on partitions. Might take some time though.

Cheers
Martin

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
  2008-01-22 15:25 Martin Knoblauch
@ 2008-01-22 23:40 ` Alasdair G Kergon
  0 siblings, 0 replies; 41+ messages in thread
From: Alasdair G Kergon @ 2008-01-22 23:40 UTC (permalink / raw)
  To: Martin Knoblauch
  Cc: Linus Torvalds, Mel Gorman, Fengguang Wu, Mike Snitzer,
	Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org, James.Bottomley, Jens Axboe,
	Milan Broz, Neil Brown

On Tue, Jan 22, 2008 at 07:25:15AM -0800, Martin Knoblauch wrote:
>   [root@lpsdm52 ~]# dmsetup table
> VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248
> VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528
> VolGroup00-LogVol00: 0 67108864 linear 104:2 384

The IO should pass straight through simple linear targets like that without
needing to get broken up, so I wouldn't expect those patches to make any
difference in this particular case.

Alasdair
-- 
agk@redhat.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: regression: 100% io-wait with 2.6.24-rcX
@ 2008-01-23 11:12 Martin Knoblauch
  0 siblings, 0 replies; 41+ messages in thread
From: Martin Knoblauch @ 2008-01-23 11:12 UTC (permalink / raw)
  To: Alasdair G Kergon
  Cc: Linus Torvalds, Mel Gorman, Fengguang Wu, Mike Snitzer,
	Peter Zijlstra, jplatte, Ingo Molnar, linux-kernel,
	linux-ext4@vger.kernel.org, James.Bottomley, Jens Axboe,
	Milan Broz, Neil Brown

----- Original Message ----
> From: Alasdair G Kergon <agk@redhat.com>
> To: Martin Knoblauch <spamtrap@knobisoft.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>; Mel Gorman <mel@csn.ul.ie>; Fengguang Wu <wfg@mail.ustc.edu.cn>; Mike Snitzer <snitzer@gmail.com>; Peter Zijlstra <peterz@infradead.org>; jplatte@naasa.net; Ingo Molnar <mingo@elte.hu>; linux-kernel@vger.kernel.org; "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>; James.Bottomley@steeleye.com; Jens Axboe <jens.axboe@oracle.com>; Milan Broz <mbroz@redhat.com>; Neil Brown <neilb@suse.de>
> Sent: Wednesday, January 23, 2008 12:40:52 AM
> Subject: Re: regression: 100% io-wait with 2.6.24-rcX
> 
> On Tue, Jan 22, 2008 at 07:25:15AM -0800, Martin Knoblauch wrote:
> >   [root@lpsdm52 ~]# dmsetup table
> > VolGroup00-LogVol02: 0 350945280 linear 104:2 67109248
> > VolGroup00-LogVol01: 0 8388608 linear 104:2 418054528
> > VolGroup00-LogVol00: 0 67108864 linear 104:2 384
> 
> The IO should pass straight through simple linear targets like
> that without needing to get broken up, so I wouldn't expect those patches to
> make any difference in this particular case.
> 

Alasdair,

 LVM/DM are off the hook :-) I converted one box to direct using partitions and the performance is the same disappointment as with LVM/DM. Thanks anyway for looking at my problem.

 I will move the discussion now to a new thread, targetting CCISS directly. 

Cheers
Martin

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2008-01-23 11:12 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-17 17:44 regression: 100% io-wait with 2.6.24-rcX Martin Knoblauch
2008-01-17 20:23 ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2008-01-23 11:12 Martin Knoblauch
2008-01-22 18:51 Martin Knoblauch
2008-01-22 15:25 Martin Knoblauch
2008-01-22 23:40 ` Alasdair G Kergon
2008-01-19 10:24 Martin Knoblauch
2008-01-18  8:19 Martin Knoblauch
2008-01-18 16:01 ` Mel Gorman
2008-01-18 17:46   ` Linus Torvalds
2008-01-18 19:01     ` Martin Knoblauch
2008-01-18 19:23       ` Linus Torvalds
2008-01-22 14:39       ` Alasdair G Kergon
2008-01-18 20:00     ` Mike Snitzer
2008-01-18 22:47       ` Mike Snitzer
2008-01-17 21:50 Martin Knoblauch
2008-01-17 22:12 ` Mel Gorman
2008-01-17 17:51 Martin Knoblauch
2008-01-17 13:52 Martin Knoblauch
2008-01-17 16:11 ` Mike Snitzer
2008-01-16 14:15 Martin Knoblauch
2008-01-16 16:27 ` Mike Snitzer
2008-01-16  9:26 Martin Knoblauch
     [not found] ` <E1JF6w8-0000vs-HM@localhost.localdomain>
2008-01-16 12:00   ` Fengguang Wu
2008-01-16 12:00   ` Fengguang Wu
     [not found] <200801071151.11200.lists@naasa.net>
     [not found] ` <200801130905.44855.jplatte@naasa.net>
     [not found]   ` <400212488.11031@ustc.edu.cn>
     [not found]     ` <200801131049.33111.jplatte@naasa.net>
     [not found]       ` <E1JE1Uz-0002w5-6z@localhost.localdomain>
2008-01-13 11:59         ` Fengguang Wu
2008-01-13 11:59         ` Fengguang Wu
     [not found]       ` <20080113115933.GA11045@mail.ustc.edu.cn>
     [not found]         ` <E1JEGPH-0001uw-Df@localhost.localdomain>
2008-01-14  3:54           ` Fengguang Wu
2008-01-14  3:54           ` Fengguang Wu
     [not found]         ` <20080114035439.GA7330@mail.ustc.edu.cn>
     [not found]           ` <E1JEM2I-00010S-5U@localhost.localdomain>
2008-01-14  9:55             ` Fengguang Wu
2008-01-14  9:55             ` Fengguang Wu
2008-01-14 11:30               ` Joerg Platte
2008-01-14 11:41                 ` Peter Zijlstra
     [not found]                   ` <E1JEOmD-0001Ap-U7@localhost.localdomain>
2008-01-14 12:50                     ` Fengguang Wu
2008-01-15 21:13                       ` Mike Snitzer
     [not found]                         ` <E1JF0m1-000101-OK@localhost.localdomain>
2008-01-16  5:25                           ` Fengguang Wu
2008-01-16  5:25                           ` Fengguang Wu
2008-01-14 12:50                     ` Fengguang Wu
2008-01-15 21:42                     ` Ingo Molnar
     [not found]                       ` <E1JF0bJ-0000zU-FG@localhost.localdomain>
2008-01-16  5:14                         ` Fengguang Wu
2008-01-16  5:14                         ` Fengguang Wu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox