public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* iozone write 50% regression in kernel 2.6.24-rc1
@ 2007-11-09  9:47 Zhang, Yanmin
  2007-11-09  9:54 ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Zhang, Yanmin @ 2007-11-09  9:47 UTC (permalink / raw)
  To: a.p.zijlstra; +Cc: LKML

Comparing with 2.6.23, iozone sequential write/rewrite (512M) has 50% regression
in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.

My machine has 8 processor cores and 8GB memory.

By bisect, I located patch
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=04fbfdc14e5f48463820d6b9807daa5e9c92c51f.


Another behavior: with kernel 2.6.23, if I run iozone for many times after rebooting machine,
the result looks stable. But with 2.6.24-rc1, the first run of iozone got a very small result and
following run has 4Xorig_result.

What I reported is the regression of 2nd/3rd run, because first run has bigger regression.

I also tried to change /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio and didn't get improvement.

-yanmin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: iozone write 50% regression in kernel 2.6.24-rc1
  2007-11-09  9:47 iozone write 50% regression in kernel 2.6.24-rc1 Zhang, Yanmin
@ 2007-11-09  9:54 ` Peter Zijlstra
  2007-11-12  2:14   ` Zhang, Yanmin
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2007-11-09  9:54 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: LKML

[-- Attachment #1: Type: text/plain, Size: 2994 bytes --]

On Fri, 2007-11-09 at 17:47 +0800, Zhang, Yanmin wrote:
> Comparing with 2.6.23, iozone sequential write/rewrite (512M) has 50% regression
> in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
> 
> My machine has 8 processor cores and 8GB memory.
> 
> By bisect, I located patch
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
> 
> 
> Another behavior: with kernel 2.6.23, if I run iozone for many times after rebooting machine,
> the result looks stable. But with 2.6.24-rc1, the first run of iozone got a very small result and
> following run has 4Xorig_result.

So the second run is 4x as fast as the first run?

> What I reported is the regression of 2nd/3rd run, because first run has bigger regression.

So the 2nd and 3rd run are stable at 50% slower than .23?

> I also tried to change /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio and didn't get improvement.

Could you try:

---
Subject: mm: speed up writeback ramp-up on clean systems

We allow violation of bdi limits if there is a lot of room on the
system. Once we hit half the total limit we start enforcing bdi limits
and bdi ramp-up should happen. Doing it this way avoids many small
writeouts on an otherwise idle system and should also speed up the
ramp-up.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Fengguang Wu <wfg@mail.ustc.edu.cn> 
---
 mm/page-writeback.c |   19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2007-09-28 10:08:33.937415368 +0200
+++ linux-2.6/mm/page-writeback.c	2007-09-28 10:54:26.018247516 +0200
@@ -355,8 +355,8 @@ get_dirty_limits(long *pbackground, long
  */
 static void balance_dirty_pages(struct address_space *mapping)
 {
-	long bdi_nr_reclaimable;
-	long bdi_nr_writeback;
+	long nr_reclaimable, bdi_nr_reclaimable;
+	long nr_writeback, bdi_nr_writeback;
 	long background_thresh;
 	long dirty_thresh;
 	long bdi_thresh;
@@ -376,11 +376,26 @@ static void balance_dirty_pages(struct a
 
 		get_dirty_limits(&background_thresh, &dirty_thresh,
 				&bdi_thresh, bdi);
+
+		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
+					global_page_state(NR_UNSTABLE_NFS);
+		nr_writeback = global_page_state(NR_WRITEBACK);
+
 		bdi_nr_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
 		bdi_nr_writeback = bdi_stat(bdi, BDI_WRITEBACK);
+
 		if (bdi_nr_reclaimable + bdi_nr_writeback <= bdi_thresh)
 			break;
 
+		/*
+		 * Throttle it only when the background writeback cannot
+		 * catch-up. This avoids (excessively) small writeouts
+		 * when the bdi limits are ramping up.
+		 */
+		if (nr_reclaimable + nr_writeback <
+				(background_thresh + dirty_thresh) / 2)
+			break;
+
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: iozone write 50% regression in kernel 2.6.24-rc1
@ 2007-11-09 12:36 Martin Knoblauch
  2007-11-12  0:45 ` Zhang, Yanmin
  0 siblings, 1 reply; 16+ messages in thread
From: Martin Knoblauch @ 2007-11-09 12:36 UTC (permalink / raw)
  To: Zhang, Yanmin, a.p.zijlstra; +Cc: LKML

----- Original Message ----
> From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
> To: a.p.zijlstra@chello.nl
> Cc: LKML <linux-kernel@vger.kernel.org>
> Sent: Friday, November 9, 2007 10:47:52 AM
> Subject: iozone write 50% regression in kernel 2.6.24-rc1
> 
> Comparing with 2.6.23, iozone sequential write/rewrite (512M) has
> 50%
> 
 regression
> in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
> 
> My machine has 8 processor cores and 8GB memory.
> 
> By bisect, I located patch
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=
> 04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
> 
> 
> Another behavior: with kernel 2.6.23, if I run iozone for many
> times
> 
 after rebooting machine,
> the result looks stable. But with 2.6.24-rc1, the first run of
> iozone
> 
 got a very small result and
> following run has 4Xorig_result.
> 
> What I reported is the regression of 2nd/3rd run, because first run
> has
> 
 bigger regression.
> 
> I also tried to change
> /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio
> 
 and didn't get improvement.
> 
> -yanmin
> -
Hi Yanmin,

 could you tell us the exact iozone command you are using? I would like to repeat it on my setup, because I definitely see the opposite behaviour in 2.6.24-rc1/rc2. The speed there is much better than in 2.6.22 and before (I skipped 2.6.23, because I was waiting for the per-bdi changes). I definitely do not see the difference between 1st and subsequent runs. But then, I do my tests with 5GB file sizes like:

iozone3_283/src/current/iozone -t 5 -F /scratch/X1 /scratch/X2 /scratch/X3 /scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1

Kind regards
Martin




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: iozone write 50% regression in kernel 2.6.24-rc1
  2007-11-09 12:36 Martin Knoblauch
@ 2007-11-12  0:45 ` Zhang, Yanmin
  0 siblings, 0 replies; 16+ messages in thread
From: Zhang, Yanmin @ 2007-11-12  0:45 UTC (permalink / raw)
  To: Martin Knoblauch; +Cc: a.p.zijlstra, LKML

On Fri, 2007-11-09 at 04:36 -0800, Martin Knoblauch wrote:
> ----- Original Message ----
> > From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
> > To: a.p.zijlstra@chello.nl
> > Cc: LKML <linux-kernel@vger.kernel.org>
> > Sent: Friday, November 9, 2007 10:47:52 AM
> > Subject: iozone write 50% regression in kernel 2.6.24-rc1
> > 
> > Comparing with 2.6.23, iozone sequential write/rewrite (512M) has
> > 50%
> > 
>  regression
> > in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
> > 
> > My machine has 8 processor cores and 8GB memory.
> > 
> > By bisect, I located patch
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=
> > 04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
> > 
> > 
> > Another behavior: with kernel 2.6.23, if I run iozone for many
> > times
> > 
>  after rebooting machine,
> > the result looks stable. But with 2.6.24-rc1, the first run of
> > iozone
> > 
>  got a very small result and
> > following run has 4Xorig_result.
> > 
> > What I reported is the regression of 2nd/3rd run, because first run
> > has
> > 
>  bigger regression.
> > 
> > I also tried to change
> > /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio
> > 
>  and didn't get improvement.
>  could you tell us the exact iozone command you are using?
iozone -i 0 -r 4k -s 512m


>  I would like to repeat it on my setup, because I definitely see the opposite behaviour in 2.6.24-rc1/rc2. The speed there is much better than in 2.6.22 and before (I skipped 2.6.23, because I was waiting for the per-bdi changes). I definitely do not see the difference between 1st and subsequent runs. But then, I do my tests with 5GB file sizes like:
> 
> iozone3_283/src/current/iozone -t 5 -F /scratch/X1 /scratch/X2 /scratch/X3 /scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1
My machine uses SATA (AHCI) disk.

-yanmin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: iozone write 50% regression in kernel 2.6.24-rc1
  2007-11-09  9:54 ` Peter Zijlstra
@ 2007-11-12  2:14   ` Zhang, Yanmin
  2007-11-12  9:45     ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Zhang, Yanmin @ 2007-11-12  2:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML

On Fri, 2007-11-09 at 10:54 +0100, Peter Zijlstra wrote:
> On Fri, 2007-11-09 at 17:47 +0800, Zhang, Yanmin wrote:
> > Comparing with 2.6.23, iozone sequential write/rewrite (512M) has 50% regression
> > in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
> > 
> > My machine has 8 processor cores and 8GB memory.
> > 
> > By bisect, I located patch
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
> > 
> > 
> > Another behavior: with kernel 2.6.23, if I run iozone for many times after rebooting machine,
> > the result looks stable. But with 2.6.24-rc1, the first run of iozone got a very small result and
> > following run has 4Xorig_result.
> 
> So the second run is 4x as fast as the first run?
Pls. see below comments.

> 
> > What I reported is the regression of 2nd/3rd run, because first run has bigger regression.
> 
> So the 2nd and 3rd run are stable at 50% slower than .23?
Almostly. I did more testing today. Pls. see below result list.

> 
> > I also tried to change /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio and didn't get improvement.
> 
> Could you try:
> 
> ---
> Subject: mm: speed up writeback ramp-up on clean systems
I tested kernel 2.6.23, 2,6,24-rc2, 2.6.24-rc2_peter(2.6.24-rc2+this patch).

1) Compare among first/second/following running
2.6.23: second run of iozone will get about 28% improvement than first run.
	Following run is very stable like 2nd run.
2.6.24-rc2: second run of iozone will get about 170% improvement than first run. 3rd run
	will get about 80% improvement than 2nd. Following run is very stable like 3rd run.
2.6.24-rc2_peter: second run of iozone will get about 14% improvement than first run. Following
	run is mostly stable like 2nd run.
So the new patch really improves the first run result. Comparing wiht 2.6.24-rc2, 2.6.24-rc2_peter
has 330% improvement on the first run.

2) Compare among different kernels(based on the stable highest result):
2.6.24-rc2 has about 50% regression than 2.6.23.
2.6.24-rc2_peter has the same result like 2.6.24-rc2.
>From this point of view, above patch has no improvement. :)

-yanmin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: iozone write 50% regression in kernel 2.6.24-rc1
  2007-11-12  2:14   ` Zhang, Yanmin
@ 2007-11-12  9:45     ` Peter Zijlstra
  2007-11-12  9:51       ` Zhang, Yanmin
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2007-11-12  9:45 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: LKML

[-- Attachment #1: Type: text/plain, Size: 1345 bytes --]


On Mon, 2007-11-12 at 10:14 +0800, Zhang, Yanmin wrote:

> > Subject: mm: speed up writeback ramp-up on clean systems
>
> I tested kernel 2.6.23, 2,6,24-rc2, 2.6.24-rc2_peter(2.6.24-rc2+this patch).
> 
> 1) Compare among first/second/following running
> 2.6.23: second run of iozone will get about 28% improvement than first run.
> 	Following run is very stable like 2nd run.
> 2.6.24-rc2: second run of iozone will get about 170% improvement than first run. 3rd run
> 	will get about 80% improvement than 2nd. Following run is very stable like 3rd run.
> 2.6.24-rc2_peter: second run of iozone will get about 14% improvement than first run. Following
> 	run is mostly stable like 2nd run.
> So the new patch really improves the first run result. Comparing wiht 2.6.24-rc2, 2.6.24-rc2_peter
> has 330% improvement on the first run.
> 
> 2) Compare among different kernels(based on the stable highest result):
> 2.6.24-rc2 has about 50% regression than 2.6.23.
> 2.6.24-rc2_peter has the same result like 2.6.24-rc2.
>
> From this point of view, above patch has no improvement. :)

Drad, still good test results though.

Could you describe you system in detail, that is, you have 8GB of memory
and 8 cpus (2*quad?). How many disks does it have and are those
aggregated using md or dm? What filesystem do you use?



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: iozone write 50% regression in kernel 2.6.24-rc1
  2007-11-12  9:45     ` Peter Zijlstra
@ 2007-11-12  9:51       ` Zhang, Yanmin
  2007-11-12 13:26         ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Zhang, Yanmin @ 2007-11-12  9:51 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML

On Mon, 2007-11-12 at 10:45 +0100, Peter Zijlstra wrote:
> On Mon, 2007-11-12 at 10:14 +0800, Zhang, Yanmin wrote:
> 
> > > Subject: mm: speed up writeback ramp-up on clean systems
> >
> > I tested kernel 2.6.23, 2,6,24-rc2, 2.6.24-rc2_peter(2.6.24-rc2+this patch).
> > 
> > 1) Compare among first/second/following running
> > 2.6.23: second run of iozone will get about 28% improvement than first run.
> > 	Following run is very stable like 2nd run.
> > 2.6.24-rc2: second run of iozone will get about 170% improvement than first run. 3rd run
> > 	will get about 80% improvement than 2nd. Following run is very stable like 3rd run.
> > 2.6.24-rc2_peter: second run of iozone will get about 14% improvement than first run. Following
> > 	run is mostly stable like 2nd run.
> > So the new patch really improves the first run result. Comparing wiht 2.6.24-rc2, 2.6.24-rc2_peter
> > has 330% improvement on the first run.
> > 
> > 2) Compare among different kernels(based on the stable highest result):
> > 2.6.24-rc2 has about 50% regression than 2.6.23.
> > 2.6.24-rc2_peter has the same result like 2.6.24-rc2.
> >
> > From this point of view, above patch has no improvement. :)
> 
> Drad, still good test results though.
> 
> Could you describe you system in detail, that is, you have 8GB of memory
> and 8 cpus (2*quad?).
Yes.

>  How many disks does it have
1 machine uses 1 AHCI SATA. Other machines use hardware raid0.

>  and are those
> aggregated using md or dm?
No.

>  What filesystem do you use?
Ext3.

I got the regression on my a couple of machines. Pls. try command
#iozone -i 0 -r 4k -s 512m

-yanmin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: iozone write 50% regression in kernel 2.6.24-rc1
@ 2007-11-12 12:58 Martin Knoblauch
  2007-11-13  2:04 ` Zhang, Yanmin
  0 siblings, 1 reply; 16+ messages in thread
From: Martin Knoblauch @ 2007-11-12 12:58 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: a.p.zijlstra, LKML

----- Original Message ----
> From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
> To: Martin Knoblauch <knobi@knobisoft.de>
> Cc: a.p.zijlstra@chello.nl; LKML <linux-kernel@vger.kernel.org>
> Sent: Monday, November 12, 2007 1:45:57 AM
> Subject: Re: iozone write 50% regression in kernel 2.6.24-rc1
> 
> On Fri, 2007-11-09 at 04:36 -0800, Martin Knoblauch wrote:
> > ----- Original Message ----
> > > From: "Zhang, Yanmin" 
> > > To: a.p.zijlstra@chello.nl
> > > Cc: LKML 
> > > Sent: Friday, November 9, 2007 10:47:52 AM
> > > Subject: iozone write 50% regression in kernel 2.6.24-rc1
> > > 
> > > Comparing with 2.6.23, iozone sequential write/rewrite (512M) has
> > > 50%
> > > 
> >  regression
> > > in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
> > > 
> > > My machine has 8 processor cores and 8GB memory.
> > > 
> > > By bisect, I located patch
> >
> >
> 
 http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h
> =
> > > 04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
> > > 
> > > 
> > > Another behavior: with kernel 2.6.23, if I run iozone for many
> > > times
> > > 
> >  after rebooting machine,
> > > the result looks stable. But with 2.6.24-rc1, the first run of
> > > iozone
> > > 
> >  got a very small result and
> > > following run has 4Xorig_result.
> > > 
> > > What I reported is the regression of 2nd/3rd run, because first run
> > > has
> > > 
> >  bigger regression.
> > > 
> > > I also tried to change
> > > /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio
> > > 
> >  and didn't get improvement.
> >  could you tell us the exact iozone command you are using?
> iozone -i 0 -r 4k -s 512m
> 

 OK, I definitely do not see the reported effect.  On a HP Proliant with a RAID5 on CCISS I get:

2.6.19.2: 654-738 MB/sec write, 1126-1154 MB/sec rewrite
2.6.24-rc2: 772-820 MB/sec write, 1495-1539 MB/sec rewrite

 The first run is always slowest, all subsequent runs are faster and the same speed.

> 
> >  I would like to repeat it on my setup, because I definitely see
> the
> 
 opposite behaviour in 2.6.24-rc1/rc2. The speed there is much
> better
> 
 than in 2.6.22 and before (I skipped 2.6.23, because I was waiting
> for
> 
 the per-bdi changes). I definitely do not see the difference between
> 1st
> 
 and subsequent runs. But then, I do my tests with 5GB file sizes like:
> > 
> > iozone3_283/src/current/iozone -t 5 -F /scratch/X1
> /scratch/X2
> 
 /scratch/X3 /scratch/X4 /scratch/X5 -s 5000M -r 1024 -c -e -i 0 -i 1
> My machine uses SATA (AHCI) disk.
> 

 4x72GB SCSI disks building a RAID5 on a CCISS controller with battery backed write cache. Systems are 2 CPUs (64-bit) with 8 GB memory. I could test on some IBM boxes (2x dual core, 8 GB) with RAID5 on "aacraid", but I need some time to free up one of the boxes.

Cheers
Martin




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: iozone write 50% regression in kernel 2.6.24-rc1
  2007-11-12  9:51       ` Zhang, Yanmin
@ 2007-11-12 13:26         ` Peter Zijlstra
       [not found]           ` <47386BC4.3050403@panasas.com>
  2007-11-12 17:25           ` Mark Lord
  0 siblings, 2 replies; 16+ messages in thread
From: Peter Zijlstra @ 2007-11-12 13:26 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: LKML

[-- Attachment #1: Type: text/plain, Size: 2138 bytes --]

Single socket, dual core opteron, 2GB memory
Single SATA disk, ext3

x86_64 kernel and userland

(dirty_background_ratio, dirty_ratio) tunables

---- (5,10) - default

2.6.23.1-42.fc8 #1 SMP

          524288       4   59580   60356
          524288       4   59247   61101
          524288       4   61030   62831

2.6.24-rc2 #28 SMP PREEMPT

          524288       4   49277   56582
          524288       4   50728   61056
          524288       4   52027   59758
          524288       4   51520   62426


---- (20,40) - similar to your 8GB

2.6.23.1-42.fc8 #1 SMP

          524288       4  225977  447461
          524288       4  232595  496848
          524288       4  220608  478076
          524288       4  203080  445230

2.6.24-rc2 #28 SMP PREEMPT

          524288       4   54043   83585
          524288       4   69949  516253
          524288       4   72343  491416
          524288       4   71775  492653

---- (60,80) - overkill

2.6.23.1-42.fc8 #1 SMP

          524288       4  208450  491892
          524288       4  216262  481135
          524288       4  221892  543608
          524288       4  202209  574725
          524288       4  231730  452482

2.6.24-rc2 #28 SMP PREEMPT

          524288       4   49091   86471
          524288       4   65071  217566
          524288       4   72238  492172
          524288       4   71818  492433
          524288       4   71327  493954


While I see that the write speed as reported under .24 ~70MB/s is much
lower than the one reported under .23 ~200MB/s, I find it very hard to
believe my poor single SATA disk could actually do the 200MB/s for
longer than its cache 8/16 MB (not sure).

vmstat shows that actual IO is done, even though the whole 512MB could
fit in cache, hence my suspicion that the ~70MB/s is the most realistic
of the two.

I'll have to look into what iozone actually does though and why this
patch makes the output different.

FWIW - because its a single backing dev it does get to 100% of the dirty
limit after a few runs, so not sure what makes the difference.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: iozone write 50% regression in kernel 2.6.24-rc1
       [not found]           ` <47386BC4.3050403@panasas.com>
@ 2007-11-12 16:48             ` Peter Zijlstra
  2007-11-13  2:19               ` Zhang, Yanmin
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2007-11-12 16:48 UTC (permalink / raw)
  To: Benny Halevy; +Cc: Zhang, Yanmin, LKML, Linus Torvalds, aneesh.kumar

[-- Attachment #1: Type: text/plain, Size: 4959 bytes --]


On Mon, 2007-11-12 at 17:05 +0200, Benny Halevy wrote:
> On Nov. 12, 2007, 15:26 +0200, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > Single socket, dual core opteron, 2GB memory
> > Single SATA disk, ext3
> > 
> > x86_64 kernel and userland
> > 
> > (dirty_background_ratio, dirty_ratio) tunables
> > 
> > ---- (5,10) - default
> > 
> > 2.6.23.1-42.fc8 #1 SMP
> > 
> >           524288       4   59580   60356
> >           524288       4   59247   61101
> >           524288       4   61030   62831
> > 
> > 2.6.24-rc2 #28 SMP PREEMPT
> > 
> >           524288       4   49277   56582
> >           524288       4   50728   61056
> >           524288       4   52027   59758
> >           524288       4   51520   62426
> > 
> > 
> > ---- (20,40) - similar to your 8GB
> > 
> > 2.6.23.1-42.fc8 #1 SMP
> > 
> >           524288       4  225977  447461
> >           524288       4  232595  496848
> >           524288       4  220608  478076
> >           524288       4  203080  445230
> > 
> > 2.6.24-rc2 #28 SMP PREEMPT
> > 
> >           524288       4   54043   83585
> >           524288       4   69949  516253
> >           524288       4   72343  491416
> >           524288       4   71775  492653
> > 
> > ---- (60,80) - overkill
> > 
> > 2.6.23.1-42.fc8 #1 SMP
> > 
> >           524288       4  208450  491892
> >           524288       4  216262  481135
> >           524288       4  221892  543608
> >           524288       4  202209  574725
> >           524288       4  231730  452482
> > 
> > 2.6.24-rc2 #28 SMP PREEMPT
> > 
> >           524288       4   49091   86471
> >           524288       4   65071  217566
> >           524288       4   72238  492172
> >           524288       4   71818  492433
> >           524288       4   71327  493954
> > 
> > 
> > While I see that the write speed as reported under .24 ~70MB/s is much
> > lower than the one reported under .23 ~200MB/s, I find it very hard to
> > believe my poor single SATA disk could actually do the 200MB/s for
> > longer than its cache 8/16 MB (not sure).
> > 
> > vmstat shows that actual IO is done, even though the whole 512MB could
> > fit in cache, hence my suspicion that the ~70MB/s is the most realistic
> > of the two.
> 
> Even 70 MB/s seems too high.  What throughput do you see for the
> raw disk partition/
> 
> Also, are the numbers above for successive runs?
> It seems like you're seeing some caching effects so
> I'd recommend using a file larger than your cache size and
> the -e and -c options (to include fsync and close in timings)
> to try to eliminate them.

------ iozone -i 0 -r 4k -s 512m -e -c

.23 (20,40)

          524288       4   31750   33560
          524288       4   29786   32114
          524288       4   29115   31476

.24 (20,40)

          524288       4   25022   32411
          524288       4   25375   31662
          524288       4   26407   33871


------ iozone -i 0 -r 4k -s 4g -e -c

.23 (20,40)

         4194304       4   39699   35550
         4194304       4   40225   36099


.24 (20,40)

         4194304       4   39961   41656
         4194304       4   39244   39673


Yanmin, for that benchmark you ran, what was it meant to measure?
From what I can make of it its just write cache benching.

One thing I don't understand is how the write numbers are so much lower
than the rewrite numbers. The iozone code (which gives me headaches,
damn what a mess) seems to suggest that the only thing that is different
is the lack of block allocation.

Linus posted a patch yesterday fixing up a regression in the ext3 bitmap
block allocator, /me goes apply that patch and rerun the tests.

> > ---- (20,40) - similar to your 8GB
> > 
> > 2.6.23.1-42.fc8 #1 SMP
> > 
> >           524288       4  225977  447461
> >           524288       4  232595  496848
> >           524288       4  220608  478076
> >           524288       4  203080  445230
> > 
> > 2.6.24-rc2 #28 SMP PREEMPT
> > 
> >           524288       4   54043   83585
> >           524288       4   69949  516253
> >           524288       4   72343  491416
> >           524288       4   71775  492653

2.6.24-rc2 +
        patches/wu-reiser.patch
        patches/writeback-early.patch
        patches/bdi-task-dirty.patch
        patches/bdi-sysfs.patch
        patches/sched-hrtick.patch
        patches/sched-rt-entity.patch
        patches/sched-watchdog.patch
        patches/linus-ext3-blockalloc.patch

          524288       4  179657  487676
          524288       4  173989  465682
          524288       4  175842  489800


Linus' patch is the one that makes the difference here. So I'm unsure
how you bisected it down to:

  04fbfdc14e5f48463820d6b9807daa5e9c92c51f

These results seem to point to

  7c9e69faa28027913ee059c285a5ea8382e24b5d

as being the offending patch.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: iozone write 50% regression in kernel 2.6.24-rc1
  2007-11-12 13:26         ` Peter Zijlstra
       [not found]           ` <47386BC4.3050403@panasas.com>
@ 2007-11-12 17:25           ` Mark Lord
  2007-11-13  1:49             ` Zhang, Yanmin
  1 sibling, 1 reply; 16+ messages in thread
From: Mark Lord @ 2007-11-12 17:25 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Zhang, Yanmin, LKML

Peter Zijlstra wrote:
..
> While I see that the write speed as reported under .24 ~70MB/s is much
> lower than the one reported under .23 ~200MB/s, I find it very hard to
> believe my poor single SATA disk could actually do the 200MB/s for
> longer than its cache 8/16 MB (not sure).
> 
> vmstat shows that actual IO is done, even though the whole 512MB could
> fit in cache, hence my suspicion that the ~70MB/s is the most realistic
> of the two.
..

Yeah, sequential 70MB/sec is quite realistic for a modern SATA drive.

But significantly faster than that (say, 100MB/sec +) is unlikely at present.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: iozone write 50% regression in kernel 2.6.24-rc1
  2007-11-12 17:25           ` Mark Lord
@ 2007-11-13  1:49             ` Zhang, Yanmin
  0 siblings, 0 replies; 16+ messages in thread
From: Zhang, Yanmin @ 2007-11-13  1:49 UTC (permalink / raw)
  To: Mark Lord; +Cc: Peter Zijlstra, LKML

On Mon, 2007-11-12 at 12:25 -0500, Mark Lord wrote:
> Peter Zijlstra wrote:
> ..
> > While I see that the write speed as reported under .24 ~70MB/s is much
> > lower than the one reported under .23 ~200MB/s, I find it very hard to
> > believe my poor single SATA disk could actually do the 200MB/s for
> > longer than its cache 8/16 MB (not sure).
> > 
> > vmstat shows that actual IO is done, even though the whole 512MB could
> > fit in cache, hence my suspicion that the ~70MB/s is the most realistic
> > of the two.
> ..
> 
> Yeah, sequential 70MB/sec is quite realistic for a modern SATA drive.
> 
> But significantly faster than that (say, 100MB/sec +) is unlikely at present.
I just use command '#iozone -i 0 -r 4k -s 512m', no '-e -c'. So if
we consider cache, the speed is very fast. On my machine with 2.6.23, the write speed is
631M/s, quite fast. :)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: iozone write 50% regression in kernel 2.6.24-rc1
  2007-11-12 12:58 Martin Knoblauch
@ 2007-11-13  2:04 ` Zhang, Yanmin
  0 siblings, 0 replies; 16+ messages in thread
From: Zhang, Yanmin @ 2007-11-13  2:04 UTC (permalink / raw)
  To: Martin Knoblauch; +Cc: a.p.zijlstra, LKML

On Mon, 2007-11-12 at 04:58 -0800, Martin Knoblauch wrote:
> ----- Original Message ----
> > From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
> > To: Martin Knoblauch <knobi@knobisoft.de>
> > Cc: a.p.zijlstra@chello.nl; LKML <linux-kernel@vger.kernel.org>
> > Sent: Monday, November 12, 2007 1:45:57 AM
> > Subject: Re: iozone write 50% regression in kernel 2.6.24-rc1
> > 
> > On Fri, 2007-11-09 at 04:36 -0800, Martin Knoblauch wrote:
> > > ----- Original Message ----
> > > > From: "Zhang, Yanmin" 
> > > > To: a.p.zijlstra@chello.nl
> > > > Cc: LKML 
> > > > Sent: Friday, November 9, 2007 10:47:52 AM
> > > > Subject: iozone write 50% regression in kernel 2.6.24-rc1
> > > > 
> > > > Comparing with 2.6.23, iozone sequential write/rewrite (512M) has
> > > > 50%
> > > > 
> > >  regression
> > > > in kernel 2.6.24-rc1. 2.6.24-rc2 has the same regression.
> > > > 
> > > > My machine has 8 processor cores and 8GB memory.
> > > > 
> > > > By bisect, I located patch
> > >
> > >
> > 
>  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h
> > =
> > > > 04fbfdc14e5f48463820d6b9807daa5e9c92c51f.
> > > > 
> > > > 
> > > > Another behavior: with kernel 2.6.23, if I run iozone for many
> > > > times
> > > > 
> > >  after rebooting machine,
> > > > the result looks stable. But with 2.6.24-rc1, the first run of
> > > > iozone
> > > > 
> > >  got a very small result and
> > > > following run has 4Xorig_result.
> > > > 
> > > > What I reported is the regression of 2nd/3rd run, because first run
> > > > has
> > > > 
> > >  bigger regression.
> > > > 
> > > > I also tried to change
> > > > /proc/sys/vm/dirty_ratio,dirty_backgroud_ratio
> > > > 
> > >  and didn't get improvement.
> > >  could you tell us the exact iozone command you are using?
> > iozone -i 0 -r 4k -s 512m
> > 
> 
>  OK, I definitely do not see the reported effect.  On a HP Proliant with a RAID5 on CCISS I get:
> 
> 2.6.19.2: 654-738 MB/sec write, 1126-1154 MB/sec rewrite
> 2.6.24-rc2: 772-820 MB/sec write, 1495-1539 MB/sec rewrite
> 
>  The first run is always slowest, all subsequent runs are faster and the same speed.
Although the first run is always slowest, but if we compare 2.6.23 and 2.6.24-rc,
we could find the first run result of 2.6.23 is 7 times of the one of 2.6.24-rc.

Originally, my test suite is just to pick up the result of first run. I might
change my test suite to make it run for many times.

Now I run the the test manually for many times after machine reboots. Comparing 2.6.24-rc
with 2.6.23, 3rd and following run of 2.6.24-rc has about 50% regression.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: iozone write 50% regression in kernel 2.6.24-rc1
  2007-11-12 16:48             ` Peter Zijlstra
@ 2007-11-13  2:19               ` Zhang, Yanmin
  2007-11-13  8:34                 ` Zhang, Yanmin
  0 siblings, 1 reply; 16+ messages in thread
From: Zhang, Yanmin @ 2007-11-13  2:19 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Benny Halevy, LKML, Linus Torvalds, aneesh.kumar

On Mon, 2007-11-12 at 17:48 +0100, Peter Zijlstra wrote:
> On Mon, 2007-11-12 at 17:05 +0200, Benny Halevy wrote:
> > On Nov. 12, 2007, 15:26 +0200, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > Single socket, dual core opteron, 2GB memory
> > > Single SATA disk, ext3
> > > 
> > > x86_64 kernel and userland
> > > 
> > > (dirty_background_ratio, dirty_ratio) tunables
> > > 
> > > ---- (5,10) - default
> > > 
> > > 2.6.23.1-42.fc8 #1 SMP
> > > 
> > >           524288       4   59580   60356
> > >           524288       4   59247   61101
> > >           524288       4   61030   62831
> > > 
> > > 2.6.24-rc2 #28 SMP PREEMPT
> > > 
> > >           524288       4   49277   56582
> > >           524288       4   50728   61056
> > >           524288       4   52027   59758
> > >           524288       4   51520   62426
> > > 
> > > 
> > > ---- (20,40) - similar to your 8GB
> > > 
> > > 2.6.23.1-42.fc8 #1 SMP
> > > 
> > >           524288       4  225977  447461
> > >           524288       4  232595  496848
> > >           524288       4  220608  478076
> > >           524288       4  203080  445230
> > > 
> > > 2.6.24-rc2 #28 SMP PREEMPT
> > > 
> > >           524288       4   54043   83585
> > >           524288       4   69949  516253
> > >           524288       4   72343  491416
> > >           524288       4   71775  492653
> > > 
> > > ---- (60,80) - overkill
> > > 
> > > 2.6.23.1-42.fc8 #1 SMP
> > > 
> > >           524288       4  208450  491892
> > >           524288       4  216262  481135
> > >           524288       4  221892  543608
> > >           524288       4  202209  574725
> > >           524288       4  231730  452482
> > > 
> > > 2.6.24-rc2 #28 SMP PREEMPT
> > > 
> > >           524288       4   49091   86471
> > >           524288       4   65071  217566
> > >           524288       4   72238  492172
> > >           524288       4   71818  492433
> > >           524288       4   71327  493954
> > > 
> > > 
> > > While I see that the write speed as reported under .24 ~70MB/s is much
> > > lower than the one reported under .23 ~200MB/s, I find it very hard to
> > > believe my poor single SATA disk could actually do the 200MB/s for
> > > longer than its cache 8/16 MB (not sure).
> > > 
> > > vmstat shows that actual IO is done, even though the whole 512MB could
> > > fit in cache, hence my suspicion that the ~70MB/s is the most realistic
> > > of the two.
> > 
> > Even 70 MB/s seems too high.  What throughput do you see for the
> > raw disk partition/
> > 
> > Also, are the numbers above for successive runs?
> > It seems like you're seeing some caching effects so
> > I'd recommend using a file larger than your cache size and
> > the -e and -c options (to include fsync and close in timings)
> > to try to eliminate them.
> 
> ------ iozone -i 0 -r 4k -s 512m -e -c
> 
> .23 (20,40)
> 
>           524288       4   31750   33560
>           524288       4   29786   32114
>           524288       4   29115   31476
> 
> .24 (20,40)
> 
>           524288       4   25022   32411
>           524288       4   25375   31662
>           524288       4   26407   33871
> 
> 
> ------ iozone -i 0 -r 4k -s 4g -e -c
> 
> .23 (20,40)
> 
>          4194304       4   39699   35550
>          4194304       4   40225   36099
> 
> 
> .24 (20,40)
> 
>          4194304       4   39961   41656
>          4194304       4   39244   39673
> 
> 
> Yanmin, for that benchmark you ran, what was it meant to measure?
> From what I can make of it its just write cache benching.
Yeah. It's quite related to cache. I did more testing on my stoakley machine (8 cores,
8GB mem). If I reduce the memory to 4GB, the speed will be far slower.

> 
> One thing I don't understand is how the write numbers are so much lower
> than the rewrite numbers. The iozone code (which gives me headaches,
> damn what a mess) seems to suggest that the only thing that is different
> is the lack of block allocation.
It might be a good direction.

> 
> Linus posted a patch yesterday fixing up a regression in the ext3 bitmap
> block allocator, /me goes apply that patch and rerun the tests.
> 
> > > ---- (20,40) - similar to your 8GB
> > > 
> > > 2.6.23.1-42.fc8 #1 SMP
> > > 
> > >           524288       4  225977  447461
> > >           524288       4  232595  496848
> > >           524288       4  220608  478076
> > >           524288       4  203080  445230
> > > 
> > > 2.6.24-rc2 #28 SMP PREEMPT
> > > 
> > >           524288       4   54043   83585
> > >           524288       4   69949  516253
> > >           524288       4   72343  491416
> > >           524288       4   71775  492653
> 
> 2.6.24-rc2 +
>         patches/wu-reiser.patch
>         patches/writeback-early.patch
>         patches/bdi-task-dirty.patch
>         patches/bdi-sysfs.patch
>         patches/sched-hrtick.patch
>         patches/sched-rt-entity.patch
>         patches/sched-watchdog.patch
>         patches/linus-ext3-blockalloc.patch
> 
>           524288       4  179657  487676
>           524288       4  173989  465682
>           524288       4  175842  489800
> 
> 
> Linus' patch is the one that makes the difference here. So I'm unsure
> how you bisected it down to:
> 
>   04fbfdc14e5f48463820d6b9807daa5e9c92c51f
Originally, my test suite is just to pick up the result of first run. Your prior
patch(speed up writeback ramp-up on clean systems) fixed an issue about first
run result regression. So my bisect captured it.

However, late on, I found following run have different results. A moment ago,
I retested 04fbfdc14e5f48463820d6b9807daa5e9c92c51f by:
#git checkout 04fbfdc14e5f48463820d6b9807daa5e9c92c51f
#make

Then, reverse your patch. It looks like 04fbfdc14e5f48463820d6b9807daa5e9c92c51f
is not the root cause of following run regression. I will change my test suite to
make it run for many times and do a new bisect.

> These results seem to point to
> 
>   7c9e69faa28027913ee059c285a5ea8382e24b5d
I tested 2.6.24-rc2 which already includes above patch. 2.6.24-rc2 has the same
regression like 2.6.24-rc1.

-yanmin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: iozone write 50% regression in kernel 2.6.24-rc1
  2007-11-13  2:19               ` Zhang, Yanmin
@ 2007-11-13  8:34                 ` Zhang, Yanmin
  2007-11-13 18:32                   ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Zhang, Yanmin @ 2007-11-13  8:34 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Benny Halevy, LKML, Linus Torvalds, aneesh.kumar

On Tue, 2007-11-13 at 10:19 +0800, Zhang, Yanmin wrote:
> On Mon, 2007-11-12 at 17:48 +0100, Peter Zijlstra wrote:
> > On Mon, 2007-11-12 at 17:05 +0200, Benny Halevy wrote:
> > > On Nov. 12, 2007, 15:26 +0200, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > > Single socket, dual core opteron, 2GB memory
> > > > Single SATA disk, ext3
> > > > 
> > > > 2.6.23.1-42.fc8 #1 SMP
> > > > 
> > > >           524288       4  225977  447461
> > > >           524288       4  232595  496848
> > > >           524288       4  220608  478076
> > > >           524288       4  203080  445230
> > > > 
> > > > 2.6.24-rc2 #28 SMP PREEMPT
> > > > 
> > > >           524288       4   54043   83585
> > > >           524288       4   69949  516253
> > > >           524288       4   72343  491416
> > > >           524288       4   71775  492653
> > 
> > 2.6.24-rc2 +
> >         patches/wu-reiser.patch
> >         patches/writeback-early.patch
> >         patches/bdi-task-dirty.patch
> >         patches/bdi-sysfs.patch
> >         patches/sched-hrtick.patch
> >         patches/sched-rt-entity.patch
> >         patches/sched-watchdog.patch
> >         patches/linus-ext3-blockalloc.patch
> > 
> >           524288       4  179657  487676
> >           524288       4  173989  465682
> >           524288       4  175842  489800
> > 
> > 
> > Linus' patch is the one that makes the difference here. So I'm unsure
> > how you bisected it down to:
> > 
> >   04fbfdc14e5f48463820d6b9807daa5e9c92c51f
> Originally, my test suite is just to pick up the result of first run. Your prior
> patch(speed up writeback ramp-up on clean systems) fixed an issue about first
> run result regression. So my bisect captured it.
> 
> However, late on, I found following run have different results. A moment ago,
> I retested 04fbfdc14e5f48463820d6b9807daa5e9c92c51f by:
> #git checkout 04fbfdc14e5f48463820d6b9807daa5e9c92c51f
> #make
> 
> Then, reverse your patch. It looks like 04fbfdc14e5f48463820d6b9807daa5e9c92c51f
> is not the root cause of following run regression. I will change my test suite to
> make it run for many times and do a new bisect.
> 
> > These results seem to point to
> > 
> >   7c9e69faa28027913ee059c285a5ea8382e24b5d
My new bisect captured 7c9e69faa28027913ee059c285a5ea8382e24b5d
which caused the regression of iozone following run (3rd/4th... run after mounting
the ext3 partition).

Peter,

Where could I download Linus new patches, especially patches/linus-ext3-blockalloc.patch?
I couldn't find it in my archives of LKML mails.

yanmin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: iozone write 50% regression in kernel 2.6.24-rc1
  2007-11-13  8:34                 ` Zhang, Yanmin
@ 2007-11-13 18:32                   ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2007-11-13 18:32 UTC (permalink / raw)
  To: Zhang, Yanmin; +Cc: Benny Halevy, LKML, Linus Torvalds, aneesh.kumar

[-- Attachment #1: Type: text/plain, Size: 1211 bytes --]


On Tue, 2007-11-13 at 16:34 +0800, Zhang, Yanmin wrote:

> My new bisect captured 7c9e69faa28027913ee059c285a5ea8382e24b5d
> which caused the regression of iozone following run (3rd/4th... run after mounting
> the ext3 partition).

Linus just reverted that commit with commit:

commit 0b832a4b93932103d73c0c3f35ef1153e288327b
Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
Date:   Tue Nov 13 08:07:31 2007 -0800

    Revert "ext2/ext3/ext4: add block bitmap validation"

    This reverts commit 7c9e69faa28027913ee059c285a5ea8382e24b5d, fixing up
    conflicts in fs/ext4/balloc.c manually.

    The cost of doing the bitmap validation on each lookup - even when the
    bitmap is cached - is absolutely prohibitive.  We could, and probably
    should, do it only when adding the bitmap to the buffer cache.  However,
    right now we are better off just reverting it.

    Peter Zijlstra measured the cost of this extra validation as a 85%
    decrease in cached iozone, and while I had a patch that took it down to
    just 17% by not being _quite_ so stupid in the validation, it was still
    a big slowdown that could have been avoided by just doing it right.



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2007-11-13 18:32 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-09  9:47 iozone write 50% regression in kernel 2.6.24-rc1 Zhang, Yanmin
2007-11-09  9:54 ` Peter Zijlstra
2007-11-12  2:14   ` Zhang, Yanmin
2007-11-12  9:45     ` Peter Zijlstra
2007-11-12  9:51       ` Zhang, Yanmin
2007-11-12 13:26         ` Peter Zijlstra
     [not found]           ` <47386BC4.3050403@panasas.com>
2007-11-12 16:48             ` Peter Zijlstra
2007-11-13  2:19               ` Zhang, Yanmin
2007-11-13  8:34                 ` Zhang, Yanmin
2007-11-13 18:32                   ` Peter Zijlstra
2007-11-12 17:25           ` Mark Lord
2007-11-13  1:49             ` Zhang, Yanmin
  -- strict thread matches above, loose matches on Subject: below --
2007-11-09 12:36 Martin Knoblauch
2007-11-12  0:45 ` Zhang, Yanmin
2007-11-12 12:58 Martin Knoblauch
2007-11-13  2:04 ` Zhang, Yanmin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox