Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
@ 2004-08-05 17:02 Mr. Berkley Shands
  2004-08-05 17:25 ` William Lee Irwin III
  0 siblings, 1 reply; 16+ messages in thread
From: Mr. Berkley Shands @ 2004-08-05 17:02 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1279 bytes --]

Two severe disk read bugs:

In a nutshell (see attached for gory details). Moving from 2.6.6 to 
2.6.7 dropped multi-threaded RAID0
read performance from 429MB/Sec to 81MB/Sec. Single threaded reads 
improved  368MB/Sec to 418MB/Sec.
The code in drivers/md has no effect on this problem. Clearly this is a 
thread access issue. Redhat ES3.0
on x86_64 or i686.  The underlying hardware is capable of 955MB/Sec disk 
reads off 28 drives,
541MB/Sec off 14 drives. Tuning I/O block size (11KB to 239KB) and 
BLKRASET size (448 to 1024 or more)
helps a little. System idle goes from 0% to 50% (2.6.6 to 2.6.8-rc3).

File reads (ext3/raid0) exceeding physical ram size cause kswapd to go 
out to lunch.
The I/O rate drops to 10MB/Sec. Under 2.6.6 there was NO effect for 
large files. Using
fadvise64() helps a little on i686, but hurts on x86_64. fadvise64_64() 
is just plain broken.

This was discovered while testing I/O throughput for a paper being 
submitted to ASPLOS BEACON
workshop - October 2004.

I'll run most any experiment on either architecture to help diagnose 
this, and will fiddle kernel code
and debug options as requested. Source code to the test suite available 
on request to developers only.

Mr. Berkley Shands
berkley<at>dssimail.com
berkley<at>cse.wustl.edu

[-- Attachment #2: Raid0.bug --]
[-- Type: text/plain, Size: 3714 bytes --]

I/O throughput regression bug going from 2.6.6 to 2.6.7 or 2.6.8-rc3
There are several I/O throughput bugs that were introduced in 2.6.7,
not related to any updates in drivers/md. The first reduces multi-threaded
reads of a file on an ext3/RAID0 file from ~600MB/Sec to ~160MB/Sec on my 
opteron. The same result is seen on a i686 based system. Doing single threaded
reads of the same ext3/RAID0 file shows a ~60MB/Sec reduction.
The hardware is 2-Adaptec 39320A-R HBAs into 4 7-drive strings of 15KRPM Seagate
U320 drives. The AIC79XX driver is V2.0.12 (the stock driver shows lower 
performance).

The second throughput bug happens when the file being read is larger than 
physical memory, in this case 16GB of file, and 8GB of RAM. Reading the first 
7GB of file runs at ~420MB/Sec (1-39320A and 14 drives). The next 9GB runs at 
60MB/Sec or less. If I use fadvise64_64() to try to manage the file cache,
the rate drops to under 10MB/Sec :-)

Observations - kswapd goes nuts under 2.6.7, 2.6.8-rc3 when the file being read
exceeds the physical memory size. System idle time (from top) is near zero
for 2 threads reading under 2.6.6, and is 50% or better for 2.6.7 or 2.6.8-rc3.
Otherwise I/O wait is the dominant state 89% to 95%. The opteron is capable of
955MB/Sec raw I/O off the 28 drive array using O_DIRECT on /dev/sda, /dev/sdb...
541MB/Sec raw I/O off the 14 drive array.

The value of 2 threads, 11KB reads, and 448 RASize were close to optimal
for 2.6.2 through 2.6.6 on the 14 drive system. fadvise64_64() is broken on 
i686 and x86_64. The 3rd parameter is being passed garbage off the stack.
Patching the ioctl fixed that one. fadvise64_64() helps on the i686, but is
very harmfull on the x86_64.

The values passed via ioctl(BLKRASET | BLKFRASET) to get peak performance
vary radically between 2.6.6, 2.6.7, and 2.6.8-rc3 for /dev/md0. The ioctl does
a right shift by 3 bits before using the passed in value, so my RASize value is
left shifted by 3 bits (X * 8) before being passed in.

4-controller, 28 drives, raid0, 2.6.6, 2 threads
<threads, Read KBytes, RASize, MB/Sec> 
 2,  11,  448,   552.617 
 2,  11, 2736,   595.695 
 2,  11, 2275,   596.911 
 2,  11, 2253,   597.956 
 2,  11, 2321,   600.234 
 2,  11, 2164,   601.115 

2-controller, 14 drives, raid0, 2.6.8-rc3, 2 threads
<threads, Read KBytes, RASize, MB/Sec> 
 2,  11, 448,    81.543 
 2, 239, 448,   154.706 
 2, 239, 673,   161.070 
 2, 239, 124,   161.158  
 2, 239, 149,   161.209 
 2, 239, 128,   161.298 
 2, 239, 229,   161.400 
 2, 239, 548,   161.897 

2-controller, 14 drives, raid0, 2.6.8-rc3, single thread
<threads, Read KBytes, RASize, MB/Sec> 
 1,  11, 448,   329.419
 1,  11, 935,   373.382 
 1,  11, 894,   373.518 
 1,  11, 1021,  377.442 
 1,  11, 1023,  387.952 
 1,  11, 1024,  418.985 

2-controller, 14 drives, raid0, 2.6.6, 2 threads
<threads, Read KBytes, RASize, MB/Sec> 
 2,  11, 471,   429.170 
 2,  11, 470,   430.252 
 2,  11, 493,   430.523 
 2,  11, 514,   430.795 
 2,  11, 448,   431.612 

2-controller, 14 drives, raid0, 2.6.6, single thread
<threads, Read KBytes, RASize, MB/Sec> 
 1,   7, 448,   328.047 
 1,   7, 681,   365.714 
 1,   7, 675,   366.107 
 1,   7, 186,   366.237 
 1,   7, 668,   367.882 
 1,   7, 662,   368.213 

Hardware setup:
dual cpu 2.0GHz opteron, Tyan S2885, 8GB ram, dual 39320A-R on
different PCi-X busses. RedHat ES3.0-update 1.

dual cpu 2.66GHZ Xeon w/hyperthtreading, SuperMicro X5DA8, 2GB RAM,
dual 39320A-R (or AIC7902 on mobo), RedHat ES3.0-update 2.

14 Seagate 36GB 15K RPM U320 drives in one partition, 
14 Fujitsu 73GB 15K RPM U320 drives in two 36GB partitions.
(better have the right ucode for those fujitsu drives!)
In two StorCase 14-bay Infostations.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
  2004-08-05 17:02 Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3 Mr. Berkley Shands
@ 2004-08-05 17:25 ` William Lee Irwin III
  2004-08-05 19:58   ` Mr. Berkley Shands
  2004-08-06 18:02   ` Fast patch for " Mr. Berkley Shands
  0 siblings, 2 replies; 16+ messages in thread
From: William Lee Irwin III @ 2004-08-05 17:25 UTC (permalink / raw)
  To: Mr. Berkley Shands; +Cc: linux-kernel

On Thu, Aug 05, 2004 at 12:02:09PM -0500, Mr. Berkley Shands wrote:
> Two severe disk read bugs:
> In a nutshell (see attached for gory details). Moving from 2.6.6 to 
> 2.6.7 dropped multi-threaded RAID0
> read performance from 429MB/Sec to 81MB/Sec. Single threaded reads 
> improved  368MB/Sec to 418MB/Sec.
> The code in drivers/md has no effect on this problem. Clearly this is a 
> thread access issue. Redhat ES3.0
> on x86_64 or i686.  The underlying hardware is capable of 955MB/Sec disk 
> reads off 28 drives,
> 541MB/Sec off 14 drives. Tuning I/O block size (11KB to 239KB) and 
> BLKRASET size (448 to 1024 or more)
> helps a little. System idle goes from 0% to 50% (2.6.6 to 2.6.8-rc3).

By any chance could you do binary search on the bk snapshots between
2.6.6 and 2.6.7?


-- wli

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
  2004-08-05 17:25 ` William Lee Irwin III
@ 2004-08-05 19:58   ` Mr. Berkley Shands
  2004-08-05 20:46     ` William Lee Irwin III
  2004-08-06 18:02   ` Fast patch for " Mr. Berkley Shands
  1 sibling, 1 reply; 16+ messages in thread
From: Mr. Berkley Shands @ 2004-08-05 19:58 UTC (permalink / raw)
  To: linux-kernel; +Cc: William Lee Irwin III

William Lee Irwin III wrote:

>By any chance could you do binary search on the bk snapshots between
>2.6.6 and 2.6.7?
>
>
>-- wli
>
>  
>
the problem does not exist using 2.6.6-bk6, but exists on 2.6.6-bk7. 
-bk8 and -bk9 faile to build.
these are from patches-2.6.6-bk6 off snapshots/old and applied to a 
vanilla 2.6.6 kernel.

berkley


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
  2004-08-05 19:58   ` Mr. Berkley Shands
@ 2004-08-05 20:46     ` William Lee Irwin III
  2004-08-05 22:33       ` Marcelo Tosatti
  0 siblings, 1 reply; 16+ messages in thread
From: William Lee Irwin III @ 2004-08-05 20:46 UTC (permalink / raw)
  To: Mr. Berkley Shands; +Cc: linux-kernel

William Lee Irwin III wrote:
>> By any chance could you do binary search on the bk snapshots between
>> 2.6.6 and 2.6.7?

On Thu, Aug 05, 2004 at 02:58:50PM -0500, Mr. Berkley Shands wrote:
> the problem does not exist using 2.6.6-bk6, but exists on 2.6.6-bk7. 
> -bk8 and -bk9 faile to build.
> these are from patches-2.6.6-bk6 off snapshots/old and applied to a 
> vanilla 2.6.6 kernel.

This is the closest it appears to be possible to narrow down where the
regression happened.

Some form of changelogging to enumerate what the contents of the
2.6.6-bk6 -> 2.6.6-bk7 delta are and to reconstruct intermediate points
between 2.6.6-bk6 and 2.6.6-bk7 is needed.

I have already tried to carry out various procedures to accomplish this
for several other problem reports and/or issues and come have come away
from the effort highly discouraged (having made zero progress) each time.

-- wli

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
  2004-08-05 20:46     ` William Lee Irwin III
@ 2004-08-05 22:33       ` Marcelo Tosatti
  2004-08-06  0:21         ` William Lee Irwin III
  2004-08-06  2:09         ` Andy Isaacson
  0 siblings, 2 replies; 16+ messages in thread
From: Marcelo Tosatti @ 2004-08-05 22:33 UTC (permalink / raw)
  To: William Lee Irwin III, Mr. Berkley Shands, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1490 bytes --]

On Thu, Aug 05, 2004 at 01:46:15PM -0700, William Lee Irwin III wrote:
> William Lee Irwin III wrote:
> >> By any chance could you do binary search on the bk snapshots between
> >> 2.6.6 and 2.6.7?
> 
> On Thu, Aug 05, 2004 at 02:58:50PM -0500, Mr. Berkley Shands wrote:
> > the problem does not exist using 2.6.6-bk6, but exists on 2.6.6-bk7. 
> > -bk8 and -bk9 faile to build.
> > these are from patches-2.6.6-bk6 off snapshots/old and applied to a 
> > vanilla 2.6.6 kernel.
> 
> This is the closest it appears to be possible to narrow down where the
> regression happened.
> 
> Some form of changelogging to enumerate what the contents of the
> 2.6.6-bk6 -> 2.6.6-bk7 delta are and to reconstruct intermediate points
> between 2.6.6-bk6 and 2.6.6-bk7 is needed.

Indeed its nasty, the problem is there is no tagging in the main BK repository
representing the -bk tree's. It shouldnt be too hard to do something about 
this? I can't think of anything which could help...

> I have already tried to carry out various procedures to accomplish this
> for several other problem reports and/or issues and come have come away
> from the effort highly discouraged (having made zero progress) each time.

A quick look in bk6-bk7 reveals the regression is probably related 
to either the readahead changes or, less likely, the mm/vmscan.c 
changes.

I'm attaching a tarball with "readahead" and "vmscan", Mr. Berkley 
can you try reverting these ones at a time and repeating your tests?

Thanks!

[-- Attachment #2: patches-bk6-bk7.tar.bz2 --]
[-- Type: application/x-bzip2, Size: 2084 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
  2004-08-05 22:33       ` Marcelo Tosatti
@ 2004-08-06  0:21         ` William Lee Irwin III
  2004-08-06  2:09         ` Andy Isaacson
  1 sibling, 0 replies; 16+ messages in thread
From: William Lee Irwin III @ 2004-08-06  0:21 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Mr. Berkley Shands, linux-kernel

On Thu, Aug 05, 2004 at 01:46:15PM -0700, William Lee Irwin III wrote:
>> Some form of changelogging to enumerate what the contents of the
>> 2.6.6-bk6 -> 2.6.6-bk7 delta are and to reconstruct intermediate points
>> between 2.6.6-bk6 and 2.6.6-bk7 is needed.

On Thu, Aug 05, 2004 at 07:33:19PM -0300, Marcelo Tosatti wrote:
> Indeed its nasty, the problem is there is no tagging in the main BK repository
> representing the -bk tree's. It shouldnt be too hard to do something about 
> this? I can't think of anything which could help...

It would help me a lot if someone would do something about this.
Distributed development does not require that mainline be
nonreconstructible. I've had serious needs to do this kind of searching
going back to early 2.4.x on several occasions, and nothing is capable
of producing correct results (referring to attempts at cvsps + bkcvs
and other alternatives to this).


-- wli

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
  2004-08-05 22:33       ` Marcelo Tosatti
  2004-08-06  0:21         ` William Lee Irwin III
@ 2004-08-06  2:09         ` Andy Isaacson
  2004-08-06  2:27           ` William Lee Irwin III
  1 sibling, 1 reply; 16+ messages in thread
From: Andy Isaacson @ 2004-08-06  2:09 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: William Lee Irwin III, Mr. Berkley Shands, linux-kernel

On Thu, Aug 05, 2004 at 07:33:19PM -0300, Marcelo Tosatti wrote:
> On Thu, Aug 05, 2004 at 01:46:15PM -0700, William Lee Irwin III wrote:
> > > the problem does not exist using 2.6.6-bk6, but exists on 2.6.6-bk7. 
> > > -bk8 and -bk9 faile to build.
> > > these are from patches-2.6.6-bk6 off snapshots/old and applied to a 
> > > vanilla 2.6.6 kernel.
> > 
> > This is the closest it appears to be possible to narrow down where the
> > regression happened.
> > 
> > Some form of changelogging to enumerate what the contents of the
> > 2.6.6-bk6 -> 2.6.6-bk7 delta are and to reconstruct intermediate points
> > between 2.6.6-bk6 and 2.6.6-bk7 is needed.

If you're willing to use bk, it's trivial.  Each changeset refers to a
particular state of the tree.  If "bk -r check -acv" reports no errors,
and "bk changes -r+ -d:KEY:" reports a particular key, you are
guaranteed that your tree state matches exactly the state of anyone else
who has that key at any point in the past. [1]

So if the -bkX creation script doesn't already, it should "bk changes
-r+ -d:KEY: > key-bk$X" when it creates the tarball.  Then anyone can
"bk clone -r`cat key-bk7` linux-2.5 linux-2.6-bk7" and duplicate the
-bk7 state of the tree, and then "bk changes -L ../linux-2.6-bk6" to
find the list of changesets differing.

> Indeed its nasty, the problem is there is no tagging in the main BK repository
> representing the -bk tree's. It shouldnt be too hard to do something about 
> this? I can't think of anything which could help...

Tagging isn't the answer for snapshots.  Rather, the snapshot metadata
needs to include the cset key at the snapshot instant.

[1] well, caveat -- bk isn't cryptographically secure, so probably a
    motivated attacker could construct a tree which would pass this test
    but have different contents.  This wouldn't allow the attacker to
    push invalid contents to other trees, just to have different
    contents in their tree.

-andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
  2004-08-06  2:09         ` Andy Isaacson
@ 2004-08-06  2:27           ` William Lee Irwin III
  2004-08-06  2:42             ` Andy Isaacson
  2004-08-06  8:33             ` Helge Hafting
  0 siblings, 2 replies; 16+ messages in thread
From: William Lee Irwin III @ 2004-08-06  2:27 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: Marcelo Tosatti, Mr. Berkley Shands, linux-kernel

At some point in the past, I wrote:
>>> Some form of changelogging to enumerate what the contents of the
>>> 2.6.6-bk6 -> 2.6.6-bk7 delta are and to reconstruct intermediate points
>>> between 2.6.6-bk6 and 2.6.6-bk7 is needed.

On Thu, Aug 05, 2004 at 09:09:30PM -0500, Andy Isaacson wrote:
> If you're willing to use bk, it's trivial.  Each changeset refers to a
> particular state of the tree.  If "bk -r check -acv" reports no errors,
> and "bk changes -r+ -d:KEY:" reports a particular key, you are
> guaranteed that your tree state matches exactly the state of anyone else
> who has that key at any point in the past. [1]
> So if the -bkX creation script doesn't already, it should "bk changes
> -r+ -d:KEY: > key-bk$X" when it creates the tarball.  Then anyone can
> "bk clone -r`cat key-bk7` linux-2.5 linux-2.6-bk7" and duplicate the
> -bk7 state of the tree, and then "bk changes -L ../linux-2.6-bk6" to
> find the list of changesets differing.

Once we get there, there must be some way to construct intermediate
points between those two faithful at the very least to the snapshot
ordering if not true chronological ordering.


On Thu, Aug 05, 2004 at 07:33:19PM -0300, Marcelo Tosatti wrote:
>> Indeed its nasty, the problem is there is no tagging in the main BK
>> repository representing the -bk tree's. It shouldnt be too hard to
>> do something about this? I can't think of anything which could help...

On Thu, Aug 05, 2004 at 09:09:30PM -0500, Andy Isaacson wrote:
> Tagging isn't the answer for snapshots.  Rather, the snapshot metadata
> needs to include the cset key at the snapshot instant.
> [1] well, caveat -- bk isn't cryptographically secure, so probably a
>     motivated attacker could construct a tree which would pass this test
>     but have different contents.  This wouldn't allow the attacker to
>     push invalid contents to other trees, just to have different
>     contents in their tree.

Yes, this would be very helpful.


-- wli

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
  2004-08-06  2:27           ` William Lee Irwin III
@ 2004-08-06  2:42             ` Andy Isaacson
  2004-08-06  3:11               ` William Lee Irwin III
  2004-08-06  8:33             ` Helge Hafting
  1 sibling, 1 reply; 16+ messages in thread
From: Andy Isaacson @ 2004-08-06  2:42 UTC (permalink / raw)
  To: William Lee Irwin III, Marcelo Tosatti, Mr. Berkley Shands,
	linux-kernel

On Thu, Aug 05, 2004 at 07:27:34PM -0700, William Lee Irwin III wrote:
> At some point in the past, I wrote:
> >>> Some form of changelogging to enumerate what the contents of the
> >>> 2.6.6-bk6 -> 2.6.6-bk7 delta are and to reconstruct intermediate points
> >>> between 2.6.6-bk6 and 2.6.6-bk7 is needed.
> 
> On Thu, Aug 05, 2004 at 09:09:30PM -0500, Andy Isaacson wrote:
> > So if the -bkX creation script doesn't already, it should "bk changes
> > -r+ -d:KEY: > key-bk$X" when it creates the tarball.  Then anyone can
> > "bk clone -r`cat key-bk7` linux-2.5 linux-2.6-bk7" and duplicate the
> > -bk7 state of the tree, and then "bk changes -L ../linux-2.6-bk6" to
> > find the list of changesets differing.
> 
> Once we get there, there must be some way to construct intermediate
> points between those two faithful at the very least to the snapshot
> ordering if not true chronological ordering.

Well, the state of the "central tree" is represented by a cset key at
each point.  So the answer to your question is a list of keys.  But the
keys in question aren't "special" in any bk sense; they're just some
keys.  You can keep track of keys outside of BK if you want, to keep a
history of "state of this tree at time X", but BK can't keep track of
that info.

Anyways, maybe an example is in order.

% bk prs -hnd:KEY: -rv2.5.4-pre6..v2.5.4 ChangeSet
torvalds@home.transmeta.com|ChangeSet|20020211032403|18448
torvalds@home.transmeta.com|ChangeSet|20020211014924|18455
torvalds@home.transmeta.com|ChangeSet|20020211013331|26396
paulus@quango.(none)|ChangeSet|20020211005601|64956
davej@suse.de|ChangeSet|20020211004458|26395
rml@tech9.net|ChangeSet|20020210234603|34727
kai@vaio.(none)|ChangeSet|20020210234057|58664
gkernel.adm@hostme.bitkeeper.com|ChangeSet|20020210215119|34443
rml@tech9.net|ChangeSet|20020210205932|34726

Those are the changesets that are present in 2.5.4 that aren't present
in 2.5.4-pre6 (note how I used tag..tag in the -r option.)  Similarly,
you can do
-r'rml@tech9.net|ChangeSet|20020210205932|34726..davej@suse.de|ChangeSet|20020211004458|26395'
(that is, "key1..key2") to find the keys implied by key2 and not implied
by key1.  (Read "bk help set" for even more sophisticated options.)

Pick the interesting csets and do a binary search...  (Or better, write
a script to do it!)

-andy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
  2004-08-06  2:42             ` Andy Isaacson
@ 2004-08-06  3:11               ` William Lee Irwin III
  0 siblings, 0 replies; 16+ messages in thread
From: William Lee Irwin III @ 2004-08-06  3:11 UTC (permalink / raw)
  To: Andy Isaacson; +Cc: Marcelo Tosatti, Mr. Berkley Shands, linux-kernel

On Thu, Aug 05, 2004 at 07:27:34PM -0700, William Lee Irwin III wrote:
>> Once we get there, there must be some way to construct intermediate
>> points between those two faithful at the very least to the snapshot
>> ordering if not true chronological ordering.

On Thu, Aug 05, 2004 at 09:42:21PM -0500, Andy Isaacson wrote:
> Well, the state of the "central tree" is represented by a cset key at
> each point.  So the answer to your question is a list of keys.  But the
> keys in question aren't "special" in any bk sense; they're just some
> keys.  You can keep track of keys outside of BK if you want, to keep a
> history of "state of this tree at time X", but BK can't keep track of
> that info.
> Anyways, maybe an example is in order.
[...]

Sounds like time to put this into Documentation/


-- wli

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
  2004-08-06  2:27           ` William Lee Irwin III
  2004-08-06  2:42             ` Andy Isaacson
@ 2004-08-06  8:33             ` Helge Hafting
  2004-08-06  8:51               ` William Lee Irwin III
  1 sibling, 1 reply; 16+ messages in thread
From: Helge Hafting @ 2004-08-06  8:33 UTC (permalink / raw)
  To: William Lee Irwin III
  Cc: Andy Isaacson, Marcelo Tosatti, Mr. Berkley Shands, linux-kernel

William Lee Irwin III wrote:

>At some point in the past, I wrote:
>
>>-r+ -d:KEY: > key-bk$X" when it creates the tarball.  Then anyone can
>>"bk clone -r`cat key-bk7` linux-2.5 linux-2.6-bk7" and duplicate the
>>-bk7 state of the tree, and then "bk changes -L ../linux-2.6-bk6" to
>>find the list of changesets differing.
>>    
>>
>
>Once we get there, there must be some way to construct intermediate
>points between those two faithful at the very least to the snapshot
>ordering if not true chronological ordering.
>  
>
You don't really need chronology for a binary search.  With a
list of changesets, just apply/back out half of them.  Divide the lot
any way you like, perhaps starting with only the "suspected" ones.

Helge Hafting

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
  2004-08-06  8:33             ` Helge Hafting
@ 2004-08-06  8:51               ` William Lee Irwin III
  0 siblings, 0 replies; 16+ messages in thread
From: William Lee Irwin III @ 2004-08-06  8:51 UTC (permalink / raw)
  To: Helge Hafting
  Cc: Andy Isaacson, Marcelo Tosatti, Mr. Berkley Shands, linux-kernel

William Lee Irwin III wrote:
>> Once we get there, there must be some way to construct intermediate
>> points between those two faithful at the very least to the snapshot
>> ordering if not true chronological ordering.

On Fri, Aug 06, 2004 at 10:33:54AM +0200, Helge Hafting wrote:
> You don't really need chronology for a binary search.  With a
> list of changesets, just apply/back out half of them.  Divide the lot
> any way you like, perhaps starting with only the "suspected" ones.

Wrong. Without chronology, one first of all gets an uncompileable tree
half the time, and second, more importantly, one has no method of
correlating the reconstructed source with user observations. Those
often come in the form of "version $FOO worked for me, but then I
upgraded to version $BAR, and the world exploded."

Between user-observable release points, one could say anything goes
modulo the first point, which is that this artifically-constructed
state may be complete gibberish from the standpoint of patches mixing.
But there is no way around the fact that user-observable release points
must be reconstructible and the ordering must be faithful to them. In
fact, this is so fine-grained as to include nightly snapshots.

-- wli

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Fast patch for Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
  2004-08-05 17:25 ` William Lee Irwin III
  2004-08-05 19:58   ` Mr. Berkley Shands
@ 2004-08-06 18:02   ` Mr. Berkley Shands
  2004-08-08  8:22     ` Ram Pai
  1 sibling, 1 reply; 16+ messages in thread
From: Mr. Berkley Shands @ 2004-08-06 18:02 UTC (permalink / raw)
  To: linux-kernel

in 2.6.8-rc3/mm/readahead.c line 475 (about label do_io:)
#if 0
            ra->next_size = min(average , (unsigned long)max);
#endif

the comment for the above is here after an lseek. The lseek IS inside 
the window, but the code will always
destroy the window and start again. The above patch corrects the 
performance problem,
but it would be better to do nothing if the lseek is still within the 
read ahead window.

berkley

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Fast patch for Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
  2004-08-06 18:02   ` Fast patch for " Mr. Berkley Shands
@ 2004-08-08  8:22     ` Ram Pai
  2004-08-16 20:30       ` [PATCH] " Ram Pai
  0 siblings, 1 reply; 16+ messages in thread
From: Ram Pai @ 2004-08-08  8:22 UTC (permalink / raw)
  To: Mr. Berkley Shands; +Cc: linux-kernel, akpm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1524 bytes --]

On Fri, 6 Aug 2004, Mr. Berkley Shands wrote:

> in 2.6.8-rc3/mm/readahead.c line 475 (about label do_io:)
> #if 0
>             ra->next_size = min(average , (unsigned long)max);
> #endif
> 
> the comment for the above is here after an lseek. The lseek IS inside 
> the window, but the code will always
> destroy the window and start again. The above patch corrects the 
> performance problem,
> but it would be better to do nothing if the lseek is still within the 
> read ahead window.

Ok. I can see your point. I did introduce a subtle change in behavior
in 2.6.8-rc3. The change in behavior is: the current window got populated
based on the average number of contiguous pages accessed  in the past.
Earlier to that patch, the current window got populated based on the
amount of locality in the current window, seen in the past.

Try this patch and see if things get back to normal. This patch 
populates the current window based on the average amount of locality 
noticed in the current window. It should help your case, but who knows
which other benchmark will scream :( . Its hard to keep every workload happy.

In any case try and see if this helps your case atleast. Meanwhile I will
run my set of benchmarks and see what gets effected. 

RP


> 
> berkley
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

[-- Attachment #2: Type: TEXT/PLAIN, Size: 905 bytes --]

--- linux-2.6.8-rc3/mm/readahead.c	2004-08-03 14:26:46.000000000 -0700
+++ ram/linux-2.6.8-rc3/mm/readahead.c	2004-08-08 07:45:40.559431280 -0700
@@ -388,10 +388,7 @@ page_cache_readahead(struct address_spac
 		goto do_io;
 	}
 
-	if (offset == ra->prev_page + 1) {
-		if (ra->serial_cnt <= (max * 2))
-			ra->serial_cnt++;
-	} else {
+	if (offset < ra->start || offset > (ra->start + ra->size)) {
 		/*
 		 * to avoid rounding errors, ensure that 'average'
 		 * tends towards the value of ra->serial_cnt.
@@ -402,9 +399,13 @@ page_cache_readahead(struct address_spac
 		}
 		ra->average = (average + ra->serial_cnt) / 2;
 		ra->serial_cnt = 1;
+	} else {
+		if (ra->serial_cnt <= (max * 2))
+			ra->serial_cnt++;
 	}
 	ra->prev_page = offset;
 
+
 	if (offset >= ra->start && offset <= (ra->start + ra->size)) {
 		/*
 		 * A readahead hit.  Either inside the window, or one

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH] Re: Fast patch for Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
  2004-08-08  8:22     ` Ram Pai
@ 2004-08-16 20:30       ` Ram Pai
  0 siblings, 0 replies; 16+ messages in thread
From: Ram Pai @ 2004-08-16 20:30 UTC (permalink / raw)
  To: akpm; +Cc: Mr. Berkley Shands, linux-kernel, miklos, shrybman, badari

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2532 bytes --]

On Sun, 8 Aug 2004, Ram Pai wrote:

> On Fri, 6 Aug 2004, Mr. Berkley Shands wrote:
> 
> > in 2.6.8-rc3/mm/readahead.c line 475 (about label do_io:)
> > #if 0
> >             ra->next_size = min(average , (unsigned long)max);
> > #endif
> > 
> > the comment for the above is here after an lseek. The lseek IS inside 
> > the window, but the code will always
> > destroy the window and start again. The above patch corrects the 
> > performance problem,
> > but it would be better to do nothing if the lseek is still within the 
> > read ahead window.
> 
> Ok. I can see your point. I did introduce a subtle change in behavior
> in 2.6.8-rc3. The change in behavior is: the current window got populated
> based on the average number of contiguous pages accessed  in the past.
> Earlier to that patch, the current window got populated based on the
> amount of locality in the current window, seen in the past.
> 
> Try this patch and see if things get back to normal. This patch 
> populates the current window based on the average amount of locality 
> noticed in the current window. It should help your case, but who knows
> which other benchmark will scream :( . Its hard to keep every workload happy.
> 
> In any case try and see if this helps your case atleast. Meanwhile I will
> run my set of benchmarks and see what gets effected. 

Andrew,

	Here is a consolidated readahead patch against mm tree that takes care
of the performance regression seen with multiple threaded writes to the same
file descriptor. 

	The patch does the following:

	1. Instead of calculating the average count of sequential
		access in the read patterns, it calculates the
		average amount of hits in the current window.
	2. This average is used to guide the size of the next current
		window.
	3. Since the field serial_cnt in the ra structure does not
	 	make sense with the introduction of the new logic,
		I have renamed that field as currnt_wnd_hit.


	This patch will help the read patterns that are not neccessarily
sequential but have sufficient locality. However it may regress random
workload. 

	Results:
		
	1. Berkley Shands has reported great performance with this
		patch.
	2. iozone showed negligible effect on various read patterns.
	3. DSS workload saw neglible change.
	4. Sysbench saw a small improvement.

	I am not sure what good/bad effects this patch will cause
	on other workloads. So can we keep this patch in your tree
	for a while and see which benchmarks scream.
	I have generated this patch w.r.t to 2.6.8-rc4-mm1.

RP

[-- Attachment #2: Type: TEXT/PLAIN, Size: 3183 bytes --]

diff -pruN ram/linux-2.6.8-rc4/include/linux/fs.h linux-2.6.8-rc4/include/linux/fs.h
--- ram/linux-2.6.8-rc4/include/linux/fs.h	2004-08-16 19:55:53.753441464 -0700
+++ linux-2.6.8-rc4/include/linux/fs.h	2004-08-16 20:01:45.996892320 -0700
@@ -556,8 +556,8 @@ struct file_ra_state {
 	unsigned long prev_page;	/* Cache last read() position */
 	unsigned long ahead_start;	/* Ahead window */
 	unsigned long ahead_size;
-	unsigned long serial_cnt;	/* measure of sequentiality */
-	unsigned long average;		/* another measure of sequentiality */
+	unsigned long currnt_wnd_hit;	/* locality in the current window */
+	unsigned long average;		/* size of next current window */
 	unsigned long ra_pages;		/* Maximum readahead window */
 	unsigned long mmap_hit;		/* Cache hit stat for mmap accesses */
 	unsigned long mmap_miss;	/* Cache miss stat for mmap accesses */
diff -pruN ram/linux-2.6.8-rc4/mm/readahead.c linux-2.6.8-rc4/mm/readahead.c
--- ram/linux-2.6.8-rc4/mm/readahead.c	2004-08-16 19:55:54.243366984 -0700
+++ linux-2.6.8-rc4/mm/readahead.c	2004-08-16 20:01:46.554807504 -0700
@@ -384,25 +384,10 @@ page_cache_readahead(struct address_spac
 		first_access=1;
 		ra->next_size = max / 2;
 		ra->prev_page = offset;
-		ra->serial_cnt++;
+		ra->currnt_wnd_hit++;
 		goto do_io;
 	}
 
-	if (offset == ra->prev_page + 1) {
-		if (ra->serial_cnt <= (max * 2))
-			ra->serial_cnt++;
-	} else {
-		/*
-		 * to avoid rounding errors, ensure that 'average'
-		 * tends towards the value of ra->serial_cnt.
-		 */
-		average = ra->average;
-		if (average < ra->serial_cnt) {
-			average++;
-		}
-		ra->average = (average + ra->serial_cnt) / 2;
-		ra->serial_cnt = 1;
-	}
 	ra->prev_page = offset;
 
 	if (offset >= ra->start && offset <= (ra->start + ra->size)) {
@@ -411,12 +396,22 @@ page_cache_readahead(struct address_spac
 		 * page beyond the end.  Expand the next readahead size.
 		 */
 		ra->next_size += 2;
+
+		if (ra->currnt_wnd_hit <= (max * 2))
+			ra->currnt_wnd_hit++;
 	} else {
 		/*
 		 * A miss - lseek, pagefault, pread, etc.  Shrink the readahead
 		 * window.
 		 */
 		ra->next_size -= 2;
+
+		average = ra->average;
+		if (average < ra->currnt_wnd_hit) {
+			average++;
+		}
+		ra->average = (average + ra->currnt_wnd_hit) / 2;
+		ra->currnt_wnd_hit = 1;
 	}
 
 	if ((long)ra->next_size > (long)max)
@@ -468,7 +463,11 @@ do_io:
 			  * pages shall be accessed in the next
 			  * current window.
 			  */
-			ra->next_size = min(ra->average , (unsigned long)max);
+			average = ra->average;
+			if (ra->currnt_wnd_hit > average)
+				average = (ra->currnt_wnd_hit + ra->average + 1) / 2;
+
+			ra->next_size = min(average , (unsigned long)max);
 		}
 		ra->start = offset;
 		ra->size = ra->next_size;
@@ -501,8 +500,8 @@ do_io:
 			 * random. Hence don't bother to readahead.
 			 */
 			average = ra->average;
-			if (ra->serial_cnt > average)
-				average = (ra->serial_cnt + ra->average + 1) / 2;
+			if (ra->currnt_wnd_hit > average)
+				average = (ra->currnt_wnd_hit + ra->average + 1) / 2;
 
 			if (average > max) {
 				ra->ahead_start = ra->start + ra->size;

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
@ 2004-08-06  0:41 Berkley Shands
  0 siblings, 0 replies; 16+ messages in thread
From: Berkley Shands @ 2004-08-06  0:41 UTC (permalink / raw)
  To: marcelo.tosatti, wli; +Cc: berkley, linux-kernel

	I took the 2.6.6-bk7 image, and replaced mm/readahead.c and mm/vmscan.c
from the 2.6.6-bk6 image (just those two files), and the read ahead error
has vanished. However, the kernel panic'ed when reading a 16gb file.
It may be related to an ongoing issue with pci-x and scsi error recovery
on the x86_64, so until I get into the office, I will not be able
to see what's on the console. 
	So clearly the code in readahead.c and vmscan.c in -bk7 is
the source of one regression. I'll keep looking at the second bug
in the morning.
	Thanks to all for the pointers on where to look.

berkley

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2004-08-16 20:31 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-05 17:02 Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3 Mr. Berkley Shands
2004-08-05 17:25 ` William Lee Irwin III
2004-08-05 19:58   ` Mr. Berkley Shands
2004-08-05 20:46     ` William Lee Irwin III
2004-08-05 22:33       ` Marcelo Tosatti
2004-08-06  0:21         ` William Lee Irwin III
2004-08-06  2:09         ` Andy Isaacson
2004-08-06  2:27           ` William Lee Irwin III
2004-08-06  2:42             ` Andy Isaacson
2004-08-06  3:11               ` William Lee Irwin III
2004-08-06  8:33             ` Helge Hafting
2004-08-06  8:51               ` William Lee Irwin III
2004-08-06 18:02   ` Fast patch for " Mr. Berkley Shands
2004-08-08  8:22     ` Ram Pai
2004-08-16 20:30       ` [PATCH] " Ram Pai
  -- strict thread matches above, loose matches on Subject: below --
2004-08-06  0:41 Berkley Shands

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox