public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
@ 2004-08-05 17:02 Mr. Berkley Shands
  2004-08-05 17:25 ` William Lee Irwin III
  0 siblings, 1 reply; 16+ messages in thread
From: Mr. Berkley Shands @ 2004-08-05 17:02 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1279 bytes --]

Two severe disk read bugs:

In a nutshell (see attached for gory details). Moving from 2.6.6 to 
2.6.7 dropped multi-threaded RAID0
read performance from 429MB/Sec to 81MB/Sec. Single threaded reads 
improved  368MB/Sec to 418MB/Sec.
The code in drivers/md has no effect on this problem. Clearly this is a 
thread access issue. Redhat ES3.0
on x86_64 or i686.  The underlying hardware is capable of 955MB/Sec disk 
reads off 28 drives,
541MB/Sec off 14 drives. Tuning I/O block size (11KB to 239KB) and 
BLKRASET size (448 to 1024 or more)
helps a little. System idle goes from 0% to 50% (2.6.6 to 2.6.8-rc3).

File reads (ext3/raid0) exceeding physical ram size cause kswapd to go 
out to lunch.
The I/O rate drops to 10MB/Sec. Under 2.6.6 there was NO effect for 
large files. Using
fadvise64() helps a little on i686, but hurts on x86_64. fadvise64_64() 
is just plain broken.

This was discovered while testing I/O throughput for a paper being 
submitted to ASPLOS BEACON
workshop - October 2004.

I'll run most any experiment on either architecture to help diagnose 
this, and will fiddle kernel code
and debug options as requested. Source code to the test suite available 
on request to developers only.

Mr. Berkley Shands
berkley<at>dssimail.com
berkley<at>cse.wustl.edu

[-- Attachment #2: Raid0.bug --]
[-- Type: text/plain, Size: 3714 bytes --]

I/O throughput regression bug going from 2.6.6 to 2.6.7 or 2.6.8-rc3
There are several I/O throughput bugs that were introduced in 2.6.7,
not related to any updates in drivers/md. The first reduces multi-threaded
reads of a file on an ext3/RAID0 file from ~600MB/Sec to ~160MB/Sec on my 
opteron. The same result is seen on a i686 based system. Doing single threaded
reads of the same ext3/RAID0 file shows a ~60MB/Sec reduction.
The hardware is 2-Adaptec 39320A-R HBAs into 4 7-drive strings of 15KRPM Seagate
U320 drives. The AIC79XX driver is V2.0.12 (the stock driver shows lower 
performance).

The second throughput bug happens when the file being read is larger than 
physical memory, in this case 16GB of file, and 8GB of RAM. Reading the first 
7GB of file runs at ~420MB/Sec (1-39320A and 14 drives). The next 9GB runs at 
60MB/Sec or less. If I use fadvise64_64() to try to manage the file cache,
the rate drops to under 10MB/Sec :-)

Observations - kswapd goes nuts under 2.6.7, 2.6.8-rc3 when the file being read
exceeds the physical memory size. System idle time (from top) is near zero
for 2 threads reading under 2.6.6, and is 50% or better for 2.6.7 or 2.6.8-rc3.
Otherwise I/O wait is the dominant state 89% to 95%. The opteron is capable of
955MB/Sec raw I/O off the 28 drive array using O_DIRECT on /dev/sda, /dev/sdb...
541MB/Sec raw I/O off the 14 drive array.

The value of 2 threads, 11KB reads, and 448 RASize were close to optimal
for 2.6.2 through 2.6.6 on the 14 drive system. fadvise64_64() is broken on 
i686 and x86_64. The 3rd parameter is being passed garbage off the stack.
Patching the ioctl fixed that one. fadvise64_64() helps on the i686, but is
very harmfull on the x86_64.

The values passed via ioctl(BLKRASET | BLKFRASET) to get peak performance
vary radically between 2.6.6, 2.6.7, and 2.6.8-rc3 for /dev/md0. The ioctl does
a right shift by 3 bits before using the passed in value, so my RASize value is
left shifted by 3 bits (X * 8) before being passed in.

4-controller, 28 drives, raid0, 2.6.6, 2 threads
<threads, Read KBytes, RASize, MB/Sec> 
 2,  11,  448,   552.617 
 2,  11, 2736,   595.695 
 2,  11, 2275,   596.911 
 2,  11, 2253,   597.956 
 2,  11, 2321,   600.234 
 2,  11, 2164,   601.115 

2-controller, 14 drives, raid0, 2.6.8-rc3, 2 threads
<threads, Read KBytes, RASize, MB/Sec> 
 2,  11, 448,    81.543 
 2, 239, 448,   154.706 
 2, 239, 673,   161.070 
 2, 239, 124,   161.158  
 2, 239, 149,   161.209 
 2, 239, 128,   161.298 
 2, 239, 229,   161.400 
 2, 239, 548,   161.897 

2-controller, 14 drives, raid0, 2.6.8-rc3, single thread
<threads, Read KBytes, RASize, MB/Sec> 
 1,  11, 448,   329.419
 1,  11, 935,   373.382 
 1,  11, 894,   373.518 
 1,  11, 1021,  377.442 
 1,  11, 1023,  387.952 
 1,  11, 1024,  418.985 

2-controller, 14 drives, raid0, 2.6.6, 2 threads
<threads, Read KBytes, RASize, MB/Sec> 
 2,  11, 471,   429.170 
 2,  11, 470,   430.252 
 2,  11, 493,   430.523 
 2,  11, 514,   430.795 
 2,  11, 448,   431.612 

2-controller, 14 drives, raid0, 2.6.6, single thread
<threads, Read KBytes, RASize, MB/Sec> 
 1,   7, 448,   328.047 
 1,   7, 681,   365.714 
 1,   7, 675,   366.107 
 1,   7, 186,   366.237 
 1,   7, 668,   367.882 
 1,   7, 662,   368.213 

Hardware setup:
dual cpu 2.0GHz opteron, Tyan S2885, 8GB ram, dual 39320A-R on
different PCi-X busses. RedHat ES3.0-update 1.

dual cpu 2.66GHZ Xeon w/hyperthtreading, SuperMicro X5DA8, 2GB RAM,
dual 39320A-R (or AIC7902 on mobo), RedHat ES3.0-update 2.

14 Seagate 36GB 15K RPM U320 drives in one partition, 
14 Fujitsu 73GB 15K RPM U320 drives in two 36GB partitions.
(better have the right ucode for those fujitsu drives!)
In two StorCase 14-bay Infostations.

^ permalink raw reply	[flat|nested] 16+ messages in thread
* Re: Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3
@ 2004-08-06  0:41 Berkley Shands
  0 siblings, 0 replies; 16+ messages in thread
From: Berkley Shands @ 2004-08-06  0:41 UTC (permalink / raw)
  To: marcelo.tosatti, wli; +Cc: berkley, linux-kernel

	I took the 2.6.6-bk7 image, and replaced mm/readahead.c and mm/vmscan.c
from the 2.6.6-bk6 image (just those two files), and the read ahead error
has vanished. However, the kernel panic'ed when reading a 16gb file.
It may be related to an ongoing issue with pci-x and scsi error recovery
on the x86_64, so until I get into the office, I will not be able
to see what's on the console. 
	So clearly the code in readahead.c and vmscan.c in -bk7 is
the source of one regression. I'll keep looking at the second bug
in the morning.
	Thanks to all for the pointers on where to look.

berkley

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2004-08-16 20:31 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-08-05 17:02 Severe I/O performance regression 2.6.6 to 2.6.7 or 2.6.8-rc3 Mr. Berkley Shands
2004-08-05 17:25 ` William Lee Irwin III
2004-08-05 19:58   ` Mr. Berkley Shands
2004-08-05 20:46     ` William Lee Irwin III
2004-08-05 22:33       ` Marcelo Tosatti
2004-08-06  0:21         ` William Lee Irwin III
2004-08-06  2:09         ` Andy Isaacson
2004-08-06  2:27           ` William Lee Irwin III
2004-08-06  2:42             ` Andy Isaacson
2004-08-06  3:11               ` William Lee Irwin III
2004-08-06  8:33             ` Helge Hafting
2004-08-06  8:51               ` William Lee Irwin III
2004-08-06 18:02   ` Fast patch for " Mr. Berkley Shands
2004-08-08  8:22     ` Ram Pai
2004-08-16 20:30       ` [PATCH] " Ram Pai
  -- strict thread matches above, loose matches on Subject: below --
2004-08-06  0:41 Berkley Shands

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox