public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* Horrible drive performance under concurrent i/o jobs (dlh problem?)
@ 2002-12-18 19:06 Torben Frey
  2002-12-20 20:40 ` Joseph D. Wagner
  2002-12-23 18:13 ` Denis Vlasenko
  0 siblings, 2 replies; 23+ messages in thread
From: Torben Frey @ 2002-12-18 19:06 UTC (permalink / raw)
  To: linux-kernel

Hi list readers (and hopefully writers),

after getting crazy with our main server in the company for over a week 
now, this list is possibly my last help - I am no kernel programmer but 
suscpect it to be a kernel problem. Reading through the list did not 
help me (although I already thought so, see below).

We are running a 3ware Escalade 7850 Raid controller with 7 IBM Deskstar 
GXP 180 disks in Raid 5 mode, so it builds a 1.11TB disk.
There's one partition on it, /dev/sda1, formatted with Reiserfs format 
3.6. The Board is an MSI 6501 (K7D Master) with 1GB RAM but only one 
processor.

We were running the Raid smoothly while there was not much I/O - but 
when we tried to produce large amounts of data last week, read and write 
performance went down to inacceptable low rates. The load of the machine 
went high up to 8,9,10... and every disk access stopped processes from 
responding for a few seconds (nedit, ls). An "rm" of many small files 
made the machine not react to "reboot" anymore, I had to reset it.

So copied everything away to a software raid and tried all the disk 
tuning stuff (min-, max-readahead, bdflush, elvtune). Nothing helped. 
Last Sunday I then found a hint about a bug introduced in kernel 
2.4.19-pre6 which could be fixed using a "dlh", disk latency hack - or 
going back to 2.4.18. Last is what I did ( from 2.4.20 )

It seemed to help in a way that I could copy about 350GB back to the 
RAID in about 3-4 hrs overnight. So I THOUGHT everything would be fine. 
Since Tuesday morning my collegues are trying to produce large amounts 
of data again - and every concurrent I/O operation blocks all the 
others. We cannot work with that.

When I am working all alone on the disk creating a 1 GB file by
time dd if=/dev/zero of=testfile bs=1G count=1
results in real times from 14 seconds when I am very lucky up to 4 
minutes usually.
Watching vmstat 1 shows me that "bo" drops quickly down from rates in 
the 10 or 20 thousands to low rates of about 2 or 3 thousands when the 
runs take so long.

Can anyone of you please tell me how can I find out if this is a kernel 
problem or a hardware 3ware-device problem? Is there a way to see the 
difference? Or could it come from running an SMP kernel although I have 
only one CPU in my board?

I would be very happy about every answer, really!

Torben



^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)
@ 2002-12-18 21:10 Con Kolivas
  2002-12-18 22:16 ` Torben Frey
  0 siblings, 1 reply; 23+ messages in thread
From: Con Kolivas @ 2002-12-18 21:10 UTC (permalink / raw)
  To: kernel; +Cc: linux kernel mailing list



>So copied everything away to a software raid and tried all the disk 
>tuning stuff (min-, max-readahead, bdflush, elvtune). Nothing helped. 
>Last Sunday I then found a hint about a bug introduced in kernel 
>2.4.19-pre6 which could be fixed using a "dlh", disk latency hack - or 
>going back to 2.4.18. Last is what I did ( from 2.4.20 )

I made the dlh (disk latency hack) and it is related to a problem of system
response under heavy IO load, NOT the actual IO throughput so this sounds
unrelated. However, I have seen what you describe with reiserFS and ide raid at
least and had it fixed by applying AA's stuck in D fix, which ReiserFS is more
prone to for some complicated reason. Give that a go.

In 

http://www.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.20aa1/

it is patch 9980_fix-pausing-2         

Regards,
Con

^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: Horrible drive performance under concurrent i/o jobs (dlh problem?)
@ 2002-12-19 14:29 Torben Frey
  2002-12-20  1:47 ` Nuno Silva
  2002-12-20 14:27 ` Roger Larsson
  0 siblings, 2 replies; 23+ messages in thread
From: Torben Frey @ 2002-12-19 14:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Con Kolivas, linux kernel mailing list

Ok, so now I have set up a backup software Raid 0 formatted with

mke2fs -b 4096 -j -R stride=16

and mounted that device. After starting to backup stuff from the 3ware
controller to the software Raid I soon had complaints from my collegues
because the system load went up to 4 and they had the same bad
responsiveness as before. Of course I CTRL-C'ed my "cp -av" while
watching "vmstat 1" in another window - and this is what surprised me:
when I stopped the copy job, there were 22 more seconds when data was
still written to the backup software raid. Is this a hint where the
problem could be? I have the same "feature" when I write to my 3ware.

My kernel is 2.4.20 with Andrew's patch from last night.

Greetings,
Torben

0 1 3 25292 2188 72056 759644 0 0 4168 23732 1069 780 6 26 68
0 1 3 25292 2176 72056 759644 0 0 0 15728 523 147 1 10 89
2 0 4 25292 2292 72056 759548 0 0 30404 20084 820 1149 4 85 11
0 1 3 25292 2828 72048 759012 0 0 40716 23772 845 1307 2 62 36
0 1 2 25292 3208 72280 758372 0 0 0 16532 573 231 6 13 81
1 0 2 25292 3216 72276 758404 0 0 4880 23800 530 264 2 10 88
0 1 3 25292 2224 72224 759420 0 0 22596 15620 695 602 5 38 57
0 1 3 25292 3996 72204 757692 0 0 26704 23808 765 924 3 38 59
1 0 3 25292 3932 72208 757760 0 0 14380 23996 651 548 3 23 74
0 1 3 25292 2180 72296 759408 0 0 39024 15948 850 1089 3 56 41
1 0 3 25292 3180 72308 758416 0 0 39028 17568 957 1265 1 57 42
0 1 2 25292 3296 72260 758328 0 0 36976 24000 837 1194 5 47 49
1 0 2 25292 3212 72264 758428 0 0 6164 22000 594 330 2 15 83
0 1 3 25292 2212 72268 759412 0 0 44492 16116 878 1433 2 63 35
1 0 3 25292 2896 72056 758952 0 0 19180 24556 683 912 1 32 67
0 0 2 25292 3300 72068 758672 0 0 10564 24296 636 518 1 23 76
HERE WAS MY CTRL-C
1 0 2 25292 3292 72068 758672 0 0 0 15820 511 146 1 15 84
0 0 2 25292 3276 72068 758672 0 0 0 27720 607 341 0 46 54
0 0 2 25292 3276 72068 758672 0 0 0 15912 529 167 1 12 87
0 0 2 25292 3232 72112 758672 0 0 0 23880 537 199 5 7 88
0 0 2 25292 3232 72112 758672 0 0 0 15872 558 198 0 8 92
0 0 2 25292 3232 72112 758672 0 0 0 23740 517 168 4 6 90
0 0 2 25292 4620 72112 757320 0 0 0 23800 528 1044 8 12 80
0 0 2 25292 4620 72112 757320 0 0 0 16100 522 177 4 6 90
0 0 2 25292 4516 72216 757320 0 0 0 24268 558 192 2 5 93
0 0 2 25292 4516 72216 757320 0 0 0 23552 525 179 3 2 95
0 0 2 25292 4516 72216 757320 0 0 0 15872 521 137 0 9 91
0 0 2 25292 4516 72216 757320 0 0 0 31924 515 179 4 7 89
0 0 2 25292 4516 72216 757320 0 0 0 17368 526 144 2 2 96
0 0 1 25292 4488 72244 757320 0 0 0 25308 533 195 1 8 91
0 0 1 25292 4488 72244 757320 0 0 0 16368 504 145 3 12 85
0 0 1 25292 4488 72244 757320 0 0 0 25320 558 247 0 28 72
0 0 1 25292 4484 72244 757320 0 0 0 16376 529 187 3 6 91
0 0 1 25292 4484 72244 757320 0 0 0 24576 508 140 1 3 96
0 0 1 25292 4464 72264 757320 0 0 0 24356 523 197 4 5 91
0 0 1 25292 4464 72264 757320 0 0 0 16364 516 148 1 1 98
0 0 1 25292 4464 72264 757320 0 0 0 24552 507 138 2 1 97
0 0 1 25292 4464 72264 757320 0 0 0 21020 522 174 2 17 81
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 25292 4448 72280 757320 0 0 0 656 347 142 0 8 92
0 0 0 25292 4432 72296 757320 0 0 0 796 353 179 4 2 94

HERE THE WRITING OUT STOPPED, 22 seconds later!!!

0 0 0 25292 4432 72296 757320 0 0 0 0 176 141 1 1 98
0 0 0 25292 4432 72296 757320 0 0 0 0 194 184 2 1 97
0 0 0 25292 4432 72296 757320 0 0 0 0 176 137 1 1 98
0 0 0 25292 4432 72296 757320 0 0 0 0 179 175 2 1 97
0 0 0 25292 4292 72312 757328 116 0 124 32 188 165 1 1 98
0 0 0 25292 4292 72312 757328 0 0 0 0 248 310 2 12 86
0 0 0 25292 4292 72312 757328 0 0 0 0 178 148 1 2 97
0 0 0 25292 4292 72312 757328 0 0 0 0 184 144 0 1 99
0 0 0 25292 4292 72312 757328 0 0 0 0 193 185 2 3 95
0 0 0 25292 4272 72332 757328 0 0 0 640 360 200 1 2 97


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2002-12-27 12:56 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-12-18 19:06 Horrible drive performance under concurrent i/o jobs (dlh problem?) Torben Frey
2002-12-20 20:40 ` Joseph D. Wagner
2002-12-20 22:25   ` David Lang
2002-12-21  6:00     ` Joseph D. Wagner
2002-12-23  1:29       ` David Lang
2002-12-24  9:18       ` Roy Sigurd Karlsbakk
2002-12-24 17:21         ` jw schultz
2002-12-24 21:00           ` Jeremy Fitzhardinge
2002-12-25  1:34           ` Rik van Riel
2002-12-25  2:02           ` jw schultz
2002-12-25  3:41           ` Barry K. Nathan
2002-12-23 17:48   ` Krzysztof Halasa
2002-12-23 18:13 ` Denis Vlasenko
  -- strict thread matches above, loose matches on Subject: below --
2002-12-18 21:10 Con Kolivas
2002-12-18 22:16 ` Torben Frey
2002-12-18 22:37   ` Andrew Morton
2002-12-18 23:30     ` Torben Frey
2002-12-18 23:46       ` Andrew Morton
2002-12-18 22:40   ` Torben Frey
2002-12-19 14:29 Torben Frey
2002-12-20  1:47 ` Nuno Silva
2002-12-27 13:04   ` Torben Frey
2002-12-20 14:27 ` Roger Larsson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox