long stalls

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* long stalls
@ 2003-01-08  0:42 Brian Tinsley
  2003-01-08  1:51 ` Russell Leighton
  2003-01-08  2:44 ` Brian Gerst
  0 siblings, 2 replies; 8+ messages in thread
From: Brian Tinsley @ 2003-01-08  0:42 UTC (permalink / raw)
  To: linux-kernel

We have been having terrible problems with long stalls, meaning from a 
couple of minutes to an hour, happening when filesystem I/O load gets 
high. The system time as reported by vmstat or sar will increase up to 
99% and as it spreads to each procesor, the system becomes completely 
unresponsive (except that it responds to pings just fine - 
interesting!). When the system finally returns to the world of the 
living, the only evidence that something bad has happened is the runtime 
for kswapd is abnormally high. I have seen this happen with the stock 
2.4.17, 2.4.19, and 2.4.20 kernels on SMP PIII and PIV machines (either 
4GB or 8GB RAM, all SCSI disks, dual GigE NICs). I've searched the lkml 
archives and google and have found several similar postings, but there 
is never an explanation or resolution. Any help would be *very* much 
appreciated! If any info from the system in question is desired, I will 
be glad to provide it.

-- 

-[========================]-
-[      Brian Tinsley     ]-
-[ Chief Systems Engineer ]-
-[        Emageon         ]-
-[========================]-

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: long stalls
  2003-01-08  0:42 long stalls Brian Tinsley
@ 2003-01-08  1:51 ` Russell Leighton
  2003-01-08  2:16   ` Brian Tinsley
                     ` (2 more replies)
  2003-01-08  2:44 ` Brian Gerst
  1 sibling, 3 replies; 8+ messages in thread
From: Russell Leighton @ 2003-01-08  1:51 UTC (permalink / raw)
  To: Brian Tinsley; +Cc: linux-kernel


I can't help, but I can echo a "me too".

We only see it when I have 2 file I/O intensive processes...they both 
will just stop for some few seconds, system seems idle...then
they just start again. RH7.3 SMP, Dual PIII, 4GB RAM, 3com RAID Controller .

Brian Tinsley wrote:

> We have been having terrible problems with long stalls, meaning from a 
> couple of minutes to an hour, happening when filesystem I/O load gets 
> high. The system time as reported by vmstat or sar will increase up to 
> 99% and as it spreads to each procesor, the system becomes completely 
> unresponsive (except that it responds to pings just fine - 
> interesting!). When the system finally returns to the world of the 
> living, the only evidence that something bad has happened is the 
> runtime for kswapd is abnormally high. I have seen this happen with 
> the stock 2.4.17, 2.4.19, and 2.4.20 kernels on SMP PIII and PIV 
> machines (either 4GB or 8GB RAM, all SCSI disks, dual GigE NICs). I've 
> searched the lkml archives and google and have found several similar 
> postings, but there is never an explanation or resolution. Any help 
> would be *very* much appreciated! If any info from the system in 
> question is desired, I will be glad to provide it.
>
>
>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: long stalls
  2003-01-08  1:51 ` Russell Leighton
@ 2003-01-08  2:16   ` Brian Tinsley
  2003-01-08  4:07     ` Russell Leighton
  2003-01-08  4:00   ` Russell Leighton
  2003-01-08 15:17   ` Juergen Sawinski
  2 siblings, 1 reply; 8+ messages in thread
From: Brian Tinsley @ 2003-01-08  2:16 UTC (permalink / raw)
  To: Russell Leighton; +Cc: linux-kernel

Out of curiosity, which RH kernel are you using? I moved on to 2.4.19 
and 2.4.20 primarily because the RH 2.4.18 series of kernels apparently 
has a scheduler bug (at least one) that causes the heartbeat software 
from Linux-HA to loose heartbeat signals and failover. Not a good 
scenario when you are trying to provide HA systems to hospitals!


Russell Leighton wrote:

>
> I can't help, but I can echo a "me too".
>
> We only see it when I have 2 file I/O intensive processes...they both 
> will just stop for some few seconds, system seems idle...then
> they just start again. RH7.3 SMP, Dual PIII, 4GB RAM, 3com RAID 
> Controller .
>
> Brian Tinsley wrote:
>
>> We have been having terrible problems with long stalls, meaning from 
>> a couple of minutes to an hour, happening when filesystem I/O load 
>> gets high. The system time as reported by vmstat or sar will increase 
>> up to 99% and as it spreads to each procesor, the system becomes 
>> completely unresponsive (except that it responds to pings just fine - 
>> interesting!). When the system finally returns to the world of the 
>> living, the only evidence that something bad has happened is the 
>> runtime for kswapd is abnormally high. I have seen this happen with 
>> the stock 2.4.17, 2.4.19, and 2.4.20 kernels on SMP PIII and PIV 
>> machines (either 4GB or 8GB RAM, all SCSI disks, dual GigE NICs). 
>> I've searched the lkml archives and google and have found several 
>> similar postings, but there is never an explanation or resolution. 
>> Any help would be *very* much appreciated! If any info from the 
>> system in question is desired, I will be glad to provide it.
>>
>>
>>
>

-- 

-[========================]-
-[      Brian Tinsley     ]-
-[ Chief Systems Engineer ]-
-[        Emageon         ]-
-[========================]-




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: long stalls
  2003-01-08  2:16   ` Brian Tinsley
@ 2003-01-08  4:07     ` Russell Leighton
  0 siblings, 0 replies; 8+ messages in thread
From: Russell Leighton @ 2003-01-08  4:07 UTC (permalink / raw)
  To: Brian Tinsley; +Cc: linux-kernel


I am pretty sure we are at 2.4.19.

Brian Tinsley wrote:

> Out of curiosity, which RH kernel are you using? I moved on to 2.4.19 
> and 2.4.20 primarily because the RH 2.4.18 series of kernels 
> apparently has a scheduler bug (at least one) that causes the 
> heartbeat software from Linux-HA to loose heartbeat signals and 
> failover. Not a good scenario when you are trying to provide HA 
> systems to hospitals!
>
>
> Russell Leighton wrote:
>
>>
>> I can't help, but I can echo a "me too".
>>
>> We only see it when I have 2 file I/O intensive processes...they both 
>> will just stop for some few seconds, system seems idle...then
>> they just start again. RH7.3 SMP, Dual PIII, 4GB RAM, 3com RAID 
>> Controller .
>>
>> Brian Tinsley wrote:
>>
>>> We have been having terrible problems with long stalls, meaning from 
>>> a couple of minutes to an hour, happening when filesystem I/O load 
>>> gets high. The system time as reported by vmstat or sar will 
>>> increase up to 99% and as it spreads to each procesor, the system 
>>> becomes completely unresponsive (except that it responds to pings 
>>> just fine - interesting!). When the system finally returns to the 
>>> world of the living, the only evidence that something bad has 
>>> happened is the runtime for kswapd is abnormally high. I have seen 
>>> this happen with the stock 2.4.17, 2.4.19, and 2.4.20 kernels on SMP 
>>> PIII and PIV machines (either 4GB or 8GB RAM, all SCSI disks, dual 
>>> GigE NICs). I've searched the lkml archives and google and have 
>>> found several similar postings, but there is never an explanation or 
>>> resolution. Any help would be *very* much appreciated! If any info 
>>> from the system in question is desired, I will be glad to provide it.
>>>
>>>
>>>
>>
>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: long stalls
  2003-01-08  1:51 ` Russell Leighton
  2003-01-08  2:16   ` Brian Tinsley
@ 2003-01-08  4:00   ` Russell Leighton
  2003-01-08 15:17   ` Juergen Sawinski
  2 siblings, 0 replies; 8+ messages in thread
From: Russell Leighton @ 2003-01-08  4:00 UTC (permalink / raw)
  To: Brian Tinsley; +Cc: linux-kernel


Minor correction: 3ware RAID controller.

Russell Leighton wrote:

>
> I can't help, but I can echo a "me too".
>
> We only see it when I have 2 file I/O intensive processes...they both 
> will just stop for some few seconds, system seems idle...then
> they just start again. RH7.3 SMP, Dual PIII, 4GB RAM, 3com RAID 
> Controller .
>
> Brian Tinsley wrote:
>
>> We have been having terrible problems with long stalls, meaning from 
>> a couple of minutes to an hour, happening when filesystem I/O load 
>> gets high. The system time as reported by vmstat or sar will increase 
>> up to 99% and as it spreads to each procesor, the system becomes 
>> completely unresponsive (except that it responds to pings just fine - 
>> interesting!). When the system finally returns to the world of the 
>> living, the only evidence that something bad has happened is the 
>> runtime for kswapd is abnormally high. I have seen this happen with 
>> the stock 2.4.17, 2.4.19, and 2.4.20 kernels on SMP PIII and PIV 
>> machines (either 4GB or 8GB RAM, all SCSI disks, dual GigE NICs). 
>> I've searched the lkml archives and google and have found several 
>> similar postings, but there is never an explanation or resolution. 
>> Any help would be *very* much appreciated! If any info from the 
>> system in question is desired, I will be glad to provide it.
>>
>>
>>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe 
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: long stalls
  2003-01-08  1:51 ` Russell Leighton
  2003-01-08  2:16   ` Brian Tinsley
  2003-01-08  4:00   ` Russell Leighton
@ 2003-01-08 15:17   ` Juergen Sawinski
  2 siblings, 0 replies; 8+ messages in thread
From: Juergen Sawinski @ 2003-01-08 15:17 UTC (permalink / raw)
  To: linux-kernel@vger

On Wed, 2003-01-08 at 02:51, Russell Leighton wrote:
> 
> I can't help, but I can echo a "me too".
> 
> We only see it when I have 2 file I/O intensive processes...they both 
> will just stop for some few seconds, system seems idle...then
> they just start again. RH7.3 SMP, Dual PIII, 4GB RAM, 3com RAID Controller .

Same thing here with a Promise SX6000 RAID controller (P4, 1GB RAM,
system is completely on RAID, 2.4.20-pre10-ac1). But, this seems not to
be related. At least in my case, it's the controller that causes the
stalls, 'cause only processes depending on file IO (including swap) get
into D state. Everything else just runs fine.

George

-- 
Juergen "George" Sawinski                  |  Phone:  +49-6221-486-308
Max-Planck Institute for Medical Research  |  Fax:    +49-6221-486-325
Dept. of Biomedical Optics                 |  Mobile: +49-171-532 5302
Jahnstr. 29                                |  
D-69120 Heidelberg                         |  
Germany                                    |  

GPG Key/Fingerprint: 9A5F7A31/86F2E5D5EDF4D9983BDD3F23986F154F9A5F7A31


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: long stalls
  2003-01-08  0:42 long stalls Brian Tinsley
  2003-01-08  1:51 ` Russell Leighton
@ 2003-01-08  2:44 ` Brian Gerst
  2003-01-08  2:48   ` Brian Tinsley
  1 sibling, 1 reply; 8+ messages in thread
From: Brian Gerst @ 2003-01-08  2:44 UTC (permalink / raw)
  To: Brian Tinsley; +Cc: linux-kernel

Brian Tinsley wrote:

> We have been having terrible problems with long stalls, meaning from a
> couple of minutes to an hour, happening when filesystem I/O load gets
> high. The system time as reported by vmstat or sar will increase up to
> 99% and as it spreads to each procesor, the system becomes completely
> unresponsive (except that it responds to pings just fine -
> interesting!). When the system finally returns to the world of the
> living, the only evidence that something bad has happened is the runtime
> for kswapd is abnormally high. I have seen this happen with the stock
> 2.4.17, 2.4.19, and 2.4.20 kernels on SMP PIII and PIV machines (either
> 4GB or 8GB RAM, all SCSI disks, dual GigE NICs). I've searched the lkml
> archives and google and have found several similar postings, but there
> is never an explanation or resolution. Any help would be *very* much
> appreciated! If any info from the system in question is desired, I will
> be glad to provide it.
>
>
>
With 4GB of memory you are likely boucing I/O requests to low memory. 
This has been fixed in 2.5.  I do not know if a backport exists for 2.4.

--
				Brian Gerst


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: long stalls
  2003-01-08  2:44 ` Brian Gerst
@ 2003-01-08  2:48   ` Brian Tinsley
  0 siblings, 0 replies; 8+ messages in thread
From: Brian Tinsley @ 2003-01-08  2:48 UTC (permalink / raw)
  To: Brian Gerst; +Cc: linux-kernel

Thanks for the reply!

I thought highmem I/O was addressed in 2.4.20? Am I off-base here?

I actually just built a 2.4.20 kernel with highmem debugging turned on. 
We'll see if anything pops up.


Brian Gerst wrote:

> Brian Tinsley wrote:
>
>> We have been having terrible problems with long stalls, meaning from a
>> couple of minutes to an hour, happening when filesystem I/O load gets
>> high. The system time as reported by vmstat or sar will increase up to
>> 99% and as it spreads to each procesor, the system becomes completely
>> unresponsive (except that it responds to pings just fine -
>> interesting!). When the system finally returns to the world of the
>> living, the only evidence that something bad has happened is the runtime
>> for kswapd is abnormally high. I have seen this happen with the stock
>> 2.4.17, 2.4.19, and 2.4.20 kernels on SMP PIII and PIV machines (either
>> 4GB or 8GB RAM, all SCSI disks, dual GigE NICs). I've searched the lkml
>> archives and google and have found several similar postings, but there
>> is never an explanation or resolution. Any help would be *very* much
>> appreciated! If any info from the system in question is desired, I will
>> be glad to provide it.
>>
>>
>>
> With 4GB of memory you are likely boucing I/O requests to low memory. 
> This has been fixed in 2.5.  I do not know if a backport exists for 2.4.
>
> -- 
>                 Brian Gerst


-- 

-[========================]-
-[      Brian Tinsley     ]-
-[ Chief Systems Engineer ]-
-[        Emageon         ]-
-[========================]-




^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2003-01-08 15:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-01-08  0:42 long stalls Brian Tinsley
2003-01-08  1:51 ` Russell Leighton
2003-01-08  2:16   ` Brian Tinsley
2003-01-08  4:07     ` Russell Leighton
2003-01-08  4:00   ` Russell Leighton
2003-01-08 15:17   ` Juergen Sawinski
2003-01-08  2:44 ` Brian Gerst
2003-01-08  2:48   ` Brian Tinsley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox