* Re: long stalls
2003-01-08 1:51 ` Russell Leighton
@ 2003-01-08 2:16 ` Brian Tinsley
2003-01-08 4:07 ` Russell Leighton
2003-01-08 4:00 ` Russell Leighton
2003-01-08 15:17 ` Juergen Sawinski
2 siblings, 1 reply; 8+ messages in thread
From: Brian Tinsley @ 2003-01-08 2:16 UTC (permalink / raw)
To: Russell Leighton; +Cc: linux-kernel
Out of curiosity, which RH kernel are you using? I moved on to 2.4.19
and 2.4.20 primarily because the RH 2.4.18 series of kernels apparently
has a scheduler bug (at least one) that causes the heartbeat software
from Linux-HA to loose heartbeat signals and failover. Not a good
scenario when you are trying to provide HA systems to hospitals!
Russell Leighton wrote:
>
> I can't help, but I can echo a "me too".
>
> We only see it when I have 2 file I/O intensive processes...they both
> will just stop for some few seconds, system seems idle...then
> they just start again. RH7.3 SMP, Dual PIII, 4GB RAM, 3com RAID
> Controller .
>
> Brian Tinsley wrote:
>
>> We have been having terrible problems with long stalls, meaning from
>> a couple of minutes to an hour, happening when filesystem I/O load
>> gets high. The system time as reported by vmstat or sar will increase
>> up to 99% and as it spreads to each procesor, the system becomes
>> completely unresponsive (except that it responds to pings just fine -
>> interesting!). When the system finally returns to the world of the
>> living, the only evidence that something bad has happened is the
>> runtime for kswapd is abnormally high. I have seen this happen with
>> the stock 2.4.17, 2.4.19, and 2.4.20 kernels on SMP PIII and PIV
>> machines (either 4GB or 8GB RAM, all SCSI disks, dual GigE NICs).
>> I've searched the lkml archives and google and have found several
>> similar postings, but there is never an explanation or resolution.
>> Any help would be *very* much appreciated! If any info from the
>> system in question is desired, I will be glad to provide it.
>>
>>
>>
>
--
-[========================]-
-[ Brian Tinsley ]-
-[ Chief Systems Engineer ]-
-[ Emageon ]-
-[========================]-
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: long stalls
2003-01-08 2:16 ` Brian Tinsley
@ 2003-01-08 4:07 ` Russell Leighton
0 siblings, 0 replies; 8+ messages in thread
From: Russell Leighton @ 2003-01-08 4:07 UTC (permalink / raw)
To: Brian Tinsley; +Cc: linux-kernel
I am pretty sure we are at 2.4.19.
Brian Tinsley wrote:
> Out of curiosity, which RH kernel are you using? I moved on to 2.4.19
> and 2.4.20 primarily because the RH 2.4.18 series of kernels
> apparently has a scheduler bug (at least one) that causes the
> heartbeat software from Linux-HA to loose heartbeat signals and
> failover. Not a good scenario when you are trying to provide HA
> systems to hospitals!
>
>
> Russell Leighton wrote:
>
>>
>> I can't help, but I can echo a "me too".
>>
>> We only see it when I have 2 file I/O intensive processes...they both
>> will just stop for some few seconds, system seems idle...then
>> they just start again. RH7.3 SMP, Dual PIII, 4GB RAM, 3com RAID
>> Controller .
>>
>> Brian Tinsley wrote:
>>
>>> We have been having terrible problems with long stalls, meaning from
>>> a couple of minutes to an hour, happening when filesystem I/O load
>>> gets high. The system time as reported by vmstat or sar will
>>> increase up to 99% and as it spreads to each procesor, the system
>>> becomes completely unresponsive (except that it responds to pings
>>> just fine - interesting!). When the system finally returns to the
>>> world of the living, the only evidence that something bad has
>>> happened is the runtime for kswapd is abnormally high. I have seen
>>> this happen with the stock 2.4.17, 2.4.19, and 2.4.20 kernels on SMP
>>> PIII and PIV machines (either 4GB or 8GB RAM, all SCSI disks, dual
>>> GigE NICs). I've searched the lkml archives and google and have
>>> found several similar postings, but there is never an explanation or
>>> resolution. Any help would be *very* much appreciated! If any info
>>> from the system in question is desired, I will be glad to provide it.
>>>
>>>
>>>
>>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: long stalls
2003-01-08 1:51 ` Russell Leighton
2003-01-08 2:16 ` Brian Tinsley
@ 2003-01-08 4:00 ` Russell Leighton
2003-01-08 15:17 ` Juergen Sawinski
2 siblings, 0 replies; 8+ messages in thread
From: Russell Leighton @ 2003-01-08 4:00 UTC (permalink / raw)
To: Brian Tinsley; +Cc: linux-kernel
Minor correction: 3ware RAID controller.
Russell Leighton wrote:
>
> I can't help, but I can echo a "me too".
>
> We only see it when I have 2 file I/O intensive processes...they both
> will just stop for some few seconds, system seems idle...then
> they just start again. RH7.3 SMP, Dual PIII, 4GB RAM, 3com RAID
> Controller .
>
> Brian Tinsley wrote:
>
>> We have been having terrible problems with long stalls, meaning from
>> a couple of minutes to an hour, happening when filesystem I/O load
>> gets high. The system time as reported by vmstat or sar will increase
>> up to 99% and as it spreads to each procesor, the system becomes
>> completely unresponsive (except that it responds to pings just fine -
>> interesting!). When the system finally returns to the world of the
>> living, the only evidence that something bad has happened is the
>> runtime for kswapd is abnormally high. I have seen this happen with
>> the stock 2.4.17, 2.4.19, and 2.4.20 kernels on SMP PIII and PIV
>> machines (either 4GB or 8GB RAM, all SCSI disks, dual GigE NICs).
>> I've searched the lkml archives and google and have found several
>> similar postings, but there is never an explanation or resolution.
>> Any help would be *very* much appreciated! If any info from the
>> system in question is desired, I will be glad to provide it.
>>
>>
>>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: long stalls
2003-01-08 1:51 ` Russell Leighton
2003-01-08 2:16 ` Brian Tinsley
2003-01-08 4:00 ` Russell Leighton
@ 2003-01-08 15:17 ` Juergen Sawinski
2 siblings, 0 replies; 8+ messages in thread
From: Juergen Sawinski @ 2003-01-08 15:17 UTC (permalink / raw)
To: linux-kernel@vger
On Wed, 2003-01-08 at 02:51, Russell Leighton wrote:
>
> I can't help, but I can echo a "me too".
>
> We only see it when I have 2 file I/O intensive processes...they both
> will just stop for some few seconds, system seems idle...then
> they just start again. RH7.3 SMP, Dual PIII, 4GB RAM, 3com RAID Controller .
Same thing here with a Promise SX6000 RAID controller (P4, 1GB RAM,
system is completely on RAID, 2.4.20-pre10-ac1). But, this seems not to
be related. At least in my case, it's the controller that causes the
stalls, 'cause only processes depending on file IO (including swap) get
into D state. Everything else just runs fine.
George
--
Juergen "George" Sawinski | Phone: +49-6221-486-308
Max-Planck Institute for Medical Research | Fax: +49-6221-486-325
Dept. of Biomedical Optics | Mobile: +49-171-532 5302
Jahnstr. 29 |
D-69120 Heidelberg |
Germany |
GPG Key/Fingerprint: 9A5F7A31/86F2E5D5EDF4D9983BDD3F23986F154F9A5F7A31
^ permalink raw reply [flat|nested] 8+ messages in thread