From: Andreas Haumer <andreas@xss.co.at>
To: Anders Karlsson <anders@trudheim.com>
Cc: Marcelo Tosatti <marcelo@conectiva.com.br>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: Linux 2.4.21-rc7
Date: Thu, 12 Jun 2003 11:35:56 +0200 [thread overview]
Message-ID: <3EE8497C.1090303@xss.co.at> (raw)
In-Reply-To: <1055408183.2552.18.camel@tor.trudheim.com>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hi!
Anders Karlsson wrote:
> On Wed, 2003-06-11 at 21:48, Marcelo Tosatti wrote:
>
>>On Sat, 7 Jun 2003, Andreas Haumer wrote:
>
> [snip]
>
>>>I had this system running under heavy load for about 24 hours
>>>without problems. I then stopped the stress testing, and had
>>>several system freezes since then.
>>>
>>>With system freeze I mean:
>>>
>>>*) machine doesn't answer to ping, no reaction to console
>>> keyboard, no message on the console screen, no message
>>> in logfile, no oops, no noticeable system activity
>
>
> I have this problem without actually stressing the machine too hard. The
> average load on my Thinkpad over a weekend would perhaps be 0.05, yet I
> can have several hard hangs where there seems to be no trace of a hang
> at all in logfiles.
>
I have to admit that "system freeze" is a quite unspecific
symptom. It could have a zillion of different reasons.
In my case I'm currently chasing SCSI errors which I think
could have something to do with it (besides, it's _not_ an Adaptec
controller, but a LSI 53c1030 with Fusion MPT driver... :-)
In my server logs I sometimes see SCSI timeouts like this:
[...]
scsi : aborting command due to timeout : pid 1148093, scsi0, channel 0, id 1, lun 0 Read (10) 00 00 00 0f af 00 00 10 00
mptscsih: OldAbort scheduling ABORT SCSI IO (sc=dfca8e00)
IOs outstanding = 3
mptscsih: ioc0: Issue of TaskMgmt Successful!
SCSI host 0 abort (pid 1148093) timed out - resetting
SCSI bus is being reset for host 0 channel 0.
mptscsih: OldReset scheduling BUS_RESET (sc=dfca8e00)
IOs outstanding = 4
SCSI Error Report =-=-= (0:0:0)
SCSI_Status=02h (CHECK CONDITION)
Original_CDB[]: 2A 00 00 3C 4D 78 00 00 02 00 - "WRITE(10)"
SenseData[20h]: 70 00 06 00 00 00 00 18 00 00 00 00 29 02 00 00 00 00 ...
SenseKey=6h (UNIT ATTENTION); FRU=00h
ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
SCSI Error Report =-=-= (0:1:0)
SCSI_Status=02h (CHECK CONDITION)
Original_CDB[]: 28 00 00 00 0F AF 00 00 10 00 - "READ(10)"
SenseData[20h]: 70 00 06 00 00 00 00 18 00 00 00 00 29 02 00 00 00 00 ...
SenseKey=6h (UNIT ATTENTION); FRU=00h
ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
SCSI Error Report =-=-= (0:2:0)
SCSI_Status=02h (CHECK CONDITION)
Original_CDB[]: 28 00 00 4E 0A 37 00 00 08 00 - "READ(10)"
SenseData[20h]: 70 00 06 00 00 00 00 18 00 00 00 00 29 02 00 00 00 00 ...
SenseKey=6h (UNIT ATTENTION); FRU=00h
ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
SCSI Error Report =-=-= (0:3:0)
SCSI_Status=02h (CHECK CONDITION)
Original_CDB[]: 28 00 03 B0 08 6F 00 00 08 00 - "READ(10)"
SenseData[20h]: 70 00 06 00 00 00 00 18 00 00 00 00 29 02 00 00 00 00 ...
SenseKey=6h (UNIT ATTENTION); FRU=00h
ASC/ASCQ=29h/02h "SCSI BUS RESET OCCURRED"
[...]
There are 4 hot swap SCSI disks in the server, and all of them
eventually report those timeouts (so it's not specific to a single
disk)
I already replaced cabling, tried a different hot swap (SCA)
cage, and I'm now trying to replace the disks one by one to
eventually find the culprit.
There are two problems with this approach:
1.) After each change I have to wait several hours up to two
days for a SCSI timeout to occur as I can not reproduce
the problem at will.
2.) I'm not _sure_ if those SCSI timeouts are related to the server
freeze symptoms I see. It's just an assumption.
IMHO it could work as follows: SCSI timeouts occure somtimes.
The driver then aborts the command and resets the SCSI bus
to get it into a sane state again. But what if the bus reset
doesn't work as expected and the bus remains unusable for a
while? Could this bring the whole system into this "freeze"
state (the system is still running, but everything waits for
the SCSI bus to recover)? Could this explain the symptom of
those big delays of ICMP ping answer messages I saw?
So the most precious resource for chasing this problem is time,
and this is also the resource which I don't have available as
much as I'd like to... :-(
>
>>Maybe the NMI oopser helps?
>
>
> Marcelo, where can I get hold of this and would there be documentation
> included with it for how to install/use it?
>
Look at /usr/src/linux/Documentation/nmi_watchdog.txt
Regards,
- - andreas
- --
Andreas Haumer | mailto:andreas@xss.co.at
*x Software + Systeme | http://www.xss.co.at/
Karmarschgasse 51/2/20 | Tel: +43-1-6060114-0
A-1100 Vienna, Austria | Fax: +43-1-6060114-71
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQE+6El7xJmyeGcXPhERAqykAKCumORTm/lDofkrg52FX33rOfgC/ACeNxR7
l9/znrbi0lZoR/zw+LTdNhI=
=W7Gt
-----END PGP SIGNATURE-----
next prev parent reply other threads:[~2003-06-12 9:24 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2003-06-03 17:04 Linux 2.4.21-rc7 Marcelo Tosatti
2003-06-03 18:02 ` Tomas Szepe
2003-06-03 18:07 ` Marcelo Tosatti
2003-06-03 19:15 ` lk
2003-06-03 19:40 ` Alan Cox
2003-06-03 18:30 ` Alex Romosan
2003-06-03 19:27 ` Jeff Garzik
2003-06-03 19:58 ` Alex Romosan
2003-06-03 20:14 ` Tom Rini
2003-06-04 3:35 ` David S. Miller
2003-06-04 15:09 ` Mr. James W. Laferriere
2003-06-04 23:37 ` Alex Romosan
2003-06-05 12:09 ` Andreas Haumer
2003-06-07 15:46 ` Andreas Haumer
2003-06-09 10:16 ` [2.4.21-rc7] AP1700-S5 system freeze :-(( Andreas Haumer
2003-06-09 11:46 ` Stephan von Krawczynski
2003-06-09 12:21 ` Andreas Haumer
2003-06-11 20:48 ` Linux 2.4.21-rc7 Marcelo Tosatti
[not found] ` <1055408183.2552.18.camel@tor.trudheim.com>
2003-06-12 9:35 ` Andreas Haumer [this message]
-- strict thread matches above, loose matches on Subject: below --
2003-06-03 18:45 Margit Schubert-While
2003-06-03 18:50 ` Marc-Christian Petersen
2003-06-03 19:38 ` Christoph Hellwig
2003-06-08 8:54 Clayton Weaver
2003-06-08 9:47 ` Willy Tarreau
2003-06-08 20:17 Clayton Weaver
2003-06-08 20:51 ` Bartlomiej Zolnierkiewicz
2003-06-08 21:47 ` Willy Tarreau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3EE8497C.1090303@xss.co.at \
--to=andreas@xss.co.at \
--cc=anders@trudheim.com \
--cc=linux-kernel@vger.kernel.org \
--cc=marcelo@conectiva.com.br \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox