From: Chris Worley <cworley@lnxi.com>
To: linux-scsi@vger.kernel.org
Subject: SCSI Timeout issue w/ QLA2460
Date: 25 Sep 2003 11:36:51 -0600 [thread overview]
Message-ID: <1064511398.9229.35490.camel@localhost.localdomain> (raw)
Background: 2.4.21 korg kernel w/ Qlogic 2460 HBA and Qlogic's driver
(modified for 4096 scatter-gather list size). Eight servers are
directly connected to a DDN S2A8000 SAN (dual controler, eight FC ports,
2GB file system), no FC switch.
Problem: DDN starts reporting command aborts from server, although no
commands have required an inordinate amount of time to process. Six
minutes later, the server reports a SCSI error. The problem is
infrequent, and only happens to one host server at a time, during heavy
load, and has occurred on 6 of 8 hosts.
What I'd like to know:
o Any insight into what's happening, how to fix it, or how to
debug it.
o Why is there such a long period between the SAN seeing aborted
requests, and the host complaining?
o Is the SCSI layer resending and re-loosing the same request
over and over, or are many different requests being lost (in
general, how do the SAN error messages correlate to the server
error messages)?
o Are their any bottlenecks in the SCSI layer that would inhibit
the number of outstanding requests, and is there any way to
increase the number of outstanding commands (a performance
issue... we're only seeing about 20 outstanding commands at peak
usage, from the SAN perspective, and this SAN is designed to
handle many more requests simultaneously).
First, the SAN begins reporting SCSI aborts from the host. When the
problem occurs, it happens on just one host at a time (but has happened
to many host). The "port number" maps directly to the host system
> Sep 23 16:50:47 192.168.0.252 Sep 23 16:43:02 192.168.2.206 DMT_45 Command Aborted: SCSI cmd:2A LUN 0 DMT_45 Lane:3 T:230 a:C2ACFADF l: 200 00/00 01,01 W:RDY AB Anonymous WWN:210000E08B0A26B7 port:4 lane:3 OX_ID:2058
> Sep 23 16:50:48 192.168.0.252 Sep 23 16:43:02 192.168.2.206 DMT_49 Command Aborted: SCSI cmd:28 LUN 0 DMT_49 Lane:1 T:230 a:C2ACF9FF l: 8 00/00 02,02 W:RDY AB Anonymous WWN:210000E08B0A26B7 port:4 lane:1 OX_ID:2EC8
> Sep 23 16:51:10 192.168.0.252 Sep 23 16:43:25 192.168.2.206 DMT_45 Command Aborted: SCSI cmd:2A LUN 0 DMT_45 Lane:6 T:229 a:C2ACFADF l: 200 00/00 01,01 W:RDY AB Anonymous WWN:210000E08B0A26B7 port:4 lane:6 OX_ID:2568
> Sep 23 16:51:11 192.168.0.252 Sep 23 16:43:25 192.168.2.206 DMT_49 Command Aborted: SCSI cmd:28 LUN 0 DMT_49 Lane:0 T:230 a:C2ACF9FF l: 8 00/00 02,02 W:RDY AB Anonymous WWN:210000E08B0A26B7 port:4 lane:0 OX_ID:2C58
> Sep 23 16:51:33 192.168.0.252 Sep 23 16:43:48 192.168.2.206 DMT_45 Command Aborted: SCSI cmd:2A LUN 0 DMT_45 Lane:3 T:229 a:C2ACFADF l: 200 00/00 01,01 W:RDY AB Anonymous WWN:210000E08B0A26B7 port:4 lane:3 OX_ID:2C88
> Sep 23 16:51:34 192.168.0.252 Sep 23 16:43:48 192.168.2.206 DMT_49 Command Aborted: SCSI cmd:28 LUN 0 DMT_49 Lane:5 T:230 a:C2ACF9FF l: 8 00/00 02,02 W:RDY AB Anonymous WWN:210000E08B0A26B7 port:4 lane:5 OX_ID:2CB8
> Sep 23 16:51:56 192.168.0.252 Sep 23 16:44:11 192.168.2.206 DMT_45 Command Aborted: SCSI cmd:2A LUN 0 DMT_45 Lane:2 T:230 a:C2ACFADF l: 200 00/00 01,01 W:RDY AB Anonymous WWN:210000E08B0A26B7 port:4 lane:2 OX_ID:2CE8
> Sep 23 16:51:57 192.168.0.252 Sep 23 16:44:11 192.168.2.206 DMT_49 Command Aborted: SCSI cmd:28 LUN 0 DMT_49 Lane:1 T:230 a:C2ACF9FF l: 8 00/00 02,02 W:RDY AB Anonymous WWN:210000E08B0A26B7 port:4 lane:1 OX_ID:2D18
> Sep 23 16:52:19 192.168.0.252 Sep 23 16:44:34 192.168.2.206 DMT_45 Command Aborted: SCSI cmd:2A LUN 0 DMT_45 Lane:4 T:230 a:C2ACFADF l: 200 00/00 01,01 W:RDY AB Anonymous WWN:210000E08B0A26B7 port:4 lane:4 OX_ID:2D48
> Sep 23 16:52:20 192.168.0.252 Sep 23 16:44:34 192.168.2.206 DMT_49 Command Aborted: SCSI cmd:28 LUN 0 DMT_49 Lane:7 T:230 a:C2ACF9FF l: 8 00/00 02,02 W:RDY AB Anonymous WWN:210000E08B0A26B7 port:4 lane:7 OX_ID:2D78
> ...
The SAN vendor adds this to explain the above:
The command aborts means that the host is canceling the
outstanding transaction. The op tells you what the scsi op code
was for that i/o request. eg: 28 = read 2a = write. Lets look
at "stats delay", and "host status" to see if there is a reason
why there is a delay. eg: If we took too long to process an i/o
request. If not, it would look like maybe something got locked
on the host side, causing the i/o to delay, and eventually be
timed out.
Later, they were able to determine that there isn't any unreasonable
delay on any of the requests coming through the SAN's queue, and, in
fact, their system could be handling many more times the number of
outstanding commands simultaneously (which would lead to better
performance... if anybody knows of anything constraining outstanding
SCSI commands on the kernel side, I'd like to know).
After the DDN starts displaying these errors, I can log into the server
associated with the port: everything looks good... I can browse the file
system without error or delay. During this time, the DDN continues to
spit out the error message every few seconds (in about bursts of ten
messages at a time). After about 6 minutes, I get the errors on the
host system:
scsi_io_completion/scsi_lib.c:
> Sep 23 16:58:04 192.168.1.1 kernel: SCSI disk error : host 2 channel 0 id 0 lun 0 return code = 20000
from scsi_end_request/scsi_lib.c:
> Sep 23 16:58:04 192.168.1.1 kernel: I/O error: dev 08:41, sector 3266116256
> Sep 23 16:58:05 192.168.1.1 kernel: SCSI disk error : host 2 channel 0 id 0 lun 0 return code = 20000
> Sep 23 16:58:05 192.168.1.1 kernel: I/O error: dev 08:41, sector 3266116032
Any insight would be appreciated.
Thanks,
Chris
next reply other threads:[~2003-09-25 17:39 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2003-09-25 17:36 Chris Worley [this message]
-- strict thread matches above, loose matches on Subject: below --
2003-09-26 18:36 SCSI Timeout issue w/ QLA2460 Andrew Vasquez
2003-09-26 18:55 ` Chris Worley
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1064511398.9229.35490.camel@localhost.localdomain \
--to=cworley@lnxi.com \
--cc=linux-scsi@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox