From: "Andrew Kinney" <andykinney@advantagecom.net>
To: linux-scsi@vger.kernel.org
Subject: PERC3/DI aacraid failed disk detection slow
Date: Sun, 13 Mar 2005 02:11:32 -0800 [thread overview]
Message-ID: <4233A154.32754.34B61B97@localhost> (raw)
Hello,
I'm fairly sure that this is a firmware issue and not a Linux issue,
but I'm hoping someone on this list would know who is the right
person to contact about firmware issues. If you know the right
person to contact, please email me off list with their contact info.
The Dell techs will replace the disk, of course, but that won't solve
the real problem that caused the system to become unresponsive when
the disk failed. We've been grappling with them for just over a year
on this issue and never once have they put me in touch with a
firmware programmer, though they've replaced every component in the
system during the same time.
We have two identical systems exhibiting these non-reproducable
symptoms that only show with full production use (ugh). First it was
drive ID 1 (the 2nd drive) in both systems. Replaced those. Now
it's drive ID 4 (the 5th drive) in both systems. Replaced it on one
system and am now replacing it on the second system. The difference
between the original drive and the replacement? The original was a
QUANTUM ATLAS10K3_36_SCA rev. 120G U160 and the replacement was
either Fujitsu U320 or Seagate U320 depending on what Dell shipped on
that day. I'm fairly sure that the Quantums just have a slightly
flaky drive firmware that locks up under certain conditions unique to
our I/O patterns, but since there is no firmware being developed for
those drives the only option is to replace the drive with a different
brand.
At any rate, since the pattern holds with both systems, it most
probably points at a misbehaved drive model. However, the RAID
controller is still at fault for the entire system going down because
it didn't mark the drive as failed and return control to the OS
within 60 seconds.
The system became unresponsive. SNMP graphing showed load of 495 and
all network activity stopped shortly after disk failure. Probably
resulted from build-up of block I/O after the kernel kicked the
unresponsive storage offline.
The specific error message repeating across the console as fast as it
would print was:
Assertion failure in do_get_write_access() at :0: "jh->b_transaction
== journal->j_committing_transaction"
The following controller log indicates that the failed disk detection
routine within the controller took too long to determine the disk was
failed:
AFA0> diagnostic show history /old=TRUE
Executing: diagnostic show history /old=TRUE
*** HISTORY BUFFER FROM LAST RUN ***
[00]: ID(0:04:0) Cmd[0x28] Fail: Block Range 15179520 : 15179647
[01]: at 5082220 sec
[02]: ID(0:04:0) Cmd[0x28] Fail: Block Range 23749824 : 23749839
[03]: at 5082220 sec
[04]: ID(0:04:0) Cmd[0x28] Fail: Block Range 41313205 : 41313206
[05]: at 5082220 sec
[06]: ID(0:04:0) Cmd[0x2a] Fail: Block Range 5270011 : 5270012 at
[07]: 5082220 sec
[08]: ID(0:04:0) Cmd[0x28] Fail: Block Range 43269255 : 43269256
[09]: at 5082220 sec
[10]: ID(0:04:0) Cmd[0x28] Fail: Block Range 41310245 : 41310246
[11]: at 5082220 sec
[12]: ID(0:04:0) Cmd[0x28] Fail: Block Range 3144832 : 3144959 at
[13]: 5082220 sec
[14]: ID(0:04:0) Cmd[0x28] Fail: Block Range 48800545 : 48800546
[15]: at 5082220 sec
[16]: ID(0:04:0) Cmd[0x28] Fail: Block Range 24652631 : 24652632
[17]: at 5082220 sec
[18]: ID(0:04:0) Cmd[0x28] Fail: Block Range 8102825 : 8102826 at
[19]: 5082220 sec
[20]: ID(0:04:0) Cmd[0x28] Fail: Block Range 59097920 : 59097951
[21]: at 5082220 sec
[22]: ID(0:04:0) Cmd[0x28] Fail: Block Range 5461313 : 5461318 at
[23]: 5082220 sec
[24]: ID(0:04:0) Cmd[0x28] Fail: Block Range 64466133 : 64466134
[25]: at 5082220 sec
[26]: ID(0:04:0) Cmd[0x2a] Fail: Block Range 3147136 : 3147263 at
[27]: 5082220 sec
[28]: ID(0:04:0) Cmd[0x28] Fail: Block Range 590215 : 590222 at 5
[29]: 082220 sec
[30]: ID(0:04:0) Cmd[0x2a] Fail: Block Range 12283087 : 12283088
[31]: at 5082220 sec
[32]: ID(0:04:0) Cmd[0x2a] Fail: Block Range 3147264 : 3147391 at
[33]: 5082220 sec
[34]: ID(0:04:0) Cmd[0x28] Fail: Block Range 19046144 : 19046271
[35]: at 5082220 sec
[36]: ID(0:04:0) Cmd[0x28] Fail: Block Range 54603697 : 54603698
[37]: at 5082220 sec
[38]: ID(0:04:0) Cmd[0x28] Fail: Block Range 215263 : 215270 at 5
[39]: 082220 sec
[40]: ID(0:04:0) Cmd[0x28] Fail: Block Range 70646759 : 70646764
[41]: at 5082220 sec
[42]: ID(0:04:0) Cmd[0x28] Fail: Block Range 64215 : 64222 at 508
[43]: 2220 sec
[44]: ID(0:04:0) Cmd[0x28] Fail: Block Range 46804736 : 46804751
[45]: at 5082220 sec
[46]: ID(0:04:0) Cmd[0x28] Fail: Block Range 70664653 : 70664654
[47]: at 5082220 sec
[48]: ID(0:04:0) Cmd[0x28] Fail: Block Range 46055040 : 46055167
[49]: at 5082220 sec
[50]: ID(0:04:0) Cmd[0x28] Fail: Block Range 39911821 : 39911830
[51]: at 5082220 sec
[52]: ID(0:04:0) Cmd[0x2a] Fail: Block Range 3146880 : 3147007 at
[53]: 5082220 sec
[54]: ID(0:04:0) Cmd[0x28] Fail: Block Range 5159344 : 5159359 at
[55]: 5082220 sec
[56]: ID(0:04:0) Cmd[0x28] Fail: Block Range 41295373 : 41295374
[57]: at 5082220 sec
[58]: ID(0:04:0) Cmd[0x28] Fail: Block Range 15544320 : 15544447
[59]: at 5082220 sec
[60]: ID(0:04:0) Cmd[0x28] Fail: Block Range 5166793 : 5166794 at
[61]: 5082220 sec
[62]: RAID5 Container 0 Drive 0:4:0 Failure
[63]: ID(0:04:0): Timeout detected on cmd[0x28]
[64]: SCSI Channel[0]: Timeout Detected On 1 Command(s)
[65]: ID(0:04:0) Timeout detected on cmd[0x28]
[66]: SCSI Channel[0]: Timeout Detected On 1 Command(s)
[67]: ID(0:04:0): Timeout detected on cmd[0x28]
[68]: SCSI Channel[0]: Timeout Detected On 1 Command(s)
[69]: ID(0:04:0): Timeout detected on cmd[0x28]
[70]: SCSI Channel[0]: Timeout Detected On 1 Command(s)
[71]: ID(0:04:0): Timeout detected on cmd[0x28]
[72]: SCSI Channel[0]: Timeout Detected On 1 Command(s)
[73]: ID(0:04:0): Timeout detected on cmd[0x28]
[74]: SCSI Channel[0]: Timeout Detected On 1 Command(s)
[75]: ID(0:04:0): Timeout detected on cmd[0x28]
[76]: SCSI Channel[0]: Timeout Detected On 1 Command(s)
[77]: ID(0:04:0) Timeout detected on cmd[0x28]
[78]: SCSI Channel[0]: Timeout Detected On 1 Command(s)
[79]: ID(0:04:0) Cmd[0x28] Fail: Block Range 0 : 0 at 5082308 sec
[80]: 2 can't read mbr dev_t:4
[81]: <...repeats 1 more times>
[82]: can't read config from slice #[4]
[83]: 2 can't read mbr dev_t:4
[84]: can't read config from slice #[4]
[85]: CT_LogMissingEntry: Log missing entry, container 0, dev 4,
[86]: signature 0x8f950a4d, nvEntry 65
[87]: CtMarkDead: container 0, deadEntry 4, dev 4, signature 0x8f
[88]: 950a4d
[89]: CtMarkDead: container 0, deadEntry 4, dev 4, signature 0x8f
[90]: 950a4d
[91]: CtMarkDead: container 0, deadEntry 4, dev 4, signature 0x8f
[92]: 950a4d
[93]: CtMarkDead: container 0, deadEntry 4, dev 4, signature 0x8f
[94]: 950a4d
[95]: CtMarkDead: container 0, deadEntry 4, dev 4, signature 0x8f
[96]: 950a4d
[97]: RAID5 Failover Container 0 No Failover Assigned
[98]: Drive 0:4:0 returning error
[99]:
[/CODE]
88 seconds to determine the drive failed. In other words, it took 88
seconds from the time it stopped processing commands from the OS
until it was ready to continue processing commands from the OS. The
kernel killed the storage at 60 seconds, thus hosing the OS since
that was the only storage device. Though the controller came back,
the OS had already given up and couldn't recover.
Am I correct in assessing that the controller's firmware is
responsible for this extended delay in detecting the failed disk?
Here's the information on our setup:
PERC3/DI on Dell PowerEdge 2500
5 disk U160 RAID5
AFA0> controller details
Executing: controller details
Controller Information
----------------------
Device Name: AFA0
Controller Type: PERC 3/Di
Access Mode: READ-WRITE
Controller Serial Number: Last Six Digits = 4C20D2
Number of Buses: 2
Devices per Bus: 15
Controller CPU: i960 R series
Controller CPU Speed: 100 Mhz
Controller Memory: 128 Mbytes
Battery State: Ok
Component Revisions
-------------------
CLI: 2.8-0 (Build #6076)
API: 2.8-0 (Build #6076)
Miniport Driver: 1.1-4 (Build #9999)
Controller Software: 2.8-0 (Build #6092)
Controller BIOS: 2.8-0 (Build #6092)
Controller Firmware: (Build #6092)
Sincerely,
Andrew Kinney
President and
Chief Technology Officer
Advantagecom Networks, Inc.
http://www.advantagecom.net
reply other threads:[~2005-03-13 10:13 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4233A154.32754.34B61B97@localhost \
--to=andykinney@advantagecom.net \
--cc=linux-scsi@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox