From: "Andrew Kinney" <andykinney@advantagecom.net>
To: Otto Solares <solca@guug.org>
Cc: Phil Brutsche <phil@brutsche.us>, linux-scsi@vger.kernel.org
Subject: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
Date: Tue, 09 Nov 2004 15:49:14 -0800 [thread overview]
Message-ID: <4190E6FA.22033.8B6089D3@localhost> (raw)
In-Reply-To: <20041109213215.GA4047@guug.org>
FWIW, we had the same problems on two identically configured Dell
PE2500 machines with PERC3/DI controllers. They were purchased about
two years ago. The problem surfaced when we moved to a 2.4.20 kernel
from an older kernel and our disk system loads increased. It seemed
that a combination of seeking through large numbers of files spread
all over the array (12 million or so), a sequential read of a largish
log file (200MB or more), and a lot of random writes all over the
array caused a single drive to become unresponsive in the array.
When the drive became unresponsive, the upper layer drivers offlined
the container when they gave up waiting for the controller kernel to
finish marking the drive as dead (this is an interpretation of our
NVRAM controller logs) and continue servicing requests to the array.
The end result is that a single misbehaving drive, cable, or
connector can cause the entire array to be taken offline by the OS.
Obviously, this isn't the intended operation, so there is still
something that isn't happening correctly, but it could be that the
controller just needs to offline the bad disk faster.
At any rate, our solution (covered by Dell warranties) was to replace
the drive, replace the controller, replace the cables, and replace
the SCSI backplane. We also reinitialized the arrays with a 64KB
stripe size instead of a 32KB stripe size to reduce the physical I/O
overhead associated with many small files. This fixed the problem for
us (so far).
It's tough to say what exactly was the fix since we took a shotgun
approach to the problem, but my guess is that the drive itself wasn't
responding quickly enough. Replacing the drive with a drive of the
same RPM and capacity but designed for U320 operation instead of U160
operation is what I suspect resolved the trouble for us. The logic
chips on the U320 drive appear to process commands faster than those
on the U160 drive, thus limiting the possibility of getting jammed
with commands. Of course, the drive is also a different brand than
the other drives in the array, so that could have been related.
Hopefully that information was useful to you and others on this list.
Unfortunately, I'm not a kernel programmer nor do I have time to
contribute code, so I'm unable to offer anything other than what
solved the issue for us.
Andrew
On 9 Nov 2004 at 15:32, Otto Solares wrote:
> JFYI
>
> I have exactly this same problem on 3 brand new Dell PE2650
> machines with Perc3/Di controllers, my other new Dell servers
> with the Perc4/Di controller have never fail.
>
> Dell customer support sucks, they would not help me as I am
> not running a supported distro/kernel.
>
> The faulty servers have the latest BIOS, Perc3/Di firmware (6092),
> latest ERA/RAC firmware, latest debian sarge, kernel 2.6.10-rc1-bk7.
> Both 2.4 and 2.6 hangs the controller.
>
> The problem appears when too many IO is happening, the kernel
> don't die, as if I have a ssh session I could execute some
> cached binaries like ps, bash, etc. Everything in memory runs
> fine until it touches sda that is offlined as you can see
> from this kernel messages:
>
> Nov 5 14:53:30 saruman kernel: aacraid: Host adapter reset request.
> SCSI hang ? Nov 5 14:54:33 saruman kernel: aacraid: SCSI bus appears
> hung Nov 5 14:54:34 saruman kernel: scsi: Device offlined - not ready
> after error recovery: host 0 channel 0 id 0 lun 0 Nov 5 14:54:34
> saruman kernel: Device sda not ready. Nov 5 14:54:34 saruman kernel:
> end_request: I/O error, dev sda, sector 127952537 Nov 5 14:54:34
> saruman kernel: scsi0 (0:0): rejecting I/O to offline device Nov 5
> 14:54:34 saruman kernel: scsi0 (0:0): rejecting I/O to offline device
> Nov 5 14:54:34 saruman kernel: EXT3-fs error (device sda4):
> ext3_find_entry: reading directory #13880243 offset 0 Nov 5 14:54:34
> saruman kernel: Nov 5 14:54:34 saruman kernel: Remounting filesystem
> read-only
>
> -otto
>
> On Tue, Nov 09, 2004 at 03:22:54PM -0500, Ryan Anderson wrote:
> > On Thu, 2004-10-28 at 14:21, Phil Brutsche wrote:
> > > Andrew Morton wrote:
> > > > Distribution: Debian Sarge
> > > > Hardware Environment: Dell Poweredge 4600, 5 Disks each 146GB in
> > > > a Raid 5 in one container, 8 GB RAM, Dual Xenon 2GHz. The Perc
> > > > 3/Di Controller is on Firmware version 2.80 Build 6092 Software
> > > > Environment: aacraid
> >
> > I have a Dell 2650, 3 disks, RAID 5, 72GB drives (Fujitsu drives) 4
> > GB Ram, dual 2.4GHz Xeon
> >
> > dmesg tells me I have this specific firmware:
> > AAC0: kernel 2.8.4 build 6092
> > AAC0: monitor 2.8.4 build 6092
> > AAC0: bios 2.8.0 build 6092
> > AAC0: serial 83ac41d3fafaf001
> > scsi0 : percraid
> > Vendor: DELL Model: PERCRAID RAID5 Rev: V1.0
> > Type: Direct-Access ANSI SCSI revision: 02
> > Vendor: DELL Model: PERCRAID RAID5 Rev: V1.0 Type:
> > Direct-Access ANSI SCSI revision: 02
> >
> >
> > Currently I have 2.6.8 on this machine. (I believe it's actually
> > 2.6.8.1, Debian sometimes blurs things a little bit in terms of
> > backporting a patch.)
> >
> > > > Problem Description:
> > > > The Container on the PERC 3/Di Controller goes offline on heavy
> > > > I/O Load with the following error message:
> > > >
> > > > SCSI:0 (0:0): rejecting I/O to offline device
> > > > Buffer I/O error due to I/O error on sda8
> >
> > That's what I'm seeing. It's rather hard to capture because the
> > only disks on this machine are in the RAID container that keeps
> > going offline.
> >
> > > > Steps to reproduce:
> > > >
> > > > I am using bonnie++ to produce I/O load on the only Volume on
> > > > the Perc 3/Di Controller with the following parameters bonnie++
> > > > -d /var/lib/postgres/test -s 16000 -n 150 -r 8000 -u
> > > > nobody:nogroup
> > >
> > > FYI, I have been seeing this as well.
> > >
> > > I can trigger this card lockup at will with mkfs.ext3; for other
> > > filesystems, I may need to extract a kernel source .tar.gz in
> > > order to cause a lockup.
> > >
> > > aacraid: Host adapter reset request. SCSI hang ?
> > > aacraid: Host adapter appears dead
> > > Device offlined - not ready after error recovery: host 1 channel 0
> > > id 0 lun 0 SCSI error : <1 0 0 0> return code = 0x6000000
> > > end_request: I/O error, dev sdb, sector 1667007 Buffer I/O error
> > > on device sdb1, logical block 208368 lost page write due to I/O
> > > error on sdb1 scsi1 (0:0): rejecting I/O to offline device Buffer
> > > I/O error on device sdb1, logical block 208369
> > >
> > > I am using an Adaptec 2120S with a RAID5 of Seagate U320 drives -
> > > yes, I know about the Seagate firmware timeout problem, these
> > > drives are brand new with firmware rev 0006 and thus aren't
> > > affected.
> > >
> > > This hardware has no problems with kernel 2.4.x.
> >
> > I had similar, but not nearly as bad, problems with 2.4.x.
> > Under 2.4.x, this machine would become unavailable for approximately
> > 20 minutes, and then would recover. The load would be around 20
> > when it came back, and would rapidly drop off.
> >
> > This was with 2.4.20, using a config modeled off the Debian
> > 2.4.18-bf2.4 config.
> >
> > Due to this problem, my machine is no longer a production machine,
> > so I can do whatever testing is necessary to fix this.
> >
> > I have gone through the hardware diagnostics process with Dell, with
> > the exception of completing the Elite HD diagnostics on the Fujitsu
> > drives. (The program filled the boot floppy with the log, and I
> > haven't gotten around to rerunning it yet.)
> >
> > I can debug and attempt to reproduce as much as is necessary at this
> > point, if anyone can give me a place to start and/or a patch to
> > apply.
> >
> >
> > --
> >
> > Ryan Anderson
> > AutoWeb Communications, Inc.
> > email: ryan@autoweb.net
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
>
Sincerely,
Andrew Kinney
President and
Chief Technology Officer
Advantagecom Networks, Inc.
http://www.advantagecom.net
next prev parent reply other threads:[~2004-11-09 23:48 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-10-28 7:53 Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Andrew Morton
2004-10-28 18:21 ` Phil Brutsche
2004-11-09 20:22 ` Ryan Anderson
2004-11-09 21:32 ` Otto Solares
2004-11-09 23:49 ` Andrew Kinney [this message]
2004-11-10 17:43 ` aacraid, seagate and adaptec issues [Was: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline] Otto Solares
2004-11-10 20:33 ` Phil Brutsche
2004-11-10 21:08 ` Otto Solares
2004-11-23 21:41 ` Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Ryan Anderson
2004-11-23 22:58 ` Otto Solares
2004-11-24 1:00 ` Andrew Kinney
2004-11-24 18:35 ` Ryan Anderson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4190E6FA.22033.8B6089D3@localhost \
--to=andykinney@advantagecom.net \
--cc=linux-scsi@vger.kernel.org \
--cc=phil@brutsche.us \
--cc=solca@guug.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).