linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Otto Solares <solca@guug.org>
To: Ryan Anderson <ryan@autoweb.net>
Cc: Phil Brutsche <phil@brutsche.us>, linux-scsi@vger.kernel.org
Subject: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
Date: Tue, 9 Nov 2004 15:32:15 -0600	[thread overview]
Message-ID: <20041109213215.GA4047@guug.org> (raw)
In-Reply-To: <1100031774.24635.157.camel@ryan2.internal.autoweb.net>

JFYI

I have exactly this same problem on 3 brand new Dell PE2650
machines with Perc3/Di controllers, my other new Dell servers
with the Perc4/Di controller have never fail.

Dell customer support sucks, they would not help me as I am
not running a supported distro/kernel.

The faulty servers have the latest BIOS, Perc3/Di firmware (6092),
latest ERA/RAC firmware, latest debian sarge, kernel 2.6.10-rc1-bk7.
Both 2.4 and 2.6 hangs the controller.

The problem appears when too many IO is happening, the kernel
don't die, as if I have a ssh session I could execute some
cached binaries like ps, bash, etc.  Everything in memory runs
fine until it touches sda that is offlined as you can see
from this kernel messages:

Nov  5 14:53:30 saruman kernel: aacraid: Host adapter reset request. SCSI hang ? 
Nov  5 14:54:33 saruman kernel: aacraid: SCSI bus appears hung 
Nov  5 14:54:34 saruman kernel: scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 
Nov  5 14:54:34 saruman kernel: Device sda not ready. 
Nov  5 14:54:34 saruman kernel: end_request: I/O error, dev sda, sector 127952537 
Nov  5 14:54:34 saruman kernel: scsi0 (0:0): rejecting I/O to offline device 
Nov  5 14:54:34 saruman kernel: scsi0 (0:0): rejecting I/O to offline device 
Nov  5 14:54:34 saruman kernel: EXT3-fs error (device sda4): ext3_find_entry: reading directory #13880243 offset 0 
Nov  5 14:54:34 saruman kernel:  
Nov  5 14:54:34 saruman kernel: Remounting filesystem read-only

-otto

On Tue, Nov 09, 2004 at 03:22:54PM -0500, Ryan Anderson wrote:
> On Thu, 2004-10-28 at 14:21, Phil Brutsche wrote:
> > Andrew Morton wrote:
> > > Distribution: Debian Sarge
> > > Hardware Environment: Dell Poweredge 4600, 5 Disks each 146GB in a Raid 5 in 
> > > one container, 8 GB RAM, Dual Xenon 2GHz. The Perc 3/Di Controller is on 
> > > Firmware version 2.80 Build 6092
> > > Software Environment: aacraid
> 
> I have a Dell 2650, 3 disks, RAID 5, 72GB drives (Fujitsu drives)
> 4 GB Ram, dual 2.4GHz Xeon
> 
> dmesg tells me I have this specific firmware:
> AAC0: kernel 2.8.4 build 6092
> AAC0: monitor 2.8.4 build 6092
> AAC0: bios 2.8.0 build 6092
> AAC0: serial 83ac41d3fafaf001
> scsi0 : percraid
>   Vendor: DELL      Model: PERCRAID RAID5    Rev: V1.0
>   Type:   Direct-Access                      ANSI SCSI revision: 02
>   Vendor: DELL      Model: PERCRAID RAID5    Rev: V1.0
>   Type:   Direct-Access                      ANSI SCSI revision: 02
> 
> 
> Currently I have 2.6.8 on this machine.  (I believe it's actually
> 2.6.8.1, Debian sometimes blurs things a little bit in terms of
> backporting a patch.)
> 
> > > Problem Description: 
> > > The Container on the PERC 3/Di Controller goes offline on heavy I/O Load with 
> > > the following error message:
> > > 
> > > SCSI:0 (0:0): rejecting I/O to offline device
> > > Buffer I/O error due to I/O error on sda8
> 
> That's what I'm seeing.  It's rather hard to capture because the only
> disks on this machine are in the RAID container that keeps going
> offline.
> 
> > > Steps to reproduce:
> > > 
> > > I am using bonnie++ to produce I/O load on the only Volume on the Perc 3/Di 
> > > Controller with the following parameters bonnie++ -d /var/lib/postgres/test -s 
> > > 16000 -n 150 -r 8000 -u nobody:nogroup
> > 
> > FYI, I have been seeing this as well.
> > 
> > I can trigger this card lockup at will with mkfs.ext3; for other
> > filesystems, I may need to extract a kernel source .tar.gz in order to
> > cause a lockup.
> > 
> > aacraid: Host adapter reset request. SCSI hang ?
> > aacraid: Host adapter appears dead
> > Device offlined - not ready after error recovery: host 1 channel 0 id 0
> > lun 0
> > SCSI error : <1 0 0 0> return code = 0x6000000
> > end_request: I/O error, dev sdb, sector 1667007
> > Buffer I/O error on device sdb1, logical block 208368
> > lost page write due to I/O error on sdb1
> > scsi1 (0:0): rejecting I/O to offline device
> > Buffer I/O error on device sdb1, logical block 208369
> > 
> > I am using an Adaptec 2120S with a RAID5 of Seagate U320 drives - yes, I
> > know about the Seagate firmware timeout problem, these drives are brand
> > new with firmware rev 0006 and thus aren't affected.
> > 
> > This hardware has no problems with kernel 2.4.x.
> 
> I had similar, but not nearly as bad, problems with 2.4.x.
> Under 2.4.x, this machine would become unavailable for approximately 20
> minutes, and then would recover.  The load would be around 20 when it
> came back, and would rapidly drop off.
> 
> This was with 2.4.20, using a config modeled off the Debian 2.4.18-bf2.4
> config.
> 
> Due to this problem, my machine is no longer a production machine, so I
> can do whatever testing is necessary to fix this.
> 
> I have gone through the hardware diagnostics process with Dell, with the
> exception of completing the Elite HD diagnostics on the Fujitsu drives. 
> (The program filled the boot floppy with the log, and I haven't gotten
> around to rerunning it yet.)
> 
> I can debug and attempt to reproduce as much as is necessary at this
> point, if anyone can give me a place to start and/or a patch to apply.
> 
> 
> -- 
> 
> Ryan Anderson                
> AutoWeb Communications, Inc. 
> email: ryan@autoweb.net 



  reply	other threads:[~2004-11-09 21:32 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-10-28  7:53 Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Andrew Morton
2004-10-28 18:21 ` Phil Brutsche
2004-11-09 20:22   ` Ryan Anderson
2004-11-09 21:32     ` Otto Solares [this message]
2004-11-09 23:49       ` Andrew Kinney
2004-11-10 17:43       ` aacraid, seagate and adaptec issues [Was: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline] Otto Solares
2004-11-10 20:33         ` Phil Brutsche
2004-11-10 21:08           ` Otto Solares
2004-11-23 21:41 ` Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Ryan Anderson
2004-11-23 22:58   ` Otto Solares
2004-11-24  1:00   ` Andrew Kinney
2004-11-24 18:35     ` Ryan Anderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20041109213215.GA4047@guug.org \
    --to=solca@guug.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=phil@brutsche.us \
    --cc=ryan@autoweb.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).