All of lore.kernel.org
 help / color / mirror / Atom feed
From: Otto Solares <solca@guug.org>
To: Ryan Anderson <ryan@autoweb.net>
Cc: Phil Brutsche <phil@brutsche.us>, linux-scsi@vger.kernel.org
Subject: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
Date: Tue, 9 Nov 2004 15:32:15 -0600	[thread overview]
Message-ID: <20041109213215.GA4047@guug.org> (raw)
In-Reply-To: <1100031774.24635.157.camel@ryan2.internal.autoweb.net>

JFYI

I have exactly this same problem on 3 brand new Dell PE2650
machines with Perc3/Di controllers, my other new Dell servers
with the Perc4/Di controller have never fail.

Dell customer support sucks, they would not help me as I am
not running a supported distro/kernel.

The faulty servers have the latest BIOS, Perc3/Di firmware (6092),
latest ERA/RAC firmware, latest debian sarge, kernel 2.6.10-rc1-bk7.
Both 2.4 and 2.6 hangs the controller.

The problem appears when too many IO is happening, the kernel
don't die, as if I have a ssh session I could execute some
cached binaries like ps, bash, etc.  Everything in memory runs
fine until it touches sda that is offlined as you can see
from this kernel messages:

Nov  5 14:53:30 saruman kernel: aacraid: Host adapter reset request. SCSI hang ? 
Nov  5 14:54:33 saruman kernel: aacraid: SCSI bus appears hung 
Nov  5 14:54:34 saruman kernel: scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 
Nov  5 14:54:34 saruman kernel: Device sda not ready. 
Nov  5 14:54:34 saruman kernel: end_request: I/O error, dev sda, sector 127952537 
Nov  5 14:54:34 saruman kernel: scsi0 (0:0): rejecting I/O to offline device 
Nov  5 14:54:34 saruman kernel: scsi0 (0:0): rejecting I/O to offline device 
Nov  5 14:54:34 saruman kernel: EXT3-fs error (device sda4): ext3_find_entry: reading directory #13880243 offset 0 
Nov  5 14:54:34 saruman kernel:  
Nov  5 14:54:34 saruman kernel: Remounting filesystem read-only

-otto

On Tue, Nov 09, 2004 at 03:22:54PM -0500, Ryan Anderson wrote:
> On Thu, 2004-10-28 at 14:21, Phil Brutsche wrote:
> > Andrew Morton wrote:
> > > Distribution: Debian Sarge
> > > Hardware Environment: Dell Poweredge 4600, 5 Disks each 146GB in a Raid 5 in 
> > > one container, 8 GB RAM, Dual Xenon 2GHz. The Perc 3/Di Controller is on 
> > > Firmware version 2.80 Build 6092
> > > Software Environment: aacraid
> 
> I have a Dell 2650, 3 disks, RAID 5, 72GB drives (Fujitsu drives)
> 4 GB Ram, dual 2.4GHz Xeon
> 
> dmesg tells me I have this specific firmware:
> AAC0: kernel 2.8.4 build 6092
> AAC0: monitor 2.8.4 build 6092
> AAC0: bios 2.8.0 build 6092
> AAC0: serial 83ac41d3fafaf001
> scsi0 : percraid
>   Vendor: DELL      Model: PERCRAID RAID5    Rev: V1.0
>   Type:   Direct-Access                      ANSI SCSI revision: 02
>   Vendor: DELL      Model: PERCRAID RAID5    Rev: V1.0
>   Type:   Direct-Access                      ANSI SCSI revision: 02
> 
> 
> Currently I have 2.6.8 on this machine.  (I believe it's actually
> 2.6.8.1, Debian sometimes blurs things a little bit in terms of
> backporting a patch.)
> 
> > > Problem Description: 
> > > The Container on the PERC 3/Di Controller goes offline on heavy I/O Load with 
> > > the following error message:
> > > 
> > > SCSI:0 (0:0): rejecting I/O to offline device
> > > Buffer I/O error due to I/O error on sda8
> 
> That's what I'm seeing.  It's rather hard to capture because the only
> disks on this machine are in the RAID container that keeps going
> offline.
> 
> > > Steps to reproduce:
> > > 
> > > I am using bonnie++ to produce I/O load on the only Volume on the Perc 3/Di 
> > > Controller with the following parameters bonnie++ -d /var/lib/postgres/test -s 
> > > 16000 -n 150 -r 8000 -u nobody:nogroup
> > 
> > FYI, I have been seeing this as well.
> > 
> > I can trigger this card lockup at will with mkfs.ext3; for other
> > filesystems, I may need to extract a kernel source .tar.gz in order to
> > cause a lockup.
> > 
> > aacraid: Host adapter reset request. SCSI hang ?
> > aacraid: Host adapter appears dead
> > Device offlined - not ready after error recovery: host 1 channel 0 id 0
> > lun 0
> > SCSI error : <1 0 0 0> return code = 0x6000000
> > end_request: I/O error, dev sdb, sector 1667007
> > Buffer I/O error on device sdb1, logical block 208368
> > lost page write due to I/O error on sdb1
> > scsi1 (0:0): rejecting I/O to offline device
> > Buffer I/O error on device sdb1, logical block 208369
> > 
> > I am using an Adaptec 2120S with a RAID5 of Seagate U320 drives - yes, I
> > know about the Seagate firmware timeout problem, these drives are brand
> > new with firmware rev 0006 and thus aren't affected.
> > 
> > This hardware has no problems with kernel 2.4.x.
> 
> I had similar, but not nearly as bad, problems with 2.4.x.
> Under 2.4.x, this machine would become unavailable for approximately 20
> minutes, and then would recover.  The load would be around 20 when it
> came back, and would rapidly drop off.
> 
> This was with 2.4.20, using a config modeled off the Debian 2.4.18-bf2.4
> config.
> 
> Due to this problem, my machine is no longer a production machine, so I
> can do whatever testing is necessary to fix this.
> 
> I have gone through the hardware diagnostics process with Dell, with the
> exception of completing the Elite HD diagnostics on the Fujitsu drives. 
> (The program filled the boot floppy with the log, and I haven't gotten
> around to rerunning it yet.)
> 
> I can debug and attempt to reproduce as much as is necessary at this
> point, if anyone can give me a place to start and/or a patch to apply.
> 
> 
> -- 
> 
> Ryan Anderson                
> AutoWeb Communications, Inc. 
> email: ryan@autoweb.net 



  reply	other threads:[~2004-11-09 21:32 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-10-28  7:53 Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Andrew Morton
2004-10-28 18:21 ` Phil Brutsche
2004-11-09 20:22   ` Ryan Anderson
2004-11-09 21:32     ` Otto Solares [this message]
2004-11-09 23:49       ` Andrew Kinney
2004-11-10 17:43       ` aacraid, seagate and adaptec issues [Was: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline] Otto Solares
2004-11-10 20:33         ` Phil Brutsche
2004-11-10 21:08           ` Otto Solares
2004-11-23 21:41 ` Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Ryan Anderson
2004-11-23 22:58   ` Otto Solares
2004-11-24  1:00   ` Andrew Kinney
2004-11-24 18:35     ` Ryan Anderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20041109213215.GA4047@guug.org \
    --to=solca@guug.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=phil@brutsche.us \
    --cc=ryan@autoweb.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.