All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Chris Lee" <labmonkey42@gmail.com>
To: <linux-kernel@vger.kernel.org>
Subject: Problem with legacy megaraid
Date: Sat, 30 Sep 2006 10:06:36 -0500	[thread overview]
Message-ID: <451e87fa.60a54fb4.44c0.4562@mx.gmail.com> (raw)

I am not subscribed to this list.  Please CC me on replies.

I have a machine I'm trying to use as a file server.  I have a RAID10 and a
RAID5 on a single Dell PERC2/DC (AMI Megaraid 467) controller.  Both arrays
are also on the same SCSI channel.  The system runs fine for days on end
until I put some heavy I/O load on either array and sustain it for a few
seconds.

Distro: Gentoo Linux
Kernel: 2.6.17-gentoo-r7

Hardware:
Motherboard: Tyan Thunder i7501 Pro (S2721-533)
CPUs: Dual 2.8Ghz P4 HT Xeons
RAM: 4GB registered (3/1 split, flat model)
RAID: Dell PERC2/DC (AMI Megaraid 467)
SCSI: Adaptec AHA-2940U2/U2W PCI
NICs: onboard e100 and dual onboard e1000

There are 14 hard drives on channel 0 of the perc2/dc.  IDs 1-6 are RAID10
mounted in an external enclosure; IDs 8-15 are RAID5 mounted internally.
The I/O problems occur on either array, which suggests to me that it is not
a cabling issue.

To reproduce the problem for the purpose of this email I ran `bonnie++ -d
/home/nobody -u nobody:nogroup` where /home is the mountpoint of the RAID5
and bonnie++ created an 8GB file.  I have reproduced this problem numerous
times and the sustained I/O time required to create the fault varies.  Once
it happened running a ./configure script for some rather small package.
However, this time it took approximately 20-30 minutes before it freaked
out.

Once the problem occurs I have to reboot the machine to regain use of the
affected array(s).  I should note that the controller bios finds nothing
wrong with the arrays, and when e2fsck is forced on boot it replays the
journal and reports no other problems aside from a "Superblock last write
time is in the future.  FIXED." which I attribute to a misconfiguration on
my part for saving system time to hardware clock.  If I attempt to unmount
the array and mount it again without rebooting I get an error message that
sdXX is not a valid block device.

Logs/info:

active vt says (this is copied via eyeball):
[4513644.094000] ext3_aboart called.
[4513644.101000] EXT3-fs error (device sdb1): ext3_journal_start_sb: <3>sd
0:0:1:0: rejecting I/O to offline device
[4513644.109000] Remounting filesystem read-only

dmesg stuff about megaraid:
[4294675.180000] megaraid: found 0x8086:0x1960:bus 3:slot 3:func 1
[4294675.191000] scsi0:Found MegaRAID controller at 0xf8806000, IRQ:177
[4294675.254000] megaraid: [1.06:1p00] detected 2 logical drives.
[4294675.316000] megaraid: channel[0] is raid.
[4294675.326000] megaraid: channel[1] is raid.
[4294675.362000] scsi0 : LSI Logic MegaRAID 1.06 254 commands 16 targs 5
chans 7 luns
[4294675.372000] scsi0: scanning scsi channel 0 for logical drives.
[4294675.383000]   Vendor: MegaRAID  Model: LD0 RAID1 09634R  Rev: 1.06
[4294675.393000]   Type:   Direct-Access                      ANSI SCSI
revision: 02
[4294675.404000]   Vendor: MegaRAID  Model: LD1 RAID5 38288R  Rev: 1.06
[4294675.415000]   Type:   Direct-Access                      ANSI SCSI
revision: 02
[4294675.428000] scsi0: scanning scsi channel 4 [P0] for physical devices.
[4294675.726000] scsi0: scanning scsi channel 5 [P1] for physical devices.
[4294680.126000] megaraid cmm: 2.20.2.6 (Release Date: Mon Mar 7 00:01:03
EST 2005)
[4294680.136000] megaraid: 2.20.4.8 (Release Date: Mon Apr 11 12:27:22 EST
2006)
[4294680.157000] SCSI device sda: 429330432 512-byte hdwr sectors (219817
MB)
[4294680.167000] sda: Write Protect is off
[4294680.198000] SCSI device sda: 429330432 512-byte hdwr sectors (219817
MB)
[4294680.208000] sda: Write Protect is off
[4294680.237000]  sda: sda1 sda2
[4294680.247000] sd 0:0:0:0: Attached scsi disk sda
[4294680.257000] SCSI device sdb: 2126413824 512-byte hdwr sectors (1088724
MB)
[4294680.267000] sdb: Write Protect is off
[4294680.296000] SCSI device sdb: 2126413824 512-byte hdwr sectors (1088724
MB)
[4294680.305000] sdb: Write Protect is off
[4294680.334000]  sdb: sdb1
[4294680.345000] sd 0:0:1:0: Attached scsi disk sdb
[4294680.355000] sd 0:0:0:0: Attached scsi generic sg0 type 0
[4294680.365000] sd 0:0:1:0: Attached scsi generic sg1 type 0

/var/log/messages:
Sep 29 07:46:16 hostname kernel: [4513615.601000] sd 0:0:1:0: SCSI error:
return code = 0x40001
Sep 29 07:46:16 hostname kernel: [4513615.601000] end_request: I/O error,
dev sdb, sector 1744348567
Sep 29 07:46:16 hostname kernel: [4513615.601000] Buffer I/O error on device
sdb1, logical block 218043563
Sep 29 07:46:16 hostname kernel: [4513615.601000] lost page write due to I/O
error on sdb1
Sep 29 07:46:16 hostname kernel: [4513615.601000] sd 0:0:1:0: SCSI error:
return code = 0x40001
Sep 29 07:46:16 hostname kernel: [4513615.601000] end_request: I/O error,
dev sdb, sector 1744348695
Sep 29 07:46:16 hostname kernel: [4513615.601000] Buffer I/O error on device
sdb1, logical block 218043579
Sep 29 07:46:16 hostname kernel: [4513615.601000] lost page write due to I/O
error on sdb1
Sep 29 07:46:16 hostname kernel: [4513615.778000] sd 0:0:1:0: SCSI error:
return code = 0x40001
Sep 29 07:46:16 hostname kernel: [4513615.778000] end_request: I/O error,
dev sdb, sector 1744348823
Sep 29 07:46:16 hostname kernel: [4513615.778000] Buffer I/O error on device
sdb1, logical block 218043595
Sep 29 07:46:16 hostname kernel: [4513615.778000] lost page write due to I/O
error on sdb1
Sep 29 07:46:16 hostname kernel: [4513615.778000] sd 0:0:1:0: SCSI error:
return code = 0x40001
Sep 29 07:46:44 hostname kernel: [4513615.778000] end_request: I/O error,
dev sdb, sector 1744348951
Sep 29 07:46:44 hostname kernel: [4513615.778000] Buffer I/O error on device
sdb1, logical block 218043611
Sep 29 07:46:44 hostname kernel: [4513615.778000] lost page write due to I/O
error on sdb1
Sep 29 07:46:44 hostname kernel: [4513615.778000] sd 0:0:1:0: SCSI error:
return code = 0x40001
Sep 29 07:46:44 hostname kernel: [4513615.778000] end_request: I/O error,
dev sdb, sector 1744348959
Sep 29 07:46:44 hostname kernel: [4513615.778000] Buffer I/O error on device
sdb1, logical block 218043612
Sep 29 07:46:44 hostname kernel: [4513615.778000] lost page write due to I/O
error on sdb1
Sep 29 07:46:44 hostname kernel: [4513643.303000] sd 0:0:1:0: rejecting I/O
to offline device
Sep 29 07:46:44 hostname last message repeated 92 times

The last two lines repeat as long as something continues trying to access
that logical drive.

If any other logs/info would be useful please let me know and I will by
happy to include them.  TIA for any help.  Also I apologise if this is not
relevant for the list.

Thanks,
Chris 


             reply	other threads:[~2006-09-30 15:06 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-09-30 15:06 Chris Lee [this message]
2006-09-30 22:54 ` Problem with legacy megaraid Andrew Morton
2006-09-30 22:54   ` Andrew Morton
2006-10-01  5:39   ` Chris Lee
2006-10-01  5:39     ` Chris Lee
2006-10-01  6:02     ` Andrew Morton
2006-10-01  6:02       ` Andrew Morton
2006-10-01  6:44       ` Chris Lee
2006-10-01  6:44         ` Chris Lee
  -- strict thread matches above, loose matches on Subject: below --
2006-10-04 10:21 Chris Lee
2006-10-04 10:21 ` Chris Lee
2006-10-04 13:01 Kolli, Neela
2006-10-04 13:01 ` Kolli, Neela

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=451e87fa.60a54fb4.44c0.4562@mx.gmail.com \
    --to=labmonkey42@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.