megaraid Error 40005 on cluster

public inbox for linux-scsi@vger.kernel.org
 help / color / mirror / Atom feed

* megaraid Error 40005 on cluster
@ 2005-05-19 14:11 pg
  2005-05-19 14:23 ` Matt Domsch
  0 siblings, 1 reply; 3+ messages in thread
From: pg @ 2005-05-19 14:11 UTC (permalink / raw)
  To: linux-scsi

I exerienced the following error on a RedHat cluster configuration with Dell hardware (Perc 3/DC controller and PowerVault 220 disk array).
When the error occurs the cluster manager shutdown the cluster node, but the filesystem is corruped and the other node cannot mount it until a manual fsck.

Any idea?


SCSI and system configuration
--------------------------------

Redhat AS 2.1 + Cluster Manager
DELL PowerEdge 2650 with PERC 3/DC
DELL PowerVault 220s - cluster configuration


# uname -a
Linux myHost 2.4.9-e.40smp #1 SMP Thu Apr 8 16:53:29 EDT 2004 i686 unknown

# cat /etc/modules.conf
options scsi_mod max_scsi_luns=255 
alias scsi_hostadapter aacraid
alias scsi_hostadapter1 megaraid_2009
....

# cat /proc/scsi/scsi 
Attached devices: 
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: DELL     Model: PERCRAID Mirror  Rev: V1.0
  Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi1 Channel: 00 Id: 00 Lun: 00
  Vendor: MegaRAID Model: LD 0 RAID1   34G Rev: 1.92
  Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi1 Channel: 00 Id: 01 Lun: 00
  Vendor: MegaRAID Model: LD 1 RAID5   69G Rev: 1.92
  Type:   Direct-Access                    ANSI SCSI revision: 02
Host: scsi1 Channel: 04 Id: 07 Lun: 00
  Vendor: DELL     Model: PERC 3/DC        Rev: 1.92
  Type:   Processor                        ANSI SCSI revision: 02
Host: scsi1 Channel: 04 Id: 15 Lun: 00
  Vendor: DELL     Model: PV22XS           Rev: E.14
  Type:   Processor                        ANSI SCSI revision: 03

# cat /proc/scsi/megaraid/1 
LSI Logic MegaRAID 1.92 254 commands 16 targs 5 chans 7 luns

# cat /proc/megaraid/hba1/config 
v2.00.9 (Release Date: Thu Sep  4 17:49:42 EDT 2003)
PERC 3/DC
Controller Type: 438/466/467/471/493/518/520/531/532
Controller Supports 40 Logical Drives
Controller capable of 64-bit memory addressing
Controller using 64-bit memory addressing
Base = f9030000, Irq = 16, Logical Drives = 2, Channels = 2
Version =1.92:3.31, DRAM = 128Mb
Controller Queue Depth = 254, Driver Queue Depth = 126
support_ext_cdb    = 1
support_random_del = 1
boot_ldrv_enabled  = 1
boot_ldrv          = 0
boot_pdrv_enabled  = 0
boot_pdrv_ch       = 0
boot_pdrv_tgt      = 0
quiescent          = 0
has_cluster        = 1

Module Parameters:
max_cmd_per_lun    = 63
max_sectors_per_io = 128

# cat /proc/megaraid/hba1/diskdrives-ch0 
Channel: 0 Id: 0 State: Online.
  Vendor: FUJITSU   Model: MAP3367NC         Rev: 5605
  Type:   Direct-Access                      ANSI SCSI revision: 03
Channel: 0 Id: 1 State: Online.
  Vendor: FUJITSU   Model: MAP3367NC         Rev: 5605
  Type:   Direct-Access                      ANSI SCSI revision: 03
Channel: 0 Id: 2 State: Hot spare.
  Vendor: FUJITSU   Model: MAP3367NC         Rev: 5605
  Type:   Direct-Access                      ANSI SCSI revision: 03
Channel: 0 Id: 3 State: Online.
  Vendor: FUJITSU   Model: MAP3367NC         Rev: 5605
  Type:   Direct-Access                      ANSI SCSI revision: 03
Channel: 0 Id: 4 State: Online.
  Vendor: FUJITSU   Model: MAP3367NC         Rev: 5605
  Type:   Direct-Access                      ANSI SCSI revision: 03
Channel: 0 Id: 5 State: Online.
  Vendor: FUJITSU   Model: MAP3367NC         Rev: 5605
  Type:   Direct-Access                      ANSI SCSI revision: 03

# cat /proc/megaraid/hba1/raiddrives-0-9
Logical drive: 0:, state: optimal
Span depth:  1, RAID level:  1, Stripe size: 64, Row size:  2
Read Policy: No read ahead, Write Policy: Write thru, Cache Policy: Direct IO

Logical drive: 1:, state: optimal
Span depth:  1, RAID level:  5, Stripe size: 64, Row size:  3
Read Policy: No read ahead, Write Policy: Write thru, Cache Policy: Direct IO

------------------------------------
Error report
------------------------------------

May 13 05:31:27 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:27 clu2a kernel:  I/O error: dev 08:21, sector 2290456
May 13 05:31:27 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:27 clu2a kernel:  I/O error: dev 08:21, sector 2290464
May 13 05:31:27 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:27 clu2a kernel:  I/O error: dev 08:21, sector 11488
May 13 05:31:27 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:27 clu2a kernel:  I/O error: dev 08:21, sector 11496
May 13 05:31:27 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:27 clu2a kernel:  I/O error: dev 08:21, sector 11504
May 13 05:31:27 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:27 clu2a kernel:  I/O error: dev 08:21, sector 528528
May 13 05:31:27 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:27 clu2a kernel:  I/O error: dev 08:21, sector 2283712
May 13 05:31:27 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:27 clu2a kernel:  I/O error: dev 08:21, sector 2283720
May 13 05:31:28 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:28 clu2a kernel:  I/O error: dev 08:21, sector 2283728
May 13 05:31:28 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:28 clu2a kernel:  I/O error: dev 08:21, sector 2283736
May 13 05:31:28 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:28 clu2a kernel:  I/O error: dev 08:21, sector 2281160
May 13 05:31:28 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:28 clu2a kernel:  I/O error: dev 08:21, sector 2281168
May 13 05:31:28 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:28 clu2a kernel:  I/O error: dev 08:21, sector 2281176
May 13 05:31:28 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:28 clu2a kernel:  I/O error: dev 08:21, sector 2281184
May 13 05:31:28 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:28 clu2a kernel:  I/O error: dev 08:22, sector 1052480
May 13 05:31:28 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:28 clu2a kernel:  I/O error: dev 08:22, sector 2363240
May 13 05:31:28 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:29 clu2a kernel:  I/O error: dev 08:22, sector 266776
May 13 05:31:29 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:29 clu2a kernel:  I/O error: dev 08:22, sector 266776
May 13 05:31:29 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:29 clu2a kernel:  I/O error: dev 08:22, sector 3673920
May 13 05:31:29 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:29 clu2a kernel:  I/O error: dev 08:22, sector 3670072
May 13 05:31:29 clu2a kernel: EXT3-fs error (device sd(8,34)): ext3_get_inode_loc: unable to read inode block - inode=216936, block=458759
May 13 05:31:29 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:29 clu2a kernel:  I/O error: dev 08:22, sector 0
May 13 05:31:29 clu2a kernel: EXT3-fs error (device sd(8,34)) in ext3_reserve_inode_write: IO failure
May 13 05:31:29 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:29 clu2a kernel:  I/O error: dev 08:22, sector 0
May 13 05:31:29 clu2a kernel: EXT3-fs error (device sd(8,34)) in ext3_new_inode: IO failure
May 13 05:31:29 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:29 clu2a kernel:  I/O error: dev 08:22, sector 0
May 13 05:31:29 clu2a kernel: SCSI disk error : host 1 channel 0 id 0 lun 0 return code = 40005
May 13 05:31:29 clu2a kernel:  I/O error: dev 08:11, sector 8
May 13 05:31:29 clu2a kernel: SCSI disk error : host 1 channel 0 id 0 lun 0 return code = 40005
May 13 05:31:29 clu2a kernel:  I/O error: dev 08:11, sector 8
May 13 05:31:29 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:29 clu2a kernel:  I/O error: dev 08:21, sector 3670080
May 13 05:31:29 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:30 clu2a kernel:  I/O error: dev 08:22, sector 2359352
May 13 05:31:30 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:30 clu2a kernel:  I/O error: dev 08:21, sector 3670064
May 13 05:31:30 clu2a kernel: SCSI disk error : host 1 channel 0 id 1 lun 0 return code = 40005
May 13 05:31:30 clu2a kernel:  I/O error: dev 08:21, sector 3674152
[...]


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: megaraid Error 40005 on cluster
  2005-05-19 14:11 megaraid Error 40005 on cluster pg
@ 2005-05-19 14:23 ` Matt Domsch
  2005-05-19 19:39   ` Pg
  0 siblings, 1 reply; 3+ messages in thread
From: Matt Domsch @ 2005-05-19 14:23 UTC (permalink / raw)
  To: pg@atlavia.it; +Cc: linux-scsi

On Thu, May 19, 2005 at 02:11:43PM +0000, pg@atlavia.it wrote:
> I exerienced the following error on a RedHat cluster configuration
> with Dell hardware (Perc 3/DC controller and PowerVault 220 disk
> array).  When the error occurs the cluster manager shutdown the
> cluster node, but the filesystem is corruped and the other node
> cannot mount it until a manual fsck.
> 
> Any idea?

The firmware on the PERC 3/DC is not multi-initiator cluster-capable
under Linux.  For this reason, neither Dell nor Red Hat recommend
trying to create a HA shared storage cluster with this configuration.
Even with write cache disabled, the lock sectors used by the cluster
manager don't manage to stay coherent.

I understand that newest versions of the Red Hat Cluster Suite may no
longer use lock sectors on the disk as the I/O fencing mechanism,
which may then enable such configurations.  But neither Dell nor Red
Hat have done any testing with the hardware config you've got.

The price is right, until you actually need your data to be highly
available and it crashes.

Thanks,
Matt

-- 
Matt Domsch
Software Architect
Dell Linux Solutions linux.dell.com & www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: megaraid Error 40005 on cluster
  2005-05-19 14:23 ` Matt Domsch
@ 2005-05-19 19:39   ` Pg
  0 siblings, 0 replies; 3+ messages in thread
From: Pg @ 2005-05-19 19:39 UTC (permalink / raw)
  To: linux-scsi

Matt Domsch ha scritto:

>On Thu, May 19, 2005 at 02:11:43PM +0000, pg@atlavia.it wrote:
>  
>
>>I exerienced the following error on a RedHat cluster configuration
>>with Dell hardware (Perc 3/DC controller and PowerVault 220 disk
>>array).  When the error occurs the cluster manager shutdown the
>>cluster node, but the filesystem is corruped and the other node
>>cannot mount it until a manual fsck.
>>
>>Any idea?
>>    
>>
>
>The firmware on the PERC 3/DC is not multi-initiator cluster-capable
>under Linux.  For this reason, neither Dell nor Red Hat recommend
>trying to create a HA shared storage cluster with this configuration.
>Even with write cache disabled, the lock sectors used by the cluster
>manager don't manage to stay coherent.
>
>I understand that newest versions of the Red Hat Cluster Suite may no
>longer use lock sectors on the disk as the I/O fencing mechanism,
>which may then enable such configurations.  But neither Dell nor Red
>Hat have done any testing with the hardware config you've got.
>
>The price is right, until you actually need your data to be highly
>available and it crashes.
>
>Thanks,
>Matt
>
>  
>
As I'm not so expert in HA clusters I got an hw ans sw solution 
"suggested" by DELL and my hw/sw configuration, that is working quiet 
well since a couple of years. May be i've been luky.

As racommended I don't mount the same filesystem on both nodes of the 
cluster at the same time; every service has its own filesystem and a 
service is active on a single node.
The uniqe shared partition is the quorum, that is on a RAID-1 volume. 
The other partitions are on a single RAID-5 volume: do you think that to 
make a seaprate volume for each partition could help?

Thanks






^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2005-05-19 19:39 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-19 14:11 megaraid Error 40005 on cluster pg
2005-05-19 14:23 ` Matt Domsch
2005-05-19 19:39   ` Pg

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox