public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* software raid 5 broken on 2.6.14.4 sparc
@ 2005-12-29 18:34 Gerhard Mack
  0 siblings, 0 replies; 3+ messages in thread
From: Gerhard Mack @ 2005-12-29 18:34 UTC (permalink / raw)
  To: linux-kernel

Hello,

I had one of 5 drives fail in my raid 5 setup and now many files are 
unreadable..  Isn't raid 5 supposed to compensate for exactly this 
happening?

Dec 25 16:29:21 localhost kernel: nfsd: last server has exited
Dec 25 16:29:21 localhost kernel: nfsd: unexporting all filesystems
Dec 25 16:29:21 localhost kernel: RPC: failed to contact portmap (errno -5).
Dec 25 16:29:22 localhost kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Dec 25 16:29:22 localhost kernel: NFSD: starting 90-second grace period
Dec 25 16:29:31 localhost kernel: nfsd: last server has exited
Dec 25 16:29:31 localhost kernel: nfsd: unexporting all filesystems
Dec 25 16:29:31 localhost kernel: RPC: failed to contact portmap (errno -5).
Dec 25 16:29:32 localhost kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Dec 25 16:29:32 localhost kernel: NFSD: starting 90-second grace period
Dec 25 16:29:48 localhost kernel: nfsd: last server has exited
Dec 25 16:29:48 localhost kernel: nfsd: unexporting all filesystems
Dec 25 16:29:48 localhost kernel: RPC: failed to contact portmap (errno -5).
Dec 25 16:29:49 localhost kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Dec 25 16:29:49 localhost kernel: NFSD: starting 90-second grace period
Dec 25 16:43:43 localhost kernel: nfsd: last server has exited
Dec 25 16:43:43 localhost kernel: nfsd: unexporting all filesystems
Dec 25 16:43:43 localhost kernel: RPC: failed to contact portmap (errno -5).
Dec 25 16:43:45 localhost kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Dec 25 16:43:45 localhost kernel: NFSD: starting 90-second grace period
Dec 25 16:48:05 localhost kernel: nfsd: last server has exited
Dec 25 16:48:05 localhost kernel: nfsd: unexporting all filesystems
Dec 25 16:48:05 localhost kernel: RPC: failed to contact portmap (errno -5).
Dec 25 16:48:06 localhost kernel: NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Dec 25 16:48:06 localhost kernel: NFSD: starting 90-second grace period
Dec 25 18:17:30 localhost kernel: udp v4 hw csum failure.
Dec 25 18:23:50 localhost kernel: udp v4 hw csum failure.
Dec 25 18:36:27 localhost kernel: sde: Current: sense key: Recovered Error
Dec 25 18:36:27 localhost kernel:     Additional sense: Recovered data - recommend reassignment
Dec 25 18:36:27 localhost kernel: Info fld=0x5ccf6d
Dec 25 19:52:05 localhost kernel: sde: Current: sense key: Recovered Error
Dec 25 19:52:05 localhost kernel:     Additional sense: Recovered data with error corr. & retries applied
Dec 25 19:52:05 localhost kernel: Info fld=0xd77a
Dec 25 21:38:40 localhost kernel: udp v4 hw csum failure.
Dec 25 22:50:34 localhost kernel: udp v4 hw csum failure.
Dec 25 22:53:51 localhost kernel: udp v4 hw csum failure.
Dec 25 23:00:22 localhost kernel: udp v4 hw csum failure.
Dec 25 23:00:24 localhost kernel: udp v4 hw csum failure.
Dec 26 11:25:56 localhost kernel: sdb: Current: sense key: Recovered Error
Dec 26 11:25:56 localhost kernel:     Additional sense: Warning - specified temperature exceeded
Dec 27 14:50:33 localhost kernel: udp v4 hw csum failure.
Dec 27 14:50:33 localhost kernel: UDP: bad checksum. From 204.83.232.135:24638 to 64.235.218.35:57999 ulen 39
Dec 27 21:31:57 localhost kernel: sdb: Current: sense key: Recovered Error
Dec 27 21:31:57 localhost kernel:     Additional sense: Warning - specified temperature exceeded
Dec 28 12:36:06 localhost kernel: hw tcp v4 csum failed
Dec 28 14:33:36 localhost kernel: hw tcp v4 csum failed
Dec 28 16:21:56 localhost kernel: hw tcp v4 csum failed
Dec 28 17:18:22 localhost kernel: hw tcp v4 csum failed
Dec 28 17:18:41 localhost kernel: hw tcp v4 csum failed
Dec 28 17:33:26 localhost kernel: hw tcp v4 csum failed
Dec 28 17:51:38 localhost kernel: hw tcp v4 csum failed
Dec 28 18:03:54 localhost kernel: hw tcp v4 csum failed
Dec 28 18:05:04 localhost kernel: hw tcp v4 csum failed
Dec 28 18:12:34 localhost kernel: hw tcp v4 csum failed
Dec 28 18:13:37 localhost last message repeated 8 times
Dec 28 18:13:51 localhost last message repeated 7 times
Dec 28 18:15:32 localhost kernel: hw tcp v4 csum failed
Dec 28 18:16:55 localhost kernel: sde: Current: sense key: Recovered Error
Dec 28 18:16:55 localhost kernel:     Additional sense: Recovered data - recommend reassignment
Dec 28 18:16:55 localhost kernel: Info fld=0x4455d87
Dec 28 18:19:31 localhost kernel: hw tcp v4 csum failed
Dec 28 18:23:44 localhost kernel: hw tcp v4 csum failed
Dec 28 18:24:04 localhost last message repeated 13 times
Dec 28 18:25:13 localhost kernel: hw tcp v4 csum failed
Dec 28 18:26:57 localhost kernel: hw tcp v4 csum failed
Dec 28 18:27:34 localhost last message repeated 16 times
Dec 28 18:27:42 localhost kernel: printk: 9 messages suppressed.
Dec 28 18:27:42 localhost kernel: hw tcp v4 csum failed
Dec 28 18:28:05 localhost last message repeated 4 times
Dec 28 18:28:30 localhost kernel: hw tcp v4 csum failed
Dec 28 18:34:49 localhost kernel: hw tcp v4 csum failed
Dec 28 18:35:54 localhost last message repeated 17 times
Dec 28 18:36:15 localhost kernel: hw tcp v4 csum failed
Dec 28 19:50:30 localhost kernel: hw tcp v4 csum failed
Dec 29 02:54:51 localhost kernel: hw tcp v4 csum failed
Dec 29 06:26:07 localhost kernel: sd 0:0:12:0: SCSI error: return code = 0x8000002
Dec 29 06:26:07 localhost kernel: sdf: Current: sense key: Hardware Error
Dec 29 06:26:07 localhost kernel:     Additional sense: Mechanical positioning error
Dec 29 06:26:07 localhost kernel: end_request: I/O error, dev sdf, sector 9664
Dec 29 06:26:07 localhost kernel: raid5: Disk failure on sdf, disabling device. Operation continuing on 4 devices
Dec 29 06:26:07 localhost kernel: RAID5 conf printout:
Dec 29 06:26:07 localhost kernel:  --- rd:6 wd:4 fd:2
Dec 29 06:26:07 localhost kernel:  disk 0, o:1, dev:sdc
Dec 29 06:26:07 localhost kernel:  disk 1, o:1, dev:sdd
Dec 29 06:26:07 localhost kernel:  disk 2, o:1, dev:sde
Dec 29 06:26:07 localhost kernel:  disk 3, o:0, dev:sdf
Dec 29 06:26:07 localhost kernel:  disk 4, o:1, dev:sdg
Dec 29 06:26:07 localhost kernel: RAID5 conf printout:
Dec 29 06:26:07 localhost kernel:  --- rd:6 wd:4 fd:2
Dec 29 06:26:07 localhost kernel:  disk 0, o:1, dev:sdc
Dec 29 06:26:07 localhost kernel:  disk 1, o:1, dev:sdd
Dec 29 06:26:07 localhost kernel:  disk 2, o:1, dev:sde
Dec 29 06:26:07 localhost kernel:  disk 4, o:1, dev:sdg
Dec 29 06:26:07 localhost kernel: Buffer I/O error on device md0, logical block 6040
Dec 29 06:26:07 localhost kernel: lost page write due to I/O error on md0
Dec 29 06:26:07 localhost kernel: REISERFS: abort (device md0): Journal write error in flush_commit_list
Dec 29 06:26:07 localhost kernel: REISERFS: Aborting journal for filesystem on md0
Dec 29 07:16:56 localhost kernel: ReiserFS: md0: warning: clm-6006: writing inode 2996 on readonly FS
Dec 29 07:16:56 localhost kernel: ReiserFS: md0: warning: clm-6006: writing inode 2996 on readonly FS
Dec 29 07:17:28 localhost kernel: Buffer I/O error on device md0, logical block 5052984
Dec 29 07:17:28 localhost kernel: lost page write due to I/O error on md0



--
Gerhard Mack

gmack@innerfire.net

<>< As a computer I find your faith in technology amusing.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: software raid 5 broken on 2.6.14.4 sparc
       [not found] <5paRd-18Z-47@gated-at.bofh.it>
@ 2005-12-29 18:51 ` Robert Hancock
  2005-12-29 19:11   ` [solved][Re: " Gerhard Mack
  0 siblings, 1 reply; 3+ messages in thread
From: Robert Hancock @ 2005-12-29 18:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: gmack

Gerhard Mack wrote:
> Hello,
> 
> I had one of 5 drives fail in my raid 5 setup and now many files are 
> unreadable..  Isn't raid 5 supposed to compensate for exactly this 
> happening?

It looks like you have multiple drives with problems here:

Here's sde unhappy:

> Dec 25 18:36:27 localhost kernel: sde: Current: sense key: Recovered Error
> Dec 25 18:36:27 localhost kernel:     Additional sense: Recovered data - recommend reassignment

sdb seems to be overheating:

> Dec 26 11:25:56 localhost kernel: sdb: Current: sense key: Recovered Error
> Dec 26 11:25:56 localhost kernel:     Additional sense: Warning - specified temperature exceeded

Now sdf is unhappy..

> Dec 29 06:26:07 localhost kernel: sdf: Current: sense key: Hardware Error
> Dec 29 06:26:07 localhost kernel:     Additional sense: Mechanical positioning error
> Dec 29 06:26:07 localhost kernel: end_request: I/O error, dev sdf, sector 9664
> Dec 29 06:26:07 localhost kernel: raid5: Disk failure on sdf, disabling device. Operation continuing on 4 devices

> Dec 29 06:26:07 localhost kernel: RAID5 conf printout:
> Dec 29 06:26:07 localhost kernel:  --- rd:6 wd:4 fd:2
> Dec 29 06:26:07 localhost kernel:  disk 0, o:1, dev:sdc
> Dec 29 06:26:07 localhost kernel:  disk 1, o:1, dev:sdd
> Dec 29 06:26:07 localhost kernel:  disk 2, o:1, dev:sde
> Dec 29 06:26:07 localhost kernel:  disk 3, o:0, dev:sdf
> Dec 29 06:26:07 localhost kernel:  disk 4, o:1, dev:sdg
> Dec 29 06:26:07 localhost kernel: RAID5 conf printout:
> Dec 29 06:26:07 localhost kernel:  --- rd:6 wd:4 fd:2
> Dec 29 06:26:07 localhost kernel:  disk 0, o:1, dev:sdc
> Dec 29 06:26:07 localhost kernel:  disk 1, o:1, dev:sdd
> Dec 29 06:26:07 localhost kernel:  disk 2, o:1, dev:sde
> Dec 29 06:26:07 localhost kernel:  disk 4, o:1, dev:sdg
> Dec 29 06:26:07 localhost kernel: Buffer I/O error on device md0, logical block 6040
> Dec 29 06:26:07 localhost kernel: lost page write due to I/O error on md0
> Dec 29 06:26:07 localhost kernel: REISERFS: abort (device md0): Journal write error in flush_commit_list
> Dec 29 06:26:07 localhost kernel: REISERFS: Aborting journal for filesystem on md0
> Dec 29 07:16:56 localhost kernel: ReiserFS: md0: warning: clm-6006: writing inode 2996 on readonly FS
> Dec 29 07:16:56 localhost kernel: ReiserFS: md0: warning: clm-6006: writing inode 2996 on readonly FS
> Dec 29 07:17:28 localhost kernel: Buffer I/O error on device md0, logical block 5052984
> Dec 29 07:17:28 localhost kernel: lost page write due to I/O error on md0

I don't know if this completely explains the failure, but it seems you 
have bigger problems than one bad drive, and RAID5 cannot handle 
multiple drive failures.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [solved][Re: software raid 5 broken on 2.6.14.4 sparc
  2005-12-29 18:51 ` software raid 5 broken on 2.6.14.4 sparc Robert Hancock
@ 2005-12-29 19:11   ` Gerhard Mack
  0 siblings, 0 replies; 3+ messages in thread
From: Gerhard Mack @ 2005-12-29 19:11 UTC (permalink / raw)
  To: Robert Hancock; +Cc: linux-kernel

On Thu, 29 Dec 2005, Robert Hancock wrote:

> Date: Thu, 29 Dec 2005 12:51:12 -0600
> From: Robert Hancock <hancockr@shaw.ca>
> To: linux-kernel <linux-kernel@vger.kernel.org>
> Cc: gmack@innerfire.net
> Subject: Re: software raid 5 broken on 2.6.14.4 sparc
> 
> Gerhard Mack wrote:
> > Hello,
> > 
> > I had one of 5 drives fail in my raid 5 setup and now many files are
> > unreadable..  Isn't raid 5 supposed to compensate for exactly this
> > happening?
> 
> It looks like you have multiple drives with problems here:
> 
> Here's sde unhappy:
> 
> > Dec 25 18:36:27 localhost kernel: sde: Current: sense key: Recovered Error
> > Dec 25 18:36:27 localhost kernel:     Additional sense: Recovered data -
> > recommend reassignment
> 
> sdb seems to be overheating:
> 
> > Dec 26 11:25:56 localhost kernel: sdb: Current: sense key: Recovered Error
> > Dec 26 11:25:56 localhost kernel:     Additional sense: Warning - specified
> > temperature exceeded
> 
> Now sdf is unhappy..
> 
> > Dec 29 06:26:07 localhost kernel: sdf: Current: sense key: Hardware Error
> > Dec 29 06:26:07 localhost kernel:     Additional sense: Mechanical
> > positioning error
> > Dec 29 06:26:07 localhost kernel: end_request: I/O error, dev sdf, sector
> > 9664
> > Dec 29 06:26:07 localhost kernel: raid5: Disk failure on sdf, disabling
> > device. Operation continuing on 4 devices
> 
> > Dec 29 06:26:07 localhost kernel: RAID5 conf printout:
> > Dec 29 06:26:07 localhost kernel:  --- rd:6 wd:4 fd:2
> > Dec 29 06:26:07 localhost kernel:  disk 0, o:1, dev:sdc
> > Dec 29 06:26:07 localhost kernel:  disk 1, o:1, dev:sdd
> > Dec 29 06:26:07 localhost kernel:  disk 2, o:1, dev:sde
> > Dec 29 06:26:07 localhost kernel:  disk 3, o:0, dev:sdf
> > Dec 29 06:26:07 localhost kernel:  disk 4, o:1, dev:sdg
> > Dec 29 06:26:07 localhost kernel: RAID5 conf printout:
> > Dec 29 06:26:07 localhost kernel:  --- rd:6 wd:4 fd:2
> > Dec 29 06:26:07 localhost kernel:  disk 0, o:1, dev:sdc
> > Dec 29 06:26:07 localhost kernel:  disk 1, o:1, dev:sdd
> > Dec 29 06:26:07 localhost kernel:  disk 2, o:1, dev:sde
> > Dec 29 06:26:07 localhost kernel:  disk 4, o:1, dev:sdg
> > Dec 29 06:26:07 localhost kernel: Buffer I/O error on device md0, logical
> > block 6040
> > Dec 29 06:26:07 localhost kernel: lost page write due to I/O error on md0
> > Dec 29 06:26:07 localhost kernel: REISERFS: abort (device md0): Journal
> > write error in flush_commit_list
> > Dec 29 06:26:07 localhost kernel: REISERFS: Aborting journal for filesystem
> > on md0
> > Dec 29 07:16:56 localhost kernel: ReiserFS: md0: warning: clm-6006: writing
> > inode 2996 on readonly FS
> > Dec 29 07:16:56 localhost kernel: ReiserFS: md0: warning: clm-6006: writing
> > inode 2996 on readonly FS
> > Dec 29 07:17:28 localhost kernel: Buffer I/O error on device md0, logical
> > block 5052984
> > Dec 29 07:17:28 localhost kernel: lost page write due to I/O error on md0
> 
> I don't know if this completely explains the failure, but it seems you have
> bigger problems than one bad drive, and RAID5 cannot handle multiple drive
> failures.
> 
Ahah.. I knew about sdb.. overheating (it's not part of the array) I 
missed sde .. that one explains it.

	Thanks, 
	Gerhard


--
Gerhard Mack

gmack@innerfire.net

<>< As a computer I find your faith in technology amusing.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2005-12-29 19:11 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <5paRd-18Z-47@gated-at.bofh.it>
2005-12-29 18:51 ` software raid 5 broken on 2.6.14.4 sparc Robert Hancock
2005-12-29 19:11   ` [solved][Re: " Gerhard Mack
2005-12-29 18:34 Gerhard Mack

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox