Raid5 software problems after loosing 4 disks for 48 hours

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Raid5 software problems after loosing 4 disks for 48 hours
@ 2006-06-15 23:59 Wilson Wilson
  2006-06-16  1:21 ` Neil Brown
  0 siblings, 1 reply; 4+ messages in thread
From: Wilson Wilson @ 2006-06-15 23:59 UTC (permalink / raw)
  To: linux-raid

I run a 15 disk raid5 array. Unknowingly a contoller card failed which
had 4 disks on it. And unfortunatly the Raid5 stayed online, doing its
best.

Now after replacing the controller the raid can not be brought on
line. I would rather have corruption of anything changed on the array
in the last 2 days then loose 4 disks.

Quote:
# mdadm --examine /dev/sdc1
/dev/sdc1:
Magic : a92b4efc
Version : 00.90.01
UUID : eed326c4:6cf58a43:d1b57676:5b765f6c
Creation Time : Sun Nov 13 16:07:21 2005
Raid Level : raid5
Raid Devices : 15
Total Devices : 15
Preferred Minor : 0

Update Time : Mon Jun 12 11:08:09 2006
State : clean
Active Devices : 11
Working Devices : 11
Failed Devices : 8
Spare Devices : 0
Checksum : c967848b - correct
Events : 0.5456277

Layout : left-symmetric
Chunk Size : 32K

Number Major Minor RaidDevice State
this 1 8 33 1 active sync /dev/sdc1

0 0 8 17 0 active sync /dev/sdb1
1 1 8 33 1 active sync /dev/sdc1
2 2 8 49 2 active sync /dev/sdd1
3 3 0 0 3 faulty removed
4 4 0 0 4 faulty removed
5 5 0 0 5 faulty removed
6 6 0 0 6 faulty removed
7 7 8 129 7 active sync /dev/sdi1
8 8 8 145 8 active sync /dev/sdj1
9 9 8 161 9 active sync /dev/sdk1
10 10 8 177 10 active sync /dev/sdl1
11 11 8 193 11 active sync /dev/sdm1
12 12 8 209 12 active sync /dev/sdn1
13 13 8 225 13 active sync /dev/sdo1
14 14 8 241 14 active sync /dev/sdp1

And is there a way if more then 1 disks goes offline, for the whole
array to be taken offline? My understanding of raid5 is loose 1+ disks
and nothing on the raid would be readable. this is not the case here.

All the disks are online now, what do I need to do to rebuild the array?

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Raid5 software problems after loosing 4 disks for 48 hours
  2006-06-15 23:59 Raid5 software problems after loosing 4 disks for 48 hours Wilson Wilson
@ 2006-06-16  1:21 ` Neil Brown
  2006-06-17 12:24   ` Wilson Wilson
  0 siblings, 1 reply; 4+ messages in thread
From: Neil Brown @ 2006-06-16  1:21 UTC (permalink / raw)
  To: Wilson Wilson; +Cc: linux-raid

On Friday June 16, wilson150@gmail.com wrote:
> 
> And is there a way if more then 1 disks goes offline, for the whole
> array to be taken offline? My understanding of raid5 is loose 1+ disks
> and nothing on the raid would be readable. this is not the case here.
> 

Nothing will be writable, but some blocks might be readable.

> All the disks are online now, what do I need to do to rebuild the array?

Have you tried
  mdadm --assemble --force /dev/md0 /dev/sd[bcdefghijklmnop]1
??
Actually, it occurs to me that that might not do the best thing if 4
drives disappeared at exactly the same time (though it is unlikely
that you would notice)
You should probably use
 mdadm --create /dev/md0 -f -l5 -n15 -c32  /dev/sd[bcdefghijklmnop]1
This is assuming that  e,f,g,h were in that order in the array before
they died.
The '-f' is quite important - it tells mdadm not recover a spare, but
to resync the parity blocks.

NeilBrown

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Raid5 software problems after loosing 4 disks for 48 hours
  2006-06-16  1:21 ` Neil Brown
@ 2006-06-17 12:24   ` Wilson Wilson
  2006-06-17 21:42     ` David Greaves
  0 siblings, 1 reply; 4+ messages in thread
From: Wilson Wilson @ 2006-06-17 12:24 UTC (permalink / raw)
  To: linux-raid

Neil great stuff, its online now!!!

I followed your 2nd suggestion and ran mdadm --create /dev/md0 -f -l5
-n15 -c32  /dev/sd[bcdefghijklmnop]1 , after 8 hours we reached 99.9%
and some errors appeared on sdh1 which was then kicked from archive
but it was fully online.

Ill see if any further errors are reported on sdh, but in the meantime
I hotadded it back into the array which was successful.

To my surprise a full fsck reported a clean volume.

I am still unsure how this raid5 volume was partially readable with 4
disks missing.  My understanding each file is written across all disks
apart from one, which is used for CRC.  So if 2 disks are offline the
whole thing should be unreadable.

Once again thanks for your help


On 6/16/06, Neil Brown <neilb@suse.de> wrote:
> On Friday June 16, wilson150@gmail.com wrote:
> >
> > And is there a way if more then 1 disks goes offline, for the whole
> > array to be taken offline? My understanding of raid5 is loose 1+ disks
> > and nothing on the raid would be readable. this is not the case here.
> >
>
> Nothing will be writable, but some blocks might be readable.
>
>
> > All the disks are online now, what do I need to do to rebuild the array?
>
> Have you tried
>  mdadm --assemble --force /dev/md0 /dev/sd[bcdefghijklmnop]1
> ??
> Actually, it occurs to me that that might not do the best thing if 4
> drives disappeared at exactly the same time (though it is unlikely
> that you would notice)
> You should probably use
>  mdadm --create /dev/md0 -f -l5 -n15 -c32  /dev/sd[bcdefghijklmnop]1
> This is assuming that  e,f,g,h were in that order in the array before
> they died.
> The '-f' is quite important - it tells mdadm not recover a spare, but
> to resync the parity blocks.
>
> NeilBrown
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Raid5 software problems after loosing 4 disks for 48 hours
  2006-06-17 12:24   ` Wilson Wilson
@ 2006-06-17 21:42     ` David Greaves
  0 siblings, 0 replies; 4+ messages in thread
From: David Greaves @ 2006-06-17 21:42 UTC (permalink / raw)
  To: Wilson Wilson; +Cc: linux-raid

Wilson Wilson wrote:
> Neil great stuff, its online now!!!
Congratulations :)
>
> I am still unsure how this raid5 volume was partially readable with 4
> disks missing.  My understanding each file is written across all disks
> apart from one, which is used for CRC.  So if 2 disks are offline the
> whole thing should be unreadable.
I'll try :)

md doesn't operate at a file level, it operates on chunks. The chunk
could be 64Kb in size.

For raid5 each stripe is made of n-1 chunks. (raid6 would be n-2).
When a stripe is read, if your file is in one of the chunks that's still
there then you're in luck.

I guess md knows it's degraded and gives as much data back as possible.

This means that you have a certain probability of accessing a given file
depending on it's size, the filesystem and the degree to which the array
is degraded.

FWIW I'd *never* try a r/w operation on such a degraded array.

Speculation:
I'm surprised you could mount such a 'sparse' array though. I wonder if
some filesystems (like xfs) would just barf as they mounted because they
have more distributed mount-time data structures and would spot the
missing chunks.  Others (ext3?) may just mount and try to read blocks on
demand.

David

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2006-06-17 21:42 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-15 23:59 Raid5 software problems after loosing 4 disks for 48 hours Wilson Wilson
2006-06-16  1:21 ` Neil Brown
2006-06-17 12:24   ` Wilson Wilson
2006-06-17 21:42     ` David Greaves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).