Please Confirm I'm Solving Problem Correctly!

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Please Confirm I'm Solving Problem Correctly!
@ 2004-03-04  3:37 AndyLiebman
  0 siblings, 0 replies; 2+ messages in thread
From: AndyLiebman @ 2004-03-04  3:37 UTC (permalink / raw)
  To: linux-raid

I have ten disks in two RAID 5 Linux arrays with Video and Audio files stored 
on them for video editing.  A couple of hours ago I was viewing a 90 minute 
video sequence (on a Windows machine that's networked to my Linux box) and all 
of a sudden, about 45 minutes into viewing, an image got corrupted; there were 
strange blocks of color in random places in the image and my video editing 
program came to a halt. When I moved backwards in the editing program timeline 
the images were okay for a while (I'm assuming because good data was still in 
the sizable RAM on the Linux box). But when I went back to the beginning of the 
video sequence and forward again, ALL of the data was corrupted with these 
random blocks of color. 

I immediately shut down the Windows machine and checked the RAID status on my 
Linux box. When I ran 'cat /proc/mdstat' it showed that all drives on both 
arrays were ACTIVE and that none were missing. 

I rebooted the Linux computer (maybe a mistake?) and now one of my arrays 
didn't start (the one that had the material I was looking at and that got 
corrupted)

[root@localhost avidserver]# mdadm -Av /dev/md6 
--uuid=57f26496:25520b96:41757b62:f83fcb7b /dev/sd*
mdadm: looking for devices for /dev/md6
mdadm: /dev/sd is not a block device.
mdadm: /dev/sd has wrong uuid.
mdadm: no RAID superblock on /dev/sda
mdadm: /dev/sda has wrong uuid.
mdadm: /dev/sda1 has wrong uuid.
mdadm: no RAID superblock on /dev/sdb
mdadm: /dev/sdb has wrong uuid.
mdadm: /dev/sdb1 is identified as a member of /dev/md6, slot 4.
mdadm: no RAID superblock on /dev/sdc
mdadm: /dev/sdc has wrong uuid.
mdadm: no RAID superblock on /dev/sdc1
mdadm: /dev/sdc1 has wrong uuid.
mdadm: no RAID superblock on /dev/sdd
mdadm: /dev/sdd has wrong uuid.
mdadm: /dev/sdd1 has wrong uuid.
mdadm: no RAID superblock on /dev/sde
mdadm: /dev/sde has wrong uuid.
mdadm: /dev/sde1 has wrong uuid.
mdadm: no RAID superblock on /dev/sdf
mdadm: /dev/sdf has wrong uuid.
mdadm: /dev/sdf1 is identified as a member of /dev/md6, slot 0.
mdadm: no RAID superblock on /dev/sdg
mdadm: /dev/sdg has wrong uuid.
mdadm: /dev/sdg1 has wrong uuid.
mdadm: no RAID superblock on /dev/sdh
mdadm: /dev/sdh has wrong uuid.
mdadm: /dev/sdh1 is identified as a member of /dev/md6, slot 1.
mdadm: no RAID superblock on /dev/sdi
mdadm: /dev/sdi has wrong uuid.
mdadm: /dev/sdi1 has wrong uuid.
mdadm: no RAID superblock on /dev/sdj
mdadm: /dev/sdj has wrong uuid.
mdadm: /dev/sdj1 is identified as a member of /dev/md6, slot 2.
mdadm: added /dev/sdh1 to /dev/md6 as 1
mdadm: added /dev/sdj1 to /dev/md6 as 2
mdadm: no uptodate device for slot 3 of /dev/md6
mdadm: added /dev/sdb1 to /dev/md6 as 4
mdadm: added /dev/sdf1 to /dev/md6 as 0
mdadm: /dev/md6 assembled from 4 drives - need all 5 to start it (use --run 
to insist)

I ran mdadm -E /dev/sdXX on each of the drives on my system to confirm which 
ones belonged to which array and which was the problem drive. I found one 
drive,sdc1, which didn't have a superblock and by process of elimination it 
belonged to md6. 

mdadm -E /dev/sdc1
mdadm: No super block found on /dev/sdc1 (Expected magic a92b4efc, got 
00000000)
[root@localhost avidserver]#

TO CORRECT THIS PROBLEM I am assuming that I have to do the following: 
mdadm /dev/md6 --fail /dev/sdc1
mdadm /dev/md6 --remove /dev/sdc1
mdadm /dev/md6 --add /dev/sdc1

Would it be advisable to do steps 1 and 2 first (fail/remove), then assemble 
the array and confirm that all the data is intact before adding the failed 
drive back in? SHOULD my data be intact without the 5th drive (I know that's the 
point of RAID5, but why was I seeing corrupt data all of a sudden when just 
seconds before it had been fine throughout the array? Even if one drive went 
bad, shouldn't the data integrity have been maintained? Or do I have to get the 
bad drive out of the picture first? 

If I do assemble the array with one drive failed, and the data is still 
corrupted, what should I do then? 

Finally, I have a couple of questions about why this occurred: 

1)  Why didn't sdc1 just fail? Why was it still in the array after the 
corruption occurred, and why is it still in the array after rebooting and not marked 
automatically as failed (thus allowing the array to restart with 4 drives). 

2)  Could the cause of the problem be related to this comment I found in my 
kernel logs: 

Mar  3 20:41:43 localhost kernel: RAID5 conf printout:
Mar  3 20:41:43 localhost kernel:  --- rd:5 wd:5 fd:0
Mar  3 20:41:43 localhost kernel:  disk 0, s:0, o:1, n:0 rd:0 us:1 
dev:scsi/host3/bus0/target0/lun0/part1
Mar  3 20:41:43 localhost kernel:  disk 1, s:0, o:1, n:1 rd:1 us:1 
dev:scsi/host4/bus0/target0/lun0/part1
Mar  3 20:41:43 localhost kernel:  disk 2, s:0, o:1, n:2 rd:2 us:1 
dev:scsi/host1/bus0/target0/lun0/part1
Mar  3 20:41:43 localhost kernel:  disk 3, s:0, o:1, n:3 rd:3 us:1 
dev:scsi/host5/bus0/target0/lun0/part1
Mar  3 20:41:43 localhost kernel:  disk 4, s:0, o:1, n:4 rd:4 us:1 
dev:scsi/host2/bus0/target1/lun0/part1
Mar  3 20:41:44 localhost kernel: XFS: SB read failed

I appreciate your help. The data on this array is very important to me. It 
will take me two weeks to restore it if that becomes necessary. 

Regards, 
Andy Liebman

P.S.  I am thinking about designing a program to "almost automatically" run 
mdadm -E /dev/sdXX on all drives, figure out which device id's go together by 
uuid, and determine by process of elimination which drive needs to be 
failed/removed and (or just) added -- and then just do what needs to be done, perhaps 
running a diagnosis on the removed drive to determine if the drive needs to be 
replaced, warning the user if that's the case or adding the drive back if it's 
not a problem. Is there another such program around ( I don't want to 
reinvent the wheel). Would such a "semi automatic recovery program" be useful? Or 
would it be more dangerous than anything else? I want to find a way to make it 
possible for folks to be able to use Linux software RAID without having to know 
all the deep innards of how it works. 

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Please Confirm I'm Solving Problem Correctly!
@ 2004-03-04 13:36 AndyLiebman
  0 siblings, 0 replies; 2+ messages in thread
From: AndyLiebman @ 2004-03-04 13:36 UTC (permalink / raw)
  To: linux-raid, bugzilla

>In a message dated 3/3/2004 11:11:54 PM Eastern Standard Time, 
bugzilla@watkins->home.com writes:

>One failed drive should NOT cause data corruption!  The array should
>continue just fine.  In most cases you may not know a disk has failed,
>unless you notice a decrease in performance.
>
>I am not an expert, so wait for someone like Neil Brown to respond before
>you take any chances with your data.
>
>Your "fail, remove and add" is correct, but only after you assemble the
>array.  I would verify the data first, before doing the fail.
>
>Corruption?  I have 2 guesses.
>1. a bug in the RAID software.
>2. somehow the bad disk was returning bad data instead of giving a read
>error.  The RAID software did not detect a read error, so it did not fail
>the drive.
>
>Guy

I was looking back at some old comments from Neil Brown regarding another 
problem I once had.  I'm confused about whether I need to start the array with 
RUN before I "fail/remove" the bad drive? 

Right now, if I try to assemble the array with 

'mdadm -Av /dev/md6 --uuid=[uuid string] /dev/sd*' 

I get back a message: 
mdadm: /dev/md6 assembled from 4 drives - need all 5 to start it (use --run 
to insist)

If I 'cat /proc/mdstat' the array shows up but is "inactive". Which I guess 
means it isn't "started". But it seems to be "assembled". 

In the past, Neil told me to do just what it says, use "run".  Meaning, 
'mdadm -Av --run --uuid=[uuid string] /dev/sd*' 

Does the array in fact need to be "started" (with run, if necessary) before I 
can "fail" and "remove" the faulty drive? I guess that's my question? I 
thought it only needed to be "assembled".  

Hope I can get a good answer to this soon! Thanks. 

Andy Liebman

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2004-03-04 13:36 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-04 13:36 Please Confirm I'm Solving Problem Correctly! AndyLiebman
  -- strict thread matches above, loose matches on Subject: below --
2004-03-04  3:37 AndyLiebman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).