From mboxrd@z Thu Jan 1 00:00:00 1970 From: AndyLiebman@aol.com Subject: Please Confirm I'm Solving Problem Correctly! Date: Wed, 3 Mar 2004 22:37:55 EST Sender: linux-raid-owner@vger.kernel.org Message-ID: <17.434a3a38.2d77fe93@aol.com> Mime-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Return-path: To: linux-raid@vger.kernel.org List-Id: linux-raid.ids I have ten disks in two RAID 5 Linux arrays with Video and Audio files stored on them for video editing. A couple of hours ago I was viewing a 90 minute video sequence (on a Windows machine that's networked to my Linux box) and all of a sudden, about 45 minutes into viewing, an image got corrupted; there were strange blocks of color in random places in the image and my video editing program came to a halt. When I moved backwards in the editing program timeline the images were okay for a while (I'm assuming because good data was still in the sizable RAM on the Linux box). But when I went back to the beginning of the video sequence and forward again, ALL of the data was corrupted with these random blocks of color. I immediately shut down the Windows machine and checked the RAID status on my Linux box. When I ran 'cat /proc/mdstat' it showed that all drives on both arrays were ACTIVE and that none were missing. I rebooted the Linux computer (maybe a mistake?) and now one of my arrays didn't start (the one that had the material I was looking at and that got corrupted) [root@localhost avidserver]# mdadm -Av /dev/md6 --uuid=57f26496:25520b96:41757b62:f83fcb7b /dev/sd* mdadm: looking for devices for /dev/md6 mdadm: /dev/sd is not a block device. mdadm: /dev/sd has wrong uuid. mdadm: no RAID superblock on /dev/sda mdadm: /dev/sda has wrong uuid. mdadm: /dev/sda1 has wrong uuid. mdadm: no RAID superblock on /dev/sdb mdadm: /dev/sdb has wrong uuid. mdadm: /dev/sdb1 is identified as a member of /dev/md6, slot 4. mdadm: no RAID superblock on /dev/sdc mdadm: /dev/sdc has wrong uuid. mdadm: no RAID superblock on /dev/sdc1 mdadm: /dev/sdc1 has wrong uuid. mdadm: no RAID superblock on /dev/sdd mdadm: /dev/sdd has wrong uuid. mdadm: /dev/sdd1 has wrong uuid. mdadm: no RAID superblock on /dev/sde mdadm: /dev/sde has wrong uuid. mdadm: /dev/sde1 has wrong uuid. mdadm: no RAID superblock on /dev/sdf mdadm: /dev/sdf has wrong uuid. mdadm: /dev/sdf1 is identified as a member of /dev/md6, slot 0. mdadm: no RAID superblock on /dev/sdg mdadm: /dev/sdg has wrong uuid. mdadm: /dev/sdg1 has wrong uuid. mdadm: no RAID superblock on /dev/sdh mdadm: /dev/sdh has wrong uuid. mdadm: /dev/sdh1 is identified as a member of /dev/md6, slot 1. mdadm: no RAID superblock on /dev/sdi mdadm: /dev/sdi has wrong uuid. mdadm: /dev/sdi1 has wrong uuid. mdadm: no RAID superblock on /dev/sdj mdadm: /dev/sdj has wrong uuid. mdadm: /dev/sdj1 is identified as a member of /dev/md6, slot 2. mdadm: added /dev/sdh1 to /dev/md6 as 1 mdadm: added /dev/sdj1 to /dev/md6 as 2 mdadm: no uptodate device for slot 3 of /dev/md6 mdadm: added /dev/sdb1 to /dev/md6 as 4 mdadm: added /dev/sdf1 to /dev/md6 as 0 mdadm: /dev/md6 assembled from 4 drives - need all 5 to start it (use --run to insist) I ran mdadm -E /dev/sdXX on each of the drives on my system to confirm which ones belonged to which array and which was the problem drive. I found one drive,sdc1, which didn't have a superblock and by process of elimination it belonged to md6. mdadm -E /dev/sdc1 mdadm: No super block found on /dev/sdc1 (Expected magic a92b4efc, got 00000000) [root@localhost avidserver]# TO CORRECT THIS PROBLEM I am assuming that I have to do the following: mdadm /dev/md6 --fail /dev/sdc1 mdadm /dev/md6 --remove /dev/sdc1 mdadm /dev/md6 --add /dev/sdc1 Would it be advisable to do steps 1 and 2 first (fail/remove), then assemble the array and confirm that all the data is intact before adding the failed drive back in? SHOULD my data be intact without the 5th drive (I know that's the point of RAID5, but why was I seeing corrupt data all of a sudden when just seconds before it had been fine throughout the array? Even if one drive went bad, shouldn't the data integrity have been maintained? Or do I have to get the bad drive out of the picture first? If I do assemble the array with one drive failed, and the data is still corrupted, what should I do then? Finally, I have a couple of questions about why this occurred: 1) Why didn't sdc1 just fail? Why was it still in the array after the corruption occurred, and why is it still in the array after rebooting and not marked automatically as failed (thus allowing the array to restart with 4 drives). 2) Could the cause of the problem be related to this comment I found in my kernel logs: Mar 3 20:41:43 localhost kernel: RAID5 conf printout: Mar 3 20:41:43 localhost kernel: --- rd:5 wd:5 fd:0 Mar 3 20:41:43 localhost kernel: disk 0, s:0, o:1, n:0 rd:0 us:1 dev:scsi/host3/bus0/target0/lun0/part1 Mar 3 20:41:43 localhost kernel: disk 1, s:0, o:1, n:1 rd:1 us:1 dev:scsi/host4/bus0/target0/lun0/part1 Mar 3 20:41:43 localhost kernel: disk 2, s:0, o:1, n:2 rd:2 us:1 dev:scsi/host1/bus0/target0/lun0/part1 Mar 3 20:41:43 localhost kernel: disk 3, s:0, o:1, n:3 rd:3 us:1 dev:scsi/host5/bus0/target0/lun0/part1 Mar 3 20:41:43 localhost kernel: disk 4, s:0, o:1, n:4 rd:4 us:1 dev:scsi/host2/bus0/target1/lun0/part1 Mar 3 20:41:44 localhost kernel: XFS: SB read failed I appreciate your help. The data on this array is very important to me. It will take me two weeks to restore it if that becomes necessary. Regards, Andy Liebman P.S. I am thinking about designing a program to "almost automatically" run mdadm -E /dev/sdXX on all drives, figure out which device id's go together by uuid, and determine by process of elimination which drive needs to be failed/removed and (or just) added -- and then just do what needs to be done, perhaps running a diagnosis on the removed drive to determine if the drive needs to be replaced, warning the user if that's the case or adding the drive back if it's not a problem. Is there another such program around ( I don't want to reinvent the wheel). Would such a "semi automatic recovery program" be useful? Or would it be more dangerous than anything else? I want to find a way to make it possible for folks to be able to use Linux software RAID without having to know all the deep innards of how it works.