Can this setup be saved?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Can this setup be saved?
@ 2010-02-13  1:56 Dmitry Teytelman
  2010-02-13  2:13 ` Michael Evans
  2010-02-13  8:58 ` Giovanni Tessore
  0 siblings, 2 replies; 4+ messages in thread
From: Dmitry Teytelman @ 2010-02-13  1:56 UTC (permalink / raw)
  To: linux-raid

Hello,

I've made a mess of my raid setup and am desperately trying to save
it. The setup is RAID-5 on 3 SATA disks. Problems started with one of
the disks getting unrecoverable read errors. Unfortunately I was away
on a trip and the machine was used by my family while this was going
on :(

Array consists of three devices: /dev/sda2, /dev/sdc2, and /dev/sdd2.
When I got back from the trip I found the following:

1. Two disks were removed from the array, leaving only /dev/sda2;
2. When either of the two was added, the array would start;
3. One combination of two disks (/dev/sda2 + /dev/sdd2) aproduced a
running /dev/md0 with a proper ext3 filesystem seen on the drive (even
passing fsck);

At this point I added /dev/sdc2 and the reconstruction started.
However did not complete, since /dev/sdd2 has unrecoverable errors.
Reading the list archives I figured I need another drive to ddrescue
/dev/sdd2, then perform the reconstruction.

However at some point during/after the reconstruction the situation
has changed. Now both /dev/sdc2 and /dev/sdd2 are marked as spare
drives (see mdadm -E output below) and I cannot start the array. I
think /dev/sdd2 should be in sync with /dev/sda2, but how can I bring
it back (it used to be device 2)?

/dev/sda2:
          Magic : a92b4efc
        Version : 0.90.01
           UUID : bd5c2dc0:f76e5f10:a98c4de7:f2020715
  Creation Time : Fri Jun 17 11:47:44 2005
     Raid Level : raid5
  Used Dev Size : 486375808 (463.84 GiB 498.05 GB)
     Array Size : 972751616 (927.69 GiB 996.10 GB)
   Raid Devices : 3
  Total Devices : 1
Preferred Minor : 0

    Update Time : Fri Feb 12 14:07:35 2010
          State : active
 Active Devices : 1
Working Devices : 1
 Failed Devices : 2
  Spare Devices : 0
       Checksum : a4cd0c48 - correct
         Events : 2125155

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     0       8        2        0      active sync   /dev/sda2

   0     0       8        2        0      active sync   /dev/sda2
   1     1       0        0        1      faulty removed
   2     2       0        0        2      faulty removed

/dev/sdc2:
          Magic : a92b4efc
        Version : 0.90.01
           UUID : bd5c2dc0:f76e5f10:a98c4de7:f2020715
  Creation Time : Fri Jun 17 11:47:44 2005
     Raid Level : raid5
  Used Dev Size : 486375808 (463.84 GiB 498.05 GB)
     Array Size : 972751616 (927.69 GiB 996.10 GB)
   Raid Devices : 3
  Total Devices : 1
Preferred Minor : 0

    Update Time : Fri Feb 12 10:30:00 2010
          State : active
 Active Devices : 1
Working Devices : 1
 Failed Devices : 2
  Spare Devices : 0
       Checksum : a4ccd973 - correct
         Events : 2125153

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     4       8        2       -1      spare   /dev/sdc2

   0     0       8       34        0      active sync   /dev/sda2
   1     1       0        0        1      faulty removed
   2     2       0        0        2      faulty removed

/dev/sdd2:
          Magic : a92b4efc
        Version : 0.90.01
           UUID : bd5c2dc0:f76e5f10:a98c4de7:f2020715
  Creation Time : Fri Jun 17 11:47:44 2005
     Raid Level : raid5
  Used Dev Size : 486375808 (463.84 GiB 498.05 GB)
     Array Size : 972751616 (927.69 GiB 996.10 GB)
   Raid Devices : 3
  Total Devices : 2
Preferred Minor : 0

    Update Time : Fri Feb 12 10:36:05 2010
          State : active
 Active Devices : 1
Working Devices : 2
 Failed Devices : 2
  Spare Devices : 1
       Checksum : a4ccdb48 - correct
         Events : 2125154

         Layout : left-symmetric
     Chunk Size : 128K

      Number   Major   Minor   RaidDevice State
this     3       8       50        3      spare   /dev/sdd2

   0     0       8       34        0      active sync   /dev/sda2
   1     1       0        0        1      faulty removed
   2     2       0        0        2      faulty removed
   3     3       8       50        3      spare   /dev/sdd2


-- 
Dmitry Teytelman

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Can this setup be saved?
  2010-02-13  1:56 Can this setup be saved? Dmitry Teytelman
@ 2010-02-13  2:13 ` Michael Evans
  2010-02-15 14:02   ` Dmitry Teytelman
  2010-02-13  8:58 ` Giovanni Tessore
  1 sibling, 1 reply; 4+ messages in thread
From: Michael Evans @ 2010-02-13  2:13 UTC (permalink / raw)
  To: Dmitry Teytelman; +Cc: linux-raid

On Fri, Feb 12, 2010 at 5:56 PM, Dmitry Teytelman <dim@dimtel.com> wrote:
> Hello,
>
> I've made a mess of my raid setup and am desperately trying to save
> it. The setup is RAID-5 on 3 SATA disks. Problems started with one of
> the disks getting unrecoverable read errors. Unfortunately I was away
> on a trip and the machine was used by my family while this was going
> on :(
>
> Array consists of three devices: /dev/sda2, /dev/sdc2, and /dev/sdd2.
> When I got back from the trip I found the following:
>
> 1. Two disks were removed from the array, leaving only /dev/sda2;
> 2. When either of the two was added, the array would start;
> 3. One combination of two disks (/dev/sda2 + /dev/sdd2) aproduced a
> running /dev/md0 with a proper ext3 filesystem seen on the drive (even
> passing fsck);
>
> At this point I added /dev/sdc2 and the reconstruction started.
> However did not complete, since /dev/sdd2 has unrecoverable errors.
> Reading the list archives I figured I need another drive to ddrescue
> /dev/sdd2, then perform the reconstruction.
>
> However at some point during/after the reconstruction the situation
> has changed. Now both /dev/sdc2 and /dev/sdd2 are marked as spare
> drives (see mdadm -E output below) and I cannot start the array. I
> think /dev/sdd2 should be in sync with /dev/sda2, but how can I bring
> it back (it used to be device 2)?
>
> /dev/sda2:
>          Magic : a92b4efc
>        Version : 0.90.01
>           UUID : bd5c2dc0:f76e5f10:a98c4de7:f2020715
>  Creation Time : Fri Jun 17 11:47:44 2005
>     Raid Level : raid5
>  Used Dev Size : 486375808 (463.84 GiB 498.05 GB)
>     Array Size : 972751616 (927.69 GiB 996.10 GB)
>   Raid Devices : 3
>  Total Devices : 1
> Preferred Minor : 0
>
>    Update Time : Fri Feb 12 14:07:35 2010
>          State : active
>  Active Devices : 1
> Working Devices : 1
>  Failed Devices : 2
>  Spare Devices : 0
>       Checksum : a4cd0c48 - correct
>         Events : 2125155
>
>         Layout : left-symmetric
>     Chunk Size : 128K
>
>      Number   Major   Minor   RaidDevice State
> this     0       8        2        0      active sync   /dev/sda2
>
>   0     0       8        2        0      active sync   /dev/sda2
>   1     1       0        0        1      faulty removed
>   2     2       0        0        2      faulty removed
>
> /dev/sdc2:
>          Magic : a92b4efc
>        Version : 0.90.01
>           UUID : bd5c2dc0:f76e5f10:a98c4de7:f2020715
>  Creation Time : Fri Jun 17 11:47:44 2005
>     Raid Level : raid5
>  Used Dev Size : 486375808 (463.84 GiB 498.05 GB)
>     Array Size : 972751616 (927.69 GiB 996.10 GB)
>   Raid Devices : 3
>  Total Devices : 1
> Preferred Minor : 0
>
>    Update Time : Fri Feb 12 10:30:00 2010
>          State : active
>  Active Devices : 1
> Working Devices : 1
>  Failed Devices : 2
>  Spare Devices : 0
>       Checksum : a4ccd973 - correct
>         Events : 2125153
>
>         Layout : left-symmetric
>     Chunk Size : 128K
>
>      Number   Major   Minor   RaidDevice State
> this     4       8        2       -1      spare   /dev/sdc2
>
>   0     0       8       34        0      active sync   /dev/sda2
>   1     1       0        0        1      faulty removed
>   2     2       0        0        2      faulty removed
>
> /dev/sdd2:
>          Magic : a92b4efc
>        Version : 0.90.01
>           UUID : bd5c2dc0:f76e5f10:a98c4de7:f2020715
>  Creation Time : Fri Jun 17 11:47:44 2005
>     Raid Level : raid5
>  Used Dev Size : 486375808 (463.84 GiB 498.05 GB)
>     Array Size : 972751616 (927.69 GiB 996.10 GB)
>   Raid Devices : 3
>  Total Devices : 2
> Preferred Minor : 0
>
>    Update Time : Fri Feb 12 10:36:05 2010
>          State : active
>  Active Devices : 1
> Working Devices : 2
>  Failed Devices : 2
>  Spare Devices : 1
>       Checksum : a4ccdb48 - correct
>         Events : 2125154
>
>         Layout : left-symmetric
>     Chunk Size : 128K
>
>      Number   Major   Minor   RaidDevice State
> this     3       8       50        3      spare   /dev/sdd2
>
>   0     0       8       34        0      active sync   /dev/sda2
>   1     1       0        0        1      faulty removed
>   2     2       0        0        2      faulty removed
>   3     3       8       50        3      spare   /dev/sdd2
>
>
> --
> Dmitry Teytelman
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

It sounds like you've reached the point where you've done something
silly and need to try to recover what you can.

READ THIS SITE CAREFULLY; perform READ ONLY recovery, and see if you
can find a permutation that can be mounted read only and READ with
valid data;

DO NOT REUSE your current disks; they have been kicked for a FAILURE
TO WRITE (if you're using a recent kernel), and should be considered
walking dead.  If you are able to read your data, copy it off to a
fresh array.

IF your new disks are large enough to hold a complete copy of your old
drives and your new copy, then dd_rescue the old drives that are a
valid combination to the end of the disks, make your array in the
beginning, then when you're done copying it take the new array offline
too, extend the partitions, grow the array in to the partitions (and
any underlying structures like LVM)

http://raid.wiki.kernel.org/index.php/RAID_Recovery

http://raid.wiki.kernel.org/index.php/Permute_array.pl

#!/usr/bin/perl -w

# If you forgot how you built an array and need to try various
# permutations then this is for you...

# based on Mark-Jason Dominus' mjd_permute: permute each word of input

use strict;
use Getopt::Long;

sub usage {
 return "syntax: permute_array --md <md_device> --mount <mountpoint>
[--opts <mdadm options>] [--for_real] <all devices>\n";
}

my $MD_DEVICE;
my $MOUNTPOINT;
my $MDADM_OPTS="";
my $REAL;

################################################################
# This function is passed each permutation of component devices.
# This includes a 'missing' device.
# This is the place to hack command variations etc...
sub try_array {
  # @_ looks like: ("/dev/sda1", "missing", "/dev/sdb1")
  my @device_list = @_;
  my $num_devices = scalar @_;

  # This may need a --force... <gulp>
  my $create = "yes | mdadm --create $MD_DEVICE
--raid-devices=$num_devices --level=5 $MDADM_OPTS @device_list
2>/dev/null\n";
  # Don't forget to mount read-only
  my $mount =  "mount -oro $MD_DEVICE $MOUNTPOINT 2>/dev/null";
  my $umount =  "umount $MOUNTPOINT 2>/dev/null";
  # and stop the array...
  my $stop = "mdadm --stop $MD_DEVICE 2>/dev/null";

  # REAL == --for_real option
  if ($REAL) {
    # we expect this to succeed
    system $create;
    if (my $err = $?>>8) {
      die "command : $create\n   exited with status $err\n\n";
    }
    # we expect this to fail and are happy if it succeeds
    system $mount;
    if (!(my $err = $?>>8)) {
      print "Success. possible command : \n  $create\n";
      system $umount;
    }
    # we expect this to succeed
    system $stop;
    if (my $err = $?>>8) {
      die "command : $stop\n   exited with status $err\n\n";
    }
  } else {
    # Just show the create/mount/stop commands
    # If you want more control you could use this to write a script
    print "$create\n$mount\n$stop\n";
  }
}


################################################################
# Execution starts here...
#
sub factorial($);

GetOptions ('md=s'      => \$MD_DEVICE,
	    "mount=s"   => \$MOUNTPOINT,
	    "opts=s"    => \$MDADM_OPTS,
	    "for_real"  => \$REAL);

if (!defined($MD_DEVICE) or !defined($MOUNTPOINT)) {
  die &usage;
}

print "using device $MD_DEVICE and mounting on $MOUNTPOINT\n";

# we *always* assume a 'missing' device - not doing so will destroy
# the array...
my @devices = @ARGV;
# how many devices?
my $num_devices = scalar @devices;
if ($num_devices < 2) {
  die "$0 needs at least two component devices\n";
}
# how many base permutations...
my $num_permutations = factorial(scalar @devices);
# try all permutations, substituting 'missing' for each device in
# turn...
for (my $d=0; $d < $num_devices; $d++) {
  my $skip_device = $devices[$d];
  $devices[$d] = "missing";
  print "skipping $skip_device\n\n";
  for (my $i=0; $i < $num_permutations; $i++) {
    my @permutation = @devices[n2perm($i, $#devices)];
    try_array(@permutation);
  }
  $devices[$d] = $skip_device;
}

################################################################
# permutation code

# n2pat($N, $len) : produce the $N-th pattern of length $len
sub n2pat {
    my $i   = 1;
    my $N   = shift;
    my $len = shift;
    my @pat;
    while ($i <= $len + 1) {   # Should really be just while ($N) { ...
        push @pat, $N % $i;
        $N = int($N/$i);
        $i++;
    }
    return @pat;
}

# pat2perm(@pat) : turn pattern returned by n2pat() into
# permutation of integers.  XXX: splice is already O(N)
sub pat2perm {
    my @pat    = @_;
    my @source = (0 .. $#pat);
    my @perm;
    push @perm, splice(@source, (pop @pat), 1) while @pat;
    return @perm;
}

# n2perm($N, $len) : generate the Nth permutation of $len objects
sub n2perm {
    pat2perm(n2pat(@_));
}

# Utility function: factorial with memoizing
BEGIN {
  my @fact = (1);
  sub factorial($) {
    my $n = shift;
    return $fact[$n] if defined $fact[$n];
    $fact[$n] = $n * factorial($n - 1);
  }
}
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Can this setup be saved?
  2010-02-13  1:56 Can this setup be saved? Dmitry Teytelman
  2010-02-13  2:13 ` Michael Evans
@ 2010-02-13  8:58 ` Giovanni Tessore
  1 sibling, 0 replies; 4+ messages in thread
From: Giovanni Tessore @ 2010-02-13  8:58 UTC (permalink / raw)
  To: linux-raid


> I've made a mess of my raid setup and am desperately trying to save
> it. The setup is RAID-5 on 3 SATA disks. Problems started with one of
> the disks getting unrecoverable read errors. Unfortunately I was away
> on a trip and the machine was used by my family while this was going
> on :(
>
> Array consists of three devices: /dev/sda2, /dev/sdc2, and /dev/sdd2.
> When I got back from the trip I found the following:
>
> 1. Two disks were removed from the array, leaving only /dev/sda2;
> 2. When either of the two was added, the array would start;
> 3. One combination of two disks (/dev/sda2 + /dev/sdd2) aproduced a
> running /dev/md0 with a proper ext3 filesystem seen on the drive (even
> passing fsck);
>
> At this point I added /dev/sdc2 and the reconstruction started.
> However did not complete, since /dev/sdd2 has unrecoverable errors.
> Reading the list archives I figured I need another drive to ddrescue
> /dev/sdd2, then perform the reconstruction.
>
> However at some point during/after the reconstruction the situation
> has changed. Now both /dev/sdc2 and /dev/sdd2 are marked as spare
> drives (see mdadm -E output below) and I cannot start the array. I
> think /dev/sdd2 should be in sync with /dev/sda2, but how can I bring
> it back (it used to be device 2)

I recently had similar problem with a 6 disk array, when one died and 
another gave read errors during reconstruction, this is my experience:

I was able to recover most data reassembling the array and copying data 
from it to another storage; I used a command like:

mdadm --create /dev/md3 --assume-clean --level=5 --raid-devices=6 
--spare-devices=0 /dev/sda4 /dev/sdb4 /dev/sdc4 /dev/sdd4 /dev/sde4 missing


where /dev/sdf4 was the dead disk, which i left out as missing, and 
/dev/sdb4 was the one which gave read errors; missing a disk avoids 
starting the reconstruction.
It's important to set the md device in read-only mode (if supported by 
mdadm version can use the --readonly option directly with the  --create 
command, see man mdadm), and to mount the device readonly.

This mostly worked for me (lost 100Gb over 2.5Tb)

I hope you can recover your data.
Regards

-- 
Cordiali saluti.
Yours faithfully.

Giovanni Tessore



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Can this setup be saved?
  2010-02-13  2:13 ` Michael Evans
@ 2010-02-15 14:02   ` Dmitry Teytelman
  0 siblings, 0 replies; 4+ messages in thread
From: Dmitry Teytelman @ 2010-02-15 14:02 UTC (permalink / raw)
  To: linux-raid

Hello all,

On Fri, Feb 12, 2010 at 18:13, Michael Evans <mjevans1983@gmail.com> wrote:
> On Fri, Feb 12, 2010 at 5:56 PM, Dmitry Teytelman <dim@dimtel.com> wrote:
>>
>> However at some point during/after the reconstruction the situation
>> has changed. Now both /dev/sdc2 and /dev/sdd2 are marked as spare
>> drives (see mdadm -E output below) and I cannot start the array. I
>> think /dev/sdd2 should be in sync with /dev/sda2, but how can I bring
>> it back (it used to be device 2)?
>
> It sounds like you've reached the point where you've done something
> silly and need to try to recover what you can.
>
> READ THIS SITE CAREFULLY; perform READ ONLY recovery, and see if you
> can find a permutation that can be mounted read only and READ with
> valid data;

I am back on the air!!! Th critical step was the use of "mdadm
--create" with a missing drive. I knew the proper order for the two
drives that were in sync. One of them (/dev/sdd) was failing, so it
was ddrescue-d onto a new drive prior to any repairs. Performing
"mdadm --create" brought the RAID device back to life. I checked that
the data looked OK (read-only). Then I went ahead and added the third
drive to the array. Soon a fully reconstructed device was up and
running. Failed drives are heading back to Hitachi :)

Thanks to the group for rapid response and critically important bits
of information!

-- 
Dmitry Teytelman

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-02-15 14:02 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-13  1:56 Can this setup be saved? Dmitry Teytelman
2010-02-13  2:13 ` Michael Evans
2010-02-15 14:02   ` Dmitry Teytelman
2010-02-13  8:58 ` Giovanni Tessore

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).