linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* degraded raid 6 (1 bad drive) showing up inactive, only spares
@ 2012-06-07 11:55 Martin Ziler
  2012-06-07 12:29 ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Martin Ziler @ 2012-06-07 11:55 UTC (permalink / raw)
  To: linux-raid

Hello everybody,

I am running a 9-disk raid6 without hot spares. I already had one drive go bad, which I could replace and continue using the array without any degraded raid messages. Recently I had another drive going bad by the smart-info. As it wasn't quite dead I left the array as was without really using it all that much waiting for a replacement drive I ordered. As I booted the machine up in order to replace the drive I was greeted by an inactive array with all devices showing up as spares.

md0 : inactive sdh2[0](S) sdi2[7](S) sde2[6](S) sdd2[5](S) sdf2[1](S) sdg2[2](S) sdc1[9](S) sdb2[3](S)
      15579088439 blocks super 1.2

mdadm --examine confirms that. I already searched the web quite a bit and found this mailing list. Maybe someone in here can give me some input. Normally a degraded raid should still be active. So I am quite surprised that my array with only one drive missing goes inactive. I appended the info mdadm --examine puts out for all the drives. However the first two should probably suffice as only /dev/sdk differs from the rest. The faulty drive - sdk - is still recognized as a raid6 member, wheres all the others show up as spares. With lots of bad sectors sdk isn't accessible anymore. 


/dev/sdk2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
           Name : server:0  (local to host server)
  Creation Time : Mon Jul 25 23:40:50 2011
     Raid Level : raid6
   Raid Devices : 9

 Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
     Array Size : 27172970496 (12957.08 GiB 13912.56 GB)
  Used Dev Size : 3881852928 (1851.01 GiB 1987.51 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 882eb11a:33b499a7:dd5856b7:165f916c

    Update Time : Fri Jun  1 20:26:45 2012
       Checksum : b8c58093 - correct
         Events : 623119

         Layout : left-symmetric
     Chunk Size : 4096K

   Device Role : Active device 8
   Array State : AAAAAAAAA ('A' == active, '.' == missing)


/dev/sdh2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
           Name : server:0  (local to host server)
  Creation Time : Mon Jul 25 23:40:50 2011
     Raid Level : -unknown-
   Raid Devices : 0

 Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 44008309:1dfb1408:cabfbd0a:64de3739

    Update Time : Thu Jun  7 12:27:52 2012
       Checksum : 27f93899 - correct
         Events : 2

   Device Role : spare
   Array State :  ('A' == active, '.' == missing)

---------------------------------------------------------------------------------------------------------------

/dev/sdi2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
           Name : server:0  (local to host server)
  Creation Time : Mon Jul 25 23:40:50 2011
     Raid Level : -unknown-
   Raid Devices : 0

 Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 135f196d:184f11a1:09207617:4022e1a5

    Update Time : Thu Jun  7 12:27:52 2012
       Checksum : 9ded8f86 - correct
         Events : 2


   Device Role : spare
   Array State :  ('A' == active, '.' == missing)

/dev/sde2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
           Name : server:0  (local to host server)
  Creation Time : Mon Jul 25 23:40:50 2011
     Raid Level : -unknown-
   Raid Devices : 0

 Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 3517bcc4:2acb381f:f5006058:5bd5c831

    Update Time : Thu Jun  7 12:27:52 2012
       Checksum : 408957c0 - correct
         Events : 2


   Device Role : spare
   Array State :  ('A' == active, '.' == missing)

/dev/sdd2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
           Name : server:0  (local to host server)
  Creation Time : Mon Jul 25 23:40:50 2011
     Raid Level : -unknown-
   Raid Devices : 0

 Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 9e8b2d2c:844a009a:fd6914a2:390f10ac

    Update Time : Thu Jun  7 12:27:52 2012
       Checksum : e6bdee68 - correct
         Events : 2


   Device Role : spare
   Array State :  ('A' == active, '.' == missing)

/dev/sdf2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
           Name : server:0  (local to host server)
  Creation Time : Mon Jul 25 23:40:50 2011
     Raid Level : -unknown-
   Raid Devices : 0

 Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 87ad38ac:4ccbd831:ee5502cd:28dafaad

    Update Time : Thu Jun  7 12:27:52 2012
       Checksum : 2b7a47f6 - correct
         Events : 2


   Device Role : spare
   Array State :  ('A' == active, '.' == missing)

/dev/sdg2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
           Name : server:0  (local to host server)
  Creation Time : Mon Jul 25 23:40:50 2011
     Raid Level : -unknown-
   Raid Devices : 0

 Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : eef2f06f:28f881a5:da857a00:fb90e250

    Update Time : Thu Jun  7 12:27:52 2012
       Checksum : 393ba0f8 - correct
         Events : 2


   Device Role : spare
   Array State :  ('A' == active, '.' == missing)

/dev/sdc1:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
           Name : server:0  (local to host server)
  Creation Time : Mon Jul 25 23:40:50 2011
     Raid Level : -unknown-
   Raid Devices : 0

 Avail Dev Size : 3985162143 (1900.27 GiB 2040.40 GB)
  Used Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 4cf86fb0:6f334e2c:19e89c99:0532f557

    Update Time : Thu Jun  7 12:27:52 2012
       Checksum : a6e42bdc - correct
         Events : 2


   Device Role : spare
   Array State :  ('A' == active, '.' == missing)

/dev/sdb2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
           Name : server:0  (local to host server)
  Creation Time : Mon Jul 25 23:40:50 2011
     Raid Level : -unknown-
   Raid Devices : 0

 Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 4852882a:b8a3989f:aad747c5:25f20d47

    Update Time : Thu Jun  7 12:27:52 2012
       Checksum : a8e25edd - correct
         Events : 2


   Device Role : spare
   Array State :  ('A' == active, '.' == missing)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: degraded raid 6 (1 bad drive) showing up inactive, only spares
  2012-06-07 11:55 Martin Ziler
@ 2012-06-07 12:29 ` NeilBrown
       [not found]   ` <CAGHsWsm_Xvf59VCuHyJvoMW6peiFHK=YQKGzr3cq=RDk7jyqKg@mail.gmail.com>
                     ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: NeilBrown @ 2012-06-07 12:29 UTC (permalink / raw)
  To: Martin Ziler; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 10909 bytes --]

On Thu, 7 Jun 2012 13:55:32 +0200 Martin Ziler <martin.ziler@googlemail.com>
wrote:

> Hello everybody,
> 
> I am running a 9-disk raid6 without hot spares. I already had one drive go bad, which I could replace and continue using the array without any degraded raid messages. Recently I had another drive going bad by the smart-info. As it wasn't quite dead I left the array as was without really using it all that much waiting for a replacement drive I ordered. As I booted the machine up in order to replace the drive I was greeted by an inactive array with all devices showing up as spares.
> 
> md0 : inactive sdh2[0](S) sdi2[7](S) sde2[6](S) sdd2[5](S) sdf2[1](S) sdg2[2](S) sdc1[9](S) sdb2[3](S)
>       15579088439 blocks super 1.2
> 
> mdadm --examine confirms that. I already searched the web quite a bit and found this mailing list. Maybe someone in here can give me some input. Normally a degraded raid should still be active. So I am quite surprised that my array with only one drive missing goes inactive. I appended the info mdadm --examine puts out for all the drives. However the first two should probably suffice as only /dev/sdk differs from the rest. The faulty drive - sdk - is still recognized as a raid6 member, wheres all the others show up as spares. With lots of bad sectors sdk isn't accessible anymore. 

You must be running 3.2.1 or 3.3 (I think).

You've been bitten by a rather nasty bug.

You can get your data back, but it will require a bit of care, so don't rush
it.

The metadata on almost all the devices have been seriously corrupted.  The
only way to repair it is to recreate the array.
Doing this just writes new metadata and assembles the array.  It doesn't touch
the data so if we get the --create command right, all your data will be
available again.
If we get it wrong, you won't be able to see your data, but we can easily stop
the array and create again with different parameters until we get it right.

First thing to do it to get a newer kernel.  I would recommend the latest in
the 3.3.y series.

Then you need to:
 - make sure you have a version of mdadm which gets the data offset to 1M
   (2048 sectors).  I think 3.2.3 or earlier does that - don't upgrade to
   3.2.5.
 - find the chunk size - looks like it is 4M, as sdk2 isn't corrupt.
 - find the order of devices.  This should be in your kernel logs in 
    "RAID conf printout".  Hopefully device names haven't changed.

 Then (with new kernel running)

  mdadm --create /dev/md0 -l6 -n9 -c 4M -e 1.2 /dev/sdb2 /dev/sdc2 /dev/sdd2 \
     /dev/sde2 /dev/sdf2 /dev/sdg2 /dev/sdh2 /dev/sdi2 missing \
     --assume-clean

 Make double-sure you add that --assume-clean.

 Note the last device is 'missing'. That corresponds to sdk2 (which we 
 know is device 8 - the last of 9 (0..8)).  It fails so it not part of the
 array any more.  The others I just guessed the order.  You should try to
 verify it before you proceed (see RAID conf printout in kernel logs).

 After the 'create' use "mdadm -E" to look at one device and make sure
 the Data Offset, Avail Dev Size and Array Size are the same as we saw
 on sdk2.
 If it is, try "fsck -n /dev/md0". That assumes ext3 or ext4.  If you had
 something else on the array some other command might be needed.

 If that looks bad, "mdadm -S /dev/md0" and try again with a different order.
 If it looks good, "echo check > /sys/block/md0/md/sync_action" and watch
  "mismatch_cnt" in the same directory.  If it says low (few hundred at most) 
 all is good.  If it goes up to thousands something is wrong - try another
 order.

 Once you have the array working again,
    "echo repair > /sys/block/md0/md/sync_action"
 then add your new device to be rebuilt.

Good luck.
Please ask if you are unsure about anything.

NeilBrown

> 
> 
> /dev/sdk2:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>            Name : server:0  (local to host server)
>   Creation Time : Mon Jul 25 23:40:50 2011
>      Raid Level : raid6
>    Raid Devices : 9
> 
>  Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>      Array Size : 27172970496 (12957.08 GiB 13912.56 GB)
>   Used Dev Size : 3881852928 (1851.01 GiB 1987.51 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : clean
>     Device UUID : 882eb11a:33b499a7:dd5856b7:165f916c
> 
>     Update Time : Fri Jun  1 20:26:45 2012
>        Checksum : b8c58093 - correct
>          Events : 623119
> 
>          Layout : left-symmetric
>      Chunk Size : 4096K
> 
>    Device Role : Active device 8
>    Array State : AAAAAAAAA ('A' == active, '.' == missing)
> 
> 
> /dev/sdh2:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>            Name : server:0  (local to host server)
>   Creation Time : Mon Jul 25 23:40:50 2011
>      Raid Level : -unknown-
>    Raid Devices : 0
> 
>  Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : 44008309:1dfb1408:cabfbd0a:64de3739
> 
>     Update Time : Thu Jun  7 12:27:52 2012
>        Checksum : 27f93899 - correct
>          Events : 2
> 
>    Device Role : spare
>    Array State :  ('A' == active, '.' == missing)
> 
> ---------------------------------------------------------------------------------------------------------------
> 
> /dev/sdi2:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>            Name : server:0  (local to host server)
>   Creation Time : Mon Jul 25 23:40:50 2011
>      Raid Level : -unknown-
>    Raid Devices : 0
> 
>  Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : 135f196d:184f11a1:09207617:4022e1a5
> 
>     Update Time : Thu Jun  7 12:27:52 2012
>        Checksum : 9ded8f86 - correct
>          Events : 2
> 
> 
>    Device Role : spare
>    Array State :  ('A' == active, '.' == missing)
> 
> /dev/sde2:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>            Name : server:0  (local to host server)
>   Creation Time : Mon Jul 25 23:40:50 2011
>      Raid Level : -unknown-
>    Raid Devices : 0
> 
>  Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : 3517bcc4:2acb381f:f5006058:5bd5c831
> 
>     Update Time : Thu Jun  7 12:27:52 2012
>        Checksum : 408957c0 - correct
>          Events : 2
> 
> 
>    Device Role : spare
>    Array State :  ('A' == active, '.' == missing)
> 
> /dev/sdd2:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>            Name : server:0  (local to host server)
>   Creation Time : Mon Jul 25 23:40:50 2011
>      Raid Level : -unknown-
>    Raid Devices : 0
> 
>  Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : 9e8b2d2c:844a009a:fd6914a2:390f10ac
> 
>     Update Time : Thu Jun  7 12:27:52 2012
>        Checksum : e6bdee68 - correct
>          Events : 2
> 
> 
>    Device Role : spare
>    Array State :  ('A' == active, '.' == missing)
> 
> /dev/sdf2:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>            Name : server:0  (local to host server)
>   Creation Time : Mon Jul 25 23:40:50 2011
>      Raid Level : -unknown-
>    Raid Devices : 0
> 
>  Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : 87ad38ac:4ccbd831:ee5502cd:28dafaad
> 
>     Update Time : Thu Jun  7 12:27:52 2012
>        Checksum : 2b7a47f6 - correct
>          Events : 2
> 
> 
>    Device Role : spare
>    Array State :  ('A' == active, '.' == missing)
> 
> /dev/sdg2:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>            Name : server:0  (local to host server)
>   Creation Time : Mon Jul 25 23:40:50 2011
>      Raid Level : -unknown-
>    Raid Devices : 0
> 
>  Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : eef2f06f:28f881a5:da857a00:fb90e250
> 
>     Update Time : Thu Jun  7 12:27:52 2012
>        Checksum : 393ba0f8 - correct
>          Events : 2
> 
> 
>    Device Role : spare
>    Array State :  ('A' == active, '.' == missing)
> 
> /dev/sdc1:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>            Name : server:0  (local to host server)
>   Creation Time : Mon Jul 25 23:40:50 2011
>      Raid Level : -unknown-
>    Raid Devices : 0
> 
>  Avail Dev Size : 3985162143 (1900.27 GiB 2040.40 GB)
>   Used Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : 4cf86fb0:6f334e2c:19e89c99:0532f557
> 
>     Update Time : Thu Jun  7 12:27:52 2012
>        Checksum : a6e42bdc - correct
>          Events : 2
> 
> 
>    Device Role : spare
>    Array State :  ('A' == active, '.' == missing)
> 
> /dev/sdb2:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>            Name : server:0  (local to host server)
>   Creation Time : Mon Jul 25 23:40:50 2011
>      Raid Level : -unknown-
>    Raid Devices : 0
> 
>  Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : 4852882a:b8a3989f:aad747c5:25f20d47
> 
>     Update Time : Thu Jun  7 12:27:52 2012
>        Checksum : a8e25edd - correct
>          Events : 2
> 
> 
>    Device Role : spare
>    Array State :  ('A' == active, '.' == missing)--
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: degraded raid 6 (1 bad drive) showing up inactive, only spares
       [not found]   ` <CAGHsWsm_Xvf59VCuHyJvoMW6peiFHK=YQKGzr3cq=RDk7jyqKg@mail.gmail.com>
@ 2012-06-07 21:14     ` NeilBrown
       [not found]       ` <C9680CD3-8DA8-4FE3-8337-481676213C39@googlemail.com>
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2012-06-07 21:14 UTC (permalink / raw)
  To: Martin Ziler; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 6222 bytes --]

On Thu, 7 Jun 2012 18:49:49 +0200 Martin Ziler <martin.ziler@googlemail.com>
wrote:

> 2012/6/7 NeilBrown <neilb@suse.de>
> 
> > On Thu, 7 Jun 2012 13:55:32 +0200 Martin Ziler <
> > martin.ziler@googlemail.com>
> > wrote:
> >
> > > Hello everybody,
> > >
> > > I am running a 9-disk raid6 without hot spares. I already had one drive
> > go bad, which I could replace and continue using the array without any
> > degraded raid messages. Recently I had another drive going bad by the
> > smart-info. As it wasn't quite dead I left the array as was without really
> > using it all that much waiting for a replacement drive I ordered. As I
> > booted the machine up in order to replace the drive I was greeted by an
> > inactive array with all devices showing up as spares.
> > >
> > > md0 : inactive sdh2[0](S) sdi2[7](S) sde2[6](S) sdd2[5](S) sdf2[1](S)
> > sdg2[2](S) sdc1[9](S) sdb2[3](S)
> > >       15579088439 blocks super 1.2
> > >
> > > mdadm --examine confirms that. I already searched the web quite a bit
> > and found this mailing list. Maybe someone in here can give me some input.
> > Normally a degraded raid should still be active. So I am quite surprised
> > that my array with only one drive missing goes inactive. I appended the
> > info mdadm --examine puts out for all the drives. However the first two
> > should probably suffice as only /dev/sdk differs from the rest. The faulty
> > drive - sdk - is still recognized as a raid6 member, wheres all the others
> > show up as spares. With lots of bad sectors sdk isn't accessible anymore.
> >
> > You must be running 3.2.1 or 3.3 (I think).
> >
> > You've been bitten by a rather nasty bug.
> >
> > You can get your data back, but it will require a bit of care, so don't
> > rush
> > it.
> >
> > The metadata on almost all the devices have been seriously corrupted.  The
> > only way to repair it is to recreate the array.
> > Doing this just writes new metadata and assembles the array.  It doesn't
> > touch
> > the data so if we get the --create command right, all your data will be
> > available again.
> > If we get it wrong, you won't be able to see your data, but we can easily
> > stop
> > the array and create again with different parameters until we get it right.
> >
> > First thing to do it to get a newer kernel.  I would recommend the latest
> > in
> > the 3.3.y series.
> >
> > Then you need to:
> >  - make sure you have a version of mdadm which gets the data offset to 1M
> >   (2048 sectors).  I think 3.2.3 or earlier does that - don't upgrade to
> >   3.2.5.
> >  - find the chunk size - looks like it is 4M, as sdk2 isn't corrupt.
> >  - find the order of devices.  This should be in your kernel logs in
> >    "RAID conf printout".  Hopefully device names haven't changed.
> >
> >  Then (with new kernel running)
> >
> >  mdadm --create /dev/md0 -l6 -n9 -c 4M -e 1.2 /dev/sdb2 /dev/sdc2
> > /dev/sdd2 \
> >     /dev/sde2 /dev/sdf2 /dev/sdg2 /dev/sdh2 /dev/sdi2 missing \
> >     --assume-clean
> >
> >  Make double-sure you add that --assume-clean.
> >
> >  Note the last device is 'missing'. That corresponds to sdk2 (which we
> >  know is device 8 - the last of 9 (0..8)).  It fails so it not part of the
> >  array any more.  The others I just guessed the order.  You should try to
> >  verify it before you proceed (see RAID conf printout in kernel logs).
> >
> >  After the 'create' use "mdadm -E" to look at one device and make sure
> >  the Data Offset, Avail Dev Size and Array Size are the same as we saw
> >  on sdk2.
> >  If it is, try "fsck -n /dev/md0". That assumes ext3 or ext4.  If you had
> >  something else on the array some other command might be needed.
> >
> >  If that looks bad, "mdadm -S /dev/md0" and try again with a different
> > order.
> >  If it looks good, "echo check > /sys/block/md0/md/sync_action" and watch
> >  "mismatch_cnt" in the same directory.  If it says low (few hundred at
> > most)
> >  all is good.  If it goes up to thousands something is wrong - try another
> >  order.
> >
> >  Once you have the array working again,
> >    "echo repair > /sys/block/md0/md/sync_action"
> >  then add your new device to be rebuilt.
> >
> > Good luck.
> > Please ask if you are unsure about anything.
> >
> > NeilBrown
> >
> >
> 
> Hello Neil,
> 
> thank you very much for this detailed input. My last reply didn't make it
> into the mailing list due to the format of my mail client (OSX mail). My
> kernel (Ubuntu) was 3.2.0 , I upgraded to 3.3.8. mdadm version was fine.
> 
> I searched the log files I got and was unable to find anything concerning
> my array. Maybe that sorta stuff isn't logged in ubuntu. I did find some
> mails concerning degraded raid that do not correlate with my current
> breakage. I received the following 2 messages:
> 
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
> [raid4] [raid10]
> md0 : active (auto-read-only) raid6 sdi2[1] sdh2[0] sdg2[8] sdc1[9] sdd2[5]
> sdb2[3] sdf2[7] sde2[6]
>       13586485248 blocks super 1.2 level 6, 4096k chunk, algorithm 2 [9/8]
> [UU_UUUUUU]
> 
> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
> [raid4] [raid10]
> md0 : active (auto-read-only) raid6 sdj2[2] sdg2[8] sdd2[5] sde2[6] sdb2[3]
> sdf2[7] sdc1[9]
>       13586485248 blocks super 1.2 level 6, 4096k chunk, algorithm 2 [9/7]
> [__UUUUUUU]
> 
> I conclude that my setup must have been sdh2 [0], sdi2 [1], sdj2 [2], sdb2
> [3], sdd2 [5] , sde2 [6], sdf2 [7], sdg2 [8], sdc1 [9]

Unfortunately these number are not the roles of the device in the array.  They
are the order in which the devices were added to the array.
So 0-8 are very likely roles 0-8 in the array.  '9' is then the first spare,
and it stays as '9' even when it becomes active.  So as there is no '4', it
does look likely that 'sdc1' should come between  'sdb2' and 'sdd2'.

NeilBrown


> sdc1 is the replacement for my first drive that went bad. It's somewhat
> strange that it is now listed as device 9 and not 4, isn't it? I reckon
> that I have to rebuild in that order, notwithstanding.
> 
> regards,
> Martin


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: degraded raid 6 (1 bad drive) showing up inactive, only spares
  2012-06-07 12:29 ` NeilBrown
       [not found]   ` <CAGHsWsm_Xvf59VCuHyJvoMW6peiFHK=YQKGzr3cq=RDk7jyqKg@mail.gmail.com>
@ 2012-06-07 21:16   ` Oliver Schinagl
  2012-06-07 22:34     ` NeilBrown
       [not found]   ` <alpine.DEB.2.00.1206080743310.25054@uplift.swm.pp.se>
  2 siblings, 1 reply; 11+ messages in thread
From: Oliver Schinagl @ 2012-06-07 21:16 UTC (permalink / raw)
  To: NeilBrown; +Cc: Martin Ziler, linux-raid

Since i'm still working on repairing my own array, and using a wrong 
version of mdadm corrupted one of my raid10 array, I'm trying to hexedit 
the start of an image of the disk to recover the metadata.

A quick question, if I've edited/checked the first superblock,
(i'm using 
https://raid.wiki.kernel.org/index.php/RAID_superblock_formats for 
reference and looks quite accurate)

Would I need to check other area's on the disk for superblocks? Or will 
the first superblock be enough?

On 07-06-12 14:29, NeilBrown wrote:
> On Thu, 7 Jun 2012 13:55:32 +0200 Martin Ziler<martin.ziler@googlemail.com>
> wrote:
>
>> Hello everybody,
>>
>> I am running a 9-disk raid6 without hot spares. I already had one drive go bad, which I could replace and continue using the array without any degraded raid messages. Recently I had another drive going bad by the smart-info. As it wasn't quite dead I left the array as was without really using it all that much waiting for a replacement drive I ordered. As I booted the machine up in order to replace the drive I was greeted by an inactive array with all devices showing up as spares.
>>
>> md0 : inactive sdh2[0](S) sdi2[7](S) sde2[6](S) sdd2[5](S) sdf2[1](S) sdg2[2](S) sdc1[9](S) sdb2[3](S)
>>        15579088439 blocks super 1.2
>>
>> mdadm --examine confirms that. I already searched the web quite a bit and found this mailing list. Maybe someone in here can give me some input. Normally a degraded raid should still be active. So I am quite surprised that my array with only one drive missing goes inactive. I appended the info mdadm --examine puts out for all the drives. However the first two should probably suffice as only /dev/sdk differs from the rest. The faulty drive - sdk - is still recognized as a raid6 member, wheres all the others show up as spares. With lots of bad sectors sdk isn't accessible anymore.
> You must be running 3.2.1 or 3.3 (I think).
>
> You've been bitten by a rather nasty bug.
>
> You can get your data back, but it will require a bit of care, so don't rush
> it.
>
> The metadata on almost all the devices have been seriously corrupted.  The
> only way to repair it is to recreate the array.
> Doing this just writes new metadata and assembles the array.  It doesn't touch
> the data so if we get the --create command right, all your data will be
> available again.
> If we get it wrong, you won't be able to see your data, but we can easily stop
> the array and create again with different parameters until we get it right.
>
> First thing to do it to get a newer kernel.  I would recommend the latest in
> the 3.3.y series.
>
> Then you need to:
>   - make sure you have a version of mdadm which gets the data offset to 1M
>     (2048 sectors).  I think 3.2.3 or earlier does that - don't upgrade to
>     3.2.5.
>   - find the chunk size - looks like it is 4M, as sdk2 isn't corrupt.
>   - find the order of devices.  This should be in your kernel logs in
>      "RAID conf printout".  Hopefully device names haven't changed.
>
>   Then (with new kernel running)
>
>    mdadm --create /dev/md0 -l6 -n9 -c 4M -e 1.2 /dev/sdb2 /dev/sdc2 /dev/sdd2 \
>       /dev/sde2 /dev/sdf2 /dev/sdg2 /dev/sdh2 /dev/sdi2 missing \
>       --assume-clean
>
>   Make double-sure you add that --assume-clean.
>
>   Note the last device is 'missing'. That corresponds to sdk2 (which we
>   know is device 8 - the last of 9 (0..8)).  It fails so it not part of the
>   array any more.  The others I just guessed the order.  You should try to
>   verify it before you proceed (see RAID conf printout in kernel logs).
>
>   After the 'create' use "mdadm -E" to look at one device and make sure
>   the Data Offset, Avail Dev Size and Array Size are the same as we saw
>   on sdk2.
>   If it is, try "fsck -n /dev/md0". That assumes ext3 or ext4.  If you had
>   something else on the array some other command might be needed.
>
>   If that looks bad, "mdadm -S /dev/md0" and try again with a different order.
>   If it looks good, "echo check>  /sys/block/md0/md/sync_action" and watch
>    "mismatch_cnt" in the same directory.  If it says low (few hundred at most)
>   all is good.  If it goes up to thousands something is wrong - try another
>   order.
>
>   Once you have the array working again,
>      "echo repair>  /sys/block/md0/md/sync_action"
>   then add your new device to be rebuilt.
>
> Good luck.
> Please ask if you are unsure about anything.
>
> NeilBrown
>
>>
>> /dev/sdk2:
>>            Magic : a92b4efc
>>          Version : 1.2
>>      Feature Map : 0x0
>>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>>             Name : server:0  (local to host server)
>>    Creation Time : Mon Jul 25 23:40:50 2011
>>       Raid Level : raid6
>>     Raid Devices : 9
>>
>>   Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>>       Array Size : 27172970496 (12957.08 GiB 13912.56 GB)
>>    Used Dev Size : 3881852928 (1851.01 GiB 1987.51 GB)
>>      Data Offset : 2048 sectors
>>     Super Offset : 8 sectors
>>            State : clean
>>      Device UUID : 882eb11a:33b499a7:dd5856b7:165f916c
>>
>>      Update Time : Fri Jun  1 20:26:45 2012
>>         Checksum : b8c58093 - correct
>>           Events : 623119
>>
>>           Layout : left-symmetric
>>       Chunk Size : 4096K
>>
>>     Device Role : Active device 8
>>     Array State : AAAAAAAAA ('A' == active, '.' == missing)
>>
>>
>> /dev/sdh2:
>>            Magic : a92b4efc
>>          Version : 1.2
>>      Feature Map : 0x0
>>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>>             Name : server:0  (local to host server)
>>    Creation Time : Mon Jul 25 23:40:50 2011
>>       Raid Level : -unknown-
>>     Raid Devices : 0
>>
>>   Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>>      Data Offset : 2048 sectors
>>     Super Offset : 8 sectors
>>            State : active
>>      Device UUID : 44008309:1dfb1408:cabfbd0a:64de3739
>>
>>      Update Time : Thu Jun  7 12:27:52 2012
>>         Checksum : 27f93899 - correct
>>           Events : 2
>>
>>     Device Role : spare
>>     Array State :  ('A' == active, '.' == missing)
>>
>> ---------------------------------------------------------------------------------------------------------------
>>
>> /dev/sdi2:
>>            Magic : a92b4efc
>>          Version : 1.2
>>      Feature Map : 0x0
>>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>>             Name : server:0  (local to host server)
>>    Creation Time : Mon Jul 25 23:40:50 2011
>>       Raid Level : -unknown-
>>     Raid Devices : 0
>>
>>   Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>>      Data Offset : 2048 sectors
>>     Super Offset : 8 sectors
>>            State : active
>>      Device UUID : 135f196d:184f11a1:09207617:4022e1a5
>>
>>      Update Time : Thu Jun  7 12:27:52 2012
>>         Checksum : 9ded8f86 - correct
>>           Events : 2
>>
>>
>>     Device Role : spare
>>     Array State :  ('A' == active, '.' == missing)
>>
>> /dev/sde2:
>>            Magic : a92b4efc
>>          Version : 1.2
>>      Feature Map : 0x0
>>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>>             Name : server:0  (local to host server)
>>    Creation Time : Mon Jul 25 23:40:50 2011
>>       Raid Level : -unknown-
>>     Raid Devices : 0
>>
>>   Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>>      Data Offset : 2048 sectors
>>     Super Offset : 8 sectors
>>            State : active
>>      Device UUID : 3517bcc4:2acb381f:f5006058:5bd5c831
>>
>>      Update Time : Thu Jun  7 12:27:52 2012
>>         Checksum : 408957c0 - correct
>>           Events : 2
>>
>>
>>     Device Role : spare
>>     Array State :  ('A' == active, '.' == missing)
>>
>> /dev/sdd2:
>>            Magic : a92b4efc
>>          Version : 1.2
>>      Feature Map : 0x0
>>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>>             Name : server:0  (local to host server)
>>    Creation Time : Mon Jul 25 23:40:50 2011
>>       Raid Level : -unknown-
>>     Raid Devices : 0
>>
>>   Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>>      Data Offset : 2048 sectors
>>     Super Offset : 8 sectors
>>            State : active
>>      Device UUID : 9e8b2d2c:844a009a:fd6914a2:390f10ac
>>
>>      Update Time : Thu Jun  7 12:27:52 2012
>>         Checksum : e6bdee68 - correct
>>           Events : 2
>>
>>
>>     Device Role : spare
>>     Array State :  ('A' == active, '.' == missing)
>>
>> /dev/sdf2:
>>            Magic : a92b4efc
>>          Version : 1.2
>>      Feature Map : 0x0
>>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>>             Name : server:0  (local to host server)
>>    Creation Time : Mon Jul 25 23:40:50 2011
>>       Raid Level : -unknown-
>>     Raid Devices : 0
>>
>>   Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>>      Data Offset : 2048 sectors
>>     Super Offset : 8 sectors
>>            State : active
>>      Device UUID : 87ad38ac:4ccbd831:ee5502cd:28dafaad
>>
>>      Update Time : Thu Jun  7 12:27:52 2012
>>         Checksum : 2b7a47f6 - correct
>>           Events : 2
>>
>>
>>     Device Role : spare
>>     Array State :  ('A' == active, '.' == missing)
>>
>> /dev/sdg2:
>>            Magic : a92b4efc
>>          Version : 1.2
>>      Feature Map : 0x0
>>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>>             Name : server:0  (local to host server)
>>    Creation Time : Mon Jul 25 23:40:50 2011
>>       Raid Level : -unknown-
>>     Raid Devices : 0
>>
>>   Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>>      Data Offset : 2048 sectors
>>     Super Offset : 8 sectors
>>            State : active
>>      Device UUID : eef2f06f:28f881a5:da857a00:fb90e250
>>
>>      Update Time : Thu Jun  7 12:27:52 2012
>>         Checksum : 393ba0f8 - correct
>>           Events : 2
>>
>>
>>     Device Role : spare
>>     Array State :  ('A' == active, '.' == missing)
>>
>> /dev/sdc1:
>>            Magic : a92b4efc
>>          Version : 1.2
>>      Feature Map : 0x0
>>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>>             Name : server:0  (local to host server)
>>    Creation Time : Mon Jul 25 23:40:50 2011
>>       Raid Level : -unknown-
>>     Raid Devices : 0
>>
>>   Avail Dev Size : 3985162143 (1900.27 GiB 2040.40 GB)
>>    Used Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>>      Data Offset : 2048 sectors
>>     Super Offset : 8 sectors
>>            State : active
>>      Device UUID : 4cf86fb0:6f334e2c:19e89c99:0532f557
>>
>>      Update Time : Thu Jun  7 12:27:52 2012
>>         Checksum : a6e42bdc - correct
>>           Events : 2
>>
>>
>>     Device Role : spare
>>     Array State :  ('A' == active, '.' == missing)
>>
>> /dev/sdb2:
>>            Magic : a92b4efc
>>          Version : 1.2
>>      Feature Map : 0x0
>>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
>>             Name : server:0  (local to host server)
>>    Creation Time : Mon Jul 25 23:40:50 2011
>>       Raid Level : -unknown-
>>     Raid Devices : 0
>>
>>   Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
>>      Data Offset : 2048 sectors
>>     Super Offset : 8 sectors
>>            State : active
>>      Device UUID : 4852882a:b8a3989f:aad747c5:25f20d47
>>
>>      Update Time : Thu Jun  7 12:27:52 2012
>>         Checksum : a8e25edd - correct
>>           Events : 2
>>
>>
>>     Device Role : spare
>>     Array State :  ('A' == active, '.' == missing)--
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: degraded raid 6 (1 bad drive) showing up inactive, only spares
  2012-06-07 21:16   ` Oliver Schinagl
@ 2012-06-07 22:34     ` NeilBrown
  0 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2012-06-07 22:34 UTC (permalink / raw)
  To: Oliver Schinagl; +Cc: Martin Ziler, linux-raid

[-- Attachment #1: Type: text/plain, Size: 12934 bytes --]

On Thu, 07 Jun 2012 23:16:34 +0200 Oliver Schinagl <oliver+list@schinagl.nl>
wrote:

> Since i'm still working on repairing my own array, and using a wrong 
> version of mdadm corrupted one of my raid10 array, I'm trying to hexedit 
> the start of an image of the disk to recover the metadata.
> 
> A quick question, if I've edited/checked the first superblock,
> (i'm using 
> https://raid.wiki.kernel.org/index.php/RAID_superblock_formats for 
> reference and looks quite accurate)
> 
> Would I need to check other area's on the disk for superblocks? Or will 
> the first superblock be enough?

Are we talking about filesystem superblocks or RAID superblocks?

there is only one RAID superblock - normally 4K from the start (with 1.2
metadta).  There may be lots of filesystem superblocks.  I think extX only
uses the first if it is good, but I don't know for certain.

NeilBrown


> 
> On 07-06-12 14:29, NeilBrown wrote:
> > On Thu, 7 Jun 2012 13:55:32 +0200 Martin Ziler<martin.ziler@googlemail.com>
> > wrote:
> >
> >> Hello everybody,
> >>
> >> I am running a 9-disk raid6 without hot spares. I already had one drive go bad, which I could replace and continue using the array without any degraded raid messages. Recently I had another drive going bad by the smart-info. As it wasn't quite dead I left the array as was without really using it all that much waiting for a replacement drive I ordered. As I booted the machine up in order to replace the drive I was greeted by an inactive array with all devices showing up as spares.
> >>
> >> md0 : inactive sdh2[0](S) sdi2[7](S) sde2[6](S) sdd2[5](S) sdf2[1](S) sdg2[2](S) sdc1[9](S) sdb2[3](S)
> >>        15579088439 blocks super 1.2
> >>
> >> mdadm --examine confirms that. I already searched the web quite a bit and found this mailing list. Maybe someone in here can give me some input. Normally a degraded raid should still be active. So I am quite surprised that my array with only one drive missing goes inactive. I appended the info mdadm --examine puts out for all the drives. However the first two should probably suffice as only /dev/sdk differs from the rest. The faulty drive - sdk - is still recognized as a raid6 member, wheres all the others show up as spares. With lots of bad sectors sdk isn't accessible anymore.
> > You must be running 3.2.1 or 3.3 (I think).
> >
> > You've been bitten by a rather nasty bug.
> >
> > You can get your data back, but it will require a bit of care, so don't rush
> > it.
> >
> > The metadata on almost all the devices have been seriously corrupted.  The
> > only way to repair it is to recreate the array.
> > Doing this just writes new metadata and assembles the array.  It doesn't touch
> > the data so if we get the --create command right, all your data will be
> > available again.
> > If we get it wrong, you won't be able to see your data, but we can easily stop
> > the array and create again with different parameters until we get it right.
> >
> > First thing to do it to get a newer kernel.  I would recommend the latest in
> > the 3.3.y series.
> >
> > Then you need to:
> >   - make sure you have a version of mdadm which gets the data offset to 1M
> >     (2048 sectors).  I think 3.2.3 or earlier does that - don't upgrade to
> >     3.2.5.
> >   - find the chunk size - looks like it is 4M, as sdk2 isn't corrupt.
> >   - find the order of devices.  This should be in your kernel logs in
> >      "RAID conf printout".  Hopefully device names haven't changed.
> >
> >   Then (with new kernel running)
> >
> >    mdadm --create /dev/md0 -l6 -n9 -c 4M -e 1.2 /dev/sdb2 /dev/sdc2 /dev/sdd2 \
> >       /dev/sde2 /dev/sdf2 /dev/sdg2 /dev/sdh2 /dev/sdi2 missing \
> >       --assume-clean
> >
> >   Make double-sure you add that --assume-clean.
> >
> >   Note the last device is 'missing'. That corresponds to sdk2 (which we
> >   know is device 8 - the last of 9 (0..8)).  It fails so it not part of the
> >   array any more.  The others I just guessed the order.  You should try to
> >   verify it before you proceed (see RAID conf printout in kernel logs).
> >
> >   After the 'create' use "mdadm -E" to look at one device and make sure
> >   the Data Offset, Avail Dev Size and Array Size are the same as we saw
> >   on sdk2.
> >   If it is, try "fsck -n /dev/md0". That assumes ext3 or ext4.  If you had
> >   something else on the array some other command might be needed.
> >
> >   If that looks bad, "mdadm -S /dev/md0" and try again with a different order.
> >   If it looks good, "echo check>  /sys/block/md0/md/sync_action" and watch
> >    "mismatch_cnt" in the same directory.  If it says low (few hundred at most)
> >   all is good.  If it goes up to thousands something is wrong - try another
> >   order.
> >
> >   Once you have the array working again,
> >      "echo repair>  /sys/block/md0/md/sync_action"
> >   then add your new device to be rebuilt.
> >
> > Good luck.
> > Please ask if you are unsure about anything.
> >
> > NeilBrown
> >
> >>
> >> /dev/sdk2:
> >>            Magic : a92b4efc
> >>          Version : 1.2
> >>      Feature Map : 0x0
> >>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
> >>             Name : server:0  (local to host server)
> >>    Creation Time : Mon Jul 25 23:40:50 2011
> >>       Raid Level : raid6
> >>     Raid Devices : 9
> >>
> >>   Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
> >>       Array Size : 27172970496 (12957.08 GiB 13912.56 GB)
> >>    Used Dev Size : 3881852928 (1851.01 GiB 1987.51 GB)
> >>      Data Offset : 2048 sectors
> >>     Super Offset : 8 sectors
> >>            State : clean
> >>      Device UUID : 882eb11a:33b499a7:dd5856b7:165f916c
> >>
> >>      Update Time : Fri Jun  1 20:26:45 2012
> >>         Checksum : b8c58093 - correct
> >>           Events : 623119
> >>
> >>           Layout : left-symmetric
> >>       Chunk Size : 4096K
> >>
> >>     Device Role : Active device 8
> >>     Array State : AAAAAAAAA ('A' == active, '.' == missing)
> >>
> >>
> >> /dev/sdh2:
> >>            Magic : a92b4efc
> >>          Version : 1.2
> >>      Feature Map : 0x0
> >>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
> >>             Name : server:0  (local to host server)
> >>    Creation Time : Mon Jul 25 23:40:50 2011
> >>       Raid Level : -unknown-
> >>     Raid Devices : 0
> >>
> >>   Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
> >>      Data Offset : 2048 sectors
> >>     Super Offset : 8 sectors
> >>            State : active
> >>      Device UUID : 44008309:1dfb1408:cabfbd0a:64de3739
> >>
> >>      Update Time : Thu Jun  7 12:27:52 2012
> >>         Checksum : 27f93899 - correct
> >>           Events : 2
> >>
> >>     Device Role : spare
> >>     Array State :  ('A' == active, '.' == missing)
> >>
> >> ---------------------------------------------------------------------------------------------------------------
> >>
> >> /dev/sdi2:
> >>            Magic : a92b4efc
> >>          Version : 1.2
> >>      Feature Map : 0x0
> >>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
> >>             Name : server:0  (local to host server)
> >>    Creation Time : Mon Jul 25 23:40:50 2011
> >>       Raid Level : -unknown-
> >>     Raid Devices : 0
> >>
> >>   Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
> >>      Data Offset : 2048 sectors
> >>     Super Offset : 8 sectors
> >>            State : active
> >>      Device UUID : 135f196d:184f11a1:09207617:4022e1a5
> >>
> >>      Update Time : Thu Jun  7 12:27:52 2012
> >>         Checksum : 9ded8f86 - correct
> >>           Events : 2
> >>
> >>
> >>     Device Role : spare
> >>     Array State :  ('A' == active, '.' == missing)
> >>
> >> /dev/sde2:
> >>            Magic : a92b4efc
> >>          Version : 1.2
> >>      Feature Map : 0x0
> >>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
> >>             Name : server:0  (local to host server)
> >>    Creation Time : Mon Jul 25 23:40:50 2011
> >>       Raid Level : -unknown-
> >>     Raid Devices : 0
> >>
> >>   Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
> >>      Data Offset : 2048 sectors
> >>     Super Offset : 8 sectors
> >>            State : active
> >>      Device UUID : 3517bcc4:2acb381f:f5006058:5bd5c831
> >>
> >>      Update Time : Thu Jun  7 12:27:52 2012
> >>         Checksum : 408957c0 - correct
> >>           Events : 2
> >>
> >>
> >>     Device Role : spare
> >>     Array State :  ('A' == active, '.' == missing)
> >>
> >> /dev/sdd2:
> >>            Magic : a92b4efc
> >>          Version : 1.2
> >>      Feature Map : 0x0
> >>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
> >>             Name : server:0  (local to host server)
> >>    Creation Time : Mon Jul 25 23:40:50 2011
> >>       Raid Level : -unknown-
> >>     Raid Devices : 0
> >>
> >>   Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
> >>      Data Offset : 2048 sectors
> >>     Super Offset : 8 sectors
> >>            State : active
> >>      Device UUID : 9e8b2d2c:844a009a:fd6914a2:390f10ac
> >>
> >>      Update Time : Thu Jun  7 12:27:52 2012
> >>         Checksum : e6bdee68 - correct
> >>           Events : 2
> >>
> >>
> >>     Device Role : spare
> >>     Array State :  ('A' == active, '.' == missing)
> >>
> >> /dev/sdf2:
> >>            Magic : a92b4efc
> >>          Version : 1.2
> >>      Feature Map : 0x0
> >>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
> >>             Name : server:0  (local to host server)
> >>    Creation Time : Mon Jul 25 23:40:50 2011
> >>       Raid Level : -unknown-
> >>     Raid Devices : 0
> >>
> >>   Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
> >>      Data Offset : 2048 sectors
> >>     Super Offset : 8 sectors
> >>            State : active
> >>      Device UUID : 87ad38ac:4ccbd831:ee5502cd:28dafaad
> >>
> >>      Update Time : Thu Jun  7 12:27:52 2012
> >>         Checksum : 2b7a47f6 - correct
> >>           Events : 2
> >>
> >>
> >>     Device Role : spare
> >>     Array State :  ('A' == active, '.' == missing)
> >>
> >> /dev/sdg2:
> >>            Magic : a92b4efc
> >>          Version : 1.2
> >>      Feature Map : 0x0
> >>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
> >>             Name : server:0  (local to host server)
> >>    Creation Time : Mon Jul 25 23:40:50 2011
> >>       Raid Level : -unknown-
> >>     Raid Devices : 0
> >>
> >>   Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
> >>      Data Offset : 2048 sectors
> >>     Super Offset : 8 sectors
> >>            State : active
> >>      Device UUID : eef2f06f:28f881a5:da857a00:fb90e250
> >>
> >>      Update Time : Thu Jun  7 12:27:52 2012
> >>         Checksum : 393ba0f8 - correct
> >>           Events : 2
> >>
> >>
> >>     Device Role : spare
> >>     Array State :  ('A' == active, '.' == missing)
> >>
> >> /dev/sdc1:
> >>            Magic : a92b4efc
> >>          Version : 1.2
> >>      Feature Map : 0x0
> >>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
> >>             Name : server:0  (local to host server)
> >>    Creation Time : Mon Jul 25 23:40:50 2011
> >>       Raid Level : -unknown-
> >>     Raid Devices : 0
> >>
> >>   Avail Dev Size : 3985162143 (1900.27 GiB 2040.40 GB)
> >>    Used Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
> >>      Data Offset : 2048 sectors
> >>     Super Offset : 8 sectors
> >>            State : active
> >>      Device UUID : 4cf86fb0:6f334e2c:19e89c99:0532f557
> >>
> >>      Update Time : Thu Jun  7 12:27:52 2012
> >>         Checksum : a6e42bdc - correct
> >>           Events : 2
> >>
> >>
> >>     Device Role : spare
> >>     Array State :  ('A' == active, '.' == missing)
> >>
> >> /dev/sdb2:
> >>            Magic : a92b4efc
> >>          Version : 1.2
> >>      Feature Map : 0x0
> >>       Array UUID : 25be3ab5:ef5f1166:d64b0e0e:4df143ed
> >>             Name : server:0  (local to host server)
> >>    Creation Time : Mon Jul 25 23:40:50 2011
> >>       Raid Level : -unknown-
> >>     Raid Devices : 0
> >>
> >>   Avail Dev Size : 3881859248 (1851.01 GiB 1987.51 GB)
> >>      Data Offset : 2048 sectors
> >>     Super Offset : 8 sectors
> >>            State : active
> >>      Device UUID : 4852882a:b8a3989f:aad747c5:25f20d47
> >>
> >>      Update Time : Thu Jun  7 12:27:52 2012
> >>         Checksum : a8e25edd - correct
> >>           Events : 2
> >>
> >>
> >>     Device Role : spare
> >>     Array State :  ('A' == active, '.' == missing)--
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: degraded raid 6 (1 bad drive) showing up inactive, only spares
       [not found]       ` <C9680CD3-8DA8-4FE3-8337-481676213C39@googlemail.com>
@ 2012-06-09 22:09         ` NeilBrown
  0 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2012-06-09 22:09 UTC (permalink / raw)
  To: Martin Ziler; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 10742 bytes --]

On Sat, 9 Jun 2012 20:14:12 +0200 Martin Ziler <martin.ziler@googlemail.com>
wrote:

> 
> Am 07.06.2012 um 23:14 schrieb NeilBrown:
> 
> > On Thu, 7 Jun 2012 18:49:49 +0200 Martin Ziler <martin.ziler@googlemail.com>
> > wrote:
> > 
> >> 2012/6/7 NeilBrown <neilb@suse.de>
> >> 
> >>> On Thu, 7 Jun 2012 13:55:32 +0200 Martin Ziler <
> >>> martin.ziler@googlemail.com>
> >>> wrote:
> >>> 
> >>>> Hello everybody,
> >>>> 
> >>>> I am running a 9-disk raid6 without hot spares. I already had one drive
> >>> go bad, which I could replace and continue using the array without any
> >>> degraded raid messages. Recently I had another drive going bad by the
> >>> smart-info. As it wasn't quite dead I left the array as was without really
> >>> using it all that much waiting for a replacement drive I ordered. As I
> >>> booted the machine up in order to replace the drive I was greeted by an
> >>> inactive array with all devices showing up as spares.
> >>>> 
> >>>> md0 : inactive sdh2[0](S) sdi2[7](S) sde2[6](S) sdd2[5](S) sdf2[1](S)
> >>> sdg2[2](S) sdc1[9](S) sdb2[3](S)
> >>>>      15579088439 blocks super 1.2
> >>>> 
> >>>> mdadm --examine confirms that. I already searched the web quite a bit
> >>> and found this mailing list. Maybe someone in here can give me some input.
> >>> Normally a degraded raid should still be active. So I am quite surprised
> >>> that my array with only one drive missing goes inactive. I appended the
> >>> info mdadm --examine puts out for all the drives. However the first two
> >>> should probably suffice as only /dev/sdk differs from the rest. The faulty
> >>> drive - sdk - is still recognized as a raid6 member, wheres all the others
> >>> show up as spares. With lots of bad sectors sdk isn't accessible anymore.
> >>> 
> >>> You must be running 3.2.1 or 3.3 (I think).
> >>> 
> >>> You've been bitten by a rather nasty bug.
> >>> 
> >>> You can get your data back, but it will require a bit of care, so don't
> >>> rush
> >>> it.
> >>> 
> >>> The metadata on almost all the devices have been seriously corrupted.  The
> >>> only way to repair it is to recreate the array.
> >>> Doing this just writes new metadata and assembles the array.  It doesn't
> >>> touch
> >>> the data so if we get the --create command right, all your data will be
> >>> available again.
> >>> If we get it wrong, you won't be able to see your data, but we can easily
> >>> stop
> >>> the array and create again with different parameters until we get it right.
> >>> 
> >>> First thing to do it to get a newer kernel.  I would recommend the latest
> >>> in
> >>> the 3.3.y series.
> >>> 
> >>> Then you need to:
> >>> - make sure you have a version of mdadm which gets the data offset to 1M
> >>>  (2048 sectors).  I think 3.2.3 or earlier does that - don't upgrade to
> >>>  3.2.5.
> >>> - find the chunk size - looks like it is 4M, as sdk2 isn't corrupt.
> >>> - find the order of devices.  This should be in your kernel logs in
> >>>   "RAID conf printout".  Hopefully device names haven't changed.
> >>> 
> >>> Then (with new kernel running)
> >>> 
> >>> mdadm --create /dev/md0 -l6 -n9 -c 4M -e 1.2 /dev/sdb2 /dev/sdc2
> >>> /dev/sdd2 \
> >>>    /dev/sde2 /dev/sdf2 /dev/sdg2 /dev/sdh2 /dev/sdi2 missing \
> >>>    --assume-clean
> >>> 
> >>> Make double-sure you add that --assume-clean.
> >>> 
> >>> Note the last device is 'missing'. That corresponds to sdk2 (which we
> >>> know is device 8 - the last of 9 (0..8)).  It fails so it not part of the
> >>> array any more.  The others I just guessed the order.  You should try to
> >>> verify it before you proceed (see RAID conf printout in kernel logs).
> >>> 
> >>> After the 'create' use "mdadm -E" to look at one device and make sure
> >>> the Data Offset, Avail Dev Size and Array Size are the same as we saw
> >>> on sdk2.
> >>> If it is, try "fsck -n /dev/md0". That assumes ext3 or ext4.  If you had
> >>> something else on the array some other command might be needed.
> >>> 
> >>> If that looks bad, "mdadm -S /dev/md0" and try again with a different
> >>> order.
> >>> If it looks good, "echo check > /sys/block/md0/md/sync_action" and watch
> >>> "mismatch_cnt" in the same directory.  If it says low (few hundred at
> >>> most)
> >>> all is good.  If it goes up to thousands something is wrong - try another
> >>> order.
> >>> 
> >>> Once you have the array working again,
> >>>   "echo repair > /sys/block/md0/md/sync_action"
> >>> then add your new device to be rebuilt.
> >>> 
> >>> Good luck.
> >>> Please ask if you are unsure about anything.
> >>> 
> >>> NeilBrown
> >>> 
> >>> 
> >> 
> >> Hello Neil,
> >> 
> >> thank you very much for this detailed input. My last reply didn't make it
> >> into the mailing list due to the format of my mail client (OSX mail). My
> >> kernel (Ubuntu) was 3.2.0 , I upgraded to 3.3.8. mdadm version was fine.
> >> 
> >> I searched the log files I got and was unable to find anything concerning
> >> my array. Maybe that sorta stuff isn't logged in ubuntu. I did find some
> >> mails concerning degraded raid that do not correlate with my current
> >> breakage. I received the following 2 messages:
> >> 
> >> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
> >> [raid4] [raid10]
> >> md0 : active (auto-read-only) raid6 sdi2[1] sdh2[0] sdg2[8] sdc1[9] sdd2[5]
> >> sdb2[3] sdf2[7] sde2[6]
> >>      13586485248 blocks super 1.2 level 6, 4096k chunk, algorithm 2 [9/8]
> >> [UU_UUUUUU]
> >> 
> >> Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5]
> >> [raid4] [raid10]
> >> md0 : active (auto-read-only) raid6 sdj2[2] sdg2[8] sdd2[5] sde2[6] sdb2[3]
> >> sdf2[7] sdc1[9]
> >>      13586485248 blocks super 1.2 level 6, 4096k chunk, algorithm 2 [9/7]
> >> [__UUUUUUU]
> >> 
> >> I conclude that my setup must have been sdh2 [0], sdi2 [1], sdj2 [2], sdb2
> >> [3], sdd2 [5] , sde2 [6], sdf2 [7], sdg2 [8], sdc1 [9]
> > 
> > Unfortunately these number are not the roles of the device in the array.  They
> > are the order in which the devices were added to the array.
> > So 0-8 are very likely roles 0-8 in the array.  '9' is then the first spare,
> > and it stays as '9' even when it becomes active.  So as there is no '4', it
> > does look likely that 'sdc1' should come between  'sdb2' and 'sdd2'.
> > 
> > NeilBrown
> > 
> > 
> >> sdc1 is the replacement for my first drive that went bad. It's somewhat
> >> strange that it is now listed as device 9 and not 4, isn't it? I reckon
> >> that I have to rebuild in that order, notwithstanding.
> >> 
> >> regards,
> >> Martin
> > 
> 
> 
> Hello Neil,
> 
> I tracked the cables in my case and tried some permutations:
> 
> mdadm --create /dev/md0 -l6 -n9 -c 4M -e 1.2 /dev/sdh2 /dev/sdi2 /dev/sdj2 /dev/sdb2 /dev/sdc1 /dev/sdd2 /dev/sde2 /dev/sdf2 missing --assume-clean
> mdadm --create /dev/md0 -l6 -n9 -c 4M -e 1.2 /dev/sdj2 /dev/sdb2 /dev/sdc1 /dev/sdd2 /dev/sde2 /dev/sdf2 missing /dev/sdh2 /dev/sdi2 --assume-clean
> mdadm --create /dev/md0 -l6 -n9 -c 4M -e 1.2   /dev/sdj2 /dev/sdb2 /dev/sdc1 /dev/sdd2 /dev/sde2 /dev/sdh2 missing /dev/sdf2 /dev/sdi2 --assume-clean
> mdadm --create /dev/md0 -l6 -n9 -c 4M -e 1.2   /dev/sdj2 /dev/sdb2 /dev/sdc1 /dev/sdd2 /dev/sde2 /dev/sdi2 missing /dev/sdf2 /dev/sdh2 --assume-clean
> mdadm --create /dev/md0 -l6 -n9 -c 4M -e 1.2 /dev/sdi2 /dev/sdh2 /dev/sdj2 /dev/sdb2 /dev/sdc1 /dev/sdd2 /dev/sde2 /dev/sdf2 missing --assume-clean
> 
> The first ones did result in metadata that looked fine but the fsck-output did not look good at all:
> 
> e2fsck 1.42 (29-Nov-2011)
> fsck.ext4: Superblock ungültig versuche es mit Backup-Blöcken...
> fsck.ext4: Ungültige magische Zahl im Superblock beim Versuch, /dev/md0 zu öffnen
> 
> SuperBlock ist unlesbar bzw. beschreibt kein gültiges ext2
> Dateisystem.  Wenn Gerät gültig ist und ein ext2
> Dateisystem (kein swap oder ufs usw.) enthält,  dann ist der SuperBlock
> beschädigt, und sie könnten e2fsck mit einem anderen SuperBlock:
>     e2fsck -b 8193 <Gerät>
> 
> The last one resulted in this fsck output:
> 
> e2fsck 1.42 (29-Nov-2011)
> fsck.ext4: Gruppen-Deskriptoren scheinen defekt zu sein... versuche es mit Backup-Blöcken...
> fsck.ext4: Ungültige magische Zahl im Superblock when using the backup blocks
> fsck.ext4: es wird zum originalen Superblock zurück gekehrt
> fsck.ext4: Gruppen-Deskriptoren scheinen defekt zu sein... versuche es mit Backup-Blöcken...
> fsck.ext4: Ungültige magische Zahl im Superblock when using the backup blocks
> fsck.ext4: es wird zum originalen Superblock zurück gekehrt
> Lesefehler - Block 3823364034 (Das Argument ist ungültig). Ignoriere Fehler? nein
> 
> SuperBlock hat ein defektes Journal (Inode 8).
> Bereinige? nein
> 
> fsck.ext4: Unzulässige Inodenummer während der Prüfung des ext3-Journals für /dev/md0
> 
> /dev/md0: ********** WARNUNG: Noch Fehler im Dateisystem  **********
> 
> If I interpret that correctly, the filesystem ext4 is now recognized. Do you think I should now go on with echo check > /sys/block/md0/md/sync_action?
> 

The "echo check ...." is read-only and so harmless - you can do it any time
you like.  To stop it if it is showing lots of mismatches just "echo idle" to
the same file.

However that e2fsck output doesn't look good.
It does find a superblock, but then when it goes to look for "group
descriptors" they are bad.
Also: "Read error - block 3823364034 (Invalid argument)." (from
google-translate to English) suggests that the filesystem thinks the array is
bigger  than it is.

This probably suggests that the first device is the correct one, but other
devices are still in the wrong order.

I suggest some more permutations.  It shouldn't be too hard to write a script
to try them all... might take a little while though.

The following script, if run with


sh permute.sh --prefix "mdadm --create /dev/md0 -l6 -n9 -c 4M -e 1.2 --assume-clean" /dev/sdh2 /dev/sdi2 /dev/sdj2 /dev/sdb2 /dev/sdc1 /dev/sdd2 /dev/sde2 /dev/sdf2 missing

will output all possible "mdadm --create" commands with different
permutations.  Don't know if you want to try it or not.  There are only
362880 possibilities :-)  Can the 'echo' to 'eval', then add 'fsck -n /dev/md0' and 'mdadm -S /dev/md0' and  collect the output for examination the next morning.

NeilBrown



#!/bin/sh

case $1 in
 --prefix )
   prefix=$2
   shift 2
   ;;
 * ) prefix=
esac

if [ $# -eq 1 ]
then echo $prefix $1
     exit 0
fi

early=
while [ $# -ge 1 ]
do
  a=$1
  shift
  sh permute.sh --prefix "$prefix $a" $early $*
  early="$early $a"
done

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: degraded raid 6 (1 bad drive) showing up inactive, only spares
       [not found]   ` <alpine.DEB.2.00.1206080743310.25054@uplift.swm.pp.se>
@ 2012-06-15  7:37     ` NeilBrown
  2012-06-15 11:35       ` Oliver Schinagl
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2012-06-15  7:37 UTC (permalink / raw)
  To: Mikael Abrahamsson, linux RAID; +Cc: Oliver Schinagl, Martin Ziler

[-- Attachment #1: Type: text/plain, Size: 245 bytes --]


At the suggestion of Mikael Abrahamsson I have written

  http://neil.brown.name/blog/20120615073245

discussing this bug in detail.

I'll probably post the whole thing to Linux RAID tomorrow, which a better
subject line.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: degraded raid 6 (1 bad drive) showing up inactive, only spares
  2012-06-15  7:37     ` NeilBrown
@ 2012-06-15 11:35       ` Oliver Schinagl
  2012-06-15 12:37         ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Oliver Schinagl @ 2012-06-15 11:35 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux RAID

Neil,

Good read! I don't know if it's important to have a list of all affected 
kernels, but as a Gentoo user, know that it affects at least the 
unstable branch, but I would not be surprised if it even affects the 
stable hardened kernel. I run those on my servers and may have noticed 
that behavior on my arrays as well.

Oliver

On 15-06-12 09:37, NeilBrown wrote:
> At the suggestion of Mikael Abrahamsson I have written
>
>    http://neil.brown.name/blog/20120615073245
>
> discussing this bug in detail.
>
> I'll probably post the whole thing to Linux RAID tomorrow, which a better
> subject line.
>
> NeilBrown


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: degraded raid 6 (1 bad drive) showing up inactive, only spares
  2012-06-15 11:35       ` Oliver Schinagl
@ 2012-06-15 12:37         ` NeilBrown
  0 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2012-06-15 12:37 UTC (permalink / raw)
  To: Oliver Schinagl; +Cc: linux RAID

[-- Attachment #1: Type: text/plain, Size: 1187 bytes --]

On Fri, 15 Jun 2012 13:35:15 +0200 Oliver Schinagl <oliver+list@schinagl.nl>
wrote:

> Neil,
> 
> Good read! I don't know if it's important to have a list of all affected 
> kernels, but as a Gentoo user, know that it affects at least the 
> unstable branch, but I would not be surprised if it even affects the 
> stable hardened kernel. I run those on my servers and may have noticed 
> that behavior on my arrays as well.

Thanks.

It would be useful adding other kernels, but only where a vendor has
backported the offending patch to an earlier release, and then I'd want
actual release numbers from the vendor.

So if you could show me how to decode the gentoo kernel package information
so I could see exactly what patches were in which release I could work out
the release numbers from that ....

Thanks,
NeilBrown


> 
> Oliver
> 
> On 15-06-12 09:37, NeilBrown wrote:
> > At the suggestion of Mikael Abrahamsson I have written
> >
> >    http://neil.brown.name/blog/20120615073245
> >
> > discussing this bug in detail.
> >
> > I'll probably post the whole thing to Linux RAID tomorrow, which a better
> > subject line.
> >
> > NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: degraded raid 6 (1 bad drive) showing up inactive, only spares
@ 2012-06-17 15:52 Martin Ziler
  2012-06-20 21:56 ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Martin Ziler @ 2012-06-17 15:52 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux RAID

Hello everybody,

long time no seen. As I was unable to find a permutation that would result in an fsck-output that looks ok I generated the 360.000 permutations using Neil's shell script. I then composed a little script that would try a permutation, do the fsck, stop the raid an go on with the new possibility and inserted the fsck-output into a text-file. The whole 360.000 permutations just recently finished - after some 6 days of building and fscking. I have not yet analyzed the output. I did however search for the term "clean" as a positive result would possibly contain that term.

I got 711 different drive-combinations that will result in this: /dev/md0: clean, 204968/849158144 files, 3394436623/3396621312 blocks.

If there really is only one working drive order, I guess I am pretty much doomed. I do hope I'll be able to rule out a large portion of those results before going into some more detail. I did get it correct, though, that there is only one possible order resulting in a working array? I gotta figure out how to analyze the output file now. It's 125 MB. I guess I'll probably split it and load it into my spreadsheet. In one go I'd stumble across the 1.000.000 row-limit.

regards,

Martin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: degraded raid 6 (1 bad drive) showing up inactive, only spares
  2012-06-17 15:52 degraded raid 6 (1 bad drive) showing up inactive, only spares Martin Ziler
@ 2012-06-20 21:56 ` NeilBrown
  0 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2012-06-20 21:56 UTC (permalink / raw)
  To: Martin Ziler; +Cc: linux RAID

[-- Attachment #1: Type: text/plain, Size: 1835 bytes --]

On Sun, 17 Jun 2012 17:52:07 +0200 Martin Ziler <martin.ziler@googlemail.com>
wrote:

> Hello everybody,
> 
> long time no seen. As I was unable to find a permutation that would result in an fsck-output that looks ok I generated the 360.000 permutations using Neil's shell script. I then composed a little script that would try a permutation, do the fsck, stop the raid an go on with the new possibility and inserted the fsck-output into a text-file. The whole 360.000 permutations just recently finished - after some 6 days of building and fscking. I have not yet analyzed the output. I did however search for the term "clean" as a positive result would possibly contain that term.
> 
> I got 711 different drive-combinations that will result in this: /dev/md0: clean, 204968/849158144 files, 3394436623/3396621312 blocks.
> 
> If there really is only one working drive order, I guess I am pretty much doomed. I do hope I'll be able to rule out a large portion of those results before going into some more detail. I did get it correct, though, that there is only one possible order resulting in a working array? I gotta figure out how to analyze the output file now. It's 125 MB. I guess I'll probably split it and load it into my spreadsheet. In one go I'd stumble across the 1.000.000 row-limit.
> 
> regards,
> 
> Martin

Yes, there is only one correct order.

711 is very close to 720 which is 6!  I wonder if that is significant.

Maybe you need "fsck -n -f" to force it to do a more thorough check even
though the fs appears to be clean.

If you "echo check > /sys/block/mdXX/md/sync_action", wait 30 seconds or so,
then "echo idle > ......", then check "mismatch_cnt", the correct ordering
should have a significantly lower number - probably zero.
Other orders will have a high number.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-06-20 21:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-17 15:52 degraded raid 6 (1 bad drive) showing up inactive, only spares Martin Ziler
2012-06-20 21:56 ` NeilBrown
  -- strict thread matches above, loose matches on Subject: below --
2012-06-07 11:55 Martin Ziler
2012-06-07 12:29 ` NeilBrown
     [not found]   ` <CAGHsWsm_Xvf59VCuHyJvoMW6peiFHK=YQKGzr3cq=RDk7jyqKg@mail.gmail.com>
2012-06-07 21:14     ` NeilBrown
     [not found]       ` <C9680CD3-8DA8-4FE3-8337-481676213C39@googlemail.com>
2012-06-09 22:09         ` NeilBrown
2012-06-07 21:16   ` Oliver Schinagl
2012-06-07 22:34     ` NeilBrown
     [not found]   ` <alpine.DEB.2.00.1206080743310.25054@uplift.swm.pp.se>
2012-06-15  7:37     ` NeilBrown
2012-06-15 11:35       ` Oliver Schinagl
2012-06-15 12:37         ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).