RAID6 fails to assemble after unclean shutdown

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID6 fails to assemble after unclean shutdown
@ 2012-04-25 10:35 Brian Candler
  2012-04-25 11:01 ` NeilBrown
  0 siblings, 1 reply; 5+ messages in thread
From: Brian Candler @ 2012-04-25 10:35 UTC (permalink / raw)
  To: linux-raid

I have a storage box (currently under test) which has two 12-drive RAID6
arrays, /dev/md/data1 and /dev/md/data2.

The box crashed for an unrelated reason, and when I brought it back up, only
one of the arrays assembled:

  root@storage1:~# cat /proc/mdstat
  Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
  md126 : active raid6 sdj[8] sdk[9] sdd[2] sde[3] sdi[7] sdm[11] sdg[5] sdc[1] sdb[0] sdl[10] sdh[6] sdf[4]
        29302650880 blocks super 1.2 level 6, 1024k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]

  md127 : inactive sdq[3](S) sdx[10](S) sdu[6](S) sdt[5](S) sds[4](S) sdv[8](S) sdp[2](S) sdy[11](S) sdo[1](S) sdn[0](S) sdw[9](S) sdr[7](S)
        35163186720 blocks super 1.2

  unused devices: <none>

So it looks like 12 of the disks have all become spares (S)!

An attempt to manually assemble the array failed:

  root@storage1:~# mdadm --stop /dev/md127
  mdadm: stopped /dev/md127
  root@storage1:~# mdadm --assemble /dev/md/disk2 /dev/sd{n..y}
  mdadm: /dev/md/disk2 assembled from 4 drives - not enough to start the array.

Since this is currently under test system I just forcibly recreated the
array, but I'm a bit worried about how I would handle this problem when I go
into production.

Here is how I recreated the array:

  root@storage1:~# mdadm --create /dev/md/disk2 -n 12 -c 1024 -l raid6 /dev/sd{n..y}
  mdadm: /dev/sdn appears to be part of a raid array:
      level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
  mdadm: /dev/sdo appears to be part of a raid array:
      level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
  mdadm: /dev/sdp appears to be part of a raid array:
      level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
  mdadm: /dev/sdq appears to be part of a raid array:
      level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
  mdadm: /dev/sdr appears to be part of a raid array:
      level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
  mdadm: /dev/sds appears to be part of a raid array:
      level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
  mdadm: /dev/sdt appears to be part of a raid array:
      level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
  mdadm: /dev/sdu appears to be part of a raid array:
      level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
  mdadm: /dev/sdv appears to be part of a raid array:
      level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
  mdadm: /dev/sdw appears to be part of a raid array:
      level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
  mdadm: /dev/sdx appears to be part of a raid array:
      level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
  mdadm: /dev/sdy appears to be part of a raid array:
  # /etc/fstab: static file system information.
      level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
  Continue creating array? y
  mdadm: Defaulting to version 1.2 metadata
  mdadm: array /dev/md/disk2 started.

So it seems like all the disks were known to be part of an array, but mdadm
was still unable to assemble more than 4.

Platform: Ubuntu 11.10 server x86_64, stock kernel:

  Linux storage1 3.0.0-16-server #29-Ubuntu SMP Tue Feb 14 13:08:12 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Unfortunately I saw the same problem once before on a different test system,
and also had to forcibly rebuild the array.

So my questions are:

* Have I built the RAID array correctly in the first place? Are there some
options I could have given to mdadm to make it more robust?

* What should I have done when presented with an array which would not
assemble, to attempt to recover without losing data?

* Any ideas why mdadm only thought 4 of the drives were usable?

Thanks,

Brian.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID6 fails to assemble after unclean shutdown
  2012-04-25 10:35 RAID6 fails to assemble after unclean shutdown Brian Candler
@ 2012-04-25 11:01 ` NeilBrown
  2012-04-25 11:16   ` Brian Candler
  0 siblings, 1 reply; 5+ messages in thread
From: NeilBrown @ 2012-04-25 11:01 UTC (permalink / raw)
  To: Brian Candler; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4586 bytes --]

On Wed, 25 Apr 2012 11:35:36 +0100 Brian Candler <B.Candler@pobox.com> wrote:

> I have a storage box (currently under test) which has two 12-drive RAID6
> arrays, /dev/md/data1 and /dev/md/data2.
> 
> The box crashed for an unrelated reason, and when I brought it back up, only
> one of the arrays assembled:
> 
>   root@storage1:~# cat /proc/mdstat
>   Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
>   md126 : active raid6 sdj[8] sdk[9] sdd[2] sde[3] sdi[7] sdm[11] sdg[5] sdc[1] sdb[0] sdl[10] sdh[6] sdf[4]
>         29302650880 blocks super 1.2 level 6, 1024k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]
>         
>   md127 : inactive sdq[3](S) sdx[10](S) sdu[6](S) sdt[5](S) sds[4](S) sdv[8](S) sdp[2](S) sdy[11](S) sdo[1](S) sdn[0](S) sdw[9](S) sdr[7](S)
>         35163186720 blocks super 1.2
>          
>   unused devices: <none>
> 
> So it looks like 12 of the disks have all become spares (S)!

The '(S) is a bit misleading there.  When the array is 'inactive', everything
claims to be spare.  Once the array is actually started it all would become
more sensible.


> 
> An attempt to manually assemble the array failed:
> 
>   root@storage1:~# mdadm --stop /dev/md127
>   mdadm: stopped /dev/md127
>   root@storage1:~# mdadm --assemble /dev/md/disk2 /dev/sd{n..y}
>   mdadm: /dev/md/disk2 assembled from 4 drives - not enough to start the array.

Adding "--verbose" here would help a lot.
Possibly adding "--force" would make it all work.

> 
> Since this is currently under test system I just forcibly recreated the
> array, but I'm a bit worried about how I would handle this problem when I go
> into production.
> 
> Here is how I recreated the array:
> 
>   root@storage1:~# mdadm --create /dev/md/disk2 -n 12 -c 1024 -l raid6 /dev/sd{n..y}
>   mdadm: /dev/sdn appears to be part of a raid array:
>       level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
>   mdadm: /dev/sdo appears to be part of a raid array:
>       level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
>   mdadm: /dev/sdp appears to be part of a raid array:
>       level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
>   mdadm: /dev/sdq appears to be part of a raid array:
>       level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
>   mdadm: /dev/sdr appears to be part of a raid array:
>       level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
>   mdadm: /dev/sds appears to be part of a raid array:
>       level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
>   mdadm: /dev/sdt appears to be part of a raid array:
>       level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
>   mdadm: /dev/sdu appears to be part of a raid array:
>       level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
>   mdadm: /dev/sdv appears to be part of a raid array:
>       level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
>   mdadm: /dev/sdw appears to be part of a raid array:
>       level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
>   mdadm: /dev/sdx appears to be part of a raid array:
>       level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
>   mdadm: /dev/sdy appears to be part of a raid array:
>   # /etc/fstab: static file system information.
>       level=raid6 devices=12 ctime=Mon Mar 19 11:52:55 2012
>   Continue creating array? y
>   mdadm: Defaulting to version 1.2 metadata
>   mdadm: array /dev/md/disk2 started.
> 
> So it seems like all the disks were known to be part of an array, but mdadm
> was still unable to assemble more than 4.

I would need to see the "--examine" output of each disk (Before you
recreated) to be able to explain.


> 
> Platform: Ubuntu 11.10 server x86_64, stock kernel:
> 
>   Linux storage1 3.0.0-16-server #29-Ubuntu SMP Tue Feb 14 13:08:12 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
> 
> Unfortunately I saw the same problem once before on a different test system,
> and also had to forcibly rebuild the array.
> 
> So my questions are:
> 
> * Have I built the RAID array correctly in the first place? Are there some
> options I could have given to mdadm to make it more robust?

Yes, you have built the array correctly.


> 
> * What should I have done when presented with an array which would not
> assemble, to attempt to recover without losing data?

 --verbose
and maybe
 --force

> 
> * Any ideas why mdadm only thought 4 of the drives were usable?

Presumably something when wrong during shutdown.  However without more
details (--examine) I cannot guess.


NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID6 fails to assemble after unclean shutdown
  2012-04-25 11:01 ` NeilBrown
@ 2012-04-25 11:16   ` Brian Candler
  2012-04-26  2:58     ` Bill Davidsen
  0 siblings, 1 reply; 5+ messages in thread
From: Brian Candler @ 2012-04-25 11:16 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Wed, Apr 25, 2012 at 09:01:45PM +1000, NeilBrown wrote:
> Presumably something when wrong during shutdown.  However without more
> details (--examine) I cannot guess.

OK, thank you.

I've also found
https://raid.wiki.kernel.org/articles/r/a/i/RAID_Recovery_d376.html
which shows the --examine option too.

When building the array, would you suggest I also specify --bitmap=internal,
so that there is less data to resync after an error?

Regards,

Brian.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID6 fails to assemble after unclean shutdown
  2012-04-25 11:16   ` Brian Candler
@ 2012-04-26  2:58     ` Bill Davidsen
  2012-04-26  3:50       ` Keith Keller
  0 siblings, 1 reply; 5+ messages in thread
From: Bill Davidsen @ 2012-04-26  2:58 UTC (permalink / raw)
  To: Linux RAID

Brian Candler wrote:
> On Wed, Apr 25, 2012 at 09:01:45PM +1000, NeilBrown wrote:
>> Presumably something when wrong during shutdown.  However without more
>> details (--examine) I cannot guess.
>
> OK, thank you.
>
> I've also found
> https://raid.wiki.kernel.org/articles/r/a/i/RAID_Recovery_d376.html
> which shows the --examine option too.
>
> When building the array, would you suggest I also specify --bitmap=internal,
> so that there is less data to resync after an error?

You are being very understated here ;-)
Without a bitmap you can spend days getting back up instead of minutes. Brian 
has the true truth here, unless there is a reason to avoid it, a bitmap is a 
ticket to a sane rebuild.

And if you have a big array, consider putting the bitmap on SSD, faster is better.


-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID6 fails to assemble after unclean shutdown
  2012-04-26  2:58     ` Bill Davidsen
@ 2012-04-26  3:50       ` Keith Keller
  0 siblings, 0 replies; 5+ messages in thread
From: Keith Keller @ 2012-04-26  3:50 UTC (permalink / raw)
  To: linux-raid

On 2012-04-26, Bill Davidsen <davidsen@tmr.com> wrote:
>
> Without a bitmap you can spend days getting back up instead of minutes. Brian 
> has the true truth here, unless there is a reason to avoid it, a bitmap is a 
> ticket to a sane rebuild.

My impression was that a bitmap only helped on an unclean shutdown or on
re-adding a disk that was previously removed, and not on failing a drive
and adding a new one.  Is that accurate?  (Of course it would help the
OP if he has another unclean shutdown.)

As Neil posted last month, you can't use mdadm 3.2.3 to create the
bitmap; use 3.2.2 or check out from git.  (I just ran into the bug in
3.2.3 and found the message in the archive.)

--keith

-- 
kkeller@wombat.san-francisco.ca.us

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-04-26  3:50 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-04-25 10:35 RAID6 fails to assemble after unclean shutdown Brian Candler
2012-04-25 11:01 ` NeilBrown
2012-04-25 11:16   ` Brian Candler
2012-04-26  2:58     ` Bill Davidsen
2012-04-26  3:50       ` Keith Keller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).