''force'' continutation of a rebuild?

Linux RAID subsystem development
 help / color / mirror / Atom feed

* ''force'' continutation of a rebuild?
@ 2011-12-15 20:36 Keith Keller
  2011-12-15 20:53 ` NeilBrown
  0 siblings, 1 reply; 3+ messages in thread
From: Keith Keller @ 2011-12-15 20:36 UTC (permalink / raw)
  To: linux-raid

Hello all,

I have another seminewbie question.  I had an issue, likely hardware
related, which forced me to reboot a machine with a RAID6 during a
rebuild after a previous drive failure.  Now, after some other hardware
issues, I've been able to successfully assemble the array, but it
seems to be in an odd state:

# mdadm -D /dev/md0
/dev/md0:
        Version : 1.01
  Creation Time : Thu Sep 29 21:26:35 2011
     Raid Level : raid6
     Array Size : 13671797440 (13038.44 GiB 13999.92 GB)
  Used Dev Size : 1953113920 (1862.63 GiB 1999.99 GB)
   Raid Devices : 9
  Total Devices : 11
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu Dec 15 12:19:41 2011
          State : clean, degraded
 Active Devices : 8
Working Devices : 11
 Failed Devices : 0
  Spare Devices : 3

     Chunk Size : 64K

           Name : 0
           UUID : 24363b01:90deb9b5:4b51e5df:68b8b6ea
         Events : 102730

    Number   Major   Minor   RaidDevice State
       0       8       17        0      active sync   /dev/sdb1
       6       8      113        1      active sync   /dev/sdh1
      11       8      177        2      spare rebuilding   /dev/sdl1
       3       8       65        3      active sync   /dev/sde1
       4       8       81        4      active sync   /dev/sdf1
       9       8      145        5      active sync   /dev/sdj1
      10       8       97        6      active sync   /dev/sdg1
       7       8      129        7      active sync   /dev/sdi1
       8       8      161        8      active sync   /dev/sdk1

      12       8      225        -      spare   /dev/sdo1
      13       8       49        -      spare   /dev/sdd1

# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid6 sdd1[13](S) sdb1[0] sdo1[12](S) sdk1[8] sdi1[7]
sdg1[10] sdj1[9] sdf1[4] sde1[3] sdl1[11] sdh1[6]
      13671797440 blocks super 1.1 level 6, 64k chunk, algorithm 2 [9/8]
[UU_UUUUUU]

unused devices: <none>

I'm interpreting this as that a member is missing, but for some reason
the rebuild on sdl1 has not restarted.  What would be the next logical
step to take?  I've found some posts which imply that setting sync_action
to repair will work, but I'm a little wary of doing that without knowing
how risky that is.  Or, reading Documentation/md.txt, perhaps I should
set it to "recover"?  Or "resync", since it's possible the array was not
shut down cleanly?

FWIW, I have started the array, activated the LVM volume, and am running
xfs_repair -n (which is not supposed to do any writes), but otherwise
haven't risked modifying the filesystem (e.g., by mounting it).  So far
the xfs_repair seems fine, and has not reported any errors.

Thanks for your help (and patience).

--keith

-- 
kkeller@wombat.san-francisco.ca.us

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: ''force'' continutation of a rebuild?
  2011-12-15 20:36 ''force'' continutation of a rebuild? Keith Keller
@ 2011-12-15 20:53 ` NeilBrown
  2011-12-15 22:44   ` Keith Keller
  0 siblings, 1 reply; 3+ messages in thread
From: NeilBrown @ 2011-12-15 20:53 UTC (permalink / raw)
  To: Keith Keller; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3598 bytes --]

On Thu, 15 Dec 2011 12:36:19 -0800 Keith Keller
<kkeller@wombat.san-francisco.ca.us> wrote:

> 
> Hello all,
> 
> I have another seminewbie question.  I had an issue, likely hardware
> related, which forced me to reboot a machine with a RAID6 during a
> rebuild after a previous drive failure.  Now, after some other hardware
> issues, I've been able to successfully assemble the array, but it
> seems to be in an odd state:
> 
> # mdadm -D /dev/md0
> /dev/md0:
>         Version : 1.01
>   Creation Time : Thu Sep 29 21:26:35 2011
>      Raid Level : raid6
>      Array Size : 13671797440 (13038.44 GiB 13999.92 GB)
>   Used Dev Size : 1953113920 (1862.63 GiB 1999.99 GB)
>    Raid Devices : 9
>   Total Devices : 11
> Preferred Minor : 0
>     Persistence : Superblock is persistent
> 
>     Update Time : Thu Dec 15 12:19:41 2011
>           State : clean, degraded
>  Active Devices : 8
> Working Devices : 11
>  Failed Devices : 0
>   Spare Devices : 3
> 
>      Chunk Size : 64K
> 
>            Name : 0
>            UUID : 24363b01:90deb9b5:4b51e5df:68b8b6ea
>          Events : 102730
> 
>     Number   Major   Minor   RaidDevice State
>        0       8       17        0      active sync   /dev/sdb1
>        6       8      113        1      active sync   /dev/sdh1
>       11       8      177        2      spare rebuilding   /dev/sdl1
>        3       8       65        3      active sync   /dev/sde1
>        4       8       81        4      active sync   /dev/sdf1
>        9       8      145        5      active sync   /dev/sdj1
>       10       8       97        6      active sync   /dev/sdg1
>        7       8      129        7      active sync   /dev/sdi1
>        8       8      161        8      active sync   /dev/sdk1
> 
>       12       8      225        -      spare   /dev/sdo1
>       13       8       49        -      spare   /dev/sdd1
> 
> # cat /proc/mdstat 
> Personalities : [raid6] [raid5] [raid4] 
> md0 : active raid6 sdd1[13](S) sdb1[0] sdo1[12](S) sdk1[8] sdi1[7]
> sdg1[10] sdj1[9] sdf1[4] sde1[3] sdl1[11] sdh1[6]
>       13671797440 blocks super 1.1 level 6, 64k chunk, algorithm 2 [9/8]
> [UU_UUUUUU]
>       
> unused devices: <none>
> 
> I'm interpreting this as that a member is missing, but for some reason
> the rebuild on sdl1 has not restarted. 

Golly, you must be running an ancient kernel ... I fixed this bug at least 2
days ago...  Though admittedly I haven't submitted the fix yet so maybe you
have a good excuse :-)


If you remove both spares:
  mdadm /dev/md0 --remove /dev/sdo1 /dev/sdd1

the rebuild should start.  You can then add them back again "--add".

http://neil.brown.name/git?p=md;a=commitdiff;h=bd8c7cf40d56ca9ce3a6f72886914193674258d1


> What would be the next logical step to take? 

Send an email to linux-raid asking who broke what..  Oh wait, you did that.

NeilBrown



> I've found some posts which imply that setting sync_action
> to repair will work, but I'm a little wary of doing that without knowing
> how risky that is.  Or, reading Documentation/md.txt, perhaps I should
> set it to "recover"?  Or "resync", since it's possible the array was not
> shut down cleanly?
> 
> FWIW, I have started the array, activated the LVM volume, and am running
> xfs_repair -n (which is not supposed to do any writes), but otherwise
> haven't risked modifying the filesystem (e.g., by mounting it).  So far
> the xfs_repair seems fine, and has not reported any errors.
> 
> Thanks for your help (and patience).
> 
> --keith
> 


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: ''force'' continutation of a rebuild?
  2011-12-15 20:53 ` NeilBrown
@ 2011-12-15 22:44   ` Keith Keller
  0 siblings, 0 replies; 3+ messages in thread
From: Keith Keller @ 2011-12-15 22:44 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1609 bytes --]

Hi Neil!

On Fri, Dec 16, 2011 at 07:53:37AM +1100, NeilBrown wrote:
> 
> Golly, you must be running an ancient kernel ... I fixed this bug at least 2
> days ago...  Though admittedly I haven't submitted the fix yet so maybe you
> have a good excuse :-)

I initially misread "2 days" as "2 years", and was slightly puzzled.
I do recall seeing the messages on the list, but obviously didn't pay
enough attention to them.  ;-)

> If you remove both spares:
>   mdadm /dev/md0 --remove /dev/sdo1 /dev/sdd1
> 
> the rebuild should start.  You can then add them back again "--add".

Yep!  I feel slightly less stupid about posting now--I would not have
thought of that.  (Though ISTR that one of the pages I found did touch
on this bug (in a slightly different context), but I don't remember whether
it mentioned this particular option as a workaround; I sort of think
not.)

I think this workaround is perfectly acceptable, where I would not want
to go out and build the latest kernel just to fix it.  But if there were
a more critical fix to the kernel, what's the most consistent way to go
about that?  Would it make more sense to grab the source code from
ELrepo and apply patches?  Or would it be better to grab the latest
kernel tree, adapt the ELrepo .config, and hope for the best?  (On this
box, there is very little else running, so it should be robust to kernel
changes that break other items, as long as it keeps md working.)

> Send an email to linux-raid asking who broke what..  Oh wait, you did that.

:)

--keith

-- 
kkeller@wombat.san-francisco.ca.us

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-12-15 22:44 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-15 20:36 ''force'' continutation of a rebuild? Keith Keller
2011-12-15 20:53 ` NeilBrown
2011-12-15 22:44   ` Keith Keller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox