Linux RAID subsystem development
 help / color / mirror / Atom feed
* --grow RAID6 gives: md: md_do_sync() got signal ... exiting + hang
@ 2013-05-07 11:36 Ole Tange
  2013-05-07 11:54 ` NeilBrown
  2013-05-07 11:56 ` Ole Tange
  0 siblings, 2 replies; 7+ messages in thread
From: Ole Tange @ 2013-05-07 11:36 UTC (permalink / raw)
  To: linux-raid

I am expanding my 9 harddisk RAID6 to 10 harddisk RAID6:

md1 : active raid6 sdg[0] sdi[12](S) sdt[15](S) sdy[17](S) sdx[16](S)
sdh[8] sdw[13] sdo[14] sdk[5] sdd[11] sdc[3] sdv[9] sdn[10]
      27349121408 blocks super 1.2 level 6, 128k chunk, algorithm 2
[9/9] [UUUUUUUUU]
      bitmap: 2/2 pages [8KB], 1048576KB chunk

It is, however, hanging the system.

# remove the bitmap
mdadm -v --grow /dev/md1 -b none

# Do the reshape
mdadm -v --grow /dev/md1 --raid-devices=10
--backup-file=/root/back-md1
mdadm: Need to backup 7168K of critical section..

cat /proc/mdstat
<<hangs>>

dmesg says:

[4328128.021614] md: reshape of RAID array md1
[4328128.021618] md: minimum _guaranteed_  speed: 10000 KB/sec/disk.
[4328128.021621] md: using maximum available idle IO bandwidth (but
not more than 30000 KB/sec) for reshape.
[4328128.021783] md: using 128k window, over a total of 3907017344k.
[4328128.312637] md: md_do_sync() got signal ... exiting

Disk I/O is blocked to the RAID.

What to do?


/Ole

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: --grow RAID6 gives: md: md_do_sync() got signal ... exiting + hang
  2013-05-07 11:36 --grow RAID6 gives: md: md_do_sync() got signal ... exiting + hang Ole Tange
@ 2013-05-07 11:54 ` NeilBrown
  2013-05-07 12:08   ` Ole Tange
  2013-05-07 11:56 ` Ole Tange
  1 sibling, 1 reply; 7+ messages in thread
From: NeilBrown @ 2013-05-07 11:54 UTC (permalink / raw)
  To: Ole Tange; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1693 bytes --]

On Tue, 7 May 2013 13:36:56 +0200 Ole Tange <tange@binf.ku.dk> wrote:

> I am expanding my 9 harddisk RAID6 to 10 harddisk RAID6:
> 
> md1 : active raid6 sdg[0] sdi[12](S) sdt[15](S) sdy[17](S) sdx[16](S)
> sdh[8] sdw[13] sdo[14] sdk[5] sdd[11] sdc[3] sdv[9] sdn[10]
>       27349121408 blocks super 1.2 level 6, 128k chunk, algorithm 2
> [9/9] [UUUUUUUUU]
>       bitmap: 2/2 pages [8KB], 1048576KB chunk
> 
> It is, however, hanging the system.
> 
> # remove the bitmap
> mdadm -v --grow /dev/md1 -b none
> 
> # Do the reshape
> mdadm -v --grow /dev/md1 --raid-devices=10
> --backup-file=/root/back-md1
> mdadm: Need to backup 7168K of critical section..
> 
> cat /proc/mdstat
> <<hangs>>
> 
> dmesg says:
> 
> [4328128.021614] md: reshape of RAID array md1
> [4328128.021618] md: minimum _guaranteed_  speed: 10000 KB/sec/disk.
> [4328128.021621] md: using maximum available idle IO bandwidth (but
> not more than 30000 KB/sec) for reshape.
> [4328128.021783] md: using 128k window, over a total of 3907017344k.
> [4328128.312637] md: md_do_sync() got signal ... exiting
> 
> Disk I/O is blocked to the RAID.
> 
> What to do?

What does
  grep . /sys/block/md1/md/*
show? Or does it hang?
What about "mdadm --examine /dev/sd*"

Did the "mdadm --grow" appear to complete, and return to the shell prompt?

What kernel version?  What mdadm version?

A hanging /proc/mdstat is definitely not a good sign.  The "got signal ...
exiting" isn't good either.  I would expect more messages with that.
You didn't just "grep md" in dmesg did you?  That is a complete dmesg output
for the entire time period that could possibly be relevant?

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: --grow RAID6 gives: md: md_do_sync() got signal ... exiting + hang
  2013-05-07 11:36 --grow RAID6 gives: md: md_do_sync() got signal ... exiting + hang Ole Tange
  2013-05-07 11:54 ` NeilBrown
@ 2013-05-07 11:56 ` Ole Tange
  2013-05-07 12:14   ` NeilBrown
  1 sibling, 1 reply; 7+ messages in thread
From: Ole Tange @ 2013-05-07 11:56 UTC (permalink / raw)
  To: linux-raid

On Tue, May 7, 2013 at 1:36 PM, Ole Tange <tange@binf.ku.dk> wrote:

> I am expanding my 9 harddisk RAID6 to 10 harddisk RAID6:
:
> It is, however, hanging the system.

I can mdadm -E:

/dev/sdi:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
     Array UUID : 242d6530:e2562ecb:1dcd2a97:15a1a868
           Name : lemaitre:1  (local to host lemaitre)
  Creation Time : Mon Nov  5 16:27:45 2012
     Raid Level : raid6
   Raid Devices : 10

 Avail Dev Size : 7814035120 (3726.02 GiB 4000.79 GB)
     Array Size : 31256138752 (29808.18 GiB 32006.29 GB)
  Used Dev Size : 7814034688 (3726.02 GiB 4000.79 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : 4b8de95b:90a2aed7:c0ae092b:a056dd95

  Reshape pos'n : 8192 (8.00 MiB 8.39 MB)
  Delta Devices : 1 (9->10)

    Update Time : Tue May  7 13:12:19 2013
       Checksum : a4f483fd - correct
         Events : 298792

         Layout : left-symmetric
     Chunk Size : 128K

   Device Role : Active device 9
   Array State : AAAAAAAAAA ('A' == active, '.' == missing)

So it seems stuck on the first 8 MB. Is it safe to reboot?

This hangs:

  grep . /sys/block/md1/md/*

$ mdadm --version
mdadm - v3.2.5 - 18th May 2012

$ uname -r
3.2.0-0.bpo.1-amd64



/Ole

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: --grow RAID6 gives: md: md_do_sync() got signal ... exiting + hang
  2013-05-07 11:54 ` NeilBrown
@ 2013-05-07 12:08   ` Ole Tange
  2013-05-07 12:40     ` NeilBrown
  0 siblings, 1 reply; 7+ messages in thread
From: Ole Tange @ 2013-05-07 12:08 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Tue, May 7, 2013 at 1:54 PM, NeilBrown <neilb@suse.de> wrote:
> On Tue, 7 May 2013 13:36:56 +0200 Ole Tange <tange@binf.ku.dk> wrote:
>
>> I am expanding my 9 harddisk RAID6 to 10 harddisk RAID6:
:
>> It is, however, hanging the system.
:
>> # Do the reshape
>> mdadm -v --grow /dev/md1 --raid-devices=10
>> --backup-file=/root/back-md1
>> mdadm: Need to backup 7168K of critical section..

This completed - did not hang.

> What does
>   grep . /sys/block/md1/md/*
> show? Or does it hang?

Hangs (ctrl-c works).

> What about "mdadm --examine /dev/sd*"

https://gist.github.com/anonymous/5532063

The disk box contains more drives than just the array in question. The
interesting array is: 242d6530:e2562ecb:1dcd2a97:15a1a868

> Did the "mdadm --grow" appear to complete, and return to the shell prompt?

Yes.

> What kernel version?  What mdadm version?

$ mdadm --version
mdadm - v3.2.5 - 18th May 2012

$ uname -r
3.2.0-0.bpo.1-amd64

> A hanging /proc/mdstat is definitely not a good sign.  The "got signal ...
> exiting" isn't good either.  I would expect more messages with that.
> You didn't just "grep md" in dmesg did you?  That is a complete dmesg output
> for the entire time period that could possibly be relevant?

dmesg of controller upgrade (after which everything worked fine)
followed by --grow at 4328065.432267

https://gist.github.com/anonymous/5532093

/Ole

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: --grow RAID6 gives: md: md_do_sync() got signal ... exiting + hang
  2013-05-07 11:56 ` Ole Tange
@ 2013-05-07 12:14   ` NeilBrown
  2013-05-07 12:16     ` Ole Tange
  0 siblings, 1 reply; 7+ messages in thread
From: NeilBrown @ 2013-05-07 12:14 UTC (permalink / raw)
  To: Ole Tange; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1728 bytes --]

On Tue, 7 May 2013 13:56:55 +0200 Ole Tange <tange@binf.ku.dk> wrote:

> On Tue, May 7, 2013 at 1:36 PM, Ole Tange <tange@binf.ku.dk> wrote:
> 
> > I am expanding my 9 harddisk RAID6 to 10 harddisk RAID6:
> :
> > It is, however, hanging the system.
> 
> I can mdadm -E:
> 
> /dev/sdi:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x4
>      Array UUID : 242d6530:e2562ecb:1dcd2a97:15a1a868
>            Name : lemaitre:1  (local to host lemaitre)
>   Creation Time : Mon Nov  5 16:27:45 2012
>      Raid Level : raid6
>    Raid Devices : 10
> 
>  Avail Dev Size : 7814035120 (3726.02 GiB 4000.79 GB)
>      Array Size : 31256138752 (29808.18 GiB 32006.29 GB)
>   Used Dev Size : 7814034688 (3726.02 GiB 4000.79 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : active
>     Device UUID : 4b8de95b:90a2aed7:c0ae092b:a056dd95
> 
>   Reshape pos'n : 8192 (8.00 MiB 8.39 MB)
>   Delta Devices : 1 (9->10)
> 
>     Update Time : Tue May  7 13:12:19 2013
>        Checksum : a4f483fd - correct
>          Events : 298792
> 
>          Layout : left-symmetric
>      Chunk Size : 128K
> 
>    Device Role : Active device 9
>    Array State : AAAAAAAAAA ('A' == active, '.' == missing)
> 
> So it seems stuck on the first 8 MB. Is it safe to reboot?
> 
> This hangs:
> 
>   grep . /sys/block/md1/md/*
> 
> $ mdadm --version
> mdadm - v3.2.5 - 18th May 2012
> 
> $ uname -r
> 3.2.0-0.bpo.1-amd64
> 

It should be safe to reboot though until we know why it is hanging, I cannot
promise it won't hang straight  away again.

You didn't answer my question about dmesg output:  did you leave anything out?

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: --grow RAID6 gives: md: md_do_sync() got signal ... exiting + hang
  2013-05-07 12:14   ` NeilBrown
@ 2013-05-07 12:16     ` Ole Tange
  0 siblings, 0 replies; 7+ messages in thread
From: Ole Tange @ 2013-05-07 12:16 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

On Tue, May 7, 2013 at 2:14 PM, NeilBrown <neilb@suse.de> wrote:

> You didn't answer my question about dmesg output:  did you leave anything out?

Nothing left out on:

https://gist.github.com/anonymous/5532093


/Ole

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: --grow RAID6 gives: md: md_do_sync() got signal ... exiting + hang
  2013-05-07 12:08   ` Ole Tange
@ 2013-05-07 12:40     ` NeilBrown
  0 siblings, 0 replies; 7+ messages in thread
From: NeilBrown @ 2013-05-07 12:40 UTC (permalink / raw)
  To: Ole Tange; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2118 bytes --]

On Tue, 7 May 2013 14:08:14 +0200 Ole Tange <tange@binf.ku.dk> wrote:

> On Tue, May 7, 2013 at 1:54 PM, NeilBrown <neilb@suse.de> wrote:
> > On Tue, 7 May 2013 13:36:56 +0200 Ole Tange <tange@binf.ku.dk> wrote:
> >
> >> I am expanding my 9 harddisk RAID6 to 10 harddisk RAID6:
> :
> >> It is, however, hanging the system.
> :
> >> # Do the reshape
> >> mdadm -v --grow /dev/md1 --raid-devices=10
> >> --backup-file=/root/back-md1
> >> mdadm: Need to backup 7168K of critical section..
> 
> This completed - did not hang.
> 
> > What does
> >   grep . /sys/block/md1/md/*
> > show? Or does it hang?
> 
> Hangs (ctrl-c works).
> 
> > What about "mdadm --examine /dev/sd*"
> 
> https://gist.github.com/anonymous/5532063
> 
> The disk box contains more drives than just the array in question. The
> interesting array is: 242d6530:e2562ecb:1dcd2a97:15a1a868
> 
> > Did the "mdadm --grow" appear to complete, and return to the shell prompt?
> 
> Yes.
> 
> > What kernel version?  What mdadm version?
> 
> $ mdadm --version
> mdadm - v3.2.5 - 18th May 2012
> 
> $ uname -r
> 3.2.0-0.bpo.1-amd64
> 
> > A hanging /proc/mdstat is definitely not a good sign.  The "got signal ...
> > exiting" isn't good either.  I would expect more messages with that.
> > You didn't just "grep md" in dmesg did you?  That is a complete dmesg output
> > for the entire time period that could possibly be relevant?
> 
> dmesg of controller upgrade (after which everything worked fine)
> followed by --grow at 4328065.432267
> 
> https://gist.github.com/anonymous/5532093
> 
> /Ole

Thanks for the extra info.  I can't find any smoking gun unfortunately.

What does "ps axgu" show.  I'm particularly looking for processes in 'D'
state.
If there  are any, particularly if they are md related, try
  cat /proc/$PID/stack
for appropriate values of $PID

Maybe also try
   echo t > /proc/sysrq_trigger

and see what gets into 'dmesg' - hopefully your dmesg buffer is big enough to
hold the important stack traces.
If you get anything from either of those, please post.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-05-07 12:40 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-07 11:36 --grow RAID6 gives: md: md_do_sync() got signal ... exiting + hang Ole Tange
2013-05-07 11:54 ` NeilBrown
2013-05-07 12:08   ` Ole Tange
2013-05-07 12:40     ` NeilBrown
2013-05-07 11:56 ` Ole Tange
2013-05-07 12:14   ` NeilBrown
2013-05-07 12:16     ` Ole Tange

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox