Reshape Shrink Hung Again

Linux RAID subsystem development
 help / color / mirror / Atom feed

* Reshape Shrink Hung Again
@ 2013-04-19  8:29 Sam Bingner
  2013-04-21  8:26 ` Sam Bingner
  2013-04-21 21:24 ` NeilBrown
  0 siblings, 2 replies; 9+ messages in thread
From: Sam Bingner @ 2013-04-19  8:29 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org

I'll start this off by saying that no data is in jeopardy, but I would like to track down the cause of this problem and fix it.  I originally thought it must have been due to the incorrect backup-file size with a raid array shrunk to smaller than the final size when it happened to me last time but this time this was not the case.

I initiated a shrink from a 4-drive RAID5 to a 3-drive RAID5, this shrink had no problems except that a drive failed right at the end of the reshape... then it hung at 99.9% and does not allow me to remove the failed drive from the array because it is "rebuilding".  I am not sure if the drive failed at the end, or if it was after it had gotten to 99.9% because I didn't see this until the next morning as it ran overnight.

Sam

root@fs:/var/log# uname -a
Linux fs 2.6.32-5-686 #1 SMP Mon Jan 16 16:04:25 UTC 2012 i686 GNU/Linux

Apr 17 22:37:41 fs kernel: [25860779.639762] md1: detected capacity change from 749122093056 to 499414728704
Apr 17 22:38:40 fs kernel: [25860837.912441] md: reshape of RAID array md1
Apr 17 22:38:40 fs kernel: [25860837.912447] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Apr 17 22:38:40 fs kernel: [25860837.912452] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
Apr 17 22:38:40 fs kernel: [25860837.912459] md: using 128k window, over a total of 243854848 blocks.
Apr 18 07:51:09 fs kernel: [25893987.273813] raid5: Disk failure on sda2, disabling device.
Apr 18 07:51:09 fs kernel: [25893987.273815] raid5: Operation continuing on 2 devices.
Apr 18 07:51:09 fs kernel: [25893987.287168] md: super_written gets error=-5, uptodate=0
Apr 18 07:51:10 fs kernel: [25893987.657039] md: md1: reshape done.
Apr 18 07:51:10 fs kernel: [25893987.781599] md: reshape of RAID array md1
Apr 18 07:51:10 fs kernel: [25893987.781607] md: minimum _guaranteed_  speed: 100 KB/sec/disk.
Apr 18 07:51:10 fs kernel: [25893987.781613] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
Apr 18 07:51:10 fs kernel: [25893987.781620] md: using 128k window, over a total of 243854848 blocks.

md1 : active raid5 sdd2[3] sda2[0](F) sdc2[2] sdb2[4]
      487709696 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU]
      [===================>.]  reshape = 99.9% (243853824/243854848) finish=343.6min speed=0K/sec

root@fs:/# mdadm --remove /dev/md1 /dev/sda2
mdadm: hot remove failed for /dev/sda2: Device or resource busy

root@fs:/# mdadm --manage /dev/md1 --force --remove /dev/sda2
mdadm: hot remove failed for /dev/sda2: Device or resource busy

root@fs:/var/log# ls -l /boot/backup.md 
-rw------- 1 root root 3146240 Apr 17 22:38 /boot/backup.md

root@fs:/var/log# hexdump /boot/backup.md 
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0300200

root@fs:/# mdadm --detail /dev/md1
/dev/md1:
        Version : 1.2
  Creation Time : Fri Feb 10 21:45:46 2012
     Raid Level : raid5
     Array Size : 487709696 (465.12 GiB 499.41 GB)
  Used Dev Size : 243854848 (232.56 GiB 249.71 GB)
   Raid Devices : 3
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Thu Apr 18 21:37:48 2013
          State : clean, degraded, recovering
 Active Devices : 3
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

 Reshape Status : 99% complete
  Delta Devices : -1, (3->2)

           Name : fs:1  (local to host fs)
           UUID : 9d7e8a08:030af4f8:e653c46c:af2c84fe
         Events : 33773764

    Number   Major   Minor   RaidDevice State
       0       8        2        0      faulty spare rebuilding   /dev/sda2
       4       8       18        1      active sync   /dev/sdb2
       2       8       34        2      active sync   /dev/sdc2

       3       8       50        3      active sync   /dev/sdd2

/dev/sdd2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
     Array UUID : 9d7e8a08:030af4f8:e653c46c:af2c84fe
           Name : fs:1  (local to host fs)
  Creation Time : Fri Feb 10 21:45:46 2012
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 487710720 (232.56 GiB 249.71 GB)
     Array Size : 975419392 (465.12 GiB 499.41 GB)
  Used Dev Size : 487709696 (232.56 GiB 249.71 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 13cefd7d:7bb42450:c229d326:a41b9ba7

  Reshape pos'n : 2048
  Delta Devices : -1 (4->3)

    Update Time : Fri Apr 19 04:22:40 2013
       Checksum : 2f033b35 - correct
         Events : 33786736

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : .AAA ('A' == active, '.' == missing)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Reshape Shrink Hung Again
  2013-04-19  8:29 Reshape Shrink Hung Again Sam Bingner
@ 2013-04-21  8:26 ` Sam Bingner
  2013-04-21 17:38   ` Phil Turmel
  2013-04-21 21:24 ` NeilBrown
  1 sibling, 1 reply; 9+ messages in thread
From: Sam Bingner @ 2013-04-21  8:26 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org


On Apr 18, 2013, at 10:29 PM, Sam Bingner <sam@bingner.com> wrote:

> I'll start this off by saying that no data is in jeopardy, but I would like to track down the cause of this problem and fix it.  I originally thought it must have been due to the incorrect backup-file size with a raid array shrunk to smaller than the final size when it happened to me last time but this time this was not the case.
> 
> I initiated a shrink from a 4-drive RAID5 to a 3-drive RAID5, this shrink had no problems except that a drive failed right at the end of the reshape... then it hung at 99.9% and does not allow me to remove the failed drive from the array because it is "rebuilding".  I am not sure if the drive failed at the end, or if it was after it had gotten to 99.9% because I didn't see this until the next morning as it ran overnight.
> 
> Sam
> 
> root@fs:/var/log# uname -a
> Linux fs 2.6.32-5-686 #1 SMP Mon Jan 16 16:04:25 UTC 2012 i686 GNU/Linux
> 
> Apr 17 22:37:41 fs kernel: [25860779.639762] md1: detected capacity change from 749122093056 to 499414728704
> Apr 17 22:38:40 fs kernel: [25860837.912441] md: reshape of RAID array md1
> Apr 17 22:38:40 fs kernel: [25860837.912447] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> Apr 17 22:38:40 fs kernel: [25860837.912452] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> Apr 17 22:38:40 fs kernel: [25860837.912459] md: using 128k window, over a total of 243854848 blocks.
> Apr 18 07:51:09 fs kernel: [25893987.273813] raid5: Disk failure on sda2, disabling device.
> Apr 18 07:51:09 fs kernel: [25893987.273815] raid5: Operation continuing on 2 devices.
> Apr 18 07:51:09 fs kernel: [25893987.287168] md: super_written gets error=-5, uptodate=0
> Apr 18 07:51:10 fs kernel: [25893987.657039] md: md1: reshape done.
> Apr 18 07:51:10 fs kernel: [25893987.781599] md: reshape of RAID array md1
> Apr 18 07:51:10 fs kernel: [25893987.781607] md: minimum _guaranteed_  speed: 100 KB/sec/disk.
> Apr 18 07:51:10 fs kernel: [25893987.781613] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> Apr 18 07:51:10 fs kernel: [25893987.781620] md: using 128k window, over a total of 243854848 blocks.
> 
> 
> md1 : active raid5 sdd2[3] sda2[0](F) sdc2[2] sdb2[4]
>      487709696 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU]
>      [===================>.]  reshape = 99.9% (243853824/243854848) finish=343.6min speed=0K/sec
> 
> 
> root@fs:/# mdadm --remove /dev/md1 /dev/sda2
> mdadm: hot remove failed for /dev/sda2: Device or resource busy
> 
> root@fs:/# mdadm --manage /dev/md1 --force --remove /dev/sda2
> mdadm: hot remove failed for /dev/sda2: Device or resource busy
> 
> root@fs:/var/log# ls -l /boot/backup.md 
> -rw------- 1 root root 3146240 Apr 17 22:38 /boot/backup.md
> 
> root@fs:/var/log# hexdump /boot/backup.md 
> 0000000 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 0300200
> 
> 
> root@fs:/# mdadm --detail /dev/md1
> /dev/md1:
>        Version : 1.2
>  Creation Time : Fri Feb 10 21:45:46 2012
>     Raid Level : raid5
>     Array Size : 487709696 (465.12 GiB 499.41 GB)
>  Used Dev Size : 243854848 (232.56 GiB 249.71 GB)
>   Raid Devices : 3
>  Total Devices : 4
>    Persistence : Superblock is persistent
> 
>    Update Time : Thu Apr 18 21:37:48 2013
>          State : clean, degraded, recovering
> Active Devices : 3
> Working Devices : 3
> Failed Devices : 1
>  Spare Devices : 0
> 
>         Layout : left-symmetric
>     Chunk Size : 512K
> 
> Reshape Status : 99% complete
>  Delta Devices : -1, (3->2)
> 
>           Name : fs:1  (local to host fs)
>           UUID : 9d7e8a08:030af4f8:e653c46c:af2c84fe
>         Events : 33773764
> 
>    Number   Major   Minor   RaidDevice State
>       0       8        2        0      faulty spare rebuilding   /dev/sda2
>       4       8       18        1      active sync   /dev/sdb2
>       2       8       34        2      active sync   /dev/sdc2
> 
>       3       8       50        3      active sync   /dev/sdd2
> 
> 
> /dev/sdd2:
>          Magic : a92b4efc
>        Version : 1.2
>    Feature Map : 0x4
>     Array UUID : 9d7e8a08:030af4f8:e653c46c:af2c84fe
>           Name : fs:1  (local to host fs)
>  Creation Time : Fri Feb 10 21:45:46 2012
>     Raid Level : raid5
>   Raid Devices : 3
> 
> Avail Dev Size : 487710720 (232.56 GiB 249.71 GB)
>     Array Size : 975419392 (465.12 GiB 499.41 GB)
>  Used Dev Size : 487709696 (232.56 GiB 249.71 GB)
>    Data Offset : 2048 sectors
>   Super Offset : 8 sectors
>          State : clean
>    Device UUID : 13cefd7d:7bb42450:c229d326:a41b9ba7
> 
>  Reshape pos'n : 2048
>  Delta Devices : -1 (4->3)
> 
>    Update Time : Fri Apr 19 04:22:40 2013
>       Checksum : 2f033b35 - correct
>         Events : 33786736
> 
>         Layout : left-symmetric
>     Chunk Size : 512K
> 
>   Device Role : Active device 3
>   Array State : .AAA ('A' == active, '.' == missing)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Am I doing something wrong in these emails?  I've yet to have a reply from anybody related to this issue... should I submit it to a bugtracker somewhere instead?  Do I need to provide some different format for my email?  Is there a specific type of goat I should sacrifice?

v/r
Sam

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Reshape Shrink Hung Again
  2013-04-21  8:26 ` Sam Bingner
@ 2013-04-21 17:38   ` Phil Turmel
  0 siblings, 0 replies; 9+ messages in thread
From: Phil Turmel @ 2013-04-21 17:38 UTC (permalink / raw)
  To: Sam Bingner; +Cc: linux-raid@vger.kernel.org

On 04/21/2013 04:26 AM, Sam Bingner wrote:
> Am I doing something wrong in these emails?  I've yet to have a reply
> from anybody related to this issue... should I submit it to a
> bugtracker somewhere instead?  Do I need to provide some different
> format for my email?  Is there a specific type of goat I should
> sacrifice?

I don't think you're doing anything wrong--you just have an uncommon
problem, and the options are non-obvious.  It sounds vaguely like a
known kernel bug, but I can't call it to mind.  You are running a rather
old kernel, so the pool of people who can help with it is certainly
shrinking.

Suggestions:

1) Try a recent kernel on a livecd that supports mdadm, something like
"SystemRescueCD" (my favorite).  Allow that newer kernel to assemble
what it can.  If that gives you accessible data, you could then re-add
that failed device.  If that works, you can go back to your runtime
environment, with a now-working array, and plan a more leisurely system
upgrade.

2) Abandon the 1/2MB of unrebuilt data on the array by recreating
the array with "mdadm --create --assume-clean", using "missing" in place
of sda2.  You have good dumps of your metadata, so that is relatively
low-risk.  (Do verify that the chunk and data offset of the
newly-created array match the originals before you allow anything to
write to the array!  "fsck -n" is your friend.)  Then you can add sda2
and let it rebuild.

Given that I've never had this problem, I'm not 100% confident in this
advice.  You might want to wait a bit for folk with more 2.6.32
experience to pipe up.

Phil

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Reshape Shrink Hung Again
  2013-04-19  8:29 Reshape Shrink Hung Again Sam Bingner
  2013-04-21  8:26 ` Sam Bingner
@ 2013-04-21 21:24 ` NeilBrown
  2013-05-01  2:00   ` Sam Bingner
  1 sibling, 1 reply; 9+ messages in thread
From: NeilBrown @ 2013-04-21 21:24 UTC (permalink / raw)
  To: Sam Bingner; +Cc: linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 3030 bytes --]

On Fri, 19 Apr 2013 08:29:37 +0000 Sam Bingner <sam@bingner.com> wrote:

> I'll start this off by saying that no data is in jeopardy, but I would like to track down the cause of this problem and fix it.  I originally thought it must have been due to the incorrect backup-file size with a raid array shrunk to smaller than the final size when it happened to me last time but this time this was not the case.
> 
> I initiated a shrink from a 4-drive RAID5 to a 3-drive RAID5, this shrink had no problems except that a drive failed right at the end of the reshape... then it hung at 99.9% and does not allow me to remove the failed drive from the array because it is "rebuilding".  I am not sure if the drive failed at the end, or if it was after it had gotten to 99.9% because I didn't see this until the next morning as it ran overnight.
> 
> Sam
> 
> root@fs:/var/log# uname -a
> Linux fs 2.6.32-5-686 #1 SMP Mon Jan 16 16:04:25 UTC 2012 i686 GNU/Linux
> 
> Apr 17 22:37:41 fs kernel: [25860779.639762] md1: detected capacity change from 749122093056 to 499414728704
> Apr 17 22:38:40 fs kernel: [25860837.912441] md: reshape of RAID array md1
> Apr 17 22:38:40 fs kernel: [25860837.912447] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> Apr 17 22:38:40 fs kernel: [25860837.912452] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> Apr 17 22:38:40 fs kernel: [25860837.912459] md: using 128k window, over a total of 243854848 blocks.
> Apr 18 07:51:09 fs kernel: [25893987.273813] raid5: Disk failure on sda2, disabling device.
> Apr 18 07:51:09 fs kernel: [25893987.273815] raid5: Operation continuing on 2 devices.
> Apr 18 07:51:09 fs kernel: [25893987.287168] md: super_written gets error=-5, uptodate=0
> Apr 18 07:51:10 fs kernel: [25893987.657039] md: md1: reshape done.
> Apr 18 07:51:10 fs kernel: [25893987.781599] md: reshape of RAID array md1
> Apr 18 07:51:10 fs kernel: [25893987.781607] md: minimum _guaranteed_  speed: 100 KB/sec/disk.
> Apr 18 07:51:10 fs kernel: [25893987.781613] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> Apr 18 07:51:10 fs kernel: [25893987.781620] md: using 128k window, over a total of 243854848 blocks.
> 
> 
> md1 : active raid5 sdd2[3] sda2[0](F) sdc2[2] sdb2[4]
>       487709696 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU]
>       [===================>.]  reshape = 99.9% (243853824/243854848) finish=343.6min speed=0K/sec
> 

Looks like a bug - probably in mdadm.
mdadm needs to help the reshape over the last little bit, and md is probably
waiting for it to do that.  This will be the only time in the whole process
when the backup file is used.

I would try stopping the array and re-assembling it.  That might require a
reboot.  If that doesn't fix it, let me know and I'll prioritise this.
Otherwise - I've put it on my to-do list.  I'll try to reproduce and fix it
in due course.

Thanks for the report,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Reshape Shrink Hung Again
  2013-04-21 21:24 ` NeilBrown
@ 2013-05-01  2:00   ` Sam Bingner
  2013-05-06  5:29     ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Sam Bingner @ 2013-05-01  2:00 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid@vger.kernel.org

On Apr 21, 2013, at 11:24 AM, NeilBrown <neilb@suse.de> wrote:

> On Fri, 19 Apr 2013 08:29:37 +0000 Sam Bingner <sam@bingner.com> wrote:
> 
>> I'll start this off by saying that no data is in jeopardy, but I would like to track down the cause of this problem and fix it.  I originally thought it must have been due to the incorrect backup-file size with a raid array shrunk to smaller than the final size when it happened to me last time but this time this was not the case.
>> 
>> I initiated a shrink from a 4-drive RAID5 to a 3-drive RAID5, this shrink had no problems except that a drive failed right at the end of the reshape... then it hung at 99.9% and does not allow me to remove the failed drive from the array because it is "rebuilding".  I am not sure if the drive failed at the end, or if it was after it had gotten to 99.9% because I didn't see this until the next morning as it ran overnight.
>> 
>> Sam
>> 
>> root@fs:/var/log# uname -a
>> Linux fs 2.6.32-5-686 #1 SMP Mon Jan 16 16:04:25 UTC 2012 i686 GNU/Linux
>> 
>> Apr 17 22:37:41 fs kernel: [25860779.639762] md1: detected capacity change from 749122093056 to 499414728704
>> Apr 17 22:38:40 fs kernel: [25860837.912441] md: reshape of RAID array md1
>> Apr 17 22:38:40 fs kernel: [25860837.912447] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
>> Apr 17 22:38:40 fs kernel: [25860837.912452] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
>> Apr 17 22:38:40 fs kernel: [25860837.912459] md: using 128k window, over a total of 243854848 blocks.
>> Apr 18 07:51:09 fs kernel: [25893987.273813] raid5: Disk failure on sda2, disabling device.
>> Apr 18 07:51:09 fs kernel: [25893987.273815] raid5: Operation continuing on 2 devices.
>> Apr 18 07:51:09 fs kernel: [25893987.287168] md: super_written gets error=-5, uptodate=0
>> Apr 18 07:51:10 fs kernel: [25893987.657039] md: md1: reshape done.
>> Apr 18 07:51:10 fs kernel: [25893987.781599] md: reshape of RAID array md1
>> Apr 18 07:51:10 fs kernel: [25893987.781607] md: minimum _guaranteed_  speed: 100 KB/sec/disk.
>> Apr 18 07:51:10 fs kernel: [25893987.781613] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
>> Apr 18 07:51:10 fs kernel: [25893987.781620] md: using 128k window, over a total of 243854848 blocks.
>> 
>> 
>> md1 : active raid5 sdd2[3] sda2[0](F) sdc2[2] sdb2[4]
>>      487709696 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU]
>>      [===================>.]  reshape = 99.9% (243853824/243854848) finish=343.6min speed=0K/sec
>> 
> 
> Looks like a bug - probably in mdadm.
> mdadm needs to help the reshape over the last little bit, and md is probably
> waiting for it to do that.  This will be the only time in the whole process
> when the backup file is used.
> 
> I would try stopping the array and re-assembling it.  That might require a
> reboot.  If that doesn't fix it, let me know and I'll prioritise this.
> Otherwise - I've put it on my to-do list.  I'll try to reproduce and fix it
> in due course.
> 
> Thanks for the report,
> NeilBrown

Sorry for the delay in responding, the server was at a remote location and didn't have a remote console.  My attempt to make an initrd that provided me SSH failed for unknown reasons (it works now that I've got physical access to the server).  Based on the results below, it looks like the drive that did drop out was pretty much at the very end and I really don't think it was related to the error.  I can leave the system in this state and get you access to it to see if you desire.  This system was in the process of being decommissioned and soon after the failure the replacement came in.  This same error happened to me twice, but I also did another reshape where it didn't happen.  I can play with this system and try to duplicate it again also.  As I said, I'll be happy to do anything to help
  find the source of this. 

In any case, here is what happened from initramfs:

 # /sbin/mdadm --assemble /dev/md1
mdadm: Failed to restore critical section for reshape, sorry.
      Possibly you needed to specify the --backup-file

# /sbin/mdadm --assemble /dev/md1 --backup-file=/boot/backup.md
mdadm: Failed to restore critical section for reshape, sorry.

# /sbin/mdadm -V
mdadm - v3.1.4 - 31st August 2010

I saw that the mdadm version was out of date, so I got the newest one and compiled it:

# ./mdadm.static -V
mdadm - v3.2.6 - 25th October 2012

# ./mdadm.static --assemble /dev/md1 --backup-file=/boot/backup.md
mdadm: Failed to restore critical section for reshape, sorry.

/boot # ./mdadm.static  -E  /dev/sda2
/dev/sda2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
     Array UUID : 9d7e8a08:030af4f8:e653c46c:af2c84fe
           Name : fs:1
  Creation Time : Sat Feb 11 02:45:46 2012
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 487710720 (232.56 GiB 249.71 GB)
     Array Size : 487709696 (465.12 GiB 499.41 GB)
  Used Dev Size : 487709696 (232.56 GiB 249.71 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : bc2c6c48:d81125bf:f767cb14:14ce323e

  Reshape pos'n : 13312 (13.00 MiB 13.63 MB)
  Delta Devices : -1 (4->3)

    Update Time : Thu Apr 18 11:49:51 2013
       Checksum : ecff7119 - correct
         Events : 33742236

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAAA ('A' == active, '.' == missing)
/boot # ./mdadm.static  -E  /dev/sdb2
/dev/sdb2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
     Array UUID : 9d7e8a08:030af4f8:e653c46c:af2c84fe
           Name : fs:1
  Creation Time : Sat Feb 11 02:45:46 2012
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 487710720 (232.56 GiB 249.71 GB)
     Array Size : 487709696 (465.12 GiB 499.41 GB)
  Used Dev Size : 487709696 (232.56 GiB 249.71 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : e2f23785:e2cc299e:d03ee428:ced00761

  Reshape pos'n : 2048
  Delta Devices : -1 (4->3)

    Update Time : Mon Apr 22 03:01:24 2013
       Checksum : 3fc5d100 - correct
         Events : 33910936

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : .AAA ('A' == active, '.' == missing)
/boot # ./mdadm.static  -E  /dev/sdc2
/dev/sdc2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
     Array UUID : 9d7e8a08:030af4f8:e653c46c:af2c84fe
           Name : fs:1
  Creation Time : Sat Feb 11 02:45:46 2012
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 487710720 (232.56 GiB 249.71 GB)
     Array Size : 487709696 (465.12 GiB 499.41 GB)
  Used Dev Size : 487709696 (232.56 GiB 249.71 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : f43b602c:1f8e0fe1:37778958:fff328e8

  Reshape pos'n : 2048
  Delta Devices : -1 (4->3)

    Update Time : Mon Apr 22 03:01:24 2013
       Checksum : e09a36e5 - correct
         Events : 33910936

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : .AAA ('A' == active, '.' == missing)
/boot # ./mdadm.static  -E  /dev/sdd2
/dev/sdd2:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x4
     Array UUID : 9d7e8a08:030af4f8:e653c46c:af2c84fe
           Name : fs:1
  Creation Time : Sat Feb 11 02:45:46 2012
     Raid Level : raid5
   Raid Devices : 3

 Avail Dev Size : 487710720 (232.56 GiB 249.71 GB)
     Array Size : 487709696 (465.12 GiB 499.41 GB)
  Used Dev Size : 487709696 (232.56 GiB 249.71 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 13cefd7d:7bb42450:c229d326:a41b9ba7

  Reshape pos'n : 2048
  Delta Devices : -1 (4->3)

    Update Time : Mon Apr 22 03:01:24 2013
       Checksum : 2f08c991 - correct
         Events : 33910936

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : .AAA ('A' == active, '.' == missing)
	



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Reshape Shrink Hung Again
  2013-05-01  2:00   ` Sam Bingner
@ 2013-05-06  5:29     ` NeilBrown
  2013-05-06  6:36       ` Sam Bingner
  0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2013-05-06  5:29 UTC (permalink / raw)
  To: Sam Bingner; +Cc: linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 4738 bytes --]

On Wed, 1 May 2013 02:00:30 +0000 Sam Bingner <sam@bingner.com> wrote:

> On Apr 21, 2013, at 11:24 AM, NeilBrown <neilb@suse.de> wrote:
> 
> > On Fri, 19 Apr 2013 08:29:37 +0000 Sam Bingner <sam@bingner.com> wrote:
> > 
> >> I'll start this off by saying that no data is in jeopardy, but I would like to track down the cause of this problem and fix it.  I originally thought it must have been due to the incorrect backup-file size with a raid array shrunk to smaller than the final size when it happened to me last time but this time this was not the case.
> >> 
> >> I initiated a shrink from a 4-drive RAID5 to a 3-drive RAID5, this shrink had no problems except that a drive failed right at the end of the reshape... then it hung at 99.9% and does not allow me to remove the failed drive from the array because it is "rebuilding".  I am not sure if the drive failed at the end, or if it was after it had gotten to 99.9% because I didn't see this until the next morning as it ran overnight.
> >> 
> >> Sam
> >> 
> >> root@fs:/var/log# uname -a
> >> Linux fs 2.6.32-5-686 #1 SMP Mon Jan 16 16:04:25 UTC 2012 i686 GNU/Linux
> >> 
> >> Apr 17 22:37:41 fs kernel: [25860779.639762] md1: detected capacity change from 749122093056 to 499414728704
> >> Apr 17 22:38:40 fs kernel: [25860837.912441] md: reshape of RAID array md1
> >> Apr 17 22:38:40 fs kernel: [25860837.912447] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> >> Apr 17 22:38:40 fs kernel: [25860837.912452] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> >> Apr 17 22:38:40 fs kernel: [25860837.912459] md: using 128k window, over a total of 243854848 blocks.
> >> Apr 18 07:51:09 fs kernel: [25893987.273813] raid5: Disk failure on sda2, disabling device.
> >> Apr 18 07:51:09 fs kernel: [25893987.273815] raid5: Operation continuing on 2 devices.
> >> Apr 18 07:51:09 fs kernel: [25893987.287168] md: super_written gets error=-5, uptodate=0
> >> Apr 18 07:51:10 fs kernel: [25893987.657039] md: md1: reshape done.
> >> Apr 18 07:51:10 fs kernel: [25893987.781599] md: reshape of RAID array md1
> >> Apr 18 07:51:10 fs kernel: [25893987.781607] md: minimum _guaranteed_  speed: 100 KB/sec/disk.
> >> Apr 18 07:51:10 fs kernel: [25893987.781613] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> >> Apr 18 07:51:10 fs kernel: [25893987.781620] md: using 128k window, over a total of 243854848 blocks.
> >> 
> >> 
> >> md1 : active raid5 sdd2[3] sda2[0](F) sdc2[2] sdb2[4]
> >>      487709696 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU]
> >>      [===================>.]  reshape = 99.9% (243853824/243854848) finish=343.6min speed=0K/sec
> >> 
> > 
> > Looks like a bug - probably in mdadm.
> > mdadm needs to help the reshape over the last little bit, and md is probably
> > waiting for it to do that.  This will be the only time in the whole process
> > when the backup file is used.
> > 
> > I would try stopping the array and re-assembling it.  That might require a
> > reboot.  If that doesn't fix it, let me know and I'll prioritise this.
> > Otherwise - I've put it on my to-do list.  I'll try to reproduce and fix it
> > in due course.
> > 
> > Thanks for the report,
> > NeilBrown
> 
> Sorry for the delay in responding, the server was at a remote location and didn't have a remote console.  My attempt to make an initrd that provided me SSH failed for unknown reasons (it works now that I've got physical access to the server).  Based on the results below, it looks like the drive that did drop out was pretty much at the very end and I really don't think it was related to the error.  I can leave the system in this state and get you access to it to see if you desire.  This system was in the process of being decommissioned and soon after the failure the replacement came in.  This same error happened to me twice, but I also did another reshape where it didn't happen.  I can play with this system and try to duplicate it again also.  As I said, I'll be happy to do anything to help find the source of this. 
> 
> In any case, here is what happened from initramfs:

Thanks.
It looks like sda2 (first device in array) failed shortly after Thu Apr 18
11:49:51 2013 when there was still 13MB to be reshaped.
Then the reshape froze with only 2MB to go.  Don't know why yet.

Could yo retry the assemble command with --verbose added?
i.e.
  mdadm.static --assemble /dev/md0 --backup-file=/boot/backup.md --verbose

The 
   export MDADM_GROW_ALLOW_OLD=1
and try again.
If that doesn't start the array, try adding
   --invalid-backup

and report the results.

Thanks.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Reshape Shrink Hung Again
  2013-05-06  5:29     ` NeilBrown
@ 2013-05-06  6:36       ` Sam Bingner
  2013-05-09  6:16         ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Sam Bingner @ 2013-05-06  6:36 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid@vger.kernel.org

On May 5, 2013, at 7:29 PM, NeilBrown <neilb@suse.de> wrote:

> On Wed, 1 May 2013 02:00:30 +0000 Sam Bingner <sam@bingner.com> wrote:
> 
>> On Apr 21, 2013, at 11:24 AM, NeilBrown <neilb@suse.de> wrote:
>> 
>>> On Fri, 19 Apr 2013 08:29:37 +0000 Sam Bingner <sam@bingner.com> wrote:
>>> 
>>>> I'll start this off by saying that no data is in jeopardy, but I would like to track down the cause of this problem and fix it.  I originally thought it must have been due to the incorrect backup-file size with a raid array shrunk to smaller than the final size when it happened to me last time but this time this was not the case.
>>>> 
>>>> I initiated a shrink from a 4-drive RAID5 to a 3-drive RAID5, this shrink had no problems except that a drive failed right at the end of the reshape... then it hung at 99.9% and does not allow me to remove the failed drive from the array because it is "rebuilding".  I am not sure if the drive failed at the end, or if it was after it had gotten to 99.9% because I didn't see this until the next morning as it ran overnight.
>>>> 
>>>> Sam
>>>> 
>>>> root@fs:/var/log# uname -a
>>>> Linux fs 2.6.32-5-686 #1 SMP Mon Jan 16 16:04:25 UTC 2012 i686 GNU/Linux
>>>> 
>>>> Apr 17 22:37:41 fs kernel: [25860779.639762] md1: detected capacity change from 749122093056 to 499414728704
>>>> Apr 17 22:38:40 fs kernel: [25860837.912441] md: reshape of RAID array md1
>>>> Apr 17 22:38:40 fs kernel: [25860837.912447] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
>>>> Apr 17 22:38:40 fs kernel: [25860837.912452] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
>>>> Apr 17 22:38:40 fs kernel: [25860837.912459] md: using 128k window, over a total of 243854848 blocks.
>>>> Apr 18 07:51:09 fs kernel: [25893987.273813] raid5: Disk failure on sda2, disabling device.
>>>> Apr 18 07:51:09 fs kernel: [25893987.273815] raid5: Operation continuing on 2 devices.
>>>> Apr 18 07:51:09 fs kernel: [25893987.287168] md: super_written gets error=-5, uptodate=0
>>>> Apr 18 07:51:10 fs kernel: [25893987.657039] md: md1: reshape done.
>>>> Apr 18 07:51:10 fs kernel: [25893987.781599] md: reshape of RAID array md1
>>>> Apr 18 07:51:10 fs kernel: [25893987.781607] md: minimum _guaranteed_  speed: 100 KB/sec/disk.
>>>> Apr 18 07:51:10 fs kernel: [25893987.781613] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
>>>> Apr 18 07:51:10 fs kernel: [25893987.781620] md: using 128k window, over a total of 243854848 blocks.
>>>> 
>>>> 
>>>> md1 : active raid5 sdd2[3] sda2[0](F) sdc2[2] sdb2[4]
>>>>     487709696 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU]
>>>>     [===================>.]  reshape = 99.9% (243853824/243854848) finish=343.6min speed=0K/sec
>>>> 
>>> 
>>> Looks like a bug - probably in mdadm.
>>> mdadm needs to help the reshape over the last little bit, and md is probably
>>> waiting for it to do that.  This will be the only time in the whole process
>>> when the backup file is used.
>>> 
>>> I would try stopping the array and re-assembling it.  That might require a
>>> reboot.  If that doesn't fix it, let me know and I'll prioritise this.
>>> Otherwise - I've put it on my to-do list.  I'll try to reproduce and fix it
>>> in due course.
>>> 
>>> Thanks for the report,
>>> NeilBrown
>> 
>> Sorry for the delay in responding, the server was at a remote location and didn't have a remote console.  My attempt to make an initrd that provided me SSH failed for unknown reasons (it works now that I've got physical access to the server).  Based on the results below, it looks like the drive that did drop out was pretty much at the very end and I really don't think it was related to the error.  I can leave the system in this state and get you access to it to see if you desire.  This system was in the process of being decommissioned and soon after the failure the replacement came in.  This same error happened to me twice, but I also did another reshape where it didn't happen.  I can play with this system and try to duplicate it again also.  As I said, I'll be happy to do anything to h
 elp find the source of this. 
>> 
>> In any case, here is what happened from initramfs:
> 
> Thanks.
> It looks like sda2 (first device in array) failed shortly after Thu Apr 18
> 11:49:51 2013 when there was still 13MB to be reshaped.
> Then the reshape froze with only 2MB to go.  Don't know why yet.
> 
> Could yo retry the assemble command with --verbose added?
> i.e.
>  mdadm.static --assemble /dev/md0 --backup-file=/boot/backup.md --verbose
> 
> The 
>   export MDADM_GROW_ALLOW_OLD=1
> and try again.
> If that doesn't start the array, try adding
>   --invalid-backup
> 
> and report the results.
> 
> Thanks.
> 
> NeilBrown
> 

The backup.md file is all zeroes (and seems to have always been since it only uses it at the end?)  The --invalid-backup option seemed to work, but I guess then the question is why it was expecting valid data if it never put anything there to start with?  It seems to have actually happily backed up the 3MB after being told to accept an invalid backup...

I'm thinking it somehow froze originally trying to restore a backup that it never made?

Sam

# ./mdadm.static --assemble --force --verbose /dev/md1 --backup-file=/boot/backup.md
mdadm: looking for devices for /dev/md1
mdadm: /dev/sdd2 is identified as a member of /dev/md1, slot 3.
mdadm: /dev/sdc2 is identified as a member of /dev/md1, slot 2.
mdadm: /dev/sdb2 is identified as a member of /dev/md1, slot 1.
mdadm: /dev/sda2 is identified as a member of /dev/md1, slot 0.
mdadm:/dev/md1 has an active reshape - checking if critical section needs to be restored
mdadm: No backup metadata on /boot/backup.md
mdadm: Failed to find backup of critical section
mdadm: Failed to restore critical section for reshape, sorry.

# export MDADM_GROW_ALLOW_OLD=1
# ./mdadm.static --assemble --force --verbose /dev/md1 --backup-file=/boot/backup.md 
mdadm: looking for devices for /dev/md1
mdadm: /dev/sdd2 is identified as a member of /dev/md1, slot 3.
mdadm: /dev/sdc2 is identified as a member of /dev/md1, slot 2.
mdadm: /dev/sdb2 is identified as a member of /dev/md1, slot 1.
mdadm: /dev/sda2 is identified as a member of /dev/md1, slot 0.
mdadm:/dev/md1 has an active reshape - checking if critical section needs to be restored
mdadm: No backup metadata on /boot/backup.md
mdadm: Failed to find backup of critical section
mdadm: Failed to restore critical section for reshape, sorry.

# ./mdadm.static --assemble --force --verbose /dev/md1 --backup-file=/boot/backup.md --invalid-backup
mdadm: looking for devices for /dev/md1
mdadm: /dev/sdd2 is identified as a member of /dev/md1, slot 3.
mdadm: /dev/sdc2 is identified as a member of /dev/md1, slot 2.
mdadm: /dev/sdb2 is identified as a member of /dev/md1, slot 1.
mdadm: /dev/sda2 is identified as a member of /dev/md1, slot 0.
mdadm:/dev/md1 has an active reshape - checking if critical section needs to be restored
mdadm: No backup metadata on /boot/backup.md
mdadm: Failed to find backup of critical section
mdadm: continuing without restoring backup
mdadm: added /dev/sda2 to /dev/md1 as 0 (possibly out of date)
mdadm: added /dev/sdc2 to /dev/md1 as 2
mdadm: added /dev/sdd2 to /dev/md1 as 3
mdadm: added /dev/sdb2 to /dev/md1 as 1
mdadm: Need to backup 3072K of critical section..
mdadm: /dev/md1 has been started with 3 drives (out of 4).

# cat /proc/mdstat 
Personalities : [raid1] [raid6] [raid5] [raid4] 
md1 : active raid5 sdb2[4] sdd2[3] sdc2[2]
      487709696 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU]
      [>....................]  recovery =  0.0% (167432/243854848) finish=121.2min speed=33486K/sec

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Reshape Shrink Hung Again
  2013-05-06  6:36       ` Sam Bingner
@ 2013-05-09  6:16         ` NeilBrown
  2013-05-09  6:58           ` Sam Bingner
  0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2013-05-09  6:16 UTC (permalink / raw)
  To: Sam Bingner; +Cc: linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 8563 bytes --]

On Mon, 6 May 2013 06:36:29 +0000 Sam Bingner <sam@bingner.com> wrote:

> On May 5, 2013, at 7:29 PM, NeilBrown <neilb@suse.de> wrote:
> 
> > On Wed, 1 May 2013 02:00:30 +0000 Sam Bingner <sam@bingner.com> wrote:
> > 
> >> On Apr 21, 2013, at 11:24 AM, NeilBrown <neilb@suse.de> wrote:
> >> 
> >>> On Fri, 19 Apr 2013 08:29:37 +0000 Sam Bingner <sam@bingner.com> wrote:
> >>> 
> >>>> I'll start this off by saying that no data is in jeopardy, but I would like to track down the cause of this problem and fix it.  I originally thought it must have been due to the incorrect backup-file size with a raid array shrunk to smaller than the final size when it happened to me last time but this time this was not the case.
> >>>> 
> >>>> I initiated a shrink from a 4-drive RAID5 to a 3-drive RAID5, this shrink had no problems except that a drive failed right at the end of the reshape... then it hung at 99.9% and does not allow me to remove the failed drive from the array because it is "rebuilding".  I am not sure if the drive failed at the end, or if it was after it had gotten to 99.9% because I didn't see this until the next morning as it ran overnight.
> >>>> 
> >>>> Sam
> >>>> 
> >>>> root@fs:/var/log# uname -a
> >>>> Linux fs 2.6.32-5-686 #1 SMP Mon Jan 16 16:04:25 UTC 2012 i686 GNU/Linux
> >>>> 
> >>>> Apr 17 22:37:41 fs kernel: [25860779.639762] md1: detected capacity change from 749122093056 to 499414728704
> >>>> Apr 17 22:38:40 fs kernel: [25860837.912441] md: reshape of RAID array md1
> >>>> Apr 17 22:38:40 fs kernel: [25860837.912447] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
> >>>> Apr 17 22:38:40 fs kernel: [25860837.912452] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> >>>> Apr 17 22:38:40 fs kernel: [25860837.912459] md: using 128k window, over a total of 243854848 blocks.
> >>>> Apr 18 07:51:09 fs kernel: [25893987.273813] raid5: Disk failure on sda2, disabling device.
> >>>> Apr 18 07:51:09 fs kernel: [25893987.273815] raid5: Operation continuing on 2 devices.
> >>>> Apr 18 07:51:09 fs kernel: [25893987.287168] md: super_written gets error=-5, uptodate=0
> >>>> Apr 18 07:51:10 fs kernel: [25893987.657039] md: md1: reshape done.
> >>>> Apr 18 07:51:10 fs kernel: [25893987.781599] md: reshape of RAID array md1
> >>>> Apr 18 07:51:10 fs kernel: [25893987.781607] md: minimum _guaranteed_  speed: 100 KB/sec/disk.
> >>>> Apr 18 07:51:10 fs kernel: [25893987.781613] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape.
> >>>> Apr 18 07:51:10 fs kernel: [25893987.781620] md: using 128k window, over a total of 243854848 blocks.
> >>>> 
> >>>> 
> >>>> md1 : active raid5 sdd2[3] sda2[0](F) sdc2[2] sdb2[4]
> >>>>     487709696 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU]
> >>>>     [===================>.]  reshape = 99.9% (243853824/243854848) finish=343.6min speed=0K/sec
> >>>> 
> >>> 
> >>> Looks like a bug - probably in mdadm.
> >>> mdadm needs to help the reshape over the last little bit, and md is probably
> >>> waiting for it to do that.  This will be the only time in the whole process
> >>> when the backup file is used.
> >>> 
> >>> I would try stopping the array and re-assembling it.  That might require a
> >>> reboot.  If that doesn't fix it, let me know and I'll prioritise this.
> >>> Otherwise - I've put it on my to-do list.  I'll try to reproduce and fix it
> >>> in due course.
> >>> 
> >>> Thanks for the report,
> >>> NeilBrown
> >> 
> >> Sorry for the delay in responding, the server was at a remote location and didn't have a remote console.  My attempt to make an initrd that provided me SSH failed for unknown reasons (it works now that I've got physical access to the server).  Based on the results below, it looks like the drive that did drop out was pretty much at the very end and I really don't think it was related to the error.  I can leave the system in this state and get you access to it to see if you desire.  This system was in the process of being decommissioned and soon after the failure the replacement came in.  This same error happened to me twice, but I also did another reshape where it didn't happen.  I can play with this system and try to duplicate it again also.  As I said, I'll be happy to do anything to help find the source of this. 
> >> 
> >> In any case, here is what happened from initramfs:
> > 
> > Thanks.
> > It looks like sda2 (first device in array) failed shortly after Thu Apr 18
> > 11:49:51 2013 when there was still 13MB to be reshaped.
> > Then the reshape froze with only 2MB to go.  Don't know why yet.
> > 
> > Could yo retry the assemble command with --verbose added?
> > i.e.
> >  mdadm.static --assemble /dev/md0 --backup-file=/boot/backup.md --verbose
> > 
> > The 
> >   export MDADM_GROW_ALLOW_OLD=1
> > and try again.
> > If that doesn't start the array, try adding
> >   --invalid-backup
> > 
> > and report the results.
> > 
> > Thanks.
> > 
> > NeilBrown
> > 
> 
> The backup.md file is all zeroes (and seems to have always been since it only uses it at the end?)  The --invalid-backup option seemed to work, but I guess then the question is why it was expecting valid data if it never put anything there to start with?  It seems to have actually happily backed up the 3MB after being told to accept an invalid backup...
> 
> I'm thinking it somehow froze originally trying to restore a backup that it never made?
> 
> Sam
> 
> # ./mdadm.static --assemble --force --verbose /dev/md1 --backup-file=/boot/backup.md
> mdadm: looking for devices for /dev/md1
> mdadm: /dev/sdd2 is identified as a member of /dev/md1, slot 3.
> mdadm: /dev/sdc2 is identified as a member of /dev/md1, slot 2.
> mdadm: /dev/sdb2 is identified as a member of /dev/md1, slot 1.
> mdadm: /dev/sda2 is identified as a member of /dev/md1, slot 0.
> mdadm:/dev/md1 has an active reshape - checking if critical section needs to be restored
> mdadm: No backup metadata on /boot/backup.md
> mdadm: Failed to find backup of critical section
> mdadm: Failed to restore critical section for reshape, sorry.
> 
> # export MDADM_GROW_ALLOW_OLD=1
> # ./mdadm.static --assemble --force --verbose /dev/md1 --backup-file=/boot/backup.md 
> mdadm: looking for devices for /dev/md1
> mdadm: /dev/sdd2 is identified as a member of /dev/md1, slot 3.
> mdadm: /dev/sdc2 is identified as a member of /dev/md1, slot 2.
> mdadm: /dev/sdb2 is identified as a member of /dev/md1, slot 1.
> mdadm: /dev/sda2 is identified as a member of /dev/md1, slot 0.
> mdadm:/dev/md1 has an active reshape - checking if critical section needs to be restored
> mdadm: No backup metadata on /boot/backup.md
> mdadm: Failed to find backup of critical section
> mdadm: Failed to restore critical section for reshape, sorry.
> 
> # ./mdadm.static --assemble --force --verbose /dev/md1 --backup-file=/boot/backup.md --invalid-backup
> mdadm: looking for devices for /dev/md1
> mdadm: /dev/sdd2 is identified as a member of /dev/md1, slot 3.
> mdadm: /dev/sdc2 is identified as a member of /dev/md1, slot 2.
> mdadm: /dev/sdb2 is identified as a member of /dev/md1, slot 1.
> mdadm: /dev/sda2 is identified as a member of /dev/md1, slot 0.
> mdadm:/dev/md1 has an active reshape - checking if critical section needs to be restored
> mdadm: No backup metadata on /boot/backup.md
> mdadm: Failed to find backup of critical section
> mdadm: continuing without restoring backup
> mdadm: added /dev/sda2 to /dev/md1 as 0 (possibly out of date)
> mdadm: added /dev/sdc2 to /dev/md1 as 2
> mdadm: added /dev/sdd2 to /dev/md1 as 3
> mdadm: added /dev/sdb2 to /dev/md1 as 1
> mdadm: Need to backup 3072K of critical section..
> mdadm: /dev/md1 has been started with 3 drives (out of 4).
> 
> # cat /proc/mdstat 
> Personalities : [raid1] [raid6] [raid5] [raid4] 
> md1 : active raid5 sdb2[4] sdd2[3] sdc2[2]
>       487709696 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [_UU]
>       [>....................]  recovery =  0.0% (167432/243854848) finish=121.2min speed=33486K/sec


The backup.md file really should not have been empty.
Assuming you are sure it was the right file something must have gone wrong.
Not overly surprising as it is hard to test that particular stage of reshape:
an interruption to the reshape very late in a shrink could get messy.

I've made a note to try testing it when I next spend some time on mdadm.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Reshape Shrink Hung Again
  2013-05-09  6:16         ` NeilBrown
@ 2013-05-09  6:58           ` Sam Bingner
  0 siblings, 0 replies; 9+ messages in thread
From: Sam Bingner @ 2013-05-09  6:58 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid@vger.kernel.org

On May 8, 2013, at 8:16 PM, "NeilBrown" <neilb@suse.de> wrote:

> 
> The backup.md file really should not have been empty.
> Assuming you are sure it was the right file something must have gone wrong.
> Not overly surprising as it is hard to test that particular stage of reshape:
> an interruption to the reshape very late in a shrink could get messy.
> 
> I've made a note to try testing it when I next spend some time on mdadm.
> 
> Thanks,
> NeilBrown
> 

That makes sense then, both times that I had this problem it was empty and there was nothing that could have caused it to have a problem.  I'll play with a couple things and see if I can find where the issue was also.

Thank you...
Sam


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-05-09  6:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-19  8:29 Reshape Shrink Hung Again Sam Bingner
2013-04-21  8:26 ` Sam Bingner
2013-04-21 17:38   ` Phil Turmel
2013-04-21 21:24 ` NeilBrown
2013-05-01  2:00   ` Sam Bingner
2013-05-06  5:29     ` NeilBrown
2013-05-06  6:36       ` Sam Bingner
2013-05-09  6:16         ` NeilBrown
2013-05-09  6:58           ` Sam Bingner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox