RAID5 - Disk failed during re-shape

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID5 - Disk failed during re-shape
@ 2012-08-09  8:38 Sam Clark
  2012-08-10 22:36 ` Phil Turmel
  0 siblings, 1 reply; 10+ messages in thread
From: Sam Clark @ 2012-08-09  8:38 UTC (permalink / raw)
  To: linux-raid

Hi All, 

Hoping you can help recover my data!

I have (had?) a software RAID 5 volume, created on Ubuntu 10.04 a few years
back consisting of 4 x 1500GB drives.  Was running great until the
motherboard died last week.   Purchased new motherboard, CPU & RAM,
installed Ubuntu 12.04, and got everything assembled fine, and working for
around 48 hours.  

After that I added a 2000GB drive to increase capacity, and ran mdadm --add
/dev/md0 /dev/sdf.  The Re-configuration started to run, and then around
11.4% of the reshaping I saw that the server had some errors:
Aug  8 22:17:41 nas kernel: [ 5927.453434] Buffer I/O error on device md0,
logical block 715013760
Aug  8 22:17:41 nas kernel: [ 5927.453439] EXT4-fs warning (device md0):
ext4_end_bio:251: I/O error writing to inode 224003641 (offset 157810688
size 4096 starting block 715013760)
Aug  8 22:17:41 nas kernel: [ 5927.453448] JBD2: Detected IO errors while
flushing file data on md0-8
Aug  8 22:17:41 nas kernel: [ 5927.453467] Aborting journal on device md0-8.
Aug  8 22:17:41 nas kernel: [ 5927.453642] Buffer I/O error on device md0,
logical block 548962304
Aug  8 22:17:41 nas kernel: [ 5927.453643] lost page write due to I/O error
on md0
Aug  8 22:17:41 nas kernel: [ 5927.453656] JBD2: I/O error detected when
updating journal superblock for md0-8.
Aug  8 22:17:41 nas kernel: [ 5927.453688] Buffer I/O error on device md0,
logical block 0
Aug  8 22:17:41 nas kernel: [ 5927.453690] lost page write due to I/O error
on md0
Aug  8 22:17:41 nas kernel: [ 5927.453697] EXT4-fs error (device md0):
ext4_journal_start_sb:327: Detected aborted journal
Aug  8 22:17:41 nas kernel: [ 5927.453700] EXT4-fs (md0): Remounting
filesystem read-only
Aug  8 22:17:41 nas kernel: [ 5927.453703] EXT4-fs (md0): previous I/O error
to superblock detected
Aug  8 22:17:41 nas kernel: [ 5927.453826] Buffer I/O error on device md0,
logical block 715013760
Aug  8 22:17:41 nas kernel: [ 5927.453828] lost page write due to I/O error
on md0
Aug  8 22:17:41 nas kernel: [ 5927.453842] JBD2: Detected IO errors while
flushing file data on md0-8
Aug  8 22:17:41 nas kernel: [ 5927.453848] Buffer I/O error on device md0,
logical block 0
Aug  8 22:17:41 nas kernel: [ 5927.453850] lost page write due to I/O error
on md0
Aug  8 22:20:54 nas kernel: [ 6120.964129] INFO: task md0_reshape:297
blocked for more than 120 seconds.

On checking the progress of /proc/mdstat, I found that 2 drives were listed
as failed (__UUU), and the finish time was simply growing by hundreds of
minutes at a time.

I was able to browse some data on the Raid set (incl my Home folder), but
couldn't browse some other sections - shell simply hung when I tried to
issue "ls /raidmount".  I tied to add one of the failed disks back in, but
got the response that there was no superblock on it.  rebooted it at that
time.

During boot I was given the option to manually recover, or skip mounting - I
chose the latter. 

Now that the system is running, I tried to assemble, but keeps failing. 
Have tried:
mdadm --assemble --force /dev/md0 /dev/sdb /dev/sdc /dev/sdd /dev/sde
/dev/sdf

I am able to see all the drives, but can see the UUID is incorrect and the
Raid Level states -unknown-, as below... does this mean the data can't be
recovered?  

root@nas:/var/log$ mdadm --examine /dev/sd[b-f]
/dev/sdb:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 00000000:00000000:00000000:00000000
  Creation Time : Thu Aug  9 07:44:48 2012
     Raid Level : -unknown-
   Raid Devices : 0
  Total Devices : 5
Preferred Minor : 0

    Update Time : Thu Aug  9 07:45:10 2012
          State : active
Active Devices : 0
Working Devices : 5
Failed Devices : 0
  Spare Devices : 5
       Checksum : a0b6e863 - correct
         Events : 1

      Number   Major   Minor   RaidDevice State
this     0       8       16        0      spare   /dev/sdb

   0     0       8       16        0      spare   /dev/sdb
   1     1       8       48        1      spare   /dev/sdd
   2     2       8       80        2      spare   /dev/sdf
   3     3       8       64        3      spare   /dev/sde
   4     4       8       32        4      spare   /dev/sdc
/dev/sdc:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 00000000:00000000:00000000:00000000
  Creation Time : Thu Aug  9 07:44:48 2012
     Raid Level : -unknown-
   Raid Devices : 0
  Total Devices : 5
Preferred Minor : 0

    Update Time : Thu Aug  9 07:45:10 2012
          State : active
Active Devices : 0
Working Devices : 5
Failed Devices : 0
  Spare Devices : 5
       Checksum : a0b6e87b - correct
         Events : 1

      Number   Major   Minor   RaidDevice State
this     4       8       32        4      spare   /dev/sdc

   0     0       8       16        0      spare   /dev/sdb
   1     1       8       48        1      spare   /dev/sdd
   2     2       8       80        2      spare   /dev/sdf
   3     3       8       64        3      spare   /dev/sde
   4     4       8       32        4      spare   /dev/sdc
/dev/sdd:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 00000000:00000000:00000000:00000000
  Creation Time : Thu Aug  9 07:44:48 2012
     Raid Level : -unknown-
   Raid Devices : 0
  Total Devices : 5
Preferred Minor : 0

    Update Time : Thu Aug  9 07:45:10 2012
          State : active
Active Devices : 0
Working Devices : 5
Failed Devices : 0
  Spare Devices : 5
       Checksum : a0b6e885 - correct
         Events : 1

      Number   Major   Minor   RaidDevice State
this     1       8       48        1      spare   /dev/sdd

   0     0       8       16        0      spare   /dev/sdb
   1     1       8       48        1      spare   /dev/sdd
   2     2       8       80        2      spare   /dev/sdf
   3     3       8       64        3      spare   /dev/sde
   4     4       8       32        4      spare   /dev/sdc
/dev/sde:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 00000000:00000000:00000000:00000000
  Creation Time : Thu Aug  9 07:44:48 2012
     Raid Level : -unknown-
   Raid Devices : 0
  Total Devices : 5
Preferred Minor : 0

    Update Time : Thu Aug  9 07:45:10 2012
          State : active
Active Devices : 0
Working Devices : 5
Failed Devices : 0
  Spare Devices : 5
       Checksum : a0b6e899 - correct
         Events : 1

      Number   Major   Minor   RaidDevice State
this     3       8       64        3      spare   /dev/sde

   0     0       8       16        0      spare   /dev/sdb
   1     1       8       48        1      spare   /dev/sdd
   2     2       8       80        2      spare   /dev/sdf
   3     3       8       64        3      spare   /dev/sde
   4     4       8       32        4      spare   /dev/sdc
/dev/sdf:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 00000000:00000000:00000000:00000000
  Creation Time : Thu Aug  9 07:44:48 2012
     Raid Level : -unknown-
   Raid Devices : 0
  Total Devices : 5
Preferred Minor : 0

    Update Time : Thu Aug  9 07:45:10 2012
          State : active
Active Devices : 0
Working Devices : 5
Failed Devices : 0
  Spare Devices : 5
       Checksum : a0b6e8a7 - correct
         Events : 1

      Number   Major   Minor   RaidDevice State
this     2       8       80        2      spare   /dev/sdf

   0     0       8       16        0      spare   /dev/sdb
   1     1       8       48        1      spare   /dev/sdd
   2     2       8       80        2      spare   /dev/sdf
   3     3       8       64        3      spare   /dev/sde
   4     4       8       32        4      spare   /dev/sdc

According to Syslog, the only drive failure that I had was /dev/sde, but I
guess that the re-shape has caused this to go awry.
syslog.1:Aug  8 22:17:41 nas mdadm[1029]: Fail event detected on md device
/dev/md0, component device /dev/sde

I tried removing the /etc/mdadm/mdadm.conf file, and re-running the scan,
where I got:
root@nas:/var/log$ sudo mdadm --assemble --scan -f -vv
mdadm: looking for devices for further assembly
mdadm: cannot open device /dev/sda5: Device or resource busy
mdadm: no RAID superblock on /dev/sda2
mdadm: cannot open device /dev/sda1: Device or resource busy
mdadm: cannot open device /dev/sda: Device or resource busy
mdadm: /dev/sdf is identified as a member of /dev/md/0_0, slot 2.
mdadm: /dev/sde is identified as a member of /dev/md/0_0, slot 3.
mdadm: /dev/sdd is identified as a member of /dev/md/0_0, slot 1.
mdadm: /dev/sdc is identified as a member of /dev/md/0_0, slot 4.
mdadm: /dev/sdb is identified as a member of /dev/md/0_0, slot 0.
mdadm: failed to add /dev/sdd to /dev/md/0_0: Invalid argument
mdadm: failed to add /dev/sdf to /dev/md/0_0: Invalid argument
mdadm: failed to add /dev/sde to /dev/md/0_0: Invalid argument
mdadm: failed to add /dev/sdc to /dev/md/0_0: Invalid argument
mdadm: failed to add /dev/sdb to /dev/md/0_0: Invalid argument
mdadm: /dev/md/0_0 assembled from -1 drives and 1 spare - not enough to
start the array.
mdadm: looking for devices for further assembly
mdadm: No arrays found in config file or automatically

I guess the 'invalid argument' is the -unknown- in the raid level.. but it's
only a guess. 

I'm at the extent of my knowledge - would appreciate some expert assistance
in recovering this array, if it's possible!

Many thanks, 
Sam

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID5 - Disk failed during re-shape
  2012-08-09  8:38 RAID5 - Disk failed during re-shape Sam Clark
@ 2012-08-10 22:36 ` Phil Turmel
  2012-08-11  1:21   ` Dmitrijs Ledkovs
  2012-08-11  8:42   ` Sam Clark
  0 siblings, 2 replies; 10+ messages in thread
From: Phil Turmel @ 2012-08-10 22:36 UTC (permalink / raw)
  To: Sam Clark; +Cc: linux-raid, NeilBrown

Hi Sam,

On 08/09/2012 04:38 AM, Sam Clark wrote:
> Hi All, 
> 
> Hoping you can help recover my data!
> 
> I have (had?) a software RAID 5 volume, created on Ubuntu 10.04 a few years
> back consisting of 4 x 1500GB drives.  Was running great until the
> motherboard died last week.   Purchased new motherboard, CPU & RAM,
> installed Ubuntu 12.04, and got everything assembled fine, and working for
> around 48 hours.  

Uh-oh.  Stock 12.04 has a buggy kernel.  See here:
http://neil.brown.name/blog/20120615073245

> After that I added a 2000GB drive to increase capacity, and ran mdadm --add
> /dev/md0 /dev/sdf.  The Re-configuration started to run, and then around
> 11.4% of the reshaping I saw that the server had some errors:

And you reshaped and got media errors ...

> Aug  8 22:17:41 nas kernel: [ 5927.453434] Buffer I/O error on device md0,
> logical block 715013760
> Aug  8 22:17:41 nas kernel: [ 5927.453439] EXT4-fs warning (device md0):
> ext4_end_bio:251: I/O error writing to inode 224003641 (offset 157810688
> size 4096 starting block 715013760)
> Aug  8 22:17:41 nas kernel: [ 5927.453448] JBD2: Detected IO errors while
> flushing file data on md0-8
> Aug  8 22:17:41 nas kernel: [ 5927.453467] Aborting journal on device md0-8.
> Aug  8 22:17:41 nas kernel: [ 5927.453642] Buffer I/O error on device md0,
> logical block 548962304
> Aug  8 22:17:41 nas kernel: [ 5927.453643] lost page write due to I/O error
> on md0
> Aug  8 22:17:41 nas kernel: [ 5927.453656] JBD2: I/O error detected when
> updating journal superblock for md0-8.
> Aug  8 22:17:41 nas kernel: [ 5927.453688] Buffer I/O error on device md0,
> logical block 0
> Aug  8 22:17:41 nas kernel: [ 5927.453690] lost page write due to I/O error
> on md0
> Aug  8 22:17:41 nas kernel: [ 5927.453697] EXT4-fs error (device md0):
> ext4_journal_start_sb:327: Detected aborted journal
> Aug  8 22:17:41 nas kernel: [ 5927.453700] EXT4-fs (md0): Remounting
> filesystem read-only
> Aug  8 22:17:41 nas kernel: [ 5927.453703] EXT4-fs (md0): previous I/O error
> to superblock detected
> Aug  8 22:17:41 nas kernel: [ 5927.453826] Buffer I/O error on device md0,
> logical block 715013760
> Aug  8 22:17:41 nas kernel: [ 5927.453828] lost page write due to I/O error
> on md0
> Aug  8 22:17:41 nas kernel: [ 5927.453842] JBD2: Detected IO errors while
> flushing file data on md0-8
> Aug  8 22:17:41 nas kernel: [ 5927.453848] Buffer I/O error on device md0,
> logical block 0
> Aug  8 22:17:41 nas kernel: [ 5927.453850] lost page write due to I/O error
> on md0
> Aug  8 22:20:54 nas kernel: [ 6120.964129] INFO: task md0_reshape:297
> blocked for more than 120 seconds.
> 
> On checking the progress of /proc/mdstat, I found that 2 drives were listed
> as failed (__UUU), and the finish time was simply growing by hundreds of
> minutes at a time.
> 
> I was able to browse some data on the Raid set (incl my Home folder), but
> couldn't browse some other sections - shell simply hung when I tried to
> issue "ls /raidmount".  I tied to add one of the failed disks back in, but
> got the response that there was no superblock on it.  rebooted it at that
> time.

Poof.  The bug wiped your active device's metadata.

> During boot I was given the option to manually recover, or skip mounting - I
> chose the latter. 

Good instincts, but probably not any help.

> Now that the system is running, I tried to assemble, but keeps failing. 
> Have tried:
> mdadm --assemble --force /dev/md0 /dev/sdb /dev/sdc /dev/sdd /dev/sde
> /dev/sdf
> 
> I am able to see all the drives, but can see the UUID is incorrect and the
> Raid Level states -unknown-, as below... does this mean the data can't be
> recovered?  

If you weren't in the middle of a reshape, you could recover using the
instructions in the blog entry above.

[trim /]

> I guess the 'invalid argument' is the -unknown- in the raid level.. but it's
> only a guess. 
> 
> I'm at the extent of my knowledge - would appreciate some expert assistance
> in recovering this array, if it's possible!

I think you are toast, as I saw nothing in the metadata that would give
you a precise reshape restart position, even if you got Neil to work up
a custom mdadm that could use it.  The 11.4% could be converted into an
approximate restart position, perhaps.

Neil, is there any way to do some combination of "create
--assume-clean", start a reshape held at zero, then skip 11.4% ?

Phil

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID5 - Disk failed during re-shape
  2012-08-10 22:36 ` Phil Turmel
@ 2012-08-11  1:21   ` Dmitrijs Ledkovs
  2012-08-11  8:42   ` Sam Clark
  1 sibling, 0 replies; 10+ messages in thread
From: Dmitrijs Ledkovs @ 2012-08-11  1:21 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Sam Clark, linux-raid, NeilBrown

On 10 August 2012 23:36, Phil Turmel <philip@turmel.org> wrote:
> Hi Sam,
>
> On 08/09/2012 04:38 AM, Sam Clark wrote:
>> Hi All,
>>
>> Hoping you can help recover my data!
>>
>> I have (had?) a software RAID 5 volume, created on Ubuntu 10.04 a few years
>> back consisting of 4 x 1500GB drives.  Was running great until the
>> motherboard died last week.   Purchased new motherboard, CPU & RAM,
>> installed Ubuntu 12.04, and got everything assembled fine, and working for
>> around 48 hours.
>
> Uh-oh.  Stock 12.04 has a buggy kernel.  See here:
> http://neil.brown.name/blog/20120615073245
>

Stock 12.04 does not have a buggy kernel.
12.04 -updates pocket had a buggy kernel at one point in time, but it
now has a fixed kernel.

Regards,
Dmitrijs.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID5 - Disk failed during re-shape
  2012-08-10 22:36 ` Phil Turmel
  2012-08-11  1:21   ` Dmitrijs Ledkovs
@ 2012-08-11  8:42   ` Sam Clark
  2012-08-12 23:35     ` NeilBrown
  1 sibling, 1 reply; 10+ messages in thread
From: Sam Clark @ 2012-08-11  8:42 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid@vger.kernel.org, NeilBrown

Thanks for the response Phil. 

I was thinking that 'toast' was the case, and have been looking into my backups (not so great, though the critical data is fine). 

Regards
Sam


On 11.08.2012, at 00:36, "Phil Turmel" <philip@turmel.org> wrote:

Hi Sam,

On 08/09/2012 04:38 AM, Sam Clark wrote:
> Hi All, 
> 
> Hoping you can help recover my data!
> 
> I have (had?) a software RAID 5 volume, created on Ubuntu 10.04 a few years
> back consisting of 4 x 1500GB drives.  Was running great until the
> motherboard died last week.   Purchased new motherboard, CPU & RAM,
> installed Ubuntu 12.04, and got everything assembled fine, and working for
> around 48 hours.  

Uh-oh.  Stock 12.04 has a buggy kernel.  See here:
http://neil.brown.name/blog/20120615073245

> After that I added a 2000GB drive to increase capacity, and ran mdadm --add
> /dev/md0 /dev/sdf.  The Re-configuration started to run, and then around
> 11.4% of the reshaping I saw that the server had some errors:

And you reshaped and got media errors ...

> Aug  8 22:17:41 nas kernel: [ 5927.453434] Buffer I/O error on device md0,
> logical block 715013760
> Aug  8 22:17:41 nas kernel: [ 5927.453439] EXT4-fs warning (device md0):
> ext4_end_bio:251: I/O error writing to inode 224003641 (offset 157810688
> size 4096 starting block 715013760)
> Aug  8 22:17:41 nas kernel: [ 5927.453448] JBD2: Detected IO errors while
> flushing file data on md0-8
> Aug  8 22:17:41 nas kernel: [ 5927.453467] Aborting journal on device md0-8.
> Aug  8 22:17:41 nas kernel: [ 5927.453642] Buffer I/O error on device md0,
> logical block 548962304
> Aug  8 22:17:41 nas kernel: [ 5927.453643] lost page write due to I/O error
> on md0
> Aug  8 22:17:41 nas kernel: [ 5927.453656] JBD2: I/O error detected when
> updating journal superblock for md0-8.
> Aug  8 22:17:41 nas kernel: [ 5927.453688] Buffer I/O error on device md0,
> logical block 0
> Aug  8 22:17:41 nas kernel: [ 5927.453690] lost page write due to I/O error
> on md0
> Aug  8 22:17:41 nas kernel: [ 5927.453697] EXT4-fs error (device md0):
> ext4_journal_start_sb:327: Detected aborted journal
> Aug  8 22:17:41 nas kernel: [ 5927.453700] EXT4-fs (md0): Remounting
> filesystem read-only
> Aug  8 22:17:41 nas kernel: [ 5927.453703] EXT4-fs (md0): previous I/O error
> to superblock detected
> Aug  8 22:17:41 nas kernel: [ 5927.453826] Buffer I/O error on device md0,
> logical block 715013760
> Aug  8 22:17:41 nas kernel: [ 5927.453828] lost page write due to I/O error
> on md0
> Aug  8 22:17:41 nas kernel: [ 5927.453842] JBD2: Detected IO errors while
> flushing file data on md0-8
> Aug  8 22:17:41 nas kernel: [ 5927.453848] Buffer I/O error on device md0,
> logical block 0
> Aug  8 22:17:41 nas kernel: [ 5927.453850] lost page write due to I/O error
> on md0
> Aug  8 22:20:54 nas kernel: [ 6120.964129] INFO: task md0_reshape:297
> blocked for more than 120 seconds.
> 
> On checking the progress of /proc/mdstat, I found that 2 drives were listed
> as failed (__UUU), and the finish time was simply growing by hundreds of
> minutes at a time.
> 
> I was able to browse some data on the Raid set (incl my Home folder), but
> couldn't browse some other sections - shell simply hung when I tried to
> issue "ls /raidmount".  I tied to add one of the failed disks back in, but
> got the response that there was no superblock on it.  rebooted it at that
> time.

Poof.  The bug wiped your active device's metadata.

> During boot I was given the option to manually recover, or skip mounting - I
> chose the latter. 

Good instincts, but probably not any help.

> Now that the system is running, I tried to assemble, but keeps failing. 
> Have tried:
> mdadm --assemble --force /dev/md0 /dev/sdb /dev/sdc /dev/sdd /dev/sde
> /dev/sdf
> 
> I am able to see all the drives, but can see the UUID is incorrect and the
> Raid Level states -unknown-, as below... does this mean the data can't be
> recovered?  

If you weren't in the middle of a reshape, you could recover using the
instructions in the blog entry above.

[trim /]

> I guess the 'invalid argument' is the -unknown- in the raid level.. but it's
> only a guess. 
> 
> I'm at the extent of my knowledge - would appreciate some expert assistance
> in recovering this array, if it's possible!

I think you are toast, as I saw nothing in the metadata that would give
you a precise reshape restart position, even if you got Neil to work up
a custom mdadm that could use it.  The 11.4% could be converted into an
approximate restart position, perhaps.

Neil, is there any way to do some combination of "create
--assume-clean", start a reshape held at zero, then skip 11.4% ?

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID5 - Disk failed during re-shape
  2012-08-11  8:42   ` Sam Clark
@ 2012-08-12 23:35     ` NeilBrown
       [not found]       ` <BLU153-ds10943E39726EDC983C484594B00@phx.gbl>
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2012-08-12 23:35 UTC (permalink / raw)
  To: Sam Clark; +Cc: Phil Turmel, linux-raid@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 1527 bytes --]

On Sat, 11 Aug 2012 10:42:01 +0200 Sam Clark <sclark_77@hotmail.com> wrote:

> Thanks for the response Phil. 
> 
> I was thinking that 'toast' was the case, and have been looking into my backups (not so great, though the critical data is fine). 
> 

If you've got backups, then that is likely the most reliable solution - sorry.



> I think you are toast, as I saw nothing in the metadata that would give
> you a precise reshape restart position, even if you got Neil to work up
> a custom mdadm that could use it.  The 11.4% could be converted into an
> approximate restart position, perhaps.
> 
> Neil, is there any way to do some combination of "create
> --assume-clean", start a reshape held at zero, then skip 11.4% ?

The metadata might contain the precise reshape position - but "mdadm
--examine" won't be displaying it.
If you can grab the last 128K of the device (probably using 'dd' with
'skip=xxx' I would be able to check and see.

If you can tell me:
  exactly how big the devices are (sectors)
  what the chunk size of the array was (probably 64K)

and get me that last 128K of a few devices, then I can provide you a shell
script that will poke around in sysfs and activate the array in read-only
mode which might allow you to mount it and copy out any important data.
After that you would need to re-create the array.

I don't think it is really practical to get the array fully working again
with recreating from scratch once you have all the important data.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <BLU153-ds10943E39726EDC983C484594B00@phx.gbl>]

* Re: RAID5 - Disk failed during re-shape
       [not found]       ` <BLU153-ds10943E39726EDC983C484594B00@phx.gbl>
@ 2012-08-14  2:38         ` NeilBrown
  2012-08-14 13:40           ` Sam Clark
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2012-08-14  2:38 UTC (permalink / raw)
  To: Sam Clark; +Cc: Phil Turmel, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3440 bytes --]

On Mon, 13 Aug 2012 18:14:30 +0200 "Sam Clark" <sclark_77@hotmail.com> wrote:

> Thanks Neil, really appreciate the assistance.
> 
> Would love to give that a try - at least to catch the data that has changed 
> since last backup, however I don't know the chunk size.  I created the array 
> so long ago, and of course didn't document anything.  I would guess they are 
> 64K, but not sure.  Is there any way to check from the disks themselves?
> 
> I've captured the 128K chunks as follows - hope it's correct:
> 
> I got the disk size in Bytes from fdisk -l, and subtracted 131072.. then 
> ran:
> sam@nas:~$ sudo dd if=/dev/sd[b-f] of=test.128k bs=1 skip=xxx count=128k,
> The 5 files are attached.
> 
> The disk sizes are as follows:
> sam@nas:~$ sudo blockdev --getsize /dev/sd[b-f]
> sdb: 2930277168
> sdc: 2930277168
> sdd: 2930277168
> sde: 2930277168
> sdf: 3907029168
> 

Unfortunately the metadata doesn't contain any trace of the the reshape
position, so we'll make do with 11.4%

The following script will assemble the array read-only.  You can then try
"fsck -n /dev/md_restore" to see if it is credible.  Then try to mount it.

Most of the details I'm confident of.

'chunk' is probably right, but there is no way to know for sure until you
have access to your data.  If you try changing it you'll need to also change
reshape to be an appropriate multiple of it.

'reshape' is approximately 11.4% of the array.  Maybe try other suitable
multiples.

'devs' is probably wrong.  I chose that order because the metadata seems to
suggest that order - yes, with sdf in the middle.  Maybe you know better.
You can try different orders until it seems to work.

Everything else should be correct.  component_size is definitely correct, I
found that in the metadata.  'layout' is the default and is hardly ever
changed.

As it assembles read-only, there is no risk in getting it wrong, changing
some values and trying again.  The script disassembles and old array before
creating the new.

good luck.

NeilBrown

# Script to try to assemble a RAID5 which got it's metadata corrupted
# in the middle of a reshape (ouch).
# We assemble as externally-managed-metadata in read-only mode
# by writing magic values to sysfs.

# devices in correct order.
devs='sdb sdd sdf sde sdc'

# number of devices, both before and after reshape
before=4
after=5

# reshape position as sectors per array.  It must be a multiple
# of one stripe, so chunk*old_data_disks*new_data_disks
# This number is 0.114 * 2930276992 * 3, rounded up to
# a multiple of 128*3*4.   Other multiples could be tried.
reshape=1002155520

# array parameters
level=raid5
chunk=65536
layout=2
component_size=2930276992

# always creates /dev/md_restore
mdadm -S /dev/md_restore
echo md_restore >  /sys/module/md_mod/parameters/new_array
cd /sys/class/block/md_restore/md

echo external:readonly > metadata_version
echo $level > level
echo $chunk > chunk_size
echo $component_size > component_size
echo 2 > layout
echo $before > raid_disks

echo $reshape > reshape_position
echo $after > raid_disks

slot=0
for i in $devs
do
 cat /sys/class/block/$i/dev > new_dev
 echo 0 > dev-$i/offset
 echo $component_size > dev-$i/size
 echo insync > dev-$i/state
 echo $slot > dev-$i/slot

 slot=$[slot+1]
done

echo readonly > array_state 

grep md_restore /proc/partitions

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: RAID5 - Disk failed during re-shape
  2012-08-14  2:38         ` NeilBrown
@ 2012-08-14 13:40           ` Sam Clark
  2012-08-14 21:05             ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Sam Clark @ 2012-08-14 13:40 UTC (permalink / raw)
  To: 'NeilBrown'; +Cc: 'Phil Turmel', linux-raid

Thanks Neil, 

Tried that and failed on the first attempt, so I tried shuffling around the
dev order.. unfortunately I don't know what they were previously, but I do
recall being surprised that sdd was first on the list when I was looking at
it previously, so perhaps a starting point.  Since there are some 120
different permutations of device order (assuming all 5 could be anywhere), I
modified the script to accept parameters and automated it a little further. 

I ended up with a few 'possible successes' but none that would mount (i.e.
fsck actually ran and found problems with the superblocks, group descriptor
checksums and Inode details, instead of failing with errorlevel 8).  The
most successful so far was the ones with SDD as device 1 and SDE as device
2.. one particular combination (sdd sde sdb sdc sdf) seems to report every
time "/dev/md_restore has been mounted 35 times without being checked, check
forced.".. does this mean we're on the right combination? 

In any case, that one produces a lot of output (some 54MB when fsck is piped
to a file) that looks bad and still fails to mount.  (I assume that "mount
-r /dev/md_restore /mnt/restore" I all I need to mount with?  I also tried
with "-t ext4", but that didn't seem to help either).

This is a summary of the errors that appear: 
Pass 1: Checking inodes, blocks, and sizes
(51 of these)
Inode 198574650 has an invalid extent node (blk 38369280, lblk 0)
Clear? no

(47 of these)
Inode 223871986, i_blocks is 2737216, should be 0.  Fix? no

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
/lost+found not found.  Create? no

Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  +(36700161--36700162) +36700164 +36700166
+(36700168--36700170) (this goes on like this for many pages.. in fact, most
of the 54 MB is here)

(and 492 of these) 
Free blocks count wrong for group #3760 (24544, counted=16439).
Fix? no

Free blocks count wrong for group #3761 (0, counted=16584).
Fix? no

/dev/md_restore: ********** WARNING: Filesystem still has errors **********
/dev/md_restore: 107033/274718720 files (5.6% non-contiguous),
976413581/1098853872 blocks

I also tried setting the reshape number to 1002152448 , 1002153984,
1002157056 , 1002158592 and 1002160128 (+/ - a couple of multiples) but
output didn't seem to change much in any case.. Not sure if there are many
different values worth testing there.

So, unless there's something else worth trying based on the above, it looks
to me that it's time to raise the white flag and start again... it's not too
bad, I'll recover most of the data.

Many thanks for your help so far, but if I may... 1 more question...
Hopefully I won't lose a disk during re-shape in the future, but just in
case I do, or for other unforeseen issues, what are good things to backup on
a system?  Is it enough to backup the /etc/mdadm/mdadm.conf and /proc/mdstat
on a regular basis?  Or should I also backup the device superblocks?  Or
something else?  

Ok, so that's actually 4 questions  ... sorry :-)

Thanks again for all your efforts. 
Sam

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of NeilBrown
Sent: 14 August 2012 04:38
To: Sam Clark
Cc: Phil Turmel; linux-raid@vger.kernel.org
Subject: Re: RAID5 - Disk failed during re-shape

On Mon, 13 Aug 2012 18:14:30 +0200 "Sam Clark" <sclark_77@hotmail.com>
wrote:

> Thanks Neil, really appreciate the assistance.
> 
> Would love to give that a try - at least to catch the data that has
changed 
> since last backup, however I don't know the chunk size.  I created the
array 
> so long ago, and of course didn't document anything.  I would guess they
are 
> 64K, but not sure.  Is there any way to check from the disks themselves?
> 
> I've captured the 128K chunks as follows - hope it's correct:
> 
> I got the disk size in Bytes from fdisk -l, and subtracted 131072.. then 
> ran:
> sam@nas:~$ sudo dd if=/dev/sd[b-f] of=test.128k bs=1 skip=xxx count=128k,
> The 5 files are attached.
> 
> The disk sizes are as follows:
> sam@nas:~$ sudo blockdev --getsize /dev/sd[b-f]
> sdb: 2930277168
> sdc: 2930277168
> sdd: 2930277168
> sde: 2930277168
> sdf: 3907029168
> 

Unfortunately the metadata doesn't contain any trace of the the reshape
position, so we'll make do with 11.4%

The following script will assemble the array read-only.  You can then try
"fsck -n /dev/md_restore" to see if it is credible.  Then try to mount it.

Most of the details I'm confident of.

'chunk' is probably right, but there is no way to know for sure until you
have access to your data.  If you try changing it you'll need to also change
reshape to be an appropriate multiple of it.

'reshape' is approximately 11.4% of the array.  Maybe try other suitable
multiples.

'devs' is probably wrong.  I chose that order because the metadata seems to
suggest that order - yes, with sdf in the middle.  Maybe you know better.
You can try different orders until it seems to work.

Everything else should be correct.  component_size is definitely correct, I
found that in the metadata.  'layout' is the default and is hardly ever
changed.

As it assembles read-only, there is no risk in getting it wrong, changing
some values and trying again.  The script disassembles and old array before
creating the new.

good luck.

NeilBrown

# Script to try to assemble a RAID5 which got it's metadata corrupted
# in the middle of a reshape (ouch).
# We assemble as externally-managed-metadata in read-only mode
# by writing magic values to sysfs.

# devices in correct order.
devs='sdb sdd sdf sde sdc'

# number of devices, both before and after reshape
before=4
after=5

# reshape position as sectors per array.  It must be a multiple
# of one stripe, so chunk*old_data_disks*new_data_disks
# This number is 0.114 * 2930276992 * 3, rounded up to
# a multiple of 128*3*4.   Other multiples could be tried.
reshape=1002155520

# array parameters
level=raid5
chunk=65536
layout=2
component_size=2930276992

# always creates /dev/md_restore
mdadm -S /dev/md_restore
echo md_restore >  /sys/module/md_mod/parameters/new_array
cd /sys/class/block/md_restore/md

echo external:readonly > metadata_version
echo $level > level
echo $chunk > chunk_size
echo $component_size > component_size
echo 2 > layout
echo $before > raid_disks

echo $reshape > reshape_position
echo $after > raid_disks

slot=0
for i in $devs
do
 cat /sys/class/block/$i/dev > new_dev
 echo 0 > dev-$i/offset
 echo $component_size > dev-$i/size
 echo insync > dev-$i/state
 echo $slot > dev-$i/slot

 slot=$[slot+1]
done

echo readonly > array_state 

grep md_restore /proc/partitions

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID5 - Disk failed during re-shape
  2012-08-14 13:40           ` Sam Clark
@ 2012-08-14 21:05             ` NeilBrown
  2012-08-15 16:32               ` Sam Clark
  0 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2012-08-14 21:05 UTC (permalink / raw)
  To: Sam Clark; +Cc: 'Phil Turmel', linux-raid

[-- Attachment #1: Type: text/plain, Size: 4174 bytes --]

On Tue, 14 Aug 2012 15:40:50 +0200 Sam Clark <sclark_77@hotmail.com> wrote:

> Thanks Neil, 
> 
> Tried that and failed on the first attempt, so I tried shuffling around the
> dev order.. unfortunately I don't know what they were previously, but I do
> recall being surprised that sdd was first on the list when I was looking at
> it previously, so perhaps a starting point.  Since there are some 120
> different permutations of device order (assuming all 5 could be anywhere), I
> modified the script to accept parameters and automated it a little further. 
> 
> I ended up with a few 'possible successes' but none that would mount (i.e.
> fsck actually ran and found problems with the superblocks, group descriptor
> checksums and Inode details, instead of failing with errorlevel 8).  The
> most successful so far was the ones with SDD as device 1 and SDE as device
> 2.. one particular combination (sdd sde sdb sdc sdf) seems to report every
> time "/dev/md_restore has been mounted 35 times without being checked, check
> forced.".. does this mean we're on the right combination? 

Certainly encouraging.  However it might just mean that the first device is
correct.  I think you only need to find the filesystem superblock to be able
to report that.

> 
> In any case, that one produces a lot of output (some 54MB when fsck is piped
> to a file) that looks bad and still fails to mount.  (I assume that "mount
> -r /dev/md_restore /mnt/restore" I all I need to mount with?  I also tried
> with "-t ext4", but that didn't seem to help either).

54MB certainly seems like more that we were hoping for.
Yes, that mount command should be sufficient.  You could try adding "-o
noload".  I'm not sure what it does but from the code it looks like it tried
to be more forgiving of some stuff.


> 
> This is a summary of the errors that appear: 
> Pass 1: Checking inodes, blocks, and sizes
> (51 of these)
> Inode 198574650 has an invalid extent node (blk 38369280, lblk 0)
> Clear? no
> 
> (47 of these)
> Inode 223871986, i_blocks is 2737216, should be 0.  Fix? no
> 
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> /lost+found not found.  Create? no
> 
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Block bitmap differences:  +(36700161--36700162) +36700164 +36700166
> +(36700168--36700170) (this goes on like this for many pages.. in fact, most
> of the 54 MB is here)
> 
> (and 492 of these) 
> Free blocks count wrong for group #3760 (24544, counted=16439).
> Fix? no
> 
> Free blocks count wrong for group #3761 (0, counted=16584).
> Fix? no
> 
> /dev/md_restore: ********** WARNING: Filesystem still has errors **********
> /dev/md_restore: 107033/274718720 files (5.6% non-contiguous),
> 976413581/1098853872 blocks
> 
> 
> I also tried setting the reshape number to 1002152448 , 1002153984,
> 1002157056 , 1002158592 and 1002160128 (+/ - a couple of multiples) but
> output didn't seem to change much in any case.. Not sure if there are many
> different values worth testing there.

Probably not.

> 
> So, unless there's something else worth trying based on the above, it looks
> to me that it's time to raise the white flag and start again... it's not too
> bad, I'll recover most of the data.
> 
> Many thanks for your help so far, but if I may... 1 more question...
> Hopefully I won't lose a disk during re-shape in the future, but just in
> case I do, or for other unforeseen issues, what are good things to backup on
> a system?  Is it enough to backup the /etc/mdadm/mdadm.conf and /proc/mdstat
> on a regular basis?  Or should I also backup the device superblocks?  Or
> something else?  

There isn't really any need to backup anything.  Just don't use a buggy
kernel (which unfortunately I let out into the wild and got into Ubuntu).
The most useful thing if things do go wrong is the "mdadm --examine" output
of all devices.


> 
> Ok, so that's actually 4 questions  ... sorry :-)
> 
> Thanks again for all your efforts. 
> Sam

Sorry we couldn't get your data back.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: RAID5 - Disk failed during re-shape
  2012-08-14 21:05             ` NeilBrown
@ 2012-08-15 16:32               ` Sam Clark
  2012-08-15 22:38                 ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Sam Clark @ 2012-08-15 16:32 UTC (permalink / raw)
  To: 'NeilBrown'; +Cc: 'Phil Turmel', linux-raid

Unbelievable!  It mounted! 

With the -o noload, my array is mounted, and files are readable!

I've tested a few, and they look fine, but it's obviously hard to be sure on
a larger scale.

In any case, I'll certainly be able to recover more data now!

Thanks again Neil!
Sam

-----Original Message-----
From: linux-raid-owner@vger.kernel.org
[mailto:linux-raid-owner@vger.kernel.org] On Behalf Of NeilBrown
Sent: 14 August 2012 23:06
To: Sam Clark
Cc: 'Phil Turmel'; linux-raid@vger.kernel.org
Subject: Re: RAID5 - Disk failed during re-shape

On Tue, 14 Aug 2012 15:40:50 +0200 Sam Clark <sclark_77@hotmail.com> wrote:

> Thanks Neil,
> 
> Tried that and failed on the first attempt, so I tried shuffling 
> around the dev order.. unfortunately I don't know what they were 
> previously, but I do recall being surprised that sdd was first on the 
> list when I was looking at it previously, so perhaps a starting point.  
> Since there are some 120 different permutations of device order 
> (assuming all 5 could be anywhere), I modified the script to accept
parameters and automated it a little further.
> 
> I ended up with a few 'possible successes' but none that would mount (i.e.
> fsck actually ran and found problems with the superblocks, group 
> descriptor checksums and Inode details, instead of failing with 
> errorlevel 8).  The most successful so far was the ones with SDD as 
> device 1 and SDE as device 2.. one particular combination (sdd sde sdb 
> sdc sdf) seems to report every time "/dev/md_restore has been mounted 
> 35 times without being checked, check forced.".. does this mean we're on
the right combination?

Certainly encouraging.  However it might just mean that the first device is
correct.  I think you only need to find the filesystem superblock to be able
to report that.

> 
> In any case, that one produces a lot of output (some 54MB when fsck is 
> piped to a file) that looks bad and still fails to mount.  (I assume 
> that "mount -r /dev/md_restore /mnt/restore" I all I need to mount 
> with?  I also tried with "-t ext4", but that didn't seem to help either).

54MB certainly seems like more that we were hoping for.
Yes, that mount command should be sufficient.  You could try adding "-o
noload".  I'm not sure what it does but from the code it looks like it tried
to be more forgiving of some stuff.


> 
> This is a summary of the errors that appear: 
> Pass 1: Checking inodes, blocks, and sizes
> (51 of these)
> Inode 198574650 has an invalid extent node (blk 38369280, lblk 0) 
> Clear? no
> 
> (47 of these)
> Inode 223871986, i_blocks is 2737216, should be 0.  Fix? no
> 
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity /lost+found not found.  
> Create? no
> 
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information Block bitmap differences:  
> +(36700161--36700162) +36700164 +36700166
> +(36700168--36700170) (this goes on like this for many pages.. in 
> +fact, most
> of the 54 MB is here)
> 
> (and 492 of these)
> Free blocks count wrong for group #3760 (24544, counted=16439).
> Fix? no
> 
> Free blocks count wrong for group #3761 (0, counted=16584).
> Fix? no
> 
> /dev/md_restore: ********** WARNING: Filesystem still has errors 
> **********
> /dev/md_restore: 107033/274718720 files (5.6% non-contiguous),
> 976413581/1098853872 blocks
> 
> 
> I also tried setting the reshape number to 1002152448 , 1002153984,
> 1002157056 , 1002158592 and 1002160128 (+/ - a couple of multiples) 
> but output didn't seem to change much in any case.. Not sure if there 
> are many different values worth testing there.

Probably not.

> 
> So, unless there's something else worth trying based on the above, it 
> looks to me that it's time to raise the white flag and start again... 
> it's not too bad, I'll recover most of the data.
> 
> Many thanks for your help so far, but if I may... 1 more question...
> Hopefully I won't lose a disk during re-shape in the future, but just 
> in case I do, or for other unforeseen issues, what are good things to 
> backup on a system?  Is it enough to backup the /etc/mdadm/mdadm.conf 
> and /proc/mdstat on a regular basis?  Or should I also backup the 
> device superblocks?  Or something else?

There isn't really any need to backup anything.  Just don't use a buggy
kernel (which unfortunately I let out into the wild and got into Ubuntu).
The most useful thing if things do go wrong is the "mdadm --examine" output
of all devices.


> 
> Ok, so that's actually 4 questions  ... sorry :-)
> 
> Thanks again for all your efforts. 
> Sam

Sorry we couldn't get your data back.

NeilBrown


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: RAID5 - Disk failed during re-shape
  2012-08-15 16:32               ` Sam Clark
@ 2012-08-15 22:38                 ` NeilBrown
  0 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2012-08-15 22:38 UTC (permalink / raw)
  To: Sam Clark; +Cc: 'Phil Turmel', linux-raid

[-- Attachment #1: Type: text/plain, Size: 5223 bytes --]

On Wed, 15 Aug 2012 18:32:43 +0200 Sam Clark <sclark_77@hotmail.com> wrote:

> Unbelievable!  It mounted! 
> 
> With the -o noload, my array is mounted, and files are readable!
> 
> I've tested a few, and they look fine, but it's obviously hard to be sure on
> a larger scale.
> 
> In any case, I'll certainly be able to recover more data now!

.. and that's what makes it all worthwhile!

Thanks for hanging in there and letting us know the result.

NeilBrown

> 
> Thanks again Neil!
> Sam
> 
> -----Original Message-----
> From: linux-raid-owner@vger.kernel.org
> [mailto:linux-raid-owner@vger.kernel.org] On Behalf Of NeilBrown
> Sent: 14 August 2012 23:06
> To: Sam Clark
> Cc: 'Phil Turmel'; linux-raid@vger.kernel.org
> Subject: Re: RAID5 - Disk failed during re-shape
> 
> On Tue, 14 Aug 2012 15:40:50 +0200 Sam Clark <sclark_77@hotmail.com> wrote:
> 
> > Thanks Neil,
> > 
> > Tried that and failed on the first attempt, so I tried shuffling 
> > around the dev order.. unfortunately I don't know what they were 
> > previously, but I do recall being surprised that sdd was first on the 
> > list when I was looking at it previously, so perhaps a starting point.  
> > Since there are some 120 different permutations of device order 
> > (assuming all 5 could be anywhere), I modified the script to accept
> parameters and automated it a little further.
> > 
> > I ended up with a few 'possible successes' but none that would mount (i.e.
> > fsck actually ran and found problems with the superblocks, group 
> > descriptor checksums and Inode details, instead of failing with 
> > errorlevel 8).  The most successful so far was the ones with SDD as 
> > device 1 and SDE as device 2.. one particular combination (sdd sde sdb 
> > sdc sdf) seems to report every time "/dev/md_restore has been mounted 
> > 35 times without being checked, check forced.".. does this mean we're on
> the right combination?
> 
> Certainly encouraging.  However it might just mean that the first device is
> correct.  I think you only need to find the filesystem superblock to be able
> to report that.
> 
> > 
> > In any case, that one produces a lot of output (some 54MB when fsck is 
> > piped to a file) that looks bad and still fails to mount.  (I assume 
> > that "mount -r /dev/md_restore /mnt/restore" I all I need to mount 
> > with?  I also tried with "-t ext4", but that didn't seem to help either).
> 
> 54MB certainly seems like more that we were hoping for.
> Yes, that mount command should be sufficient.  You could try adding "-o
> noload".  I'm not sure what it does but from the code it looks like it tried
> to be more forgiving of some stuff.
> 
> 
> > 
> > This is a summary of the errors that appear: 
> > Pass 1: Checking inodes, blocks, and sizes
> > (51 of these)
> > Inode 198574650 has an invalid extent node (blk 38369280, lblk 0) 
> > Clear? no
> > 
> > (47 of these)
> > Inode 223871986, i_blocks is 2737216, should be 0.  Fix? no
> > 
> > Pass 2: Checking directory structure
> > Pass 3: Checking directory connectivity /lost+found not found.  
> > Create? no
> > 
> > Pass 4: Checking reference counts
> > Pass 5: Checking group summary information Block bitmap differences:  
> > +(36700161--36700162) +36700164 +36700166
> > +(36700168--36700170) (this goes on like this for many pages.. in 
> > +fact, most
> > of the 54 MB is here)
> > 
> > (and 492 of these)
> > Free blocks count wrong for group #3760 (24544, counted=16439).
> > Fix? no
> > 
> > Free blocks count wrong for group #3761 (0, counted=16584).
> > Fix? no
> > 
> > /dev/md_restore: ********** WARNING: Filesystem still has errors 
> > **********
> > /dev/md_restore: 107033/274718720 files (5.6% non-contiguous),
> > 976413581/1098853872 blocks
> > 
> > 
> > I also tried setting the reshape number to 1002152448 , 1002153984,
> > 1002157056 , 1002158592 and 1002160128 (+/ - a couple of multiples) 
> > but output didn't seem to change much in any case.. Not sure if there 
> > are many different values worth testing there.
> 
> Probably not.
> 
> > 
> > So, unless there's something else worth trying based on the above, it 
> > looks to me that it's time to raise the white flag and start again... 
> > it's not too bad, I'll recover most of the data.
> > 
> > Many thanks for your help so far, but if I may... 1 more question...
> > Hopefully I won't lose a disk during re-shape in the future, but just 
> > in case I do, or for other unforeseen issues, what are good things to 
> > backup on a system?  Is it enough to backup the /etc/mdadm/mdadm.conf 
> > and /proc/mdstat on a regular basis?  Or should I also backup the 
> > device superblocks?  Or something else?
> 
> There isn't really any need to backup anything.  Just don't use a buggy
> kernel (which unfortunately I let out into the wild and got into Ubuntu).
> The most useful thing if things do go wrong is the "mdadm --examine" output
> of all devices.
> 
> 
> > 
> > Ok, so that's actually 4 questions  ... sorry :-)
> > 
> > Thanks again for all your efforts. 
> > Sam
> 
> Sorry we couldn't get your data back.
> 
> NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-08-15 22:38 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-09  8:38 RAID5 - Disk failed during re-shape Sam Clark
2012-08-10 22:36 ` Phil Turmel
2012-08-11  1:21   ` Dmitrijs Ledkovs
2012-08-11  8:42   ` Sam Clark
2012-08-12 23:35     ` NeilBrown
     [not found]       ` <BLU153-ds10943E39726EDC983C484594B00@phx.gbl>
2012-08-14  2:38         ` NeilBrown
2012-08-14 13:40           ` Sam Clark
2012-08-14 21:05             ` NeilBrown
2012-08-15 16:32               ` Sam Clark
2012-08-15 22:38                 ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).