RAID6 seemingly shrunk itself after hard power outage and rebuild with replacement disk

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID6 seemingly shrunk itself after hard power outage and rebuild with replacement disk
@ 2011-03-04 23:27 Robin H. Johnson
  2011-03-05  8:32 ` Stan Hoeppner
  2011-03-05 16:57 ` Phil Turmel
  0 siblings, 2 replies; 5+ messages in thread
From: Robin H. Johnson @ 2011-03-04 23:27 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5985 bytes --]

(Please CC, not subscribed to linux-raid).

Problem summary:
-------------------
After a rebuild following disk replacement, the MD array (RAID6, 12 devices)
appears to have shrunk by 10880KiB. Presumed at the start of the device, but no
confirmation.

Background:
-----------
I got called in to help a friend with a data loss problem after a catastrophic
UPS failure which killed at least one motherboards, and several disks. Almost
all of which lead to no data loss, except for one system...

For the system in question, one disk died (cciss/c1d12), and was
promptly replaced, and this problem started when the rebuild kicked in.

Prior to calling me, my friend had already tried a few things from a rescue
env, and almost certainly contributed to making the problem worse, and doesn't
have good logs of what he did.

The MD array was portions of two very large LVM LVs (15TiB and ~20TiB
respectively).  Specifically, the PV of the MD array was chunk in the middle of
each of the two LVs.

The kernel version 2.6.35.4 did not change during the power outage.

Problem identification:
-----------------------
When bringing the system back online, LVM refused to make one LV accessible as
it complained of a shrunk device. One other LV exhibited corruption.

The entry in /proc/partitions noted the array size of 14651023360KiB, while
older LVM backups showed the usable size of the array to previously be
14651034240KiB, a difference of 10880KiB.

The first LV has inaccessible data for all files at or after the missing chunk.
All files prior to that point are accessible.

LVM refused to bring the second LV online as it complained the physical device
was now too small for all the extents. 

Prior to the outage, 800KiB of the collected devices was used for metadata, and
post the outage, now 11680KiB is used (difference of 10880 KIB).

Questions:
----------
Why did the array shrink? How can I get it back to the original size, or
accurately identify the missing chunk size and offset, so that I can adjust the
LVM definitions and recover the other data.

Collected information:
----------------------

Relevant lines from /proc/partitions:
=====================================
   9        3 14651023360 md3
 105      209 1465103504 cciss/c1d13p1
 ...

Line from mdstat right now:
===========================
md3 : active raid6 cciss/c1d18p1[5] cciss/c1d17p1[4] cciss/c1d13p1[0]
cciss/c1d21p1[8] cciss/c1d20p1[7] cciss/c1d19p1[6] cciss/c1d15p1[2]
cciss/c1d12p1[12] cciss/c1d14p1[1] cciss/c1d23p1[10] cciss/c1d16p1[3]
cciss/c1d22p1[9]
      14651023360 blocks super 1.2 level 6, 64k chunk, algorithm 2
	  [12/12] [UUUUUUUUUUUU]

MDADM output:
=============
# mdadm --detail /dev/md3
/dev/md3:
        Version : 1.2
  Creation Time : Wed Feb 16 19:53:05 2011
     Raid Level : raid6
     Array Size : 14651023360 (13972.30 GiB 15002.65 GB)
  Used Dev Size : 1465102336 (1397.23 GiB 1500.26 GB)
   Raid Devices : 12
  Total Devices : 12
    Persistence : Superblock is persistent

    Update Time : Fri Mar  4 17:19:43 2011
          State : clean
 Active Devices : 12
Working Devices : 12
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : CENSORED:3  (local to host CENSORED)
           UUID : efa04ecf:4dbd0bfa:820a5942:de8a234f
         Events : 25

    Number   Major   Minor   RaidDevice State
       0     105      209        0      active sync   /dev/cciss/c1d13p1
       1     105      225        1      active sync   /dev/cciss/c1d14p1
       2     105      241        2      active sync   /dev/cciss/c1d15p1
       3     105      257        3      active sync   /dev/cciss/c1d16p1
       4     105      273        4      active sync   /dev/cciss/c1d17p1
       5     105      289        5      active sync   /dev/cciss/c1d18p1
       6     105      305        6      active sync   /dev/cciss/c1d19p1
       7     105      321        7      active sync   /dev/cciss/c1d20p1
       8     105      337        8      active sync   /dev/cciss/c1d21p1
       9     105      353        9      active sync   /dev/cciss/c1d22p1
      10     105      369       10      active sync   /dev/cciss/c1d23p1
      12     105      193       11      active sync   /dev/cciss/c1d12p1

LVM PV definition:
==================
  pv1 {
      id = "CENSORED"
      device = "/dev/md3" # Hint only
      status = ["ALLOCATABLE"]
      flags = []
      dev_size = 29302068480  # 13.6448 Terabytes
      pe_start = 384 
      pe_count = 3576912  # 13.6448 Terabytes
  }   

LVM segments output:
====================

# lvs --units 1m --segments \
  -o lv_name,lv_size,seg_start,seg_start_pe,seg_size,seg_pe_ranges \
  vg/LV1 vg/LV2
  LV    LSize     Start     Start   SSize    PE Ranges               
  LV1   15728640m        0m       0 1048576m /dev/md2:1048576-1310719
  LV1   15728640m  1048576m  262144 1048576m /dev/md2:2008320-2270463
  LV1   15728640m  2097152m  524288 7936132m /dev/md3:1592879-3576911
  LV1   15728640m 10033284m 2508321  452476m /dev/md4:2560-115678    
  LV1   15728640m 10485760m 2621440 5242880m /dev/md4:2084381-3395100
  LV2   20969720m        0m       0 4194304m /dev/md2:0-1048575      
  LV2   20969720m  4194304m 1048576 1048576m /dev/md2:1746176-2008319
  LV2   20969720m  5242880m 1310720  456516m /dev/md2:2270464-2384592
  LV2   20969720m  5699396m 1424849  511996m /dev/md2:1566721-1694719
  LV2   20969720m  6211392m 1552848       4m /dev/md2:1566720-1566720
  LV2   20969720m  6211396m 1552849 6371516m /dev/md3:0-1592878      
  LV2   20969720m 12582912m 3145728  512000m /dev/md2:1438720-1566719
  LV2   20969720m 13094912m 3273728 7874808m /dev/md4:115679-2084380 

-- 
Robin Hugh Johnson
Gentoo Linux: Developer, Trustee & Infrastructure Lead
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

[-- Attachment #2: Type: application/pgp-signature, Size: 330 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID6 seemingly shrunk itself after hard power outage and rebuild with replacement disk
  2011-03-04 23:27 RAID6 seemingly shrunk itself after hard power outage and rebuild with replacement disk Robin H. Johnson
@ 2011-03-05  8:32 ` Stan Hoeppner
  2011-03-05 16:57 ` Phil Turmel
  1 sibling, 0 replies; 5+ messages in thread
From: Stan Hoeppner @ 2011-03-05  8:32 UTC (permalink / raw)
  To: linux-raid, robbat2

Robin H. Johnson put forth on 3/4/2011 5:27 PM:

> After a rebuild following disk replacement, the MD array (RAID6, 12 devices)
> appears to have shrunk by 10880KiB.
----------------------------------------------------------------------------------
> The entry in /proc/partitions noted the array size of 14651023360KiB, while
> older LVM backups showed the usable size of the array to previously be
> 14651034240KiB, a difference of 10880KiB.
----------------------------------------------------------------------------------
> Prior to the outage, 800KiB of the collected devices was used for metadata, and
> post the outage, now 11680KiB is used (difference of 10880 KIB).
----------------------------------------------------------------------------------
> Why did the array shrink?

It appears it shrunk by exactly the size of the new metadata, 10880KiB,
if this is actually considered an array shrink.  It seems you need to
identify why the metadata size increased, and figure out a way to revert
it back to its previous size.

Your current metadata version is 1.2.  What was it prior to the
catastrophic UPS event?

-- 
Stan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID6 seemingly shrunk itself after hard power outage and rebuild with replacement disk
  2011-03-04 23:27 RAID6 seemingly shrunk itself after hard power outage and rebuild with replacement disk Robin H. Johnson
  2011-03-05  8:32 ` Stan Hoeppner
@ 2011-03-05 16:57 ` Phil Turmel
  2011-03-05 17:09   ` Phil Turmel
  2011-03-06 19:22   ` Robin H. Johnson
  1 sibling, 2 replies; 5+ messages in thread
From: Phil Turmel @ 2011-03-05 16:57 UTC (permalink / raw)
  To: linux-raid, robbat2

Hi Robin,

On 03/04/2011 06:27 PM, Robin H. Johnson wrote:
> (Please CC, not subscribed to linux-raid).
> 
> Problem summary:
> -------------------
> After a rebuild following disk replacement, the MD array (RAID6, 12 devices)
> appears to have shrunk by 10880KiB. Presumed at the start of the device, but no
> confirmation.

Sounds similar to a problem recently encountered by Simon McNeil...

> Background:
> -----------
> I got called in to help a friend with a data loss problem after a catastrophic
> UPS failure which killed at least one motherboards, and several disks. Almost
> all of which lead to no data loss, except for one system...
> 
> For the system in question, one disk died (cciss/c1d12), and was
> promptly replaced, and this problem started when the rebuild kicked in.
> 
> Prior to calling me, my friend had already tried a few things from a rescue
> env, and almost certainly contributed to making the problem worse, and doesn't
> have good logs of what he did.

I have a suspicion that 'mdadm --create --assume-clean' or some variant was one of those.  And that the rescue environment has a version of mdadm >= 3.1.2.  The default metadata alignment changed in that version.

> The MD array was portions of two very large LVM LVs (15TiB and ~20TiB
> respectively).  Specifically, the PV of the MD array was chunk in the middle of
> each of the two LVs.
> 
> The kernel version 2.6.35.4 did not change during the power outage.
> 
> Problem identification:
> -----------------------
> When bringing the system back online, LVM refused to make one LV accessible as
> it complained of a shrunk device. One other LV exhibited corruption.
> 
> The entry in /proc/partitions noted the array size of 14651023360KiB, while
> older LVM backups showed the usable size of the array to previously be
> 14651034240KiB, a difference of 10880KiB.
> 
> The first LV has inaccessible data for all files at or after the missing chunk.
> All files prior to that point are accessible.
> 
> LVM refused to bring the second LV online as it complained the physical device
> was now too small for all the extents. 
> 
> Prior to the outage, 800KiB of the collected devices was used for metadata, and
> post the outage, now 11680KiB is used (difference of 10880 KIB).
> 
> Questions:
> ----------
> Why did the array shrink? How can I get it back to the original size, or
> accurately identify the missing chunk size and offset, so that I can adjust the
> LVM definitions and recover the other data.

Please share mdadm -E for all of the devices in the problem array, and a sample of mdadm -E for some of the devices in the working arrays.  I think you'll find differences in the data offset.  Newer mdadm aligns to 1MB.  Older mdadm aligns to "superblock size + bitmap size".

"mdadm -E /dev/cciss/c1d{12..23}p1" should show us individual device details for the problem array.

> Collected information:
> ----------------------
> 
> Relevant lines from /proc/partitions:
> =====================================
>    9        3 14651023360 md3
>  105      209 1465103504 cciss/c1d13p1
>  ...
> 
> Line from mdstat right now:
> ===========================
> md3 : active raid6 cciss/c1d18p1[5] cciss/c1d17p1[4] cciss/c1d13p1[0]
> cciss/c1d21p1[8] cciss/c1d20p1[7] cciss/c1d19p1[6] cciss/c1d15p1[2]
> cciss/c1d12p1[12] cciss/c1d14p1[1] cciss/c1d23p1[10] cciss/c1d16p1[3]
> cciss/c1d22p1[9]
>       14651023360 blocks super 1.2 level 6, 64k chunk, algorithm 2
> 	  [12/12] [UUUUUUUUUUUU]
> 
> MDADM output:
> =============
> # mdadm --detail /dev/md3
> /dev/md3:
>         Version : 1.2
>   Creation Time : Wed Feb 16 19:53:05 2011
>      Raid Level : raid6
>      Array Size : 14651023360 (13972.30 GiB 15002.65 GB)
>   Used Dev Size : 1465102336 (1397.23 GiB 1500.26 GB)
>    Raid Devices : 12
>   Total Devices : 12
>     Persistence : Superblock is persistent
> 
>     Update Time : Fri Mar  4 17:19:43 2011
>           State : clean
>  Active Devices : 12
> Working Devices : 12
>  Failed Devices : 0
>   Spare Devices : 0
> 
>          Layout : left-symmetric
>      Chunk Size : 64K
> 
>            Name : CENSORED:3  (local to host CENSORED)
>            UUID : efa04ecf:4dbd0bfa:820a5942:de8a234f
>          Events : 25
> 
>     Number   Major   Minor   RaidDevice State
>        0     105      209        0      active sync   /dev/cciss/c1d13p1
>        1     105      225        1      active sync   /dev/cciss/c1d14p1
>        2     105      241        2      active sync   /dev/cciss/c1d15p1
>        3     105      257        3      active sync   /dev/cciss/c1d16p1
>        4     105      273        4      active sync   /dev/cciss/c1d17p1
>        5     105      289        5      active sync   /dev/cciss/c1d18p1
>        6     105      305        6      active sync   /dev/cciss/c1d19p1
>        7     105      321        7      active sync   /dev/cciss/c1d20p1
>        8     105      337        8      active sync   /dev/cciss/c1d21p1
>        9     105      353        9      active sync   /dev/cciss/c1d22p1
>       10     105      369       10      active sync   /dev/cciss/c1d23p1
>       12     105      193       11      active sync   /dev/cciss/c1d12p1

The lowest device node is the last device role?  Any chance these are also out of order?

> LVM PV definition:
> ==================
>   pv1 {
>       id = "CENSORED"
>       device = "/dev/md3" # Hint only
>       status = ["ALLOCATABLE"]
>       flags = []
>       dev_size = 29302068480  # 13.6448 Terabytes
>       pe_start = 384 
>       pe_count = 3576912  # 13.6448 Terabytes
>   }   

It would be good to know where the LVM PV signature is on the problem array's devices, and which one has it.  LVM stores a text copy of the VG's configuration in its metadata blocks at the beginning of a PV, so you should find it on the true "Raid device 0", at the original MD data offset from the beginning of the device.

I suggest scripting a loop through each device, piping the first 1MB (with dd) to "strings -t x" to grep, looking for the PV uuid in clear text.

> LVM segments output:
> ====================
> 
> # lvs --units 1m --segments \
>   -o lv_name,lv_size,seg_start,seg_start_pe,seg_size,seg_pe_ranges \
>   vg/LV1 vg/LV2
>   LV    LSize     Start     Start   SSize    PE Ranges               
>   LV1   15728640m        0m       0 1048576m /dev/md2:1048576-1310719
>   LV1   15728640m  1048576m  262144 1048576m /dev/md2:2008320-2270463
>   LV1   15728640m  2097152m  524288 7936132m /dev/md3:1592879-3576911
>   LV1   15728640m 10033284m 2508321  452476m /dev/md4:2560-115678    
>   LV1   15728640m 10485760m 2621440 5242880m /dev/md4:2084381-3395100
>   LV2   20969720m        0m       0 4194304m /dev/md2:0-1048575      
>   LV2   20969720m  4194304m 1048576 1048576m /dev/md2:1746176-2008319
>   LV2   20969720m  5242880m 1310720  456516m /dev/md2:2270464-2384592
>   LV2   20969720m  5699396m 1424849  511996m /dev/md2:1566721-1694719
>   LV2   20969720m  6211392m 1552848       4m /dev/md2:1566720-1566720
>   LV2   20969720m  6211396m 1552849 6371516m /dev/md3:0-1592878      
>   LV2   20969720m 12582912m 3145728  512000m /dev/md2:1438720-1566719
>   LV2   20969720m 13094912m 3273728 7874808m /dev/md4:115679-2084380 
> 

If my suspicions are right, you'll have to use an old version of mdadm to redo an 'mdadm --create --assume-clean'.

HTH,

Phil

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID6 seemingly shrunk itself after hard power outage and rebuild with replacement disk
  2011-03-05 16:57 ` Phil Turmel
@ 2011-03-05 17:09   ` Phil Turmel
  2011-03-06 19:22   ` Robin H. Johnson
  1 sibling, 0 replies; 5+ messages in thread
From: Phil Turmel @ 2011-03-05 17:09 UTC (permalink / raw)
  To: linux-raid, robbat2

On 03/05/2011 11:57 AM, Phil Turmel wrote:
> Hi Robin,
[trim /]
> 
> Sounds similar to a problem recently encountered by Simon McNeil...

Whoops!  That was "Simon Mcnair".

Phil

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: RAID6 seemingly shrunk itself after hard power outage and rebuild with replacement disk
  2011-03-05 16:57 ` Phil Turmel
  2011-03-05 17:09   ` Phil Turmel
@ 2011-03-06 19:22   ` Robin H. Johnson
  1 sibling, 0 replies; 5+ messages in thread
From: Robin H. Johnson @ 2011-03-06 19:22 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid, robbat2

On Sat, Mar 05, 2011 at 11:57:58AM -0500, Phil Turmel wrote:
> I have a suspicion that 'mdadm --create --assume-clean' or some
> variant was one of those.  And that the rescue environment has a
> version of mdadm >= 3.1.2.  The default metadata alignment changed in
> that version.
Confirmed.

> The lowest device node is the last device role?  Any chance these are also out of order?
Yes, the data was confirmed to be shuffled later.

> If my suspicions are right, you'll have to use an old version of mdadm
> to redo an 'mdadm --create --assume-clean'.
Passed -e 0 with that, and corrected the order of the devices, and then
it looked much better. Some minor data corruption where the new metadata
overwrote stuff, but much easier to recover those files than the entire
35 TiB.

Thanks everybody for the help.

-- 
Robin Hugh Johnson
Gentoo Linux: Developer, Trustee & Infrastructure Lead
E-Mail     : robbat2@gentoo.org
GnuPG FP   : 11AC BA4F 4778 E3F6 E4ED  F38E B27B 944E 3488 4E85

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-03-06 19:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-04 23:27 RAID6 seemingly shrunk itself after hard power outage and rebuild with replacement disk Robin H. Johnson
2011-03-05  8:32 ` Stan Hoeppner
2011-03-05 16:57 ` Phil Turmel
2011-03-05 17:09   ` Phil Turmel
2011-03-06 19:22   ` Robin H. Johnson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).