From: Phil Turmel <philip@turmel.org>
To: linux-raid@vger.kernel.org, robbat2@gentoo.org
Subject: Re: RAID6 seemingly shrunk itself after hard power outage and rebuild with replacement disk
Date: Sat, 05 Mar 2011 11:57:58 -0500 [thread overview]
Message-ID: <4D726B96.3010501@turmel.org> (raw)
In-Reply-To: <robbat2-20110304T221807-988505787Z@orbis-terrarum.net>
Hi Robin,
On 03/04/2011 06:27 PM, Robin H. Johnson wrote:
> (Please CC, not subscribed to linux-raid).
>
> Problem summary:
> -------------------
> After a rebuild following disk replacement, the MD array (RAID6, 12 devices)
> appears to have shrunk by 10880KiB. Presumed at the start of the device, but no
> confirmation.
Sounds similar to a problem recently encountered by Simon McNeil...
> Background:
> -----------
> I got called in to help a friend with a data loss problem after a catastrophic
> UPS failure which killed at least one motherboards, and several disks. Almost
> all of which lead to no data loss, except for one system...
>
> For the system in question, one disk died (cciss/c1d12), and was
> promptly replaced, and this problem started when the rebuild kicked in.
>
> Prior to calling me, my friend had already tried a few things from a rescue
> env, and almost certainly contributed to making the problem worse, and doesn't
> have good logs of what he did.
I have a suspicion that 'mdadm --create --assume-clean' or some variant was one of those. And that the rescue environment has a version of mdadm >= 3.1.2. The default metadata alignment changed in that version.
> The MD array was portions of two very large LVM LVs (15TiB and ~20TiB
> respectively). Specifically, the PV of the MD array was chunk in the middle of
> each of the two LVs.
>
> The kernel version 2.6.35.4 did not change during the power outage.
>
> Problem identification:
> -----------------------
> When bringing the system back online, LVM refused to make one LV accessible as
> it complained of a shrunk device. One other LV exhibited corruption.
>
> The entry in /proc/partitions noted the array size of 14651023360KiB, while
> older LVM backups showed the usable size of the array to previously be
> 14651034240KiB, a difference of 10880KiB.
>
> The first LV has inaccessible data for all files at or after the missing chunk.
> All files prior to that point are accessible.
>
> LVM refused to bring the second LV online as it complained the physical device
> was now too small for all the extents.
>
> Prior to the outage, 800KiB of the collected devices was used for metadata, and
> post the outage, now 11680KiB is used (difference of 10880 KIB).
>
> Questions:
> ----------
> Why did the array shrink? How can I get it back to the original size, or
> accurately identify the missing chunk size and offset, so that I can adjust the
> LVM definitions and recover the other data.
Please share mdadm -E for all of the devices in the problem array, and a sample of mdadm -E for some of the devices in the working arrays. I think you'll find differences in the data offset. Newer mdadm aligns to 1MB. Older mdadm aligns to "superblock size + bitmap size".
"mdadm -E /dev/cciss/c1d{12..23}p1" should show us individual device details for the problem array.
> Collected information:
> ----------------------
>
> Relevant lines from /proc/partitions:
> =====================================
> 9 3 14651023360 md3
> 105 209 1465103504 cciss/c1d13p1
> ...
>
> Line from mdstat right now:
> ===========================
> md3 : active raid6 cciss/c1d18p1[5] cciss/c1d17p1[4] cciss/c1d13p1[0]
> cciss/c1d21p1[8] cciss/c1d20p1[7] cciss/c1d19p1[6] cciss/c1d15p1[2]
> cciss/c1d12p1[12] cciss/c1d14p1[1] cciss/c1d23p1[10] cciss/c1d16p1[3]
> cciss/c1d22p1[9]
> 14651023360 blocks super 1.2 level 6, 64k chunk, algorithm 2
> [12/12] [UUUUUUUUUUUU]
>
> MDADM output:
> =============
> # mdadm --detail /dev/md3
> /dev/md3:
> Version : 1.2
> Creation Time : Wed Feb 16 19:53:05 2011
> Raid Level : raid6
> Array Size : 14651023360 (13972.30 GiB 15002.65 GB)
> Used Dev Size : 1465102336 (1397.23 GiB 1500.26 GB)
> Raid Devices : 12
> Total Devices : 12
> Persistence : Superblock is persistent
>
> Update Time : Fri Mar 4 17:19:43 2011
> State : clean
> Active Devices : 12
> Working Devices : 12
> Failed Devices : 0
> Spare Devices : 0
>
> Layout : left-symmetric
> Chunk Size : 64K
>
> Name : CENSORED:3 (local to host CENSORED)
> UUID : efa04ecf:4dbd0bfa:820a5942:de8a234f
> Events : 25
>
> Number Major Minor RaidDevice State
> 0 105 209 0 active sync /dev/cciss/c1d13p1
> 1 105 225 1 active sync /dev/cciss/c1d14p1
> 2 105 241 2 active sync /dev/cciss/c1d15p1
> 3 105 257 3 active sync /dev/cciss/c1d16p1
> 4 105 273 4 active sync /dev/cciss/c1d17p1
> 5 105 289 5 active sync /dev/cciss/c1d18p1
> 6 105 305 6 active sync /dev/cciss/c1d19p1
> 7 105 321 7 active sync /dev/cciss/c1d20p1
> 8 105 337 8 active sync /dev/cciss/c1d21p1
> 9 105 353 9 active sync /dev/cciss/c1d22p1
> 10 105 369 10 active sync /dev/cciss/c1d23p1
> 12 105 193 11 active sync /dev/cciss/c1d12p1
The lowest device node is the last device role? Any chance these are also out of order?
> LVM PV definition:
> ==================
> pv1 {
> id = "CENSORED"
> device = "/dev/md3" # Hint only
> status = ["ALLOCATABLE"]
> flags = []
> dev_size = 29302068480 # 13.6448 Terabytes
> pe_start = 384
> pe_count = 3576912 # 13.6448 Terabytes
> }
It would be good to know where the LVM PV signature is on the problem array's devices, and which one has it. LVM stores a text copy of the VG's configuration in its metadata blocks at the beginning of a PV, so you should find it on the true "Raid device 0", at the original MD data offset from the beginning of the device.
I suggest scripting a loop through each device, piping the first 1MB (with dd) to "strings -t x" to grep, looking for the PV uuid in clear text.
> LVM segments output:
> ====================
>
> # lvs --units 1m --segments \
> -o lv_name,lv_size,seg_start,seg_start_pe,seg_size,seg_pe_ranges \
> vg/LV1 vg/LV2
> LV LSize Start Start SSize PE Ranges
> LV1 15728640m 0m 0 1048576m /dev/md2:1048576-1310719
> LV1 15728640m 1048576m 262144 1048576m /dev/md2:2008320-2270463
> LV1 15728640m 2097152m 524288 7936132m /dev/md3:1592879-3576911
> LV1 15728640m 10033284m 2508321 452476m /dev/md4:2560-115678
> LV1 15728640m 10485760m 2621440 5242880m /dev/md4:2084381-3395100
> LV2 20969720m 0m 0 4194304m /dev/md2:0-1048575
> LV2 20969720m 4194304m 1048576 1048576m /dev/md2:1746176-2008319
> LV2 20969720m 5242880m 1310720 456516m /dev/md2:2270464-2384592
> LV2 20969720m 5699396m 1424849 511996m /dev/md2:1566721-1694719
> LV2 20969720m 6211392m 1552848 4m /dev/md2:1566720-1566720
> LV2 20969720m 6211396m 1552849 6371516m /dev/md3:0-1592878
> LV2 20969720m 12582912m 3145728 512000m /dev/md2:1438720-1566719
> LV2 20969720m 13094912m 3273728 7874808m /dev/md4:115679-2084380
>
If my suspicions are right, you'll have to use an old version of mdadm to redo an 'mdadm --create --assume-clean'.
HTH,
Phil
next prev parent reply other threads:[~2011-03-05 16:57 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-03-04 23:27 RAID6 seemingly shrunk itself after hard power outage and rebuild with replacement disk Robin H. Johnson
2011-03-05 8:32 ` Stan Hoeppner
2011-03-05 16:57 ` Phil Turmel [this message]
2011-03-05 17:09 ` Phil Turmel
2011-03-06 19:22 ` Robin H. Johnson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4D726B96.3010501@turmel.org \
--to=philip@turmel.org \
--cc=linux-raid@vger.kernel.org \
--cc=robbat2@gentoo.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.