From: Phil Turmel <philip@turmel.org>
To: linux-raid@vger.kernel.org, robbat2@gentoo.org
Subject: Re: RAID6 seemingly shrunk itself after hard power outage and rebuild with replacement disk
Date: Sat, 05 Mar 2011 11:57:58 -0500 [thread overview]
Message-ID: <4D726B96.3010501@turmel.org> (raw)
In-Reply-To: <robbat2-20110304T221807-988505787Z@orbis-terrarum.net>
Hi Robin,
On 03/04/2011 06:27 PM, Robin H. Johnson wrote:
> (Please CC, not subscribed to linux-raid).
>
> Problem summary:
> -------------------
> After a rebuild following disk replacement, the MD array (RAID6, 12 devices)
> appears to have shrunk by 10880KiB. Presumed at the start of the device, but no
> confirmation.
Sounds similar to a problem recently encountered by Simon McNeil...
> Background:
> -----------
> I got called in to help a friend with a data loss problem after a catastrophic
> UPS failure which killed at least one motherboards, and several disks. Almost
> all of which lead to no data loss, except for one system...
>
> For the system in question, one disk died (cciss/c1d12), and was
> promptly replaced, and this problem started when the rebuild kicked in.
>
> Prior to calling me, my friend had already tried a few things from a rescue
> env, and almost certainly contributed to making the problem worse, and doesn't
> have good logs of what he did.
I have a suspicion that 'mdadm --create --assume-clean' or some variant was one of those. And that the rescue environment has a version of mdadm >= 3.1.2. The default metadata alignment changed in that version.
> The MD array was portions of two very large LVM LVs (15TiB and ~20TiB
> respectively). Specifically, the PV of the MD array was chunk in the middle of
> each of the two LVs.
>
> The kernel version 2.6.35.4 did not change during the power outage.
>
> Problem identification:
> -----------------------
> When bringing the system back online, LVM refused to make one LV accessible as
> it complained of a shrunk device. One other LV exhibited corruption.
>
> The entry in /proc/partitions noted the array size of 14651023360KiB, while
> older LVM backups showed the usable size of the array to previously be
> 14651034240KiB, a difference of 10880KiB.
>
> The first LV has inaccessible data for all files at or after the missing chunk.
> All files prior to that point are accessible.
>
> LVM refused to bring the second LV online as it complained the physical device
> was now too small for all the extents.
>
> Prior to the outage, 800KiB of the collected devices was used for metadata, and
> post the outage, now 11680KiB is used (difference of 10880 KIB).
>
> Questions:
> ----------
> Why did the array shrink? How can I get it back to the original size, or
> accurately identify the missing chunk size and offset, so that I can adjust the
> LVM definitions and recover the other data.
Please share mdadm -E for all of the devices in the problem array, and a sample of mdadm -E for some of the devices in the working arrays. I think you'll find differences in the data offset. Newer mdadm aligns to 1MB. Older mdadm aligns to "superblock size + bitmap size".
"mdadm -E /dev/cciss/c1d{12..23}p1" should show us individual device details for the problem array.
> Collected information:
> ----------------------
>
> Relevant lines from /proc/partitions:
> =====================================
> 9 3 14651023360 md3
> 105 209 1465103504 cciss/c1d13p1
> ...
>
> Line from mdstat right now:
> ===========================
> md3 : active raid6 cciss/c1d18p1[5] cciss/c1d17p1[4] cciss/c1d13p1[0]
> cciss/c1d21p1[8] cciss/c1d20p1[7] cciss/c1d19p1[6] cciss/c1d15p1[2]
> cciss/c1d12p1[12] cciss/c1d14p1[1] cciss/c1d23p1[10] cciss/c1d16p1[3]
> cciss/c1d22p1[9]
> 14651023360 blocks super 1.2 level 6, 64k chunk, algorithm 2
> [12/12] [UUUUUUUUUUUU]
>
> MDADM output:
> =============
> # mdadm --detail /dev/md3
> /dev/md3:
> Version : 1.2
> Creation Time : Wed Feb 16 19:53:05 2011
> Raid Level : raid6
> Array Size : 14651023360 (13972.30 GiB 15002.65 GB)
> Used Dev Size : 1465102336 (1397.23 GiB 1500.26 GB)
> Raid Devices : 12
> Total Devices : 12
> Persistence : Superblock is persistent
>
> Update Time : Fri Mar 4 17:19:43 2011
> State : clean
> Active Devices : 12
> Working Devices : 12
> Failed Devices : 0
> Spare Devices : 0
>
> Layout : left-symmetric
> Chunk Size : 64K
>
> Name : CENSORED:3 (local to host CENSORED)
> UUID : efa04ecf:4dbd0bfa:820a5942:de8a234f
> Events : 25
>
> Number Major Minor RaidDevice State
> 0 105 209 0 active sync /dev/cciss/c1d13p1
> 1 105 225 1 active sync /dev/cciss/c1d14p1
> 2 105 241 2 active sync /dev/cciss/c1d15p1
> 3 105 257 3 active sync /dev/cciss/c1d16p1
> 4 105 273 4 active sync /dev/cciss/c1d17p1
> 5 105 289 5 active sync /dev/cciss/c1d18p1
> 6 105 305 6 active sync /dev/cciss/c1d19p1
> 7 105 321 7 active sync /dev/cciss/c1d20p1
> 8 105 337 8 active sync /dev/cciss/c1d21p1
> 9 105 353 9 active sync /dev/cciss/c1d22p1
> 10 105 369 10 active sync /dev/cciss/c1d23p1
> 12 105 193 11 active sync /dev/cciss/c1d12p1
The lowest device node is the last device role? Any chance these are also out of order?
> LVM PV definition:
> ==================
> pv1 {
> id = "CENSORED"
> device = "/dev/md3" # Hint only
> status = ["ALLOCATABLE"]
> flags = []
> dev_size = 29302068480 # 13.6448 Terabytes
> pe_start = 384
> pe_count = 3576912 # 13.6448 Terabytes
> }
It would be good to know where the LVM PV signature is on the problem array's devices, and which one has it. LVM stores a text copy of the VG's configuration in its metadata blocks at the beginning of a PV, so you should find it on the true "Raid device 0", at the original MD data offset from the beginning of the device.
I suggest scripting a loop through each device, piping the first 1MB (with dd) to "strings -t x" to grep, looking for the PV uuid in clear text.
> LVM segments output:
> ====================
>
> # lvs --units 1m --segments \
> -o lv_name,lv_size,seg_start,seg_start_pe,seg_size,seg_pe_ranges \
> vg/LV1 vg/LV2
> LV LSize Start Start SSize PE Ranges
> LV1 15728640m 0m 0 1048576m /dev/md2:1048576-1310719
> LV1 15728640m 1048576m 262144 1048576m /dev/md2:2008320-2270463
> LV1 15728640m 2097152m 524288 7936132m /dev/md3:1592879-3576911
> LV1 15728640m 10033284m 2508321 452476m /dev/md4:2560-115678
> LV1 15728640m 10485760m 2621440 5242880m /dev/md4:2084381-3395100
> LV2 20969720m 0m 0 4194304m /dev/md2:0-1048575
> LV2 20969720m 4194304m 1048576 1048576m /dev/md2:1746176-2008319
> LV2 20969720m 5242880m 1310720 456516m /dev/md2:2270464-2384592
> LV2 20969720m 5699396m 1424849 511996m /dev/md2:1566721-1694719
> LV2 20969720m 6211392m 1552848 4m /dev/md2:1566720-1566720
> LV2 20969720m 6211396m 1552849 6371516m /dev/md3:0-1592878
> LV2 20969720m 12582912m 3145728 512000m /dev/md2:1438720-1566719
> LV2 20969720m 13094912m 3273728 7874808m /dev/md4:115679-2084380
>
If my suspicions are right, you'll have to use an old version of mdadm to redo an 'mdadm --create --assume-clean'.
HTH,
Phil
next prev parent reply other threads:[~2011-03-05 16:57 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-03-04 23:27 RAID6 seemingly shrunk itself after hard power outage and rebuild with replacement disk Robin H. Johnson
2011-03-05 8:32 ` Stan Hoeppner
2011-03-05 16:57 ` Phil Turmel [this message]
2011-03-05 17:09 ` Phil Turmel
2011-03-06 19:22 ` Robin H. Johnson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4D726B96.3010501@turmel.org \
--to=philip@turmel.org \
--cc=linux-raid@vger.kernel.org \
--cc=robbat2@gentoo.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).