From: NeilBrown <neilb@suse.de>
To: Ross Boylan <ross@biostat.ucsf.edu>
Cc: linux-raid@vger.kernel.org
Subject: Re: System runs with RAID but fails to reboot [explanation?]
Date: Tue, 27 Nov 2012 13:15:27 +1100 [thread overview]
Message-ID: <20121127131527.4d0d81d5@notabene.brown> (raw)
In-Reply-To: <1353973722.7078.51.camel@markov.biostat.ucsf.edu>
[-- Attachment #1: Type: text/plain, Size: 10990 bytes --]
On Mon, 26 Nov 2012 15:48:42 -0800 Ross Boylan <ross@biostat.ucsf.edu> wrote:
> I may have an explanation for what happened, including why md0 and md1
> were treated differently.
> On Fri, 2012-11-23 at 16:15 -0800, Ross Boylan wrote:
> > On Thu, 2012-11-22 at 15:52 +1100, NeilBrown wrote:
> > > On Wed, 21 Nov 2012 08:58:57 -0800 Ross Boylan <ross@biostat.ucsf.edu> wrote:
> > >
> > > > I spent most of yesterday dealing with the failure of my (md) RAID
> > > > arrays to come up on reboot. If anyone can explain what happened or
> > > > what I can do to avoid it, I'd appreciate it. Also, I'd like to know if
> > > > the failure of one device in a RAID 1 can contaminate the other with bad
> > > > data (I think the answer must be yes, in general, but I can hope).
> > > >
> > > > In particular, I'll need to reinsert the disks I removed (described
> > > > below) without getting everything screwed up.
> > > >
> > > > Linux 2.6.32 amd64 kernel.
> > > >
> > > > I'll describe what I did for md1 first:
> > > >
> > > > 1. At the start, system has 3 physically identical disks. sda and sdc
> > > > are twins and sdb is unused, though partitioned. md1 is a raid1 of sda3
> > > > and sdc3. Disks have DOS partitions.
> > > > 2. Add 2 larger drives to the system. They become sdd and sde. These 2
> > > > are physically identical to each other, and bigger than the first batch
> > > > of drives.
> > > > 3. GPT format the drives with larger partitions than sda.
> > > > 4. mdadm --fail /dev/md1 /dev/sdc3
> > > > 5. mdadm --add /dev/md1 /dev/sdd4. Wait for sync.
> > > > 6. madadm --add /dev/md1 /dev/sde4.
> > > > 7. mdadm --grow /dev/md1 -n 3. Wait for sync.
> > > >
> > > > md0 was same story except I only added sdd (and I used partitions sda1
> > > > and sdd2).
> > > >
> > > > This all seemed to be working fine.
> > > >
> > > > Reboot.
> > > >
> > > > System came up with md0 as sda1 and sdd2, as expected.
> > > > But md1 was the failed sdc3 only. Note I did not remove the partition
> > > > from md1; maybe I needed to?
> First, the Debian initrd I'm using does recognize GPT partitions, and so
> unrecognized partitions did not cause the problem.
>
> Second, the initrd executes mdadm --assemble --scan --run --auto=yes.
> This uses conf/conf.d/md and etc/mdadm/mdadm.conf. The latter includes
> --num-devices for each array.
Yes, having an out-of-date "devices=" in mdadm.conf would cause the problems
you are having. You don't really want that at all.
> Since I did not regenerate this after
> changing the array sizes, it was 2 for both arrays. man mdadm.conf says
> ARRAY The ARRAY lines identify actual arrays. The second word on the
> line should be the name of the device where the array is nor-
> mally assembled, such as /dev/md1. Subsequent words identify
> the array, or identify the array as a member of a group. If
> multiple identities are given, then a component device must
> match ALL identities to be considered a match. [ num-devices is
> one of the identity keywords].
>
> This was fine for md0 (unless it should have been 3 because of the
> failed device),
It should be the number of "raid devices" i.e. the number of active devices
when the array is optimal. It ignores spares.
> and at least consistent with the metadata on sdc3,
> formerly part of md1. It was inconsistent with the metadata for md1 on
> its current components, sda3, sdd4, and sde4, all of which indicates a
> size of 3 (or 4 if failed devices count).
>
> I do not know if the "must match" logic applies to --num-devices (since
> the manual says the option is mainly for compatibility with the output
> of --examine --scan), nor do I know if the --run option overrides the
> matching requirement. But md0's components might match the num-devices
> in mdadm.conf, while md1's current components do not match. md1's old
> commponent does match.
Yes, "must match" means "must match".
And this is exactly what md1's old component was made into an array while the
new components were ignored.
>
> I don't know if, before all that, udev triggers attempts to assemble
> arrays incrementally. Nor do I know how such incremental assembly works
> when some of the candidate devices are out of date.
"mdadm -I" (run from udev) pays more attention to the uuid than "mdadm -A"
does - it can only assemble one array with a given uuid. (mdadm -A will
sometimes assemble 2. That is the bug I mentioned in a previous email which
will be fixed in mdadm-3.3).
So it would see several devices with the same uuid, but some are inconsistent
with mdadm.conf so would be rejected (I think).
>
> So the mismatch between the array size for md0, but not md1, might
> explain why md0 came up as expected, but md1 came up as a single, old
> partition instead of the 3 current ones.
s/might/does/
>
> However, it is awkward for this account that after I set the array sizes
> to 1 for both md0 and md1 (using partitions from sda)--which would be
> inconsistent with the size in mdadm.conf--they both came up. There were
> fewer choices at that point, since I had removed all the other disks.
I guess that as "all" the devices with a given UUID were consistent, mdadm -I
accepted them even as "not present in mdadm.conf".
>
> Third, my recent experience suggests something more is going on, and
> perhaps the count considerations just mentioned are not that important.
> I'll put what happened at the end, since it happened after everything
> else described here.
> > > >
> > > > Shutdown, removed disk sdc for the computer. Reboot.
> > > > /md0 is reassembled to but md1 is not, and so the system can not not
> > > > come up (since root is on md0). BTW, md1 is used as a PV for LVM; md0
> > > > is /boot.
> > > >
> > > > In at least some kernels the GPT partitions were not recognized in the
> > > > initrd of the boot process (Knoppix 6--same version of the kernel,
> > > > 2.6.32, as my system, though I'm not sure the kernel modules are same as
> > > > for Debian). I'm not sure if the GPT partitions were recognized under
> > > > Debian in the initrd, though they obviously were in the running system
> > > > at the start.
> > >
> > > Well if your initrd doesn't recognise GPT, then that would explain your
> > > problems.
> > I later found, using the Debian initrd, that arrays with fewer than the
> > expected number of devices (as in the n= paramter) do not get activated.
> > I think that's what you mean by "explain your problems." Or did you have
> > something else in mind?
> >
> > At least I think I found arrays with missing parts are not activated;
> > perhaps there was something else about my operations from knoppix 7
> > (described 2 paragraps below this) that helped.
> >
> > The other problem with that discovery is that the first reboot activated
> > md1 with only 1 partition, even though md1 had never been configured
> > with <2.
> >
> > Most of my theories have the character of being consistent with some
> > behavior I saw and inconsistent with other observed behavior. Possibly
> > I misperceived or misremembered something.
> > >
> > > >
> > > > After much trashing, I pulled all drives but sda and sdb. This was
> > > > still not sufficient to boot because the md's wouldn't come up. md0 was
> > > > reported as assembled, but was not readable. I'm pretty sure that was
> > > > because it wasn't activated (--run) since md was waiting for the
> > > > expected number of disks (2). md1, as before, wasn't assembled at all.
> > > >
> > > > >From knoppix (v7, 32 bit) I activated both md's and shrunk them to size
> > > > 1 (--grow --force -n 1). In retrospect this probably could have been
> > > > done from the initrd.
> > > >
> > > > Then I was able to boot.
> > > >
> > > > I repartitioned sdb and added it to the RAID arrays. This led to hard
> > > > disk failures on sdb, though the arrays eventually were assembled. I
> > > > failed and removed the sdb partitions from the arrays and shrunk them.
> > > > I hope the bad sdb has not screwed up the good sda.
> > >
> > > Its not entirely impossible (I've seen it happen) but it is very unlikely
> > > that hardware errors on one device will "infect" the other.
> > Our local sysadmin also believes the errors in sdb were either
> > corrected, or resulted in an error code, rather than ever sending bad
> > data back. I'm proceeding on the assumption sda is OK.
> > >
> > > >
> > > > Thanks for any assistance you can offer.
> > >
> > > What sort of assistance are you after?
> > I'm trying to understand what happened and how to avoid having it happen
> > again.
> >
> > I'm also trying to understand under what conditions it is safe to insert
> > disks that have out of date versions of arrays in them.
> >
> > >
> > > first questions is: does the initrd handle GPT. If not, fix that first.
> > That is the first thing I'll check when I'm at the machine. The problem
> > with the "initrd didn't recognize GPT theory" was that in my very first
> > reboot md0 was assemebled from two partitions, one of which was on a GPT
> > disk. (another example of "all my theories have contradictory evidence")
> >
> > Ross
> After running for awhile with both RAIDs having size 1 and using sda
> exclusively, I shut down the sytem, removed the physically failing sdb,
> and added the 2 GPT disks, formerly known as sdd and sde. sdd has
> partitions that were part of md0 and md1; sde has a partition that was
> part of md1. For simplicity I'll continue to refer to them as sdd and
> sde, even though they were called sdb and sdc in the new configuration.
>
> This time, md0 came up with sdd2 (which is old) only and md1 came up
> correctly with sda3 only. Substantively sdd2 and sda1 are identical,
> since they hold /boot and there have been no recent changes to it.
>
> This happened across 2 consecutive boots. Once again, the older device
> (sdd2) was activated in preference to the newer one (sda1).
>
> In terms of counts for md0, mdadm.conf continued to indicate 2; sda1
> indicates 1 device; and sdd2 indicates 2 devices + 1 failed device.
That is why mdadm preferred sdd2 to sda1 - it matched mdadm.conf better.
I strongly suggest that you remove all "devices=" entries from mdadm.conf.
NeilBrown
>
> BTW, by using break=bottom as a kernel parameter one can interrupt the
> initrd just after mdadm has run and see if the mappings are right. For
> the 2nd boot I did just that, and then manually shutdown md0 and brought
> it back with sda1. The code appears to offer break=post-mdadm as an
> alternative, but that did not work for me (there was no break). These
> are Debian-specific tweaks, I believe.
>
> Ross
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
next prev parent reply other threads:[~2012-11-27 2:15 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-11-21 16:58 System runs with RAID but fails to reboot Ross Boylan
2012-11-22 4:52 ` NeilBrown
2012-11-24 0:15 ` Ross Boylan
2012-11-26 23:48 ` System runs with RAID but fails to reboot [explanation?] Ross Boylan
2012-11-27 2:15 ` NeilBrown [this message]
2012-11-28 2:54 ` Ross Boylan
2012-11-29 1:45 ` NeilBrown
2012-11-29 6:42 ` Ross Boylan
2012-12-03 0:13 ` NeilBrown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20121127131527.4d0d81d5@notabene.brown \
--to=neilb@suse.de \
--cc=linux-raid@vger.kernel.org \
--cc=ross@biostat.ucsf.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).