From: Mike Myers <mikesm559@yahoo.com>
To: Neil Brown <neilb@suse.de>
Cc: Justin Piszcz <jpiszcz@lucidpixels.com>,
linux-raid@vger.kernel.org, john lists <john4lists@gmail.com>
Subject: Re: Need urgent help in fixing raid5 array
Date: Tue, 6 Jan 2009 15:54:13 -0800 (PST) [thread overview]
Message-ID: <46247.99152.qm@web30807.mail.mud.yahoo.com> (raw)
In-Reply-To: 18787.59872.647206.934303@notabene.brown
Thanks for this and the previous explanation of how roles and slots work. I should be able to try and few combinations and see. At this point, I am not sure if the issue was caused by a bad backplane, or bad controller or bad disk. I can't tell for sure the backplane was bad, but I have a replacement sitting at my desk now, so I can go ahead and replace it just to be sure. The LSI MPT controller that failed was connected only to drives in md2, but that array is up and running fine and so I don't think it broke something when it failed.
I had seen two smart alerts indicating a drive was failing, which is what caused me to try and replace the kicked drive with a new one and do a rebuild, which was the event that started this chain of events. I swapped the drive (part of md1), but the OS did not indicate the SATA port went down and did not init the new drive. When I rebooted to system (suspecting a temporary problem with the controller), everything went to hell. I suspect this initial failure was due to the backplane problem, but it may have had some corruption on the disks as well.
I may have fat fingered something after the reboot that caused the problem with a bad superblock being written to the sdf1 as the device names may have changed on boot, and I didn't catch that (I may have done a hotswap a month ago when I had my first near death experience with md2) leading me to use the wrong device in an mdadm command, but it's hard to tell that now.
With 15 hotswap drives in the system, I can tell you that device name changing is fraught with peril. I am unfamilar with the /dev/disk/by-uuid functionality. Is that documented in a howto somewhere? How is that supposed to work?
thx
mike
----- Original Message ----
From: Neil Brown <neilb@suse.de>
To: Mike Myers <mikesm559@yahoo.com>
Cc: Justin Piszcz <jpiszcz@lucidpixels.com>; linux-raid@vger.kernel.org; john lists <john4lists@gmail.com>
Sent: Tuesday, January 6, 2009 3:31:44 PM
Subject: Re: Need urgent help in fixing raid5 array
On Monday January 5, mikesm559@yahoo.com wrote:
> BTW, in the original email I sent that had the --examine info for
> each of these array members, three devices have the same device UUID
> and array slot, and two of them share an older event count, and one
> has a slightly newer event count. Which of these should be the real
> array slot 0? And I notice that one of the members in that email
> had a device UUID that I can't find anymore (I suspect it's the
> current sdf1 that thinks it's part of md2). In that email, it had
> array slot 4, which is one of the missing devices in the current
> familt (that I assume --assemble would add as "3"). It also has
> 9663 hours on it, which makes it part of the original set of 4
> members for this raid5 array. The drive in slot 5 only has 7630
> hours on it, so it should have been added later as part of a --grow
> operation.
>
> Does all that make sense? If so, then sdb1, (which says it's slot
> 0), sdi1 (at 9671 hours) and also thinks it's slot 0, sdj1 (at 9194
> hours) which also says it's 0, and sdf1 (at 9663 hours) and used to
> apparently think it's slot 4 should be the original 4 drives of the
> array. How can I figure out which is the real slot 0, and who is
> slot 1 and 2 if sdi1 and sdj1 all have the same event count and
> array slot id (0) and same device UUID?
I had noticed the slot number was repeated. I hadn't noticed the
device uuid was the same, though I guess that makes sense. Somehow
the superblock for one device has been written to the other devices.
It is not really possible to be sure which is the original without
knowing how this happened, though I suspect that the one with the
higher event count is more likely to be the original.
Being a software guy, I tend to like to blame hardware, and I wonder
if your problematic backplane managed to send write requests to the
wrong drive somehow. If it did, then my expectation of your success
just went down a few notches. :-(
The only option for you to try to find out which device is which is to
try various combinations and see what gives you access to the most
consistent data.
>
> This is way harder work than should be need to fix a problem. :-)
> But I am sure glad you gurus know how this stuff is supposed to
> work!
I'm happy to help as much as I can... I just hope your hardware hasn't
done too much damage...
NeilBrown
next prev parent reply other threads:[~2009-01-06 23:54 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <451872.61166.qm@web30802.mail.mud.yahoo.com>
2009-01-01 15:40 ` Need urgent help in fixing raid5 array Justin Piszcz
2009-01-01 17:51 ` Mike Myers
2009-01-01 18:29 ` Justin Piszcz
2009-01-01 18:40 ` Jon Nelson
2009-01-01 20:38 ` Mike Myers
2009-01-02 6:19 ` Mike Myers
2009-01-02 12:10 ` Justin Piszcz
2009-01-02 18:12 ` Mike Myers
2009-01-02 18:22 ` Justin Piszcz
2009-01-02 18:46 ` Mike Myers
2009-01-02 18:57 ` Justin Piszcz
2009-01-02 20:46 ` Mike Myers
2009-01-02 20:56 ` Mike Myers
2009-01-02 21:37 ` Mike Myers
2009-01-03 4:19 ` Mike Myers
2009-01-03 4:43 ` Guy Watkins
2009-01-03 5:02 ` Mike Myers
2009-01-03 12:46 ` John Robinson
2009-01-03 15:49 ` Mike Myers
2009-01-03 16:14 ` John Robinson
2009-01-03 16:47 ` Mike Myers
2009-01-03 19:03 ` Mike Myers
2009-01-05 22:11 ` Neil Brown
2009-01-05 22:22 ` Mike Myers
2009-01-05 22:53 ` NeilBrown
2009-01-06 2:46 ` Mike Myers
2009-01-06 4:00 ` NeilBrown
2009-01-06 5:55 ` Mike Myers
2009-01-06 23:23 ` Neil Brown
2009-01-06 6:24 ` Mike Myers
2009-01-06 23:31 ` Neil Brown
2009-01-06 23:54 ` Mike Myers [this message]
2009-01-07 0:19 ` NeilBrown
2009-01-13 5:38 ` Mike Myers
2009-01-13 5:57 ` Mike Myers
2009-01-01 15:31 Mike Myers
-- strict thread matches above, loose matches on Subject: below --
2008-12-05 17:03 Mike Myers
2008-12-06 0:18 ` Mike Myers
2008-12-06 0:24 ` Justin Piszcz
2008-12-06 0:47 ` Mike Myers
2008-12-06 0:51 ` Justin Piszcz
2008-12-06 0:58 ` Mike Myers
2008-12-06 19:02 ` Mike Myers
2008-12-06 19:30 ` Mike Myers
2008-12-06 20:14 ` Mike Myers
2008-12-06 0:52 ` David Lethe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=46247.99152.qm@web30807.mail.mud.yahoo.com \
--to=mikesm559@yahoo.com \
--cc=john4lists@gmail.com \
--cc=jpiszcz@lucidpixels.com \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).