All of lore.kernel.org
 help / color / mirror / Atom feed
* need a little help rebuilding a raid 10
@ 2011-12-06  2:05 Greg Freemyer
  2011-12-06 14:11 ` Greg Freemyer
  0 siblings, 1 reply; 5+ messages in thread
From: Greg Freemyer @ 2011-12-06  2:05 UTC (permalink / raw)
  To: Linux RAID

All,

I have a raid10 that failed recently due to a failed drive slot.  The
drive is good from what I can tell.  In theory it is rebuilding now.

1) Once the current recovery process finishes, are there any commands
I can (should) issue to make sure the array is consistent.  I'm afraid
my mirror halves won't really be in sync.

2) If I want to pause the recovery and do some production real, can I
do that?  How?

== details

Not sure why but each of the members dropped one by one until the
raid10 went offline.

I likely did something wrong by now, but I currently have it in this state:

md127 : active raid10 sdb5[4] sda5[0] sdc3[5] sdd3[2]
      923517952 blocks super 1.2 512K chunks 2 near-copies [4/3] [UUU_]
      [>....................]  recovery =  0.8% (4117760/461758976)
finish=1373.3min speed=5553K/sec

(it used to be md2.  No idea where md127 came from.  There are only 4
md's on the machine.)

It's currently providing a usable volume I think.  I just rebooted the
machine and the filesystem looks good at first glance.

The recovery looks very slow to me, but maybe I still have hardware issues.

The first 2 members forming a raid 1 immediately after being told
makes since to me.  I don't understand how the 3rd member got sync'ed
up so fast.  It seemed to be instantaneous and I don't think it was in
sync.

Originally it was a raid10 with
sda5 mirrored to sdb5
sdc3 mirrored to sdd3
(or so I believe)

Immediately after the failure I had nothing, so I did:
# mdadm --stop /dev/md2

# mdadm --create /dev/md2 -v --assume-clean --level=raid10
--raid-devices=4 /dev/sda5 missing /dev/sdd3 missing

(or similar, my sdX names have been changing as this event progresses.
These names are based on what I see in mdstat.)

I ran that way for a day, which is why I really don't think either of
the missing mirror halves should have immediately sync'ed.

Anyway, I have a backup but I prefer not to use it if it can be
avoided.  (the machine is in sporadic production, for an hour or two
at a time, and going offline for a day to recreate it from scratch
does not sound like fun.)


Thanks
Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
CNN/TruTV Aired Forensic Imaging Demo -
   http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: need a little help rebuilding a raid 10
  2011-12-06  2:05 need a little help rebuilding a raid 10 Greg Freemyer
@ 2011-12-06 14:11 ` Greg Freemyer
  2011-12-06 14:39   ` Robin Hill
  2011-12-06 14:52   ` Phil Turmel
  0 siblings, 2 replies; 5+ messages in thread
From: Greg Freemyer @ 2011-12-06 14:11 UTC (permalink / raw)
  To: Linux RAID

Hmm...

My rebuild failed.  At first glance I had both a failed drive and a failed slot?

What I don't understand is I have I/O errors in /var/log/messages from
when the rebuild failed over night.

But this morning, hdparm --read-sector is reading the "bad" sectors fine.

I already tried replacing the drive and the replacement drive also
reported media errors during the rebuild, that's why I came to believe
I had a bad slot.

Now I have non-repeatable media errors.

fyi: I have the problem drive connected via eSata now, so it's a
different controller totally than where it was when the failure first
occurred.

Any thoughts?

Thanks
Greg

On Mon, Dec 5, 2011 at 9:05 PM, Greg Freemyer <greg.freemyer@gmail.com> wrote:
> All,
>
> I have a raid10 that failed recently due to a failed drive slot.  The
> drive is good from what I can tell.  In theory it is rebuilding now.
>
> 1) Once the current recovery process finishes, are there any commands
> I can (should) issue to make sure the array is consistent.  I'm afraid
> my mirror halves won't really be in sync.
>
> 2) If I want to pause the recovery and do some production real, can I
> do that?  How?
>
> == details
>
> Not sure why but each of the members dropped one by one until the
> raid10 went offline.
>
> I likely did something wrong by now, but I currently have it in this state:
>
> md127 : active raid10 sdb5[4] sda5[0] sdc3[5] sdd3[2]
>      923517952 blocks super 1.2 512K chunks 2 near-copies [4/3] [UUU_]
>      [>....................]  recovery =  0.8% (4117760/461758976)
> finish=1373.3min speed=5553K/sec
>
> (it used to be md2.  No idea where md127 came from.  There are only 4
> md's on the machine.)
>
> It's currently providing a usable volume I think.  I just rebooted the
> machine and the filesystem looks good at first glance.
>
> The recovery looks very slow to me, but maybe I still have hardware issues.
>
> The first 2 members forming a raid 1 immediately after being told
> makes since to me.  I don't understand how the 3rd member got sync'ed
> up so fast.  It seemed to be instantaneous and I don't think it was in
> sync.
>
> Originally it was a raid10 with
> sda5 mirrored to sdb5
> sdc3 mirrored to sdd3
> (or so I believe)
>
> Immediately after the failure I had nothing, so I did:
> # mdadm --stop /dev/md2
>
> # mdadm --create /dev/md2 -v --assume-clean --level=raid10
> --raid-devices=4 /dev/sda5 missing /dev/sdd3 missing
>
> (or similar, my sdX names have been changing as this event progresses.
> These names are based on what I see in mdstat.)
>
> I ran that way for a day, which is why I really don't think either of
> the missing mirror halves should have immediately sync'ed.
>
> Anyway, I have a backup but I prefer not to use it if it can be
> avoided.  (the machine is in sporadic production, for an hour or two
> at a time, and going offline for a day to recreate it from scratch
> does not sound like fun.)
>
>
> Thanks
> Greg
> --
> Greg Freemyer
> Head of EDD Tape Extraction and Processing team
> Litigation Triage Solutions Specialist
> http://www.linkedin.com/in/gregfreemyer
> CNN/TruTV Aired Forensic Imaging Demo -
>    http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/
>
> The Norcross Group
> The Intersection of Evidence & Technology
> http://www.norcrossgroup.com



-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
CNN/TruTV Aired Forensic Imaging Demo -
   http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: need a little help rebuilding a raid 10
  2011-12-06 14:11 ` Greg Freemyer
@ 2011-12-06 14:39   ` Robin Hill
  2011-12-06 14:52   ` Phil Turmel
  1 sibling, 0 replies; 5+ messages in thread
From: Robin Hill @ 2011-12-06 14:39 UTC (permalink / raw)
  To: Greg Freemyer; +Cc: Linux RAID

[-- Attachment #1: Type: text/plain, Size: 1854 bytes --]

On Tue Dec 06, 2011 at 09:11:24AM -0500, Greg Freemyer wrote:

> Hmm...
> 
> My rebuild failed.  At first glance I had both a failed drive and a failed slot?
> 
> What I don't understand is I have I/O errors in /var/log/messages from
> when the rebuild failed over night.
> 
> But this morning, hdparm --read-sector is reading the "bad" sectors fine.
> 
> I already tried replacing the drive and the replacement drive also
> reported media errors during the rebuild, that's why I came to believe
> I had a bad slot.
> 
> Now I have non-repeatable media errors.
> 
> fyi: I have the problem drive connected via eSata now, so it's a
> different controller totally than where it was when the failure first
> occurred.
> 
> Any thoughts?
> 
Last time I had this sort of issue, it was down to the motherboard.
Somewhere between the drives and the CPU, one or more of the chipsets
were causing issues (I actually had the same issue on multiple
motherboards, though I think using the same/similar onboard SATA
controllers). Single drive tests worked fine - it was only when
hammering the entire array that it would get a write error and fail a
random drive. I've since bought a proper SAS/SATA PCIe card (Intel
RS2WC080) and have had no issues since.

The other things I can think of that may cause this type of issue are a
flaky PSU, or physical shock to the server chassis (even relatively
small movements can cause read/write slowdowns/errors - there's a video
clip online of someone just shouting in front of a rack and causing the
transfer speed to drop off dramatically).

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: need a little help rebuilding a raid 10
  2011-12-06 14:11 ` Greg Freemyer
  2011-12-06 14:39   ` Robin Hill
@ 2011-12-06 14:52   ` Phil Turmel
  2011-12-07  1:35     ` Greg Freemyer
  1 sibling, 1 reply; 5+ messages in thread
From: Phil Turmel @ 2011-12-06 14:52 UTC (permalink / raw)
  To: Greg Freemyer; +Cc: Linux RAID

Hi Greg,

On 12/06/2011 09:11 AM, Greg Freemyer wrote:
> Hmm...
> 
> My rebuild failed.  At first glance I had both a failed drive and a failed slot?
> 
> What I don't understand is I have I/O errors in /var/log/messages from
> when the rebuild failed over night.

Something in your system is untrustworthy.

> But this morning, hdparm --read-sector is reading the "bad" sectors fine.

What does smartctl say about your drives (all of them)?

> I already tried replacing the drive and the replacement drive also
> reported media errors during the rebuild, that's why I came to believe
> I had a bad slot.
> 
> Now I have non-repeatable media errors.
> 
> fyi: I have the problem drive connected via eSata now, so it's a
> different controller totally than where it was when the failure first
> occurred.

Are the errors in /var/log/messages only from that drive?  If so, then that
drive is probably toast.

> Any thoughts?

Your prior e-mail said that you re-created the array.  I didn't see that you
had definitively nailed down the problem at that point, so it probably wasn't
a good idea.  In particular, it destroys all prior metadata on the array
members.  If you didn't keep the output of "mdadm -E" for each drive, that
information is now lost.

In general, "--create" is a last resort, and only to be used for recovery
when you have absolute confidence you understand the layout (mdadm -E
printouts of the original array).  "--assemble --force" is the proper step
after "--assemble" fails.

I would completely scrub the questionable drive with random data, run a long
smartctl test on it, and replace it if it reports any re-allocated sectors at
that point.

I would also run long smartctl tests on the other drives, looking for pending
sectors or re-allocated sectors.  If any, I would plan on replacements for
them as well, and would try to validate the content of your files.  You do
have a backup to compare against, after all.

If you are running a Debian-based distro, and the array contains your rootfs,
you might find "debsums" useful.

HTH,

Phil

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: need a little help rebuilding a raid 10
  2011-12-06 14:52   ` Phil Turmel
@ 2011-12-07  1:35     ` Greg Freemyer
  0 siblings, 0 replies; 5+ messages in thread
From: Greg Freemyer @ 2011-12-07  1:35 UTC (permalink / raw)
  Cc: Linux RAID

All,

I found a fan that wasn't working.

This is 1u rack mount unit, so that fan not working apparently caused
a lot of issues.

I replaced the fan about 10 hours ago and I've done a bunch of
different tests today.  No disk errors reported in that time.

I gave up on my previous array.  I just deleted it and recreated it.

I'm restoring from backup now.

Thanks
Greg

On Tue, Dec 6, 2011 at 9:52 AM, Phil Turmel <philip@turmel.org> wrote:
> Hi Greg,
>
> On 12/06/2011 09:11 AM, Greg Freemyer wrote:
>> Hmm...
>>
>> My rebuild failed.  At first glance I had both a failed drive and a failed slot?
>>
>> What I don't understand is I have I/O errors in /var/log/messages from
>> when the rebuild failed over night.
>
> Something in your system is untrustworthy.
>
>> But this morning, hdparm --read-sector is reading the "bad" sectors fine.
>
> What does smartctl say about your drives (all of them)?
>
>> I already tried replacing the drive and the replacement drive also
>> reported media errors during the rebuild, that's why I came to believe
>> I had a bad slot.
>>
>> Now I have non-repeatable media errors.
>>
>> fyi: I have the problem drive connected via eSata now, so it's a
>> different controller totally than where it was when the failure first
>> occurred.
>
> Are the errors in /var/log/messages only from that drive?  If so, then that
> drive is probably toast.
>
>> Any thoughts?
>
> Your prior e-mail said that you re-created the array.  I didn't see that you
> had definitively nailed down the problem at that point, so it probably wasn't
> a good idea.  In particular, it destroys all prior metadata on the array
> members.  If you didn't keep the output of "mdadm -E" for each drive, that
> information is now lost.
>
> In general, "--create" is a last resort, and only to be used for recovery
> when you have absolute confidence you understand the layout (mdadm -E
> printouts of the original array).  "--assemble --force" is the proper step
> after "--assemble" fails.
>
> I would completely scrub the questionable drive with random data, run a long
> smartctl test on it, and replace it if it reports any re-allocated sectors at
> that point.
>
> I would also run long smartctl tests on the other drives, looking for pending
> sectors or re-allocated sectors.  If any, I would plan on replacements for
> them as well, and would try to validate the content of your files.  You do
> have a backup to compare against, after all.
>
> If you are running a Debian-based distro, and the array contains your rootfs,
> you might find "debsums" useful.
>
> HTH,
>
> Phil



-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
CNN/TruTV Aired Forensic Imaging Demo -
   http://insession.blogs.cnn.com/2010/03/23/how-computer-evidence-gets-retrieved/

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-12-07  1:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-06  2:05 need a little help rebuilding a raid 10 Greg Freemyer
2011-12-06 14:11 ` Greg Freemyer
2011-12-06 14:39   ` Robin Hill
2011-12-06 14:52   ` Phil Turmel
2011-12-07  1:35     ` Greg Freemyer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.