Re: Trouble adding disk to degraded array

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Phil Turmel <philip@turmel.org>
To: "Nicholas Ipsen(Sephiroth_VII)" <sephiroth7vii@gmail.com>
Cc: linux-raid <linux-raid@vger.kernel.org>
Subject: Re: Trouble adding disk to degraded array
Date: Wed, 09 Jan 2013 16:54:04 -0500	[thread overview]
Message-ID: <50EDE6FC.1060403@turmel.org> (raw)
In-Reply-To: <CAJ=LqmbKR1BczeFyGeQooDsPMf8PqriiD3z7i1_LjupGij5ewQ@mail.gmail.com>

Hi Nicholas,

[Top-posting fixed.  Please don't do that.]

On 01/09/2013 04:18 PM, Nicholas Ipsen(Sephiroth_VII) wrote:
> On 9 January 2013 18:55, Phil Turmel <philip@turmel.org> wrote:
>> On 01/09/2013 12:21 PM, Nicholas Ipsen(Sephiroth_VII) wrote:
>>> I recently had mdadm mark a disk in my RAID5-array as faulty. As it
>>> was within warranty, I returned it to the manufacturer, and have now
>>> installed a new drive. However, when I try to add it, recovery fails
>>> about halfway through,  with the newly added drive being marked as a
>>> spare, and one of my other drives marked as faulty!
>>>
>>> I seem to have full access to my data when assembling the array
>>> without the new disk using --force, and e2fsck reports no problems
>>> with the filesystem.
>>>
>>> What is happening here?
>>
>> You haven't offered a great deal of information here, so I'll speculate:
>>  an unused sector one of your original drives has become unreadable (per
>> most drive specs, occurs naturally about every 12TB read).  Since
>> rebuilding an array involves computing parity for every stripe, the
>> unused sector is read and triggers the unrecoverable read error (URE).
>> Since the rebuild is incomplete, mdadm has no way to generate this
>> sector from another source, and doesn't know it isn't used, so the drive
>> is kicked out of the array.  You now have a double-degraded raid5, which
>> cannot continue operating.
>>
>> If you post the output of dmesg, "mdadm -D /dev/mdX", and "mdadm -E
>> /dev/sd[a-z]" (the latter with the appropriate member devices), we can
>> be more specific.
>>
>> BTW, this exact scenario is why raid6 is so popular, and why weekly
>> scrubbing is vital.
>>
>> It's also possible that you are experiencing the side effects of an
>> error timeout mismatch between your drives (defaults vary) and the linux
>> driver stack (default 30s).  Drive timeout must be less than the driver
>> timeout, or good drives will eventually be kicked out of your array.
>> Enterprise drives default to 7 seconds.  Desktop drives all default to
>> more than 60 seconds, and it seems most will spend up to 120 seconds.
>>
>> Cheap desktop drives cannot change their timeout.  For those, you must
>> change the driver timeout with:
>>
>> echo 120 >/sys/block/sdX/device/timeout
>>
>> Better desktop drives will allow you to set a 7 second timeout with:
>>
>> smartctl -l scterc,70,70 /dev/sdX
>>
>> Either solution must be executed on each boot, or drive hot-swap.

> Hello Phil, thank you for your prompt reply. It's the first time I've
> done any serious debugging work on mdadm, so please excuse my
> inadequacies. I've attached the files you requested. If you could
> please look through them and offer your thoughts, it'd be most
> appreciated.

I've looked at your dmesg, and it confirms that you had an unrecoverable
read error on /dev/sdc1. The attachment that was supposed to be the
output of "mdadm -E /dev/sd[abcde]1" was something else, but no big
deal.  (Partition #1 is the array member, not the whole drive.)

(You can put such things directly in the email in the future--easier to
read.)

At this point, you could try to re-write the sectors on /dev/sdc that
are currently unreadable, to get them to relocate.  But I'd recommend
using the spare with dd_rescue to copy everything readable from
/dev/sdc.  (With the array stopped.)

Then you can zero the superblock on /dev/sdc1, leave the copy in place,
and restart the array with the copy.  Then add sdc1 to the array, and
let mdadm rebuild (*to* sdc, instead of *from* sdc).

This plan does depend on the problem with sdc being transient.  Many
UREs are, and are fixed by writing over them.  Please show the output of:

smartctl -x /dev/sdc

Phil

next prev parent reply	other threads:[~2013-01-09 21:54 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAJ=LqmbYG8H45M196ZuRcDMu9Ucz0t_pQenQbZtMKM9AhSqrpQ@mail.gmail.com>
2013-01-09 17:21 ` Trouble adding disk to degraded array Nicholas Ipsen(Sephiroth_VII)
2013-01-09 17:55   ` Phil Turmel
2013-01-09 21:18     ` Nicholas Ipsen(Sephiroth_VII)
2013-01-09 21:54       ` Phil Turmel [this message]
2013-01-09 22:33       ` Tudor Holton
2013-01-09 23:47         ` Nicholas Ipsen
2013-01-11 13:14           ` Nicholas Ipsen
2013-01-11 14:07             ` Mikael Abrahamsson
2013-01-12  0:01               ` Nicholas Ipsen
2013-01-12  0:24                 ` Phil Turmel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50EDE6FC.1060403@turmel.org \
    --to=philip@turmel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=sephiroth7vii@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.