From: Phil Turmel <philip@turmel.org>
To: Barrett Lewis <barrett.lewis.mitsi@gmail.com>
Cc: "linux-raid@vger.kernel.org" <linux-raid@vger.kernel.org>
Subject: Re: Mdadm server eating drives
Date: Thu, 13 Jun 2013 22:08:40 -0400 [thread overview]
Message-ID: <51BA7B28.9030808@turmel.org> (raw)
In-Reply-To: <CAPSPcXhn8WKcZMVWhSDBVkRyqwo5XCyDc55f2sgr6davVsN5XA@mail.gmail.com>
Hi Barrett,
Please interleave your replies, and trim unnecessary quotes.
On 06/13/2013 08:19 PM, Barrett Lewis wrote:
> Sorry for the delay, I wanted to let the memtest run for 48 hours.
> It's at 49 hours now with zero errors, so memory is pretty much ruled
> out.
>
> As far as power, I would *think* I have enough power. The power
> supply is a 500w Thermaltake TR2. It's powering an Asrock z77 mobo
> with an i5-3570k, and the only card on it is a dinky little 2 port
> sata card my OS drive is on (the RAID components are plugged into the
> mobo). Eight 7200 drives and an SSD. Tell me if this sounds
> insufficient.
>
> Phil, when you say "what you are experiencing", what do you mean
> specifically? The dmesg errors and drives falling off? Or did you
> mean the beeping noises (since thats the part you trimmed)?
Drives dropping out when they shouldn't, and smartctl says "PASSED".
This is *unavoidable* when you have mismatched device and driver timeouts.
> Here is the data you requested
>
> 1) mdadm -E /dev/sd[a-f] http://pastie.org/8040826
/dev/sdd and /dev/sde have old event counts ...
> 2) mdadm -D /dev/md0 http://pastie.org/8040828
... matching the array report ...
> 3)
> smartctl -x /dev/sda http://pastie.org/8040847
Ok, but no error recovery support (typical of green drives).
> smartctl -x /dev/sdb http://pastie.org/8040848
Ok, green again. No ERC.
> smartctl -x /dev/sdc http://pastie.org/8040850
Ok, with ERC support, but disabled. Not a green drive.
> smartctl -x /dev/sdd http://pastie.org/8040851
Not Ok. A few relocations, a couple pending errors. ERC support
present but disabled.
> smartctl -x /dev/sde http://pastie.org/8040852
Not Ok. No relocations, but several pending errors. No ERC.
> smartctl -x /dev/sdf http://pastie.org/8040853
Ok, but no ERC.
> 4) cat /proc/mdstat http://pastie.org/8040859
>
> 5) for x in /sys/block/sd*/device/timeout ; do echo $x $(< $x) ; done
> http://pastie.org/8040870
All timeouts are still the default 30 seconds. With enabled ERC
support, these values must be two to three minutes. I recommend 180
seconds. Your array *will not* complete a rebuild with dealing with
this problem.
> 6) dmesg | grep -e sd -e md http://pastie.org/8040871
> (note that I have rebooted since the last dmesg link I posted (where
> two drives failed) because I was running memtest, if I should do dmesg
> differently, let me know)
>
> 7) cat /etc/mdadm.conf http://pastie.org/8040876
I generally simplify the ARRAY line to just the device and the UUID, but
it is ok as is.
> Adam, I wouldn't be opposed to spending the money on a good sata card,
> but I'd like to get opinions from a few people first. Any suggestions
> on a good one for mdadm specifically?
No need. Just fix your timeouts. For the two devices that support ERC,
you need to turn it on:
> smartctl -l scterc,70,70 /dev/sdc
> smartctl -l scterc,70,70 /dev/sdd
For the others, you need long timeouts in the linux driver:
> for x in /sys/block/sd[abef]/device/timeout ; do echo 180 >$x ; done
This must be done now, and at every power cycle or reboot. rc.local or
similar distro config is the appropriate place. (Enterprise drives
power up with ERC enabled. As do raid-rated consumer drives like WD Red.)
Then stop and re-assemble your array. Use --force to reintegrate your
problem drives. Fortunately, this is a raid6--with compatible timeouts,
your rebuild will succeed. A URE on /dev/sdd would have to fall in the
same place as a URE on /dev/sde to kill it.
Upon completion, the UREs will either be fixed or relocated. If any
drive's relocations reach double digits, I'd replace it.
Finally, after your array is recovered, set up a cron job that'll
trigger a "check" scrub of your array on a regular basis. I use a
weekly scrub. The scrub keeps UREs that develop on idle parts of your
array from accumulating. Note, the scrub itself will crash your array
if your timeouts are mismatched and any UREs are lurking.
I'll let you browse the archives for a more detailed explanation of
*why* this happens.
Phil
next prev parent reply other threads:[~2013-06-14 2:08 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-06-12 13:47 Mdadm server eating drives Barrett Lewis
2013-06-12 13:57 ` David Brown
2013-06-12 14:44 ` Phil Turmel
2013-06-12 15:41 ` Adam Goryachev
[not found] ` <CAPSPcXihHrAi2TB9Fuxb1qOGMc_WzwGoXAA7nHdwe2knkO0LkQ@mail.gmail.com>
[not found] ` <CAPSPcXib4YZ9Ah-jLvL_kPwpKHLxaGT0rNaDL4XQcFm=RtjcAQ@mail.gmail.com>
2013-06-14 0:19 ` Barrett Lewis
2013-06-14 2:08 ` Phil Turmel [this message]
[not found] ` <CAPSPcXgMxOF-C2Szu_nf4ZLDC8p+yJFOtvLPu7xy1DTW9VAHjg@mail.gmail.com>
2013-06-14 21:18 ` Barrett Lewis
2013-06-14 21:20 ` Barrett Lewis
2013-06-14 21:25 ` Phil Turmel
2013-06-14 21:30 ` Phil Turmel
2013-06-17 21:37 ` Barrett Lewis
2013-06-18 4:13 ` Mikael Abrahamsson
2013-06-27 0:23 ` Barrett Lewis
2013-06-27 17:13 ` Nicolas Jungers
2013-07-02 0:17 ` Barrett Lewis
2013-07-02 1:57 ` Stan Hoeppner
2013-07-02 15:48 ` Barrett Lewis
2013-07-02 19:44 ` Stan Hoeppner
2013-07-02 19:54 ` Stan Hoeppner
2013-07-02 20:07 ` Jon Nelson
2013-07-02 20:23 ` Stan Hoeppner
2013-07-02 20:58 ` Barrett Lewis
2013-07-03 1:50 ` Stan Hoeppner
2013-07-03 5:26 ` Barrett Lewis
2013-07-03 14:03 ` Jon Nelson
2013-07-03 14:36 ` Phil Turmel
2013-07-03 17:32 ` Stan Hoeppner
2013-07-03 19:47 ` Barrett Lewis
2013-07-03 20:38 ` Jon Nelson
2013-07-04 2:21 ` Stan Hoeppner
2013-07-03 17:05 ` Stan Hoeppner
2013-07-02 21:49 ` Phil Turmel
2013-06-14 21:24 ` Phil Turmel
2013-07-29 22:25 ` Roy Sigurd Karlsbakk
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51BA7B28.9030808@turmel.org \
--to=philip@turmel.org \
--cc=barrett.lewis.mitsi@gmail.com \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.