Re: recovering failed raid5

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Andreas Klauer <Andreas.Klauer@metamorpher.de>
To: Alexander Shenkin <al@shenkin.org>
Cc: linux-raid@vger.kernel.org
Subject: Re: recovering failed raid5
Date: Thu, 27 Oct 2016 18:04:00 +0200	[thread overview]
Message-ID: <20161027160400.GA21042@metamorpher.de> (raw)
In-Reply-To: <CAM97BgQLPUN=t7VKHaVG-=SrJmS_tvaxGDbo3yqMwHm8B-do_Q@mail.gmail.com>

On Thu, Oct 27, 2016 at 04:06:14PM +0100, Alexander Shenkin wrote:
> md2: raid5 mounted on /, via sd[abcd]3

Two failed disks...

> md0: raid1 mounted on /boot, via sd[abcd]1

Actually only two disks active in that one, the other two are spares.
It hardly matters for /boot, but you could grow it to a 4 disk raid1.
Spares are not useful.

> My sdb was recently reporting problems.  Instead of second guessing
> those problems, I just got a new disk, replaced it, and added it to
> the arrays.

Replacing right away is the right thing to do.
Unfortunately it seems you have another disk that is broke too.

> 2) smartctl (disabled on drives - can enable once back up.  should I?)
> note: SMART only enabled after problems started cropping up.

But... why? Why disable smart? And if you do, is it a surprise that you 
only notice disk failures when it's already too late?

You should enable smart, and not only that, also run regular selftests, 
and have smartd running, and have it send you mail when something happens. 
Same with raid checks, raid checks are at least something but it won't 
tell you about how many reallocated sectors your drive has.

> root@machinename:/home/username# smartctl --xall /dev/sda

Looks fine but never ran a selftest.

> root@machinename:/home/username# smartctl --xall /dev/sdb

Looks new. (New drives need selftests too.)

> root@machinename:/home/username# smartctl --xall /dev/sdc
> smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.19.0-39-generic] (local build)
> Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
> 
> === START OF INFORMATION SECTION ===
> Model Family:     Seagate Barracuda 7200.14 (AF)
> Device Model:     ST3000DM001-1CH166
> Serial Number:    W1F1N909
>
> 197 Current_Pending_Sector  -O--C-   100   100   000    -    8
> 198 Offline_Uncorrectable   ----C-   100   100   000    -    8

This one is faulty and probably the reason why your resync failed.
You have no redundancy left, so an option here would be to get a 
new drive and ddrescue it over.

That's exactly the kind of thing you should be notified instantly 
about via mail. And it should be discovered when running selftests. 
Without full surface scan of the media, the disk itself won't know.

> ==> WARNING: A firmware update for this drive may be available,
> see the following Seagate web pages:
> http://knowledge.seagate.com/articles/en_US/FAQ/207931en
> http://knowledge.seagate.com/articles/en_US/FAQ/223651en

About this, *shrug*
I don't have these drives, you might want to check that out.
But it probably won't fix bad sectors.

> root@machinename:/home/username# smartctl --xall /dev/sdd

Some strange things in the error log here, but old.
Still, same as for all others - selftest.

> ################### mdadm --examine ###########################
> 
> /dev/sda1:
>      Raid Level : raid1
>    Raid Devices : 2

A RAID 1 with two drives, could be four.

> /dev/sdb1:
> /dev/sdc1:

So these would also have data instead of being spare.

> /dev/sda3:
>      Raid Level : raid5
>    Raid Devices : 4
> 
>     Update Time : Mon Oct 24 09:02:52 2016
>          Events : 53547
> 
>    Device Role : Active device 0
>    Array State : A..A ('A' == active, '.' == missing)

RAID-5 with two failed disks.

> /dev/sdc3:
>      Raid Level : raid5
>    Raid Devices : 4
> 
>     Update Time : Mon Oct 24 08:53:57 2016
>          Events : 53539
> 
>    Device Role : Active device 2
>    Array State : AAAA ('A' == active, '.' == missing)

This one failed, 8:53.

> ############ /proc/mdstat ############################################
> 
> md2 : active raid5 sda3[0] sdc3[2](F) sdd3[3]
>       8760565248 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2]
> [U__U]

[U__U] refers to device roles as in [0123], 
so device role 0 and 3 is okay, 1 and 2 missing.

> md0 : active raid1 sdb1[4](S) sdc1[2](S) sda1[0] sdd1[3]
>       1950656 blocks super 1.2 [2/2] [UU]

Those two spares again, could be [UUUU] instead.

tl;dr
stop it all,
ddrescue /dev/sdc to your new disk,
try your luck with --assemble --force (not using /dev/sdc!),
get yet another new disk, add, sync, cross fingers.

There's also mdadm --replace instead of --remove, --add, 
that sometimes helps if there's only a few bad sectors 
on each disk. If the disk you already removed wasn't 
already kicked from the array by the time you replaced, 
maybe it would have avoided this problem.

But good disk monitoring and testing is even more important.

Regards
Andreas Klauer

next prev parent reply	other threads:[~2016-10-27 16:04 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-27 15:06 recovering failed raid5 Alexander Shenkin
2016-10-27 16:04 ` Andreas Klauer [this message]
2016-10-28 12:22   ` Alexander Shenkin
2016-10-28 13:33     ` Andreas Klauer
2016-10-28 21:16       ` Phil Turmel
2016-10-28 23:45         ` Andreas Klauer
2016-10-29  2:52           ` Edward Kuns
2016-10-29  2:53           ` Phil Turmel
2016-10-29  8:46           ` Mikael Abrahamsson
2016-10-29 10:29       ` Roman Mamedov
2016-10-29 12:02         ` Andreas Klauer
2016-10-30 16:18           ` Phil Turmel
2016-10-28 13:36     ` Robin Hill
2016-10-31 10:44       ` Alexander Shenkin
2016-10-31 11:09         ` Andreas Klauer
2016-10-31 15:19         ` Robin Hill
2016-10-31 16:26         ` Wols Lists
2016-10-31 16:28       ` Wols Lists
2016-11-16  9:04       ` Alexander Shenkin
2016-11-16 11:14         ` Andreas Klauer
2016-11-16 13:27           ` Alexander Shenkin
2016-11-16 13:59             ` Andreas Klauer
2016-11-16 15:35         ` Wols Lists
2016-11-16 15:50           ` Alexander Shenkin
2016-11-16 16:38             ` Wols Lists
2017-01-05 12:08               ` Alexander Shenkin
2016-10-31 16:31     ` Wols Lists
2016-10-27 16:26 ` Roman Mamedov
2016-10-27 20:34 ` Robin Hill

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161027160400.GA21042@metamorpher.de \
    --to=andreas.klauer@metamorpher.de \
    --cc=al@shenkin.org \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.