RAID 5 : recovery after failure

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RAID 5 : recovery after failure
@ 2013-10-08 19:19 Guillaume Betous
  2013-10-08 21:06 ` Robin Hill
  0 siblings, 1 reply; 8+ messages in thread
From: Guillaume Betous @ 2013-10-08 19:19 UTC (permalink / raw)
  To: linux-raid

Hi,

My RAID 5 has failed... After a first failure, the spare disk has
started its rebuild. During the rebuild process (60%), I received a
2nd email :(

The RAID became laggy and finally unusable.

Now I can't recover the RAID array. Even if there is no particular
precious data, I'm trying to recover it, only be to learn a little bit
:)

I've tried the procedures written in the wiki, and before trying the
last one (recreate), I write this mail, as said in the wiki :)

Trying to --force fails with the following message :
mdadm: /dev/sdf1 has no superblock - assembly aborted

Removing sdf1 from RAID array results in same error on sdd1 and so on...

You'll find some command results here after :

mdadm --examine : http://pastie.org/8385891
timeouts : http://pastie.org/8385901
smartclt -x : http://pastebin.com/BXMHADZD

Thanks all for your help :)

Regards,

gUI

-- 
Pour la santé de votre ordinateur, préférez les logiciels libres.
Lire son mail : http://www.mozilla-europe.org/fr/products/thunderbird/
Browser le web : http://www.mozilla-europe.org/fr/products/firefox/
Suite bureautique : http://www.libreoffice.org/download/
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID 5 : recovery after failure
  2013-10-08 19:19 RAID 5 : recovery after failure Guillaume Betous
@ 2013-10-08 21:06 ` Robin Hill
  2013-10-09  6:22   ` Guillaume Betous
  0 siblings, 1 reply; 8+ messages in thread
From: Robin Hill @ 2013-10-08 21:06 UTC (permalink / raw)
  To: Guillaume Betous; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2565 bytes --]

On Tue Oct 08, 2013 at 09:19:44 +0200, Guillaume Betous wrote:

> Hi,
> 
> My RAID 5 has failed... After a first failure, the spare disk has
> started its rebuild. During the rebuild process (60%), I received a
> 2nd email :(
> 
> The RAID became laggy and finally unusable.
> 
> Now I can't recover the RAID array. Even if there is no particular
> precious data, I'm trying to recover it, only be to learn a little bit
> :)
> 
> I've tried the procedures written in the wiki, and before trying the
> last one (recreate), I write this mail, as said in the wiki :)
> 
> Trying to --force fails with the following message :
> mdadm: /dev/sdf1 has no superblock - assembly aborted
> 
> Removing sdf1 from RAID array results in same error on sdd1 and so on...
> 
> You'll find some command results here after :
> 
> mdadm --examine : http://pastie.org/8385891
> timeouts : http://pastie.org/8385901
> smartclt -x : http://pastebin.com/BXMHADZD
> 
Looks like you've got some timeout mismatches, which is probably causing
some of the issues. Two of the drives are WD Reds, which have SCT ERC
enabled by default at 7 seconds (which is good). There's also a Seagate
which supports ERC but it isn't enabled (you'll need to set that each
boot). Then there's a Seagate and a WD Green which don't support ERC at
all, so you'll need to set the timeouts to 180+ at each boot for those.

I've no idea which disk is which though - I'd guess the smartctl output
is in order, but there's nothing to actually say which output
corresponds with which device.

The first WD Red (sda?) is reporting a couple of read errors - those are
the only obvious SMART errors though. There are a couple of command
timeouts in the SMART attributes for the non-green Seagate (sdb?) as
well, which might also be relevant (I'm not familiar with that attribute
though, so I'm not entirely sure).

As far as the array goes, it looks like you _should_ be able to force
assembly with sdc1, sde1 & sdf1. They all have array positions listed,
whereas the other two are just listed as spares. If that fails, retry
with --verbose and post the resulting error messages (and the
corresponding section of dmesg output).

Make sure you set the ERC/timeouts before attempting to re-add either of
the other disks though.

HTH,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID 5 : recovery after failure
  2013-10-08 21:06 ` Robin Hill
@ 2013-10-09  6:22   ` Guillaume Betous
  2013-10-09  6:54     ` Mikael Abrahamsson
  0 siblings, 1 reply; 8+ messages in thread
From: Guillaume Betous @ 2013-10-09  6:22 UTC (permalink / raw)
  To: Guillaume Betous, linux-raid

> As far as the array goes, it looks like you _should_ be able to force
> assembly with sdc1, sde1 & sdf1.

Impossible :(
Always this superblock error.

mdadm: looking for devices for /dev/md127
mdadm: cannot open device /dev/sdf1: Device or resource busy
mdadm: /dev/sdf1 has no superblock - assembly aborted

I don't know where this "busy" message comes from : I have no mount
nor RAID service started.

> Make sure you set the ERC/timeouts before attempting to re-add either of
> the other disks though.

Sure, I will :)

gUI

-- 
Pour la santé de votre ordinateur, préférez les logiciels libres.
Lire son mail : http://www.mozilla-europe.org/fr/products/thunderbird/
Browser le web : http://www.mozilla-europe.org/fr/products/firefox/
Suite bureautique : http://www.libreoffice.org/download/
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID 5 : recovery after failure
  2013-10-09  6:22   ` Guillaume Betous
@ 2013-10-09  6:54     ` Mikael Abrahamsson
  2013-10-09  8:20       ` Guillaume Betous
  0 siblings, 1 reply; 8+ messages in thread
From: Mikael Abrahamsson @ 2013-10-09  6:54 UTC (permalink / raw)
  To: Guillaume Betous; +Cc: linux-raid

On Wed, 9 Oct 2013, Guillaume Betous wrote:

> I don't know where this "busy" message comes from : I have no mount nor 
> RAID service started.

"cat /proc/mdstat" says no md volume is in any state at all? This is 
usually the case when these busy messages are shown.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID 5 : recovery after failure
  2013-10-09  6:54     ` Mikael Abrahamsson
@ 2013-10-09  8:20       ` Guillaume Betous
  2013-10-09  8:28         ` Mikael Abrahamsson
  0 siblings, 1 reply; 8+ messages in thread
From: Guillaume Betous @ 2013-10-09  8:20 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid

Damn, you're right !

# cat /proc/mdstats
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md127 : inactive sde1[2](S) sdd1[6](S) sdb1[5](S) sdf1[4](S)
      7814050144 blocks super 1.2

unused devices: <none>

=> I stopped my array (mdadm --stop), then restarted :

sam / # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md127 : active raid5 sdc1[0] sdf1[4] sde1[2]
      5860535808 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [U_UU]

unused devices: <none>

It works !!!

Now, should I add the other drives ? How ?

Thanks a lot :)

gUI

2013/10/9 Mikael Abrahamsson <swmike@swm.pp.se>:
> On Wed, 9 Oct 2013, Guillaume Betous wrote:
>
>> I don't know where this "busy" message comes from : I have no mount nor
>> RAID service started.
>
>
> "cat /proc/mdstat" says no md volume is in any state at all? This is usually
> the case when these busy messages are shown.
>
> --
> Mikael Abrahamsson    email: swmike@swm.pp.se



-- 
Pour la santé de votre ordinateur, préférez les logiciels libres.
Lire son mail : http://www.mozilla-europe.org/fr/products/thunderbird/
Browser le web : http://www.mozilla-europe.org/fr/products/firefox/
Suite bureautique : http://www.libreoffice.org/download/
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID 5 : recovery after failure
  2013-10-09  8:20       ` Guillaume Betous
@ 2013-10-09  8:28         ` Mikael Abrahamsson
  2013-10-09  8:54           ` Guillaume Betous
  0 siblings, 1 reply; 8+ messages in thread
From: Mikael Abrahamsson @ 2013-10-09  8:28 UTC (permalink / raw)
  To: Guillaume Betous; +Cc: linux-raid

On Wed, 9 Oct 2013, Guillaume Betous wrote:

> Now, should I add the other drives ? How ?

Depends on what the problem is.

Is the initial drive that failed now totally unusable?

In that case, just do mdadm --manage /dev/md0 --add /dev/sd<whatever> with 
a new drive.

But you said you received a second error, was this a read error on one of 
good working drive (see other threads in archive) and so you don't have 
any read errors (the information on those bad sectors will now be lost), 
then you can resync properly.

I strongly recommend going to RAID6 to solve this problem in the future.

I recommend having

for x in /sys/block/sd[a-z] ; do echo 180 > $x/device/timeout ; done

in rc.local (or equivalent) to make sure you can handle timeouts properly 
even for consumer drives.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID 5 : recovery after failure
  2013-10-09  8:28         ` Mikael Abrahamsson
@ 2013-10-09  8:54           ` Guillaume Betous
  2013-10-09  9:14             ` Robin Hill
  0 siblings, 1 reply; 8+ messages in thread
From: Guillaume Betous @ 2013-10-09  8:54 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid

Here is the 2nd message I received.
For now my RAID has restarted with only sdc, sde and sdf.

I don't know if /dev/sdb is still usable, or if this was only a
desynchro failure.
How to know ?

------------------------------------------------------------------------------------------

This is an automatically generated mail message from mdadm
running on sam

A FailSpare event had been detected on md device /dev/md127.

It could be related to component device /dev/sdb1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md127 : active raid5 sde1[2] sdb1[5](F) sdc1[0](F) sdd1[6] sdf1[4]
      5860535808 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2] [__UU]
      [============>........]  recovery = 62.0%
(1212358192/1953511936) finish=854.7min speed=14451K/sec

2013/10/9 Mikael Abrahamsson <swmike@swm.pp.se>:
> On Wed, 9 Oct 2013, Guillaume Betous wrote:
>
>> Now, should I add the other drives ? How ?
>
>
> Depends on what the problem is.
>
> Is the initial drive that failed now totally unusable?
>
> In that case, just do mdadm --manage /dev/md0 --add /dev/sd<whatever> with a
> new drive.
>
> But you said you received a second error, was this a read error on one of
> good working drive (see other threads in archive) and so you don't have any
> read errors (the information on those bad sectors will now be lost), then
> you can resync properly.
>
> I strongly recommend going to RAID6 to solve this problem in the future.
>
> I recommend having
>
> for x in /sys/block/sd[a-z] ; do echo 180 > $x/device/timeout ; done
>
> in rc.local (or equivalent) to make sure you can handle timeouts properly
> even for consumer drives.
>
>
> --
> Mikael Abrahamsson    email: swmike@swm.pp.se



-- 
Pour la santé de votre ordinateur, préférez les logiciels libres.
Lire son mail : http://www.mozilla-europe.org/fr/products/thunderbird/
Browser le web : http://www.mozilla-europe.org/fr/products/firefox/
Suite bureautique : http://www.libreoffice.org/download/
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: RAID 5 : recovery after failure
  2013-10-09  8:54           ` Guillaume Betous
@ 2013-10-09  9:14             ` Robin Hill
  0 siblings, 0 replies; 8+ messages in thread
From: Robin Hill @ 2013-10-09  9:14 UTC (permalink / raw)
  To: Guillaume Betous; +Cc: Mikael Abrahamsson, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2892 bytes --]

On Wed Oct 09, 2013 at 10:54:09AM +0200, Guillaume Betous wrote:

> I don't know if /dev/sdb is still usable, or if this was only a
> desynchro failure.
> How to know ?
> 
As sdb1 has already been marked spare, it'll need rebuilding anyway, so
it doesn't really matter. If there's a real issue with it then it'll
fail during the recovery process anyway. You can do a full read test on
it (either a long SMART test, a simple dd from it, or a read-only
badblocks test) if you want to check for issues though.

> P.S. The /proc/mdstat file currently contains the following:
> 
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
> md127 : active raid5 sde1[2] sdb1[5](F) sdc1[0](F) sdd1[6] sdf1[4]
>       5860535808 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2] [__UU]
>       [============>........]  recovery = 62.0%
> (1212358192/1953511936) finish=854.7min speed=14451K/sec
> 
It looks like recovery was kicked off onto sde1, but sdc has failed
again during the rebuild. This would suggest a read error on sdc1
somewhere - dmesg should show some indication of what's happened.

You'll need to stop the array and sort out sdc before you can get it
going again. Use GNU ddrescue to image it onto another disk (preferably
one that wasn't originally a member of the array) - it may be able to
get all the data read (it tries somewhat harder than normal processes),
or you'll at least see how much is unreadable.

If it's all read okay then you can just re-run the force assembly using
that disk instead of sdc (make sure you explicitly list the devices to
use in the assembly command). Then add one of the other disks and wait
for the rebuild to complete (there may be no real issue with sdc - you
do sometimes get read errors on disks which are solved by simply
rewriting the data).

If not then you have to make a decision about whether there's few enough
unreadable blocks to continue with assembly (as above) and possibly end
up with some corrupt files, or whether you want to risk re-creating the
array using the other original member (I'd suggest doing a full read
test on that disk first though, as it may be in the same state).

If you're wanting to do a re-create then we'll need to revisit your
original array details to see parameters would be needed (and which
mdadm version you'll need to get the correct data offsets).

Once everything's back up and running, you really need to:
 - make sure the timeouts/ERC are set correctly at every boot
 - schedule array checks on a regular basis to pick up any read errors
   while they can still be corrected

Cheers,
    Robin
-- 
     ___        
    ( ' }     |       Robin Hill        <robin@robinhill.me.uk> |
   / / )      | Little Jim says ....                            |
  // !!       |      "He fallen in de water !!"                 |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-10-09  9:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-10-08 19:19 RAID 5 : recovery after failure Guillaume Betous
2013-10-08 21:06 ` Robin Hill
2013-10-09  6:22   ` Guillaume Betous
2013-10-09  6:54     ` Mikael Abrahamsson
2013-10-09  8:20       ` Guillaume Betous
2013-10-09  8:28         ` Mikael Abrahamsson
2013-10-09  8:54           ` Guillaume Betous
2013-10-09  9:14             ` Robin Hill

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).