Help with data recovery - RAID6 with 2 failed drives and another with broken sectors

All of lore.kernel.org
 help / color / mirror / Atom feed

* Help with data recovery - RAID6 with 2 failed drives and another with broken sectors
@ 2013-09-30 23:23 Michał Sawicz
  2013-10-01 19:24 ` Michał Sawicz
  0 siblings, 1 reply; 6+ messages in thread
From: Michał Sawicz @ 2013-09-30 23:23 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 2882 bytes --]

Hey all,

So the nightmare came for me - I've a 7x2TB setup under RAID6, and one
of the drives started showing uncorrectable sectors a few days ago, but
I didn't yet have time to address that. I had two-disk redundancy, after
all...

Soon thereafter the cables / controller spew a slew of errors and the
array was stopped. A --force --assemble later it was back up, rebuilding
onto 2 spares - I was left with no redundancy. If only the bad sectors
drive was one of those two, everything would be fine. Unfortunately
that's not the case, so I'm now left with an array with read errors. So
it fails during rebuild due to those.

What I'd like to do first is to make sure the array rebuilds onto the 6
healthy drives, regardless of the bad blocks, I can probably recover the
data (assuming I can find out which files were affected - any
pointers?), but if the array doesn't rebuild correctly, I'm afraid it's
gonna get worse, and soon.

I could probably use the data from the two spares to correct the few
broken blocks, but it's probably not worth it - I'd rather have a
working array with a few bad files than to fight with an unprotected array.

Please find some details about my array below, and let me know if I can
supply more.

As a side note... I've a full array scrub enabled on the array every now
and again - and it did run after the disk started failing blocks, but
they never got reallocated, they all remain pending / uncorrectable. Is
that expected?

> # mdadm --examine /dev/sda
> /dev/sda:
>           Magic : a92b4efc
>         Version : 1.2
>     Feature Map : 0x0
>      Array UUID : ff9e032c:446ed0bd:fc9473f3:f8e090ed
>            Name : media:store  (local to host media)
>   Creation Time : Tue Sep 13 21:36:43 2011
>      Raid Level : raid6
>    Raid Devices : 7
> 
>  Avail Dev Size : 3907027120 (1863.02 GiB 2000.40 GB)
>      Array Size : 19535119360 (9315.07 GiB 10001.98 GB)
>   Used Dev Size : 3907023872 (1863.01 GiB 2000.40 GB)
>     Data Offset : 2048 sectors
>    Super Offset : 8 sectors
>           State : clean
>     Device UUID : 3b40e74a:c9b652ce:6810bdcd:d2648b69
> 
>     Update Time : Tue Oct  1 01:06:35 2013
>        Checksum : a0ddd145 - correct
>          Events : 753179
> 
>          Layout : left-symmetric
>      Chunk Size : 512K
> 
>    Device Role : Active device 3
>    Array State : AAAAAAA ('A' == active, '.' == missing)

> # cat /proc/mdstat 
> Personalities : [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
> md126 : active raid6 sdg[9] sdb[8] sdh1[6] sdc[7] sdf1[5] sda[10] sdi[11]
>       9767559680 blocks super 1.2 level 6, 512k chunk, algorithm 2 [7/5] [UU_UUU_]
>       [========>............]  recovery = 44.4% (867604356/1953511936) finish=605.9min speed=29866K/sec

Thanks and best regards,
-- 
Michał (Saviq) Sawicz <michal@sawicz.net>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 901 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Help with data recovery - RAID6 with 2 failed drives and another with broken sectors
  2013-09-30 23:23 Help with data recovery - RAID6 with 2 failed drives and another with broken sectors Michał Sawicz
@ 2013-10-01 19:24 ` Michał Sawicz
  2013-10-06 21:44   ` Phil Turmel
  0 siblings, 1 reply; 6+ messages in thread
From: Michał Sawicz @ 2013-10-01 19:24 UTC (permalink / raw)
  To: linux-raid

[-- Attachment #1: Type: text/plain, Size: 1331 bytes --]

On 01.10.2013 01:23, Michał Sawicz wrote:
> So the nightmare came for me - I've a 7x2TB setup under RAID6, and one
> of the drives started showing uncorrectable sectors a few days ago, but
> I didn't yet have time to address that. I had two-disk redundancy, after
> all...
> 
> Soon thereafter the cables / controller spew a slew of errors and the
> array was stopped. A --force --assemble later it was back up, rebuilding
> onto 2 spares - I was left with no redundancy. If only the bad sectors
> drive was one of those two, everything would be fine. Unfortunately
> that's not the case, so I'm now left with an array with read errors. So
> it fails during rebuild due to those.
> 
> What I'd like to do first is to make sure the array rebuilds onto the 6
> healthy drives, regardless of the bad blocks, I can probably recover the
> data (assuming I can find out which files were affected - any
> pointers?), but if the array doesn't rebuild correctly, I'm afraid it's
> gonna get worse, and soon.

OK, so a ddrescue and --zero-superblock later my array is rebuilding
onto one healthy spare. According to ddrescue I only lost some 8kB of
data in more or less one chunk, so after the array is rebuilt my next
task will be finding which file(s) that was.

-- 
Michał (Saviq) Sawicz <michal@sawicz.net>


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 901 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Help with data recovery - RAID6 with 2 failed drives and another with broken sectors
  2013-10-01 19:24 ` Michał Sawicz
@ 2013-10-06 21:44   ` Phil Turmel
  2013-10-06 22:11     ` Michał Sawicz
  0 siblings, 1 reply; 6+ messages in thread
From: Phil Turmel @ 2013-10-06 21:44 UTC (permalink / raw)
  To: Michał Sawicz; +Cc: linux-raid

Hi Michał,

On 10/01/2013 03:24 PM, Michał Sawicz wrote:
> On 01.10.2013 01:23, Michał Sawicz wrote:

[trim /]

>> What I'd like to do first is to make sure the array rebuilds onto the 6
>> healthy drives, regardless of the bad blocks, I can probably recover the
>> data (assuming I can find out which files were affected - any
>> pointers?), but if the array doesn't rebuild correctly, I'm afraid it's
>> gonna get worse, and soon.
> 
> OK, so a ddrescue and --zero-superblock later my array is rebuilding
> onto one healthy spare. According to ddrescue I only lost some 8kB of
> data in more or less one chunk, so after the array is rebuilt my next
> task will be finding which file(s) that was.

I noticed that you never got any direct response, and I realized you
might still be at risk.  In particular, your OP said:

> As a side note... I've a full array scrub enabled on the array every now
> and again - and it did run after the disk started failing blocks, but
> they never got reallocated, they all remain pending / uncorrectable. Is
> that expected?

The answer is *NO*.  That is not expected.  But it does happen with
timeout mismatches, and the double failure you experienced is a common
result of error correction timeout mismatch.  Timeout mismatch is where
your drives are internally trying to retry reading a bad sector long
after the OS has given up.  It is always associated with consumer-grade
hard drives in raid arrays.

You might want to search the list archives for various combinations of
"error recovery", "scterc", "URE" and "timeout mismatch" for a full
description of the problem and the recommended ways to avoid it.

HTH,

Phil

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Help with data recovery - RAID6 with 2 failed drives and another with broken sectors
  2013-10-06 21:44   ` Phil Turmel
@ 2013-10-06 22:11     ` Michał Sawicz
  2013-10-06 22:15       ` Phil Turmel
  0 siblings, 1 reply; 6+ messages in thread
From: Michał Sawicz @ 2013-10-06 22:11 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

On 06.10.2013 23:44, Phil Turmel wrote:
> The answer is*NO*.  That is not expected.  But it does happen with
> timeout mismatches, and the double failure you experienced is a common
> result of error correction timeout mismatch.  Timeout mismatch is where
> your drives are internally trying to retry reading a bad sector long
> after the OS has given up.  It is always associated with consumer-grade
> hard drives in raid arrays.

Right, I knew that consumer HDDs did that, but didn't expect this to 
cause such mayhem. So the take out for me for this is: as soon as you 
see bad blocks on the drive, fail it, otherwise the whole array will 
probably get kicked out sooner or later. Or try and manually force the 
drive to reallocate, and then do a scrub.

> You might want to search the list archives for various combinations of
> "error recovery", "scterc", "URE" and "timeout mismatch" for a full
> description of the problem and the recommended ways to avoid it.

Thanks, will do.
-- 
Michał (Saviq) Sawicz <michal@sawicz.net>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Help with data recovery - RAID6 with 2 failed drives and another with broken sectors
  2013-10-06 22:11     ` Michał Sawicz
@ 2013-10-06 22:15       ` Phil Turmel
  2013-10-06 22:56         ` Michał Sawicz
  0 siblings, 1 reply; 6+ messages in thread
From: Phil Turmel @ 2013-10-06 22:15 UTC (permalink / raw)
  To: Michał Sawicz; +Cc: linux-raid

On 10/06/2013 06:11 PM, Michał Sawicz wrote:
> On 06.10.2013 23:44, Phil Turmel wrote:
>> The answer is*NO*.  That is not expected.  But it does happen with
>> timeout mismatches, and the double failure you experienced is a common
>> result of error correction timeout mismatch.  Timeout mismatch is where
>> your drives are internally trying to retry reading a bad sector long
>> after the OS has given up.  It is always associated with consumer-grade
>> hard drives in raid arrays.
> 
> Right, I knew that consumer HDDs did that, but didn't expect this to
> cause such mayhem. So the take out for me for this is: as soon as you
> see bad blocks on the drive, fail it, otherwise the whole array will
> probably get kicked out sooner or later. Or try and manually force the
> drive to reallocate, and then do a scrub.

No, just fix the timeouts.  Otherwise, you'll be kicking drives out
*way* more often than you think.

Do check your smartctl reports for actual relocations, though.  In my
experience, once you pass single digits, further failures are rapid.

>> You might want to search the list archives for various combinations of
>> "error recovery", "scterc", "URE" and "timeout mismatch" for a full
>> description of the problem and the recommended ways to avoid it.
> 
> Thanks, will do.

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Help with data recovery - RAID6 with 2 failed drives and another with broken sectors
  2013-10-06 22:15       ` Phil Turmel
@ 2013-10-06 22:56         ` Michał Sawicz
  0 siblings, 0 replies; 6+ messages in thread
From: Michał Sawicz @ 2013-10-06 22:56 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid

On 07.10.2013 00:15, Phil Turmel wrote:
> No, just fix the timeouts.  Otherwise, you'll be kicking drives out
> *way*  more often than you think.

Ah! Now I've actually read through some of the finds (and found dozens 
of instances where you recommend the same - you *should* have a dime for 
every time you mention that to people), I'm happy to report that just 
one of my drives (at least from what they claim) does not support scterc 
- the rest just had it disabled... This is fixed now, and a brand new 
udev rule should take care of the timeout for the other drive. This 
should make my array way more stable - thank you so much!

> Do check your smartctl reports for actual relocations, though.  In my
> experience, once you pass single digits, further failures are rapid.

Smartd is notifying me of all such - and since most of the drives are 
still under warranty - I replace them as soon as possible when they 
start to show up bad blocks.

Again, thank you Phil for your patience.
-- 
Michał (Saviq) Sawicz <michal@sawicz.net>
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-10-06 22:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-09-30 23:23 Help with data recovery - RAID6 with 2 failed drives and another with broken sectors Michał Sawicz
2013-10-01 19:24 ` Michał Sawicz
2013-10-06 21:44   ` Phil Turmel
2013-10-06 22:11     ` Michał Sawicz
2013-10-06 22:15       ` Phil Turmel
2013-10-06 22:56         ` Michał Sawicz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.