md failing mechanism

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* md failing mechanism
@ 2016-01-22 17:59 Dark Penguin
  2016-01-22 19:29 ` Phil Turmel
  0 siblings, 1 reply; 21+ messages in thread
From: Dark Penguin @ 2016-01-22 17:59 UTC (permalink / raw)
  To: linux-raid

Greetings,

Recently, I've had my first drive failure in a software RAID1 on a file 
server. And I was really surprised about exactly what happened; I always 
thought that when md can't process a read request from one of the 
drives, it is supposed to mark that drive as faulty and read from 
another drive; but, for some reason, it was deliberately trying to read 
from a faulty drive no matter what, which apparently caused Samba to 
wait until it's finished, and so the whole server was rendered 
inaccessible (I mean, the whole Samba).

What I expected:
- A user tries to read a file via Samba.
- Samba issues a read request to md.
- md tries to read the file from one of the drives... the drive is 
struggling to read a bad sector...
- md thinks: okay, this is taking too long, production is not waiting; 
I'll just read from another drive instead.
- It reads from another drive successfully, and users continue their work.
- Finally, the "bad" drive gives up on trying to read the bad sector and 
returns an error. md marks the drive as faulty and sends an email 
telling me to replace the drive as soon as possible.

What happened instead:
- A user tries to read a file via Samba.
- Samba issues a read request to md.
- md tries to read the file from one of the drives... the drive is 
struggling to read a bad sector... Samba is waiting for md, md is 
waiting for the drive, and the drive is trying again and again to read 
this blasted sector like its life depends on it, while users see that 
the network folder doesn't respond anymore at all.

This goes on forever, until users call me, I come to investigate, see 
Samba down, see a lot of errors in dmesg, and then I manually mark this 
drive as faulty.

Now, that happened a while ago; I did not have the most recent kernel on 
that server (I think it was 3.2 from Debian Wheezy or something a little 
newer from the backports), but I can't try it again with a new server, 
because I can't make a functional RAID1, write data there, and then 
destroy some sectors and see what happens. I just want to ask, is that 
really how it works?.. Was that supposed to happen?.. I thought the main 
point of a RAID1 is to avoid any downtime, especially in such cases!.. 
Or is it maybe a known issue fixed in the more recent versions, so I 
should just update my kernels and expect different behaviour next time?..

-- 
darkpenguin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-22 17:59 md failing mechanism Dark Penguin
@ 2016-01-22 19:29 ` Phil Turmel
  2016-01-22 20:00   ` Wols Lists
  2016-01-22 21:44   ` Dark Penguin
  0 siblings, 2 replies; 21+ messages in thread
From: Phil Turmel @ 2016-01-22 19:29 UTC (permalink / raw)
  To: Dark Penguin, linux-raid

Hi,

On 01/22/2016 12:59 PM, Dark Penguin wrote:
> Greetings,
> 
> Recently, I've had my first drive failure in a software RAID1 on a file
> server. And I was really surprised about exactly what happened; I always
> thought that when md can't process a read request from one of the
> drives, it is supposed to mark that drive as faulty and read from
> another drive; but, for some reason, it was deliberately trying to read
> from a faulty drive no matter what, which apparently caused Samba to
> wait until it's finished, and so the whole server was rendered
> inaccessible (I mean, the whole Samba).

What you've described does sound like a bug, maybe.  It also sounds
similar to traditional timeout mismatch caused by cheap desktop drives
used in a raid array.

In a properly functioning array, the normal sequence of events for a
simple failing sector is:

1) read from sector X fails and is reported by the drive to the kernel
2) kernel tells MD "read failed"
3) MD reads from different mirror or from peers & parity to reconstruct
the failed sector
4a) MD supplies reconstructed sector to upper layer/user.
4b) MD writes reconstructed sector back to failed location to fix it or
relocate it.  If this write succeeds (either case), the device stays in
the array.

The above sequence of events is disturbed when a drive takes too long in
step 1.

It would be good to see your dmesg of this event to see what failure
mode is present.

Meanwhile, some reading material for you:

http://marc.info/?l=linux-raid&m=139050322510249&w=2
http://marc.info/?l=linux-raid&m=135863964624202&w=2
http://marc.info/?l=linux-raid&m=135811522817345&w=1
http://marc.info/?l=linux-raid&m=133761065622164&w=2
http://marc.info/?l=linux-raid&m=132477199207506
http://marc.info/?l=linux-raid&m=133665797115876&w=2
http://marc.info/?l=linux-raid&m=142487508806844&w=3
http://marc.info/?l=linux-raid&m=144535576302583&w=2

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-22 19:29 ` Phil Turmel
@ 2016-01-22 20:00   ` Wols Lists
  2016-01-22 21:44   ` Dark Penguin
  1 sibling, 0 replies; 21+ messages in thread
From: Wols Lists @ 2016-01-22 20:00 UTC (permalink / raw)
  To: Dark Penguin, linux-raid

On 22/01/16 19:29, Phil Turmel wrote:
> What you've described does sound like a bug, maybe.  It also sounds
> similar to traditional timeout mismatch caused by cheap desktop drives
> used in a raid array.

man smartctl

smartctl -i /dev/sd-dodgy-drive

If I run that on my drives I get

smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.1.12-gentoo] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST3000DM001-1CH166
Serial Number:    W1F50K0F
LU WWN Device Id: 5 000c50 0737a98a4
Firmware Version: CC27
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Fri Jan 22 19:56:47 2016 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Disabled

NOTE THAT SMART SUPPORT IS DISABLED.

Yes I know I'm running with dodgy drives and a dodgy config, but smart
support should be enabled and things like ERC should be turned on. If
they're not ...

Cheers,
Wol

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-22 19:29 ` Phil Turmel
  2016-01-22 20:00   ` Wols Lists
@ 2016-01-22 21:44   ` Dark Penguin
  2016-01-22 22:18     ` Phil Turmel
                       ` (2 more replies)
  1 sibling, 3 replies; 21+ messages in thread
From: Dark Penguin @ 2016-01-22 21:44 UTC (permalink / raw)
  To: linux-raid

Oh! Thank you! I really wanted to see a reliable "what's supposed to 
happen" sequence!

As for my case, those were indeed, um, "cheap desktop drives" - to be 
precise, some 80-Gb IDE drives in a Pentium-4 machine; "it works well 
for a small file server", I thought, oblivious to the finer details 
about the process of failure handling... But, I also have "big" file 
servers, so that timeout mismatch issue is something worth paying attention!

And also, now I understand why I probably "should have been scrubbing". 
=/ Do I understand correctly that "scrubbing" means those "monthly 
redundancy checks" that mdadm suggests? And I suppose what it does is 
just the same - read every sector and attempt to write it back upon 
failure, otherwise kicking the device?

So, I understand a common problem now: the read timeout on the "desktop" 
drives is too long, which makes sense for the desktops, but not for 
RAIDs, because the "write back attempt" fails and leads to "BOOM" and 
kick. Enterprise-grade drives, however, offer an option to change their 
timeout, which is called "TL;DR technology" (yes, that's how I'm going 
to call it! Because I can't remember the acronym no matter how may times 
I read it, and the meaning kinda fits!). And what about drives that do 
not support it?.. Do they even have some kid of huge timeout or 
something?.. Yesterday I've been checking one drive for bad blocks 
(badblocks read-only test), and it took no more than two seconds per 
block to confirm its... badness!

As I understand, one way around this problem is to change the kernel 
timeout to exceed the drive timeout by changing 
/sys/block/sd?/device/timeout to something larger than the default 30, 
but I'd have to do that after every reboot, is all that correct?

Still, I don't think it has anything to do with what has happened to my 
"small file server"... It was the opposite; for some reason, it was not 
kicked from the array. But, it happened a while ago, and I've destroyed 
the array afterwards, so I can't get any more data about that incident. 
But, I've got what I wanted: I now I know what is supposed to happen 
when a drive in a RAID fails, and it's not what happened that time. And 
I know I should set up proper TL;DR timeouts and scrubbing...

-- 
darkpenguin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-22 21:44   ` Dark Penguin
@ 2016-01-22 22:18     ` Phil Turmel
  2016-01-22 22:50       ` Dark Penguin
  2016-01-22 23:34       ` Wols Lists
  2016-01-22 22:37     ` Edward Kuns
  2016-01-22 23:40     ` James J
  2 siblings, 2 replies; 21+ messages in thread
From: Phil Turmel @ 2016-01-22 22:18 UTC (permalink / raw)
  To: Dark Penguin, linux-raid

On 01/22/2016 04:44 PM, Dark Penguin wrote:
> Oh! Thank you! I really wanted to see a reliable "what's supposed to
> happen" sequence!

You're welcome.

> As for my case, those were indeed, um, "cheap desktop drives" - to be
> precise, some 80-Gb IDE drives in a Pentium-4 machine; "it works well
> for a small file server", I thought, oblivious to the finer details
> about the process of failure handling... But, I also have "big" file
> servers, so that timeout mismatch issue is something worth paying
> attention!
> 
> And also, now I understand why I probably "should have been scrubbing".
> =/ Do I understand correctly that "scrubbing" means those "monthly
> redundancy checks" that mdadm suggests? And I suppose what it does is
> just the same - read every sector and attempt to write it back upon
> failure, otherwise kicking the device?

A "check" scrub reads every sector every member device's data area.  If
any fail, the normal reconstruct and rewrite will fix it.  It also looks
for successfull reads where the data is inconsistent between mirrors or
between data blocks and parity blocks.  Those are counted for you to review.

A "repair" scrub reads forcibly ensures consistent redundancy by copying
mirror one to the others, and recomputing parity from data.  It will
also reconstruct if needed.

The "check" mode is your recommended regular scrub.  I do mine weekly,
but monthly is probably fine.  "Repair" is needed if "check" reports any
mismatches.

> ..... is all that correct?

From one of your reading assignments: (
http://marc.info/?l=linux-raid&m=135811522817345&w=1 )

> Options are:
> 
> A) Buy Enterprise drives. They have appropriate error timeouts and work
> properly with MD right out of the box.
> 
> B) Buy Desktop drives with SCTERC support. They have inappropriate
> default timeouts, but can be set to an appropriate value. Udev or boot
> script assistance is needed to call smartctl to set it. They do *not*
> work properly with MD out of the box.
> 
> C) Suffer with desktop drives without SCTERC support. They cannot be
> set to appropriate error timeouts. Udev or boot script assistance is
> needed to set a 120 second driver timeout in sysfs. They do *not* work
> properly with MD out of the box.
> 
> D) Lose your data during spare rebuild after your first URE. (Odds in
> proportion to array size.)
> 
> One last point bears repeating: MD is *not* a backup system, although
> some people leverage it's features for rotating off-site backup disks.
> Raid arrays are all about *uptime*. They will not save you from
> accidental deletion or other operator errors. They will not save you if
> your office burns down. You need a separate backup system for critical
> files.

Since that was written, 'A' would now include almost-enterprise drives
with RAID ratings like the Western Digital Red family.  And the
recommended timeout for 'C' has drifted upward to 180.

[trim /]

> Still, I don't think it has anything to do with what has happened to my
> "small file server"...

That's why I asked for the dmesg.  It could have been a bug.  No crisis
if it's lost, so long as you've accepted one of A through D above.

Phil

ps.  convention on kernel.org is reply-to-all and no top-posting.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-22 22:18     ` Phil Turmel
@ 2016-01-22 22:50       ` Dark Penguin
  2016-01-22 23:23         ` Edward Kuns
  2016-01-22 23:34       ` Wols Lists
  1 sibling, 1 reply; 21+ messages in thread
From: Dark Penguin @ 2016-01-22 22:50 UTC (permalink / raw)
  To: Phil Turmel, linux-raid

>> C) Suffer with desktop drives without SCTERC support. They cannot be
>> set to appropriate error timeouts. Udev or boot script assistance is
>> needed to set a 120 second driver timeout in sysfs. They do *not* work
>> properly with MD out of the box.

 > the recommended timeout for 'C' has drifted upward to 180.

Yes, I saw this; but, is it really not possible to examine the default 
timeout in a certain desktop drive, rather than follow rough estimates 
like "about two of three minutes should be enough"?.. I wanted to make 
sure it is indeed not possible, because that is hard to believe. Or do 
they not have a specified timeout at all?..

> Since that was written, 'A' would now include almost-enterprise drives
> with RAID ratings like the Western Digital Red family.

Yes, I understand that; always make sure they support it.

>> Still, I don't think it has anything to do with what has happened to my
>> "small file server"...
>
> That's why I asked for the dmesg.  It could have been a bug.  No crisis
> if it's lost, so long as you've accepted one of A through D above.

I've moved all the data to another server, disassembled this one, and 
reused the surviving hard drive, so I'm safe, but sadly, no logs. The 
important thing is, I've confirmed that this is not the expected 
behaviour - I was kind of ready to hear that "that's how it is with 
softraids, faulty drives hang your entire system like they hang Windows".

I've checked all my hard drives in all my RAIDs; all of them support 
"TL;DR" technology. They were all made before the "crippling" tendencies 
took over, and they are mostly Hitachi, so I'm lucky.

-- 
darkpenguin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-22 22:50       ` Dark Penguin
@ 2016-01-22 23:23         ` Edward Kuns
  0 siblings, 0 replies; 21+ messages in thread
From: Edward Kuns @ 2016-01-22 23:23 UTC (permalink / raw)
  To: Dark Penguin; +Cc: Phil Turmel, Linux-RAID

On Fri, Jan 22, 2016 at 4:50 PM, Dark Penguin <darkpenguin@yandex.ru> wrote:
>> the recommended timeout for 'C' has drifted upward to 180.
>
> Yes, I saw this; but, is it really not possible to examine the default
> timeout in a certain desktop drive, rather than follow rough estimates like
> "about two of three minutes should be enough"?

When I asked something similar in
https://www.marc.info/?l=linux-raid&m=144666581206186&w=2 the answers
I got

https://www.marc.info/?l=linux-raid&m=144666614306348&w=2
https://www.marc.info/?l=linux-raid&m=144666776906740&w=2

indicated that you just need the kernel timeout to be longer than the
disk timeout.  *Anything* longer is good.  It just has to be longer.
You want the *disk* to return a complaint.  You don't want the driver
to time out before the disk has had the opportunity to complain.  The
point of setting SCTERC to 7 seconds is to force the *drive* (not
kernel) to give up quickly and complain.  That'll force a rebuild,
rewriting the sector in question and remapping the bad sector on the
drive with the failure.  (If I've misunderstood this hopefully someone
much more expert will correct me!)  If you can't configure the drive
to give up quickly on a read error, then your remaining alternative is
to tell the kernel to wait (effectively) forever (180 sec) to allow
the drive enough time to give up on its own.

Regarding disabling scrubbing, I'd say that even with disabled SCTERC
(TL;DR haha) and bad default timeouts, it's still better for the array
to fail right away when you only have a single bad sector on a single
drive.  If you let it wait, you run the risk of developing a bad
sector on another drive, so that when you hit the first bad sector
through usage and the drive fails and you trigger a rebuild, you hit
the bad sector on the other drive and *bam* you're toast.

               Eddie

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-22 22:18     ` Phil Turmel
  2016-01-22 22:50       ` Dark Penguin
@ 2016-01-22 23:34       ` Wols Lists
  2016-01-23  0:09         ` Dark Penguin
  1 sibling, 1 reply; 21+ messages in thread
From: Wols Lists @ 2016-01-22 23:34 UTC (permalink / raw)
  To: Phil Turmel, Dark Penguin, linux-raid

On 22/01/16 22:18, Phil Turmel wrote:
> Since that was written, 'A' would now include almost-enterprise drives
> with RAID ratings like the Western Digital Red family.  And the
> recommended timeout for 'C' has drifted upward to 180.

Almost all posts mention WD Reds. It's NOT a recommendation, but seeing
as I've tended to buy Seagate (note my Barracudas) I looked for the
Seagate equivalent. They're called NAS drives, and they cost roughly the
same ... just pointing this out in case people like Seagate or dislike
WD ... :-)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-22 23:34       ` Wols Lists
@ 2016-01-23  0:09         ` Dark Penguin
  0 siblings, 0 replies; 21+ messages in thread
From: Dark Penguin @ 2016-01-23  0:09 UTC (permalink / raw)
  To: Wols Lists, Phil Turmel, linux-raid

>> Since that was written, 'A' would now include almost-enterprise drives
>> with RAID ratings like the Western Digital Red family.  And the
>> recommended timeout for 'C' has drifted upward to 180.
>
> Almost all posts mention WD Reds. It's NOT a recommendation, but seeing
> as I've tended to buy Seagate (note my Barracudas) I looked for the
> Seagate equivalent. They're called NAS drives, and they cost roughly the
> same ... just pointing this out in case people like Seagate or dislike
> WD ... :-)

I would prefer a Seagate over a WD, but with my experience lately, I 
don't want any more Seagates, either... I switched to Hitachi, but they 
don't make low-speed drives anymore, and they are expensive... I wonder 
of those Seagate NAS drives are actually anywhere near "reliable"?..


-- 
darkpenguin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-22 21:44   ` Dark Penguin
  2016-01-22 22:18     ` Phil Turmel
@ 2016-01-22 22:37     ` Edward Kuns
  2016-01-22 23:07       ` Dark Penguin
  2016-01-22 23:40     ` James J
  2 siblings, 1 reply; 21+ messages in thread
From: Edward Kuns @ 2016-01-22 22:37 UTC (permalink / raw)
  To: Dark Penguin; +Cc: Linux-RAID

On Fri, Jan 22, 2016 at 3:44 PM, Dark Penguin <darkpenguin@yandex.ru> wrote:
> And also, now I understand why I probably "should have been scrubbing". =/

Depending on your distribution, you may have been scrubbing all along.
When I looked into this, I discovered that mdadm as bundled in Fedora
(at least as of 21) already scrubs weekly:

$ rpm -ql mdadm | grep /etc
/etc/cron.d/raid-check
/etc/libreport/events.d/mdadm_event.conf
/etc/sysconfig/raid-check

$ cat /etc/cron.d/raid-check
# Run system wide raid-check once a week on Sunday at 1am by default
0 1 * * Sun root /usr/sbin/raid-check

Sweet!  I was pleased to discovered this when I realized I had been
lax.  The /usr/sbin/raid-check script does scrubbing, as configured by
/etc/sysconfig/raid-check.  I looked in /var/log/messages and indeed
saw evidence of successful weekly scrubbing.

Notice that if you have problems with timeouts, then this scrubbing
can break your array by causing you to hit a bad sector and fail as
Phil and others have described in several of his referenced EMails.
But better to fail early while only one drive is bad than to discover
this after more than one drive has problems and your data is
irrecoverable.  I had a mirror where one drive kept falling out.  I
now understand why.  (Weekly scrubbing + dodgy drives + no attempt to
address the timeouts == occasional unnecessary failure.)

> As I understand, one way around this problem is to change the kernel timeout
> to exceed the drive timeout by changing /sys/block/sd?/device/timeout to
> something larger than the default 30, but I'd have to do that after every
> reboot, is all that correct?

I took the script from this Email:

https://www.marc.info/?l=linux-raid&m=144661276420400&w=2

and dropped that code in my /etc/rc.d/rc.local after verifying that my
Linux distribution still ran that script on every startup.  YMMV.
That solved my problem.  Good luck.

             Eddie

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-22 22:37     ` Edward Kuns
@ 2016-01-22 23:07       ` Dark Penguin
  2016-01-22 23:39         ` Wols Lists
  2016-01-23  0:34         ` Phil Turmel
  0 siblings, 2 replies; 21+ messages in thread
From: Dark Penguin @ 2016-01-22 23:07 UTC (permalink / raw)
  To: Edward Kuns, linux-raid

> Depending on your distribution, you may have been scrubbing all along.

 > Notice that if you have problems with timeouts, then this scrubbing
 > can break your array by causing you to hit a bad sector and fail as
 > Phil and others have described in several of his referenced EMails.

I remember disabling scrubbing myself. My reasons were not very... 
bright, but now it turned out to be a it's a good thing, because with 
TL;DR disabled by default, it could lead to that kind of bad things 
happening, yes. I remember having one drive kicked out of an array in my 
home storage, and since then, I've learned to use write-intent bitmaps 
to re-add them more easily. But I'm a BAARF person, so I only have 
mirrors; I wonder what happens if the only drive in a degraded mirror 
fails?..

> I took the script from this Email:
>
> https://www.marc.info/?l=linux-raid&m=144661276420400&w=2
>
> and dropped that code in my /etc/rc.d/rc.local after verifying that my
> Linux distribution still ran that script on every startup.  YMMV.
> That solved my problem.  Good luck.

I was afraid to learn that my latest pack of drives are modern enough to 
be castrated already, but it turned out they are not. :) My anime is 
safe with the last 5900RPM Hitachi drives available, and they do support 
TL;DR. I wish Hitachi made more 5900RPM drives for such purposes...

-- 
darkpenguin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-22 23:07       ` Dark Penguin
@ 2016-01-22 23:39         ` Wols Lists
  2016-01-23  0:09           ` Dark Penguin
  2016-01-23  0:34         ` Phil Turmel
  1 sibling, 1 reply; 21+ messages in thread
From: Wols Lists @ 2016-01-22 23:39 UTC (permalink / raw)
  To: Dark Penguin, Edward Kuns, linux-raid

On 22/01/16 23:07, Dark Penguin wrote:
> I was afraid to learn that my latest pack of drives are modern enough to
> be castrated already, but it turned out they are not. :) My anime is
> safe with the last 5900RPM Hitachi drives available, and they do support
> TL;DR. I wish Hitachi made more 5900RPM drives for such purposes...

I have a gut feel that drives of 1TB or less are "old technology" and
even new today are okay. It's the multi-TB drives that have been
castrated...

Note I said "gut feel" - don't trust what I think ... :-)

Cheers,
Wol

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-22 23:39         ` Wols Lists
@ 2016-01-23  0:09           ` Dark Penguin
  0 siblings, 0 replies; 21+ messages in thread
From: Dark Penguin @ 2016-01-23  0:09 UTC (permalink / raw)
  To: Wols Lists, Edward Kuns, linux-raid

>> I was afraid to learn that my latest pack of drives are modern enough to
>> be castrated already, but it turned out they are not. :) My anime is
>> safe with the last 5900RPM Hitachi drives available, and they do support
>> TL;DR. I wish Hitachi made more 5900RPM drives for such purposes...
>
> I have a gut feel that drives of 1TB or less are "old technology" and
> even new today are okay. It's the multi-TB drives that have been
> castrated...
>
> Note I said "gut feel" - don't trust what I think ... :-)

I'm not sure exactly when did the "castration" started, but I have few 
Seagate LP's 1,5 and 2 Tb, over 5 years old: they support it. But I 
found out that they suck in terms of reliability, so I switched to 
Hitachi. There was this 5900RPM model about four years ago, I think - I 
believe it was almost the only low-RPM one Hitachi ever did; they 
support it, too. And I have a 500Gb 2,5" Hitachi Travelstar bought a 
year or two later, and even they support it! So, I guess I was right; 
apparently, Hitachi is not into the castration business... Has anybody 
seen any Hitachi without TL;DR ?..

-- 
darkpenguin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-22 23:07       ` Dark Penguin
  2016-01-22 23:39         ` Wols Lists
@ 2016-01-23  0:34         ` Phil Turmel
  2016-01-23 10:33           ` Dark Penguin
  1 sibling, 1 reply; 21+ messages in thread
From: Phil Turmel @ 2016-01-23  0:34 UTC (permalink / raw)
  To: Dark Penguin, Edward Kuns, linux-raid

On 01/22/2016 06:07 PM, Dark Penguin wrote:
>> Depending on your distribution, you may have been scrubbing all along.
> 
>> Notice that if you have problems with timeouts, then this scrubbing
>> can break your array by causing you to hit a bad sector and fail as
>> Phil and others have described in several of his referenced EMails.
> 
> I remember disabling scrubbing myself. My reasons were not very...
> bright, but now it turned out to be a it's a good thing, because with
> TL;DR disabled by default, it could lead to that kind of bad things
> happening, yes. 

Nooooo!

Disabling scrubbing with "dodgy" drives turns an annoyance (drives
kicked out by scrubs) into a catastrophe later (unrecoverable array).
The archives of this list are full of such incidents.

The only people who should disable scrubbing are enterprise data centers
where the disks are so busy they hit every sector every so often anyways.

> I remember having one drive kicked out of an array in my
> home storage, and since then, I've learned to use write-intent bitmaps
> to re-add them more easily. But I'm a BAARF person, so I only have
> mirrors; I wonder what happens if the only drive in a degraded mirror
> fails?..

Or any drive in a raid5 hits an error while replacing a failed disk.

BOOM.

Phil

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-23  0:34         ` Phil Turmel
@ 2016-01-23 10:33           ` Dark Penguin
  2016-01-23 15:12             ` Phil Turmel
  0 siblings, 1 reply; 21+ messages in thread
From: Dark Penguin @ 2016-01-23 10:33 UTC (permalink / raw)
  To: Phil Turmel, Edward Kuns, linux-raid

>> I wonder what happens if the only drive in a degraded mirror fails?..
>
> Or any drive in a raid5 hits an error while replacing a failed disk.
>
> BOOM.

Oh... That sucks. I understand that BOOM would happen on a RAID-F, but 
kicking the last drive doesn't seem like a very rational thing to do - 
you would get a RAID with zero drives, which is not something anybody 
would want in any situation!.. I thought it would get into some kind of 
extreme panic mode so that you could at least save what you can...

Actually, it would make sense to avoid "BOOM" as much as possible; 
instead of going into a "BOOM" mode, it should go into some kind of 
"recovery" mode, when the array immediately turns read-only and allows 
you to salvage as much data as you can from the faulty drive. It would 
be especially useful in RAID5; only one faulty sector or even some other 
stupid error without even faulty sectors should not cause total data 
loss!.. I don't think it would be very hard to implement, or contradict 
some important philosophy, but it would save data, jobs and lives. :)

Well, suppose "BOOM" happened on my RAID1, and both drives failed. 
Naturally, not at the same time, so data differs between the drives. So 
I have two faulty drives; is it possible (with some other recovery 
software?) to just mount each one read-only and copy data from it?.. I 
can live with having to copy as much data as I can from the first one 
and try to copy the rest from the older drive. I mean, I basically have 
the entire filesystem there! I should be able to just mount the 
filesystem on that drive, skipping the RAID header - just fdisk the 
drive, delete the RAID partition and create another one starting a few 
blocks later!.. Would that not work?.. That's actually why I chose 
RAID1; even if everything goes BOOM, I should be able to mount the 
drives separately!..

-- 
darkpenguin

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-23 10:33           ` Dark Penguin
@ 2016-01-23 15:12             ` Phil Turmel
  0 siblings, 0 replies; 21+ messages in thread
From: Phil Turmel @ 2016-01-23 15:12 UTC (permalink / raw)
  To: Dark Penguin, Edward Kuns, linux-raid

On 01/23/2016 05:33 AM, Dark Penguin wrote:
>>> I wonder what happens if the only drive in a degraded mirror fails?..
>>
>> Or any drive in a raid5 hits an error while replacing a failed disk.
>>
>> BOOM.
> 
> Oh... That sucks. I understand that BOOM would happen on a RAID-F, but
> kicking the last drive doesn't seem like a very rational thing to do -
> you would get a RAID with zero drives, which is not something anybody
> would want in any situation!.. I thought it would get into some kind of
> extreme panic mode so that you could at least save what you can...

It does.  It's called manual intervention.

> Actually, it would make sense to avoid "BOOM" as much as possible;
> instead of going into a "BOOM" mode, it should go into some kind of
> "recovery" mode, when the array immediately turns read-only and allows
> you to salvage as much data as you can from the faulty drive. It would
> be especially useful in RAID5; only one faulty sector or even some other
> stupid error without even faulty sectors should not cause total data
> loss!.. I don't think it would be very hard to implement, or contradict
> some important philosophy, but it would save data, jobs and lives. :)

Well, the root causes vary, and the actions needed vary accordingly.
There isn't enough information to proceed without operator assistance.
This list has helped numerous people recover their data.  ( But not
always. )

> Well, suppose "BOOM" happened on my RAID1, and both drives failed.
> Naturally, not at the same time, so data differs between the drives. So
> I have two faulty drives; is it possible (with some other recovery
> software?) to just mount each one read-only and copy data from it?.. I
> can live with having to copy as much data as I can from the first one
> and try to copy the rest from the older drive. I mean, I basically have
> the entire filesystem there! I should be able to just mount the
> filesystem on that drive, skipping the RAID header - just fdisk the
> drive, delete the RAID partition and create another one starting a few
> blocks later!.. Would that not work?.. That's actually why I chose
> RAID1; even if everything goes BOOM, I should be able to mount the
> drives separately!..

Yes, you can use dmsetup to gain access to any raid1's data area.
Mirrors using v0.90 and v1.0 metadata can be accessed directly.  (v.90
is strongly discouraged, though, as there are unavoidable ambiguities in
that old format that can blow up on you.)

raid10 and parity raids cannot be accessed like this.  You need
scrubbing and good drives, quick response to failures, and backups for
the final tiny probability of failure.

Phil

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-22 21:44   ` Dark Penguin
  2016-01-22 22:18     ` Phil Turmel
  2016-01-22 22:37     ` Edward Kuns
@ 2016-01-22 23:40     ` James J
  2016-01-23  0:44       ` Phil Turmel
  2016-01-23 14:09       ` Wols Lists
  2 siblings, 2 replies; 21+ messages in thread
From: James J @ 2016-01-22 23:40 UTC (permalink / raw)
  To: Dark Penguin, linux-raid

On 22/01/2016 22:44, Dark Penguin wrote:
>
> As I understand, one way around this problem is to change the kernel 
> timeout to exceed the drive timeout by changing 
> /sys/block/sd?/device/timeout to something larger than the default 30, 
> but I'd have to do that after every reboot, is all that correct?
>

No, this part needs further investigation and comments from the gurus.

With a SCSI timeout 30 secs, which is the setting you had at the time of 
the incident AFAIU, what should have happened was that the drive should 
have been kicked out at the 30th second, this is BEFORE it had a chance 
to return a read failure because your desktop drive takes more than 
30secs to return a read failure. This was what you indeed expected but 
it is not what has happened.

The recommentation of raising the timeout to 120+ is for the opposite 
purpose of what you want. It is for the case the sysadmin accepts to 
wait a long time because he wants to prevent the kicking of the drive at 
the first read-error (normally drives are kicked for a write error). 
This might be wanted in order to a) defer the replacement of the drive, 
either to perform the replacement at a more opportune time and/or in a 
better manner such as a no-degrade replace operation, or b) because he 
does not want to replace the drive at all: maybe he believes that the 
error might be spurious and will not happen again and the drive is still 
of acceptable fitness for the purpose, e.g. in a low-cost file server.

So what happened is still wrong AFAIK, in the sense of a kernel bug.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-22 23:40     ` James J
@ 2016-01-23  0:44       ` Phil Turmel
  2016-01-23 14:09       ` Wols Lists
  1 sibling, 0 replies; 21+ messages in thread
From: Phil Turmel @ 2016-01-23  0:44 UTC (permalink / raw)
  To: James J, Dark Penguin, linux-raid

On 01/22/2016 06:40 PM, James J wrote:
> On 22/01/2016 22:44, Dark Penguin wrote:
>>
>> As I understand, one way around this problem is to change the kernel
>> timeout to exceed the drive timeout by changing
>> /sys/block/sd?/device/timeout to something larger than the default 30,
>> but I'd have to do that after every reboot, is all that correct?
>>
> 
> No, this part needs further investigation and comments from the gurus.

Yes, DP had that correct.

> With a SCSI timeout 30 secs, which is the setting you had at the time of
> the incident AFAIU, what should have happened was that the drive should
> have been kicked out at the 30th second, this is BEFORE it had a chance
> to return a read failure because your desktop drive takes more than
> 30secs to return a read failure. This was what you indeed expected but
> it is not what has happened.

His problem description doesn't perfectly match timeout mismatch.  He
probably had a real problem that was exacerbated by his now-discovered
timeout problem.  He no longer has the dmesg so further speculation is
moot.  If it happens again, we can look closer.

> The recommentation of raising the timeout to 120+ is for the opposite
> purpose of what you want. It is for the case the sysadmin accepts to
> wait a long time because he wants to prevent the kicking of the drive at
> the first read-error (normally drives are kicked for a write error).
> This might be wanted in order to a) defer the replacement of the drive,
> either to perform the replacement at a more opportune time and/or in a
> better manner such as a no-degrade replace operation, or b) because he
> does not want to replace the drive at all: maybe he believes that the
> error might be spurious and will not happen again and the drive is still
> of acceptable fitness for the purpose, e.g. in a low-cost file server.

No.  If you have a drive that doesn't support scterc or has it turned
off, you *must* set a timeout longer than the drive's native timeout or
you will have great problems.  I suggest you read the references to the
archives I posted.

Keep in mind that in a properly working array UREs are *fixed* when
discovered by overwriting them.  This is vital to array robustness, as
many UREs are transient (don't need relocation at all).

Phil


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-22 23:40     ` James J
  2016-01-23  0:44       ` Phil Turmel
@ 2016-01-23 14:09       ` Wols Lists
  2016-01-23 19:02         ` James J
  1 sibling, 1 reply; 21+ messages in thread
From: Wols Lists @ 2016-01-23 14:09 UTC (permalink / raw)
  To: James J, linux-raid

On 22/01/16 23:40, James J wrote:
> The recommentation of raising the timeout to 120+ is for the opposite
> purpose of what you want. It is for the case the sysadmin accepts to
> wait a long time because he wants to prevent the kicking of the drive at
> the first read-error (normally drives are kicked for a write error).
> This might be wanted in order to a) defer the replacement of the drive,
> either to perform the replacement at a more opportune time and/or in a
> better manner such as a no-degrade replace operation, or b) because he
> does not want to replace the drive at all: maybe he believes that the
> error might be spurious and will not happen again and the drive is still
> of acceptable fitness for the purpose, e.g. in a low-cost file server.

Except, aiui, even in your scenario! drives are kicked for a *write* error.

What happens (should be) is the kernel times out, the raid handles the
read error by trying a rewrite, the drive is still hung on the read
error so it doesn't respond to the write request, and the drive gets
kicked for a write failure.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-23 14:09       ` Wols Lists
@ 2016-01-23 19:02         ` James J
  2016-01-24 22:13           ` Adam Goryachev
  0 siblings, 1 reply; 21+ messages in thread
From: James J @ 2016-01-23 19:02 UTC (permalink / raw)
  To: Wols Lists, linux-raid

On 23/01/2016 15:09, Wols Lists wrote:
> On 22/01/16 23:40, James J wrote:
>> The recommentation of raising the timeout to 120+ is for the opposite
>> purpose of what you want. It is for the case the sysadmin accepts to
>> wait a long time because he wants to prevent the kicking of the drive at
>> the first read-error (normally drives are kicked for a write error).
>> This might be wanted in order to a) defer the replacement of the drive,
>> either to perform the replacement at a more opportune time and/or in a
>> better manner such as a no-degrade replace operation, or b) because he
>> does not want to replace the drive at all: maybe he believes that the
>> error might be spurious and will not happen again and the drive is still
>> of acceptable fitness for the purpose, e.g. in a low-cost file server.
> Except, aiui, even in your scenario! drives are kicked for a *write* error.
>
> What happens (should be) is the kernel times out, the raid handles the
> read error by trying a rewrite, the drive is still hung on the read
> error so it doesn't respond to the write request, and the drive gets
> kicked for a write failure.

Oh yes you are correct, so the drive would be kicked after 60secs and 
not after 30secs contrary to what I said.
So the sequence would be: drive stuck on read --> scsi read failure due 
to timeout at the 30th second --> MD receives failure and attempts 
rewrite --> scsi write failure due to timeout at the 60th second --> 
drive kicked by MD at the 60th second
I think this is what should have happened, but it didn't happen like 
this anyway so I think there is probably a kernel bug somewhere.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: md failing mechanism
  2016-01-23 19:02         ` James J
@ 2016-01-24 22:13           ` Adam Goryachev
  0 siblings, 0 replies; 21+ messages in thread
From: Adam Goryachev @ 2016-01-24 22:13 UTC (permalink / raw)
  To: James J, Wols Lists, linux-raid



On 24/01/2016 06:02, James J wrote:
> On 23/01/2016 15:09, Wols Lists wrote:
>> On 22/01/16 23:40, James J wrote:
>>> The recommentation of raising the timeout to 120+ is for the opposite
>>> purpose of what you want. It is for the case the sysadmin accepts to
>>> wait a long time because he wants to prevent the kicking of the 
>>> drive at
>>> the first read-error (normally drives are kicked for a write error).
>>> This might be wanted in order to a) defer the replacement of the drive,
>>> either to perform the replacement at a more opportune time and/or in a
>>> better manner such as a no-degrade replace operation, or b) because he
>>> does not want to replace the drive at all: maybe he believes that the
>>> error might be spurious and will not happen again and the drive is 
>>> still
>>> of acceptable fitness for the purpose, e.g. in a low-cost file server.
>> Except, aiui, even in your scenario! drives are kicked for a *write* 
>> error.
>>
>> What happens (should be) is the kernel times out, the raid handles the
>> read error by trying a rewrite, the drive is still hung on the read
>> error so it doesn't respond to the write request, and the drive gets
>> kicked for a write failure.
>
> Oh yes you are correct, so the drive would be kicked after 60secs and 
> not after 30secs contrary to what I said.
> So the sequence would be: drive stuck on read --> scsi read failure 
> due to timeout at the 30th second --> MD receives failure and attempts 
> rewrite --> scsi write failure due to timeout at the 60th second --> 
> drive kicked by MD at the 60th second
> I think this is what should have happened, but it didn't happen like 
> this anyway so I think there is probably a kernel bug somewhere.
I don't have a lot to add, except that I recall the OP suggested it was 
an IDE drive. I wonder if the IDE sub-system and/or hardware operates 
differently compared to the sata variants. Possibly the MD layer never 
got any timeout or error on the read, and (or maybe it was the write) 
and hence it was never kicked from the array.

Regards,
Adam

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-01-24 22:13 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-22 17:59 md failing mechanism Dark Penguin
2016-01-22 19:29 ` Phil Turmel
2016-01-22 20:00   ` Wols Lists
2016-01-22 21:44   ` Dark Penguin
2016-01-22 22:18     ` Phil Turmel
2016-01-22 22:50       ` Dark Penguin
2016-01-22 23:23         ` Edward Kuns
2016-01-22 23:34       ` Wols Lists
2016-01-23  0:09         ` Dark Penguin
2016-01-22 22:37     ` Edward Kuns
2016-01-22 23:07       ` Dark Penguin
2016-01-22 23:39         ` Wols Lists
2016-01-23  0:09           ` Dark Penguin
2016-01-23  0:34         ` Phil Turmel
2016-01-23 10:33           ` Dark Penguin
2016-01-23 15:12             ` Phil Turmel
2016-01-22 23:40     ` James J
2016-01-23  0:44       ` Phil Turmel
2016-01-23 14:09       ` Wols Lists
2016-01-23 19:02         ` James J
2016-01-24 22:13           ` Adam Goryachev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).