URE, link resets, user hostile defaults

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* URE, link resets, user hostile defaults
@ 2016-06-27 16:42 Chris Murphy
  2016-06-28  6:33 ` Hannes Reinecke
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Murphy @ 2016-06-27 16:42 UTC (permalink / raw)
  To: linux-raid

Hi,

Drives with SCT ERC not supported or unset, result in potentially long
error recoveries for marginal or bad sectors: upwards of 180 second
recovers are suggested.

The kernel's SCSI command timer default of 30 seconds, i.e.

cat /sys/block/<dev>/device/timeout

conspires to  undermine the deep recovery of most drives now on the
market. This by default misconfiguration results in problems list
regulars are very well aware of. It affects all raid configurations,
and even affects the non-RAID single drive use case. And it does so in
a way that doesn't happen on either Windows or macOS. Basically it is
linux kernel induced data loss, the drive very possibly could present
the requested data upon deep recovery being permitted, but the
kernel's command timer is reached before recovery completes, and
obliterates any possibility of recovering that data. By default.

This now seems to affect the majority of use cases. At one time 30
seconds might have been sane for a world with drives that had less
than 30 second recoveries for bad sectors. But that's no longer the
case.

I'm wondering if anyone has floated the idea of changing the kernels
default SCSI command timer? And if so, if there's a thread discussing
where that was rejected upstream? Or if this exposes other liabilities
that merits an alternative work around for what now amounts to a
defect. Maybe it needs to be a udev rule?

Perhaps ideally what we'd like to have is two timers. A timer that
reports back "slowness" for a drive to complete a queued command,
which could be used by e.g. scrubs to preemptively overwrite those
sectors rather than wait for read errors to happen. And then a timer
with a longer value would be the present timer that results in a link
reset once it's reached.

Thanks,

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: URE, link resets, user hostile defaults
  2016-06-27 16:42 URE, link resets, user hostile defaults Chris Murphy
@ 2016-06-28  6:33 ` Hannes Reinecke
  2016-06-28 17:33   ` Chris Murphy
  0 siblings, 1 reply; 16+ messages in thread
From: Hannes Reinecke @ 2016-06-28  6:33 UTC (permalink / raw)
  To: Chris Murphy, linux-raid

On 06/27/2016 06:42 PM, Chris Murphy wrote:
> Hi,
> 
> Drives with SCT ERC not supported or unset, result in potentially long
> error recoveries for marginal or bad sectors: upwards of 180 second
> recovers are suggested.
> 
> The kernel's SCSI command timer default of 30 seconds, i.e.
> 
> cat /sys/block/<dev>/device/timeout
> 
> conspires to  undermine the deep recovery of most drives now on the
> market. This by default misconfiguration results in problems list
> regulars are very well aware of. It affects all raid configurations,
> and even affects the non-RAID single drive use case. And it does so in
> a way that doesn't happen on either Windows or macOS. Basically it is
> linux kernel induced data loss, the drive very possibly could present
> the requested data upon deep recovery being permitted, but the
> kernel's command timer is reached before recovery completes, and
> obliterates any possibility of recovering that data. By default.
> 
> This now seems to affect the majority of use cases. At one time 30
> seconds might have been sane for a world with drives that had less
> than 30 second recoveries for bad sectors. But that's no longer the
> case.
> 
'Majority of use cases'.
Hardly. I'm not aware of any issues here.

The problem with SCT ERC (or TLER or whatever the currrent acronym of
the day is called) is that it's a non-standard setting, where every
vendor basically does its own thing.
Plus you can only influence this on higher end-disks; on others you are
at the mercy of the drive firmware, hoping you got the timeout right.

Can you post a message log detailing this problem?
We surely have ways of influencing the timeout, but first we need to
understand what actually is happening.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: URE, link resets, user hostile defaults
  2016-06-28  6:33 ` Hannes Reinecke
@ 2016-06-28 17:33   ` Chris Murphy
  2016-06-28 18:28     ` Phil Turmel
                       ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Chris Murphy @ 2016-06-28 17:33 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Chris Murphy, linux-raid

On Tue, Jun 28, 2016 at 12:33 AM, Hannes Reinecke <hare@suse.de> wrote:
> On 06/27/2016 06:42 PM, Chris Murphy wrote:
>> Hi,
>>
>> Drives with SCT ERC not supported or unset, result in potentially long
>> error recoveries for marginal or bad sectors: upwards of 180 second
>> recovers are suggested.
>>
>> The kernel's SCSI command timer default of 30 seconds, i.e.
>>
>> cat /sys/block/<dev>/device/timeout
>>
>> conspires to  undermine the deep recovery of most drives now on the
>> market. This by default misconfiguration results in problems list
>> regulars are very well aware of. It affects all raid configurations,
>> and even affects the non-RAID single drive use case. And it does so in
>> a way that doesn't happen on either Windows or macOS. Basically it is
>> linux kernel induced data loss, the drive very possibly could present
>> the requested data upon deep recovery being permitted, but the
>> kernel's command timer is reached before recovery completes, and
>> obliterates any possibility of recovering that data. By default.
>>
>> This now seems to affect the majority of use cases. At one time 30
>> seconds might have been sane for a world with drives that had less
>> than 30 second recoveries for bad sectors. But that's no longer the
>> case.
>>
> 'Majority of use cases'.
> Hardly. I'm not aware of any issues here.

This list is prolific with this now common misconfiguration. It
manifests on average about weekly, as a message from libata that it's
"hard resetting link". In every single case where the user is
instructed to either set SCT ERC lower than 30 seconds if possible, or
increase the kernel SCSI command timer well above 30 seconds (180 is
often recommended on this list), suddenly the user's problems start to
go away.

Now the md driver gets an explicit read failure from the drive, after
30 seconds, instead of a link reset. And this includes the LBA for the
bad sector, which is apparently what md wants to write the fixup back
to that drive.

However the manifestation of the problem and the nature of this list
self-selects the user reports. Of course people with failed mdadm
based RAID come here. But this problem is also manifesting on Btrfs
for the same reasons. It also manifests, more rarely, with users who
have just a single drive if the drive does "deep recovery" reads on
marginally bad sectors, but the kernel flips out at 30 seconds
preventing that recovery. Of course not every drive model has such
deep recoveries, but by now it's extremely common. I have yet to see a
single consumer hard drive, ever, configured out of the box with SCT
ERC enabled.

> The problem with SCT ERC (or TLER or whatever the currrent acronym of
> the day is called) is that it's a non-standard setting, where every
> vendor basically does its own thing.
> Plus you can only influence this on higher end-disks; on others you are
> at the mercy of the drive firmware, hoping you got the timeout right.

WDC Scorpio Blue laptop drive supports SCT ERC. But it's disabled. Not
a high end drive.

TOSHIBA MQ01ABD100, also an inexpensive laptop drive, supports SCT
ERC, is disabled, not a high end drive.

Samsung 840 EVO, inexpensive SSD, supports SCT ERC, is disabled, not a
high end drive.

That the maximum recovery time is unpublished or difficult to
determine is beside the point. Clearly 30 seconds for the command
timer isn't long enough or this list wouldn't be full of problems
resulting directly from link resets obscuring the actual problem and
fix: either recovering the data, or explicitly failing with a read
error and an LBA so that md (or even Btrfs) can do their job and
overwrite the bad sector thereby causing in-drive remapping by its
firmware.

When this doesn't happen, those bad sectors just accumulate. And it's
a time bomb for data loss waiting to happen.

> Can you post a message log detailing this problem?

http://www.spinics.net/lists/raid/msg50289.html

There are hundreds, maybe thousands, of these on this list alone in
the form of "raid 5 failure help me recover my data!" because what's
happening is the bad sectors accumulate, finally one drive dies, and
of the remaining drives that survive one or more have one or more bad
sectors that were permitted to persist despite scrubs. And that's
because the kernel is f'ing resetting the goddamn link instead of
waiting for the drive to do its job and either recover the data or
explicitly report a read error.

The 30 second default is simply impatient.

Just over the weekend Phil Turmel posted an email with a bunch of back
reading on the subject of timeout mismatches for someone to read. I've
lost track of how many user emails he's replied to, discovering this
common misconfiguration, and get it straightened out and more often
than not helping the user recover data that otherwise would have been
lost *because* of hard link resetting instead of explicit read errors.

http://www.spinics.net/lists/raid/msg52789.html

He isn't the only list regular who helps educate users tirelessly with
this very repetitive work around for a very old misconfiguration that
as far as I can tell only exists on Linux. And it's the default
behavior.

Now we could say that 30 seconds is already too long, and 180 seconds
is just insane. But that's the reality of how a massive pile of
consumer hard drives actually behave. They can do so called "deep
recoveries" that take minutes during which time they appear to hang.

Usually recoveries don't take minutes. But they can take minutes. And
that's where the problem comes in. I don't see why the user should be
the one punished by the kernel, which is in effect what a 30 second
default command timer is doing.

Perhaps there's a better way to do this than change the default
timeout in the kernel? Maybe what we need is an upstream udev rule
that polls SCT ERC for each drive, and if it's
disabled/unsupported/unknown then it sets a much higher command timer
for that block device. And maybe it only does this on USB and SATA.
For anything enterprise or NAS grade, they do report (at least to
smartctl) SCT ERC in deciseconds. The most common value is 70
deciseconds, so a 30 second command timer is OK. Maybe it could even
be lower but that's a separate optimization conversation.

In any case, the current situation is pretty much crap for the user.
And the idea we can educate users on what to buy isn't working, and
the esoteric crap they need to change to avoid the carnage from this
misconfiguration is still mostly unknown even to seasoned sysadmins
and uber storage geeks. They have no idea this is the way things are,
until they have a problem, come on this list, and get schooled. It's a
big reason why so many people have thrown raid 6 at the problem, which
really just papers over the real issue by throwing more redundancy at
it. But this list has in fact seen raid 6 implosions as a result of
this problem where two drives fail, and a 3rd drive has bad sectors
allowed to accumulate because of this misconfiguration and the array
collapses.

> We surely have ways of influencing the timeout, but first we need to
> understand what actually is happening.

I think the list regulars on this list understand what's actually
happening. Users are buying cheap drives that were never designed for,
or even are explicitly excluded from use in raid 5 or raid 6. But the
problem impacts non-RAID users, linear/concat layouts, and RAID 0.

It even impacts Btrfs DUP profile, where there are two copies of
metadata on disk. If one of those fs metadata sectors reads slow
enough, the drive gets reset, the command queue is flushed and now the
fs has to rerequest everything *and* it has no idea, due to lack of a
read error, where to get the mirrored copy of metadata on that drive,
and no idea where to write it back to in order to fix the slow sector
read.

It screws users who merely use ext4, because instead of getting a slow
computer, they get one that starts to face plant with obscure messages
like link resets. The problem isn't the link. The problem is bad
sectors. But they don't see that message because the link reset
happens before the drive reports the read failure.

Where is Phil and Stan to back me up on this?

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: URE, link resets, user hostile defaults
  2016-06-28 17:33   ` Chris Murphy
@ 2016-06-28 18:28     ` Phil Turmel
  2016-06-28 20:46       ` Wols Lists
  2016-06-29  6:01     ` Hannes Reinecke
  2016-06-29 12:17     ` Zygo Blaxell
  2 siblings, 1 reply; 16+ messages in thread
From: Phil Turmel @ 2016-06-28 18:28 UTC (permalink / raw)
  To: Chris Murphy, Hannes Reinecke; +Cc: linux-raid

On 06/28/2016 01:33 PM, Chris Murphy wrote:

> Perhaps there's a better way to do this than change the default
> timeout in the kernel? Maybe what we need is an upstream udev rule
> that polls SCT ERC for each drive, and if it's
> disabled/unsupported/unknown then it sets a much higher command timer
> for that block device. And maybe it only does this on USB and SATA.
> For anything enterprise or NAS grade, they do report (at least to
> smartctl) SCT ERC in deciseconds. The most common value is 70
> deciseconds, so a 30 second command timer is OK. Maybe it could even
> be lower but that's a separate optimization conversation.

When Neil retired from maintainership, I mentioned that I would take a
stab at this.  You're right, just setting the kernel default timeout to
180 would be a regression.  If I recall correctly, there are network
services that would disconnect if storage stacks could delay that long
before replying, whether good or bad.

So a device discovery process that examines the drive's parameter pages
and makes an intelligent decision would be the way to go.  But as you
can see, I haven't dug into the ata & scsi layers to figure it out yet.
 It won't hurt my feelings if someone beats me to it.

Phil

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: URE, link resets, user hostile defaults
  2016-06-28 18:28     ` Phil Turmel
@ 2016-06-28 20:46       ` Wols Lists
  2016-06-28 22:17         ` Chris Murphy
  0 siblings, 1 reply; 16+ messages in thread
From: Wols Lists @ 2016-06-28 20:46 UTC (permalink / raw)
  To: Phil Turmel, Chris Murphy, Hannes Reinecke; +Cc: linux-raid

On 28/06/16 19:28, Phil Turmel wrote:
> On 06/28/2016 01:33 PM, Chris Murphy wrote:
> 
>> > Perhaps there's a better way to do this than change the default
>> > timeout in the kernel? Maybe what we need is an upstream udev rule
>> > that polls SCT ERC for each drive, and if it's
>> > disabled/unsupported/unknown then it sets a much higher command timer
>> > for that block device. And maybe it only does this on USB and SATA.
>> > For anything enterprise or NAS grade, they do report (at least to
>> > smartctl) SCT ERC in deciseconds. The most common value is 70
>> > deciseconds, so a 30 second command timer is OK. Maybe it could even
>> > be lower but that's a separate optimization conversation.
> When Neil retired from maintainership, I mentioned that I would take a
> stab at this.  You're right, just setting the kernel default timeout to
> 180 would be a regression.  If I recall correctly, there are network
> services that would disconnect if storage stacks could delay that long
> before replying, whether good or bad.
> 
> So a device discovery process that examines the drive's parameter pages
> and makes an intelligent decision would be the way to go.  But as you
> can see, I haven't dug into the ata & scsi layers to figure it out yet.
>  It won't hurt my feelings if someone beats me to it.

Talking off the top of my head :-) would it be possible to spawn a
kernel thread - if it takes longer than an aggressive time-out - that
just waits for far longer then rewrites it if the read finally completes?

In other words, wait for say the 70 deciseconds, then spawn the rewrite
thread, then continue waiting until whatever timeout. The thread could
actually not even time out but just wait for the drive to time out. If
the drive (eventually) responds rather than timing out then the rewrite
would hopefully fix the potential impending URE.

So we'd need two timeouts really. Timeout 1 says "if it takes longer
than this, do a background rewrite when it finally succeeds", and
timeout 2 says "if it takes longer than this, return an error, but let
the rewrite thread continue to wait".

Cheers,
Wol

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: URE, link resets, user hostile defaults
  2016-06-28 20:46       ` Wols Lists
@ 2016-06-28 22:17         ` Chris Murphy
  0 siblings, 0 replies; 16+ messages in thread
From: Chris Murphy @ 2016-06-28 22:17 UTC (permalink / raw)
  To: Wols Lists; +Cc: Phil Turmel, Chris Murphy, Hannes Reinecke, linux-raid

On Tue, Jun 28, 2016 at 2:46 PM, Wols Lists <antlists@youngman.org.uk> wrote:
> On 28/06/16 19:28, Phil Turmel wrote:
>> On 06/28/2016 01:33 PM, Chris Murphy wrote:
>>
>>> > Perhaps there's a better way to do this than change the default
>>> > timeout in the kernel? Maybe what we need is an upstream udev rule
>>> > that polls SCT ERC for each drive, and if it's
>>> > disabled/unsupported/unknown then it sets a much higher command timer
>>> > for that block device. And maybe it only does this on USB and SATA.
>>> > For anything enterprise or NAS grade, they do report (at least to
>>> > smartctl) SCT ERC in deciseconds. The most common value is 70
>>> > deciseconds, so a 30 second command timer is OK. Maybe it could even
>>> > be lower but that's a separate optimization conversation.
>> When Neil retired from maintainership, I mentioned that I would take a
>> stab at this.  You're right, just setting the kernel default timeout to
>> 180 would be a regression.  If I recall correctly, there are network
>> services that would disconnect if storage stacks could delay that long
>> before replying, whether good or bad.
>>
>> So a device discovery process that examines the drive's parameter pages
>> and makes an intelligent decision would be the way to go.  But as you
>> can see, I haven't dug into the ata & scsi layers to figure it out yet.
>>  It won't hurt my feelings if someone beats me to it.
>
> Talking off the top of my head :-) would it be possible to spawn a
> kernel thread - if it takes longer than an aggressive time-out - that
> just waits for far longer then rewrites it if the read finally completes?
>
> In other words, wait for say the 70 deciseconds, then spawn the rewrite
> thread, then continue waiting until whatever timeout. The thread could
> actually not even time out but just wait for the drive to time out. If
> the drive (eventually) responds rather than timing out then the rewrite
> would hopefully fix the potential impending URE.

I do not think the hang comes from the kernel, but from the drive
itself, during these deep recovery reads. I think the whole drive does
a big fat "look at the hand" while it deeply considers, many, many,
many thousands of times, how the F to recover this one goddamn sector.
And  until it recovers it (sometimes wrongly), or gives up and submits
a read error, the drive responds to nothing at all, is my
understanding. And hence why the hard resetting link ends up
happening.

If I'm right, threading this in the kernel won't help. It needs to be
threaded in the drive. And I'm also pretty sure that SAS drives have
command queue independence, don't have this problem, and can have
individual commands cancelled, where SATA is S.O.L.

Over on the Btrfs list someone wondered if this hang can just be
reinterpreted as always being the result of bad sectors, the kernel
knows what's pending in the drive command queue, resets the drive, and
pre-emptively reconstructs and overwrites every single LBA for every
command that was stuck in the queue. And I'm like, well that's not
very accurate is it?  That's like taking a baseball bat to a tick.
Assuming an unresponsive drive needs a pile of sectors overwritten
might actually piss off that drive, or its controller, and cause other
problems with the storage stack for all we know.

Anyway...

>
> So we'd need two timeouts really. Timeout 1 says "if it takes longer
> than this, do a background rewrite when it finally succeeds", and
> timeout 2 says "if it takes longer than this, return an error, but let
> the rewrite thread continue to wait".

The idea I had was similar, only applying to storage arrays where
there's redundancy. In that case, the first timeout is an
informational message what LBA range is experiencing a read delay. And
that would permit an upper layer to just preemptively overwrite those
slow LBAs.

This is bad though for the single drive use case, or even
linear/concat, and RAID 0 where the data on the slow sector really
must be read or you get EIO or whatever.

But this sort of work around requires lower layers knowing how the
upper layers are organized and I don't know there's a good way to work
that out.

I think we just poll the drive for SCT ERC and based on what comes
back, make a one size fits all decision for that block device. It can
hardly be much worse than now where "hard resetting link" doesn't
really stand out as an oh fuck moment. It just gets lost in other
kernel messages. At least by 180 seconds, there will be truth in
kernel messages that the drive is having read or write errors, even as
depending services are getting mad at all the delays. They're going to
get delayed anyway, just in a different way.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: URE, link resets, user hostile defaults
  2016-06-28 17:33   ` Chris Murphy
  2016-06-28 18:28     ` Phil Turmel
@ 2016-06-29  6:01     ` Hannes Reinecke
  2016-06-29 10:48       ` Pasi Kärkkäinen
  2016-06-29 12:17     ` Zygo Blaxell
  2 siblings, 1 reply; 16+ messages in thread
From: Hannes Reinecke @ 2016-06-29  6:01 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid

On 06/28/2016 07:33 PM, Chris Murphy wrote:
> On Tue, Jun 28, 2016 at 12:33 AM, Hannes Reinecke <hare@suse.de> wrote:
>> On 06/27/2016 06:42 PM, Chris Murphy wrote:
>>> Hi,
>>>
>>> Drives with SCT ERC not supported or unset, result in potentially long
>>> error recoveries for marginal or bad sectors: upwards of 180 second
>>> recovers are suggested.
>>>
>>> The kernel's SCSI command timer default of 30 seconds, i.e.
>>>
>>> cat /sys/block/<dev>/device/timeout
>>>
>>> conspires to  undermine the deep recovery of most drives now on the
>>> market. This by default misconfiguration results in problems list
>>> regulars are very well aware of. It affects all raid configurations,
>>> and even affects the non-RAID single drive use case. And it does so in
>>> a way that doesn't happen on either Windows or macOS. Basically it is
>>> linux kernel induced data loss, the drive very possibly could present
>>> the requested data upon deep recovery being permitted, but the
>>> kernel's command timer is reached before recovery completes, and
>>> obliterates any possibility of recovering that data. By default.
>>>
>>> This now seems to affect the majority of use cases. At one time 30
>>> seconds might have been sane for a world with drives that had less
>>> than 30 second recoveries for bad sectors. But that's no longer the
>>> case.
>>>
>> 'Majority of use cases'.
>> Hardly. I'm not aware of any issues here.
> 
> This list is prolific with this now common misconfiguration. It
> manifests on average about weekly, as a message from libata that it's
> "hard resetting link". In every single case where the user is
> instructed to either set SCT ERC lower than 30 seconds if possible, or
> increase the kernel SCSI command timer well above 30 seconds (180 is
> often recommended on this list), suddenly the user's problems start to
> go away.
> 
> Now the md driver gets an explicit read failure from the drive, after
> 30 seconds, instead of a link reset. And this includes the LBA for the
> bad sector, which is apparently what md wants to write the fixup back
> to that drive.
> 
> However the manifestation of the problem and the nature of this list
> self-selects the user reports. Of course people with failed mdadm
> based RAID come here. But this problem is also manifesting on Btrfs
> for the same reasons. It also manifests, more rarely, with users who
> have just a single drive if the drive does "deep recovery" reads on
> marginally bad sectors, but the kernel flips out at 30 seconds
> preventing that recovery. Of course not every drive model has such
> deep recoveries, but by now it's extremely common. I have yet to see a
> single consumer hard drive, ever, configured out of the box with SCT
> ERC enabled.
> 
So we should rather implement SCT ERC support in libata, and set ERC to
the scsi command timeout, no?
Then the user could tweak the scsi command timeout however he likes it
to, and that timeout would be reflected into the ERC setting.

And then we could add an initialisation bit which reads the current ERC
values, increasing the SCSI command timeout as required.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: URE, link resets, user hostile defaults
  2016-06-29  6:01     ` Hannes Reinecke
@ 2016-06-29 10:48       ` Pasi Kärkkäinen
  0 siblings, 0 replies; 16+ messages in thread
From: Pasi Kärkkäinen @ 2016-06-29 10:48 UTC (permalink / raw)
  To: Hannes Reinecke; +Cc: Chris Murphy, linux-raid

On Wed, Jun 29, 2016 at 08:01:56AM +0200, Hannes Reinecke wrote:
> On 06/28/2016 07:33 PM, Chris Murphy wrote:
> > On Tue, Jun 28, 2016 at 12:33 AM, Hannes Reinecke <hare@suse.de> wrote:
> >> On 06/27/2016 06:42 PM, Chris Murphy wrote:
> >>> Hi,
> >>>
> >>> Drives with SCT ERC not supported or unset, result in potentially long
> >>> error recoveries for marginal or bad sectors: upwards of 180 second
> >>> recovers are suggested.
> >>>
> >>> The kernel's SCSI command timer default of 30 seconds, i.e.
> >>>
> >>> cat /sys/block/<dev>/device/timeout
> >>>
> >>> conspires to  undermine the deep recovery of most drives now on the
> >>> market. This by default misconfiguration results in problems list
> >>> regulars are very well aware of. It affects all raid configurations,
> >>> and even affects the non-RAID single drive use case. And it does so in
> >>> a way that doesn't happen on either Windows or macOS. Basically it is
> >>> linux kernel induced data loss, the drive very possibly could present
> >>> the requested data upon deep recovery being permitted, but the
> >>> kernel's command timer is reached before recovery completes, and
> >>> obliterates any possibility of recovering that data. By default.
> >>>
> >>> This now seems to affect the majority of use cases. At one time 30
> >>> seconds might have been sane for a world with drives that had less
> >>> than 30 second recoveries for bad sectors. But that's no longer the
> >>> case.
> >>>
> >> 'Majority of use cases'.
> >> Hardly. I'm not aware of any issues here.
> > 
> > This list is prolific with this now common misconfiguration. It
> > manifests on average about weekly, as a message from libata that it's
> > "hard resetting link". In every single case where the user is
> > instructed to either set SCT ERC lower than 30 seconds if possible, or
> > increase the kernel SCSI command timer well above 30 seconds (180 is
> > often recommended on this list), suddenly the user's problems start to
> > go away.
> > 
> > Now the md driver gets an explicit read failure from the drive, after
> > 30 seconds, instead of a link reset. And this includes the LBA for the
> > bad sector, which is apparently what md wants to write the fixup back
> > to that drive.
> > 
> > However the manifestation of the problem and the nature of this list
> > self-selects the user reports. Of course people with failed mdadm
> > based RAID come here. But this problem is also manifesting on Btrfs
> > for the same reasons. It also manifests, more rarely, with users who
> > have just a single drive if the drive does "deep recovery" reads on
> > marginally bad sectors, but the kernel flips out at 30 seconds
> > preventing that recovery. Of course not every drive model has such
> > deep recoveries, but by now it's extremely common. I have yet to see a
> > single consumer hard drive, ever, configured out of the box with SCT
> > ERC enabled.
> > 
> So we should rather implement SCT ERC support in libata, and set ERC to
> the scsi command timeout, no?
> Then the user could tweak the scsi command timeout however he likes it
> to, and that timeout would be reflected into the ERC setting.
> 
> And then we could add an initialisation bit which reads the current ERC
> values, increasing the SCSI command timeout as required.
> 

But this still leaves the "consumer" (non-NAS, non-RAID) drives broken as a default,
until the user tweaks the SCSI command timeout for the disk to much bigger value (longer than the drive's internal timeout, whatever it is, 180 seconds or so..) ? 



-- Pasi


> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke		   Teamlead Storage & Networking
> hare@suse.de			               +49 911 74053 688
> SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
> GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
> HRB 21284 (AG Nürnberg)

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: URE, link resets, user hostile defaults
  2016-06-28 17:33   ` Chris Murphy
  2016-06-28 18:28     ` Phil Turmel
  2016-06-29  6:01     ` Hannes Reinecke
@ 2016-06-29 12:17     ` Zygo Blaxell
  2016-06-29 18:16       ` Edward Kuns
  2016-07-04 21:43       ` Pasi Kärkkäinen
  2 siblings, 2 replies; 16+ messages in thread
From: Zygo Blaxell @ 2016-06-29 12:17 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Hannes Reinecke, linux-raid

[-- Attachment #1: Type: text/plain, Size: 3969 bytes --]

On Tue, Jun 28, 2016 at 11:33:36AM -0600, Chris Murphy wrote:
> On Tue, Jun 28, 2016 at 12:33 AM, Hannes Reinecke <hare@suse.de> wrote:
> > Can you post a message log detailing this problem?
>
> Just over the weekend Phil Turmel posted an email with a bunch of back
> reading on the subject of timeout mismatches for someone to read. I've
> lost track of how many user emails he's replied to, discovering this
> common misconfiguration, and get it straightened out and more often
> than not helping the user recover data that otherwise would have been
> lost *because* of hard link resetting instead of explicit read errors.

OK, but the two links you provided are not examples of these.

> http://www.spinics.net/lists/raid/msg50289.html

This one is basically a software or pilot-error problem that lead to a
partition table being destroyed (with a dash of terrible advice along
the way, like "pull two disks out of the machine and see if the array
recovers").  The one SATA link reset in the logs took all of 9ms to
report a drive error about 4 seconds after boot.  Nothing about this
would be affected by changing the 30-second SATA timeout.

> http://www.spinics.net/lists/raid/msg52789.html

This one is a RAID5 array that was in degraded mode for a *year* before
it was finally taken out by a second disk failure.  Data loss is the
expected outcome given those conditions--you don't get to keep your
data if you ignore drive failures for a year!  Changing the timeout
to expose latent UREs could not have helped in that case--errors were
already detected, but the admin ignored their monitoring responsibility
and just left the array to die.

> He isn't the only list regular who helps educate users tirelessly with
> this very repetitive work around

He repeats it a lot, to be sure, and he's not wrong--but it doesn't seem
to be relevant in those specific examples.  Timeout mismatch mitigation
is presented before any causal analysis of the reported failure.

There's a use case for the long timeout in situations where the system
is no longer healthy and ddrescue/myrescue-style tools are in play.

In redundant setups that are still healthy, the time to error detection
should be as short as possible so repair can start sooner, while still
long enough to avoid crazy numbers of false positives.  Unfortunately
that's not what seems to happen if the Linux-side timeout is shortened.

> for a very old misconfiguration that
> as far as I can tell only exists on Linux. And it's the default
> behavior.
[...]
> Usually recoveries don't take minutes. But they can take minutes. And
> that's where the problem comes in. I don't see why the user should be
> the one punished by the kernel, which is in effect what a 30 second
> default command timer is doing.

Long timeouts don't really serve anyone, even in single-disk cases.

I was once presented a machine with an obvious disk failure--painfully
slow multi-minute application startup times, and the disk was making loud
clicking/rattling noises--but the drive was never reporting any problems
to the OS (Windows, as it happened).  The machine's owners would not
believe that its disk had failed due to the lack of reported errors, and
would not authorize a test build with a new disk restored from backups
(an expensive proposition since there weren't any, and making a copy
of this broken disk would have taken days if it was successful at all).
The owners were convinced it was some sort of software problem.  Finally I
told the users to run a drive self-test over a weekend and--after about
40 hours and only 4% of the disk tested--it finally found a bad sector
it couldn't read, and generated an error code that would get the drive
replaced.  Apparently the machine's users had been living with this
for three months before I got there, and the machine was unusable the
whole time.  A much shorter error timeout would at least have provided
evidence of a hardware problem, even if it was the wrong one.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: URE, link resets, user hostile defaults
  2016-06-29 12:17     ` Zygo Blaxell
@ 2016-06-29 18:16       ` Edward Kuns
  2016-07-01 20:43         ` Chris Murphy
  2016-07-04 21:43       ` Pasi Kärkkäinen
  1 sibling, 1 reply; 16+ messages in thread
From: Edward Kuns @ 2016-06-29 18:16 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Hannes Reinecke, Linux-RAID

On Wed, Jun 29, 2016 at 7:17 AM, Zygo Blaxell
<u0oo5pgu@umail.furryterror.org> wrote:
> OK, but the two links you provided are not examples of these.

But there *are* plenty of examples of this.  I've run into this
personally, before I knew to specifically check the ERC/TLER/whatever
configuration on all my drives and pro-actively configure them
properly.

When the only two options are 1) long kernel timeout and URE is caught
and fixed, or 2) short kernel timeout and the drive is detected as
failed and kicked from all arrays, then I'll take #1 please.
Obviously, trying to detect misconfiguration and drives that don't
support ERC/TLER and fixing the timeout accordingly would be better.
I agree with others, the current default behavior is unintentionally
user-hostile.

> Long timeouts don't really serve anyone, even in single-disk cases.

This statement is too dogmatic.  It depends on the drive.  For a drive
with the proper features and settings, that is guaranteed to respond
in a few seconds unless it has truly totally failed, I agree with you.
For a drive with those features but misconfigured (e.g., by default),
best is to configure it properly, so in that case I agree with you but
changes are needed somewhere to get the configuration to occur
automatically.  For a consumer drive that lacks those features
entirely, I disagree with you.  Although for that case, it would be
worth having an alarm of some sort be triggered perhaps similar to the
EMails generated when an array degrades.  That would let the user know
that the drive is responding very slowly (probably indicating
recoverable read errors) and may fail soon.  Again, changes are needed
to do that.

              Eddie

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: URE, link resets, user hostile defaults
  2016-06-29 18:16       ` Edward Kuns
@ 2016-07-01 20:43         ` Chris Murphy
  2016-07-04  6:00           ` Hannes Reinecke
  0 siblings, 1 reply; 16+ messages in thread
From: Chris Murphy @ 2016-07-01 20:43 UTC (permalink / raw)
  To: Edward Kuns; +Cc: Zygo Blaxell, Chris Murphy, Hannes Reinecke, Linux-RAID

Here's a fun one of these I just got off the Fedora users mailing list
with a laptop drive that's apparently hanging on *write*. This I would
not expect to take a long time for a drive to figure out, but... there
are more resets than there are write errors, and in fact there's no
discrete write error from the drive, all we know is the failed command
is a WRITE command.

What seems to happen is, everything in the queue gets obliterated in
the reset, and when ext4 finds out everything failed, not just one
write, it barfs and goes read only.

http://pastebin.com/3JAL297z

How might this turn out differently if the drive reported a single
discrete write error? I don't know how any file system tolerates this
because it's so rare. Would ext4 just try to write again? Would it try
to write to the same sector or another one? Or maybe the write finally
succeeds by resulting in a remap (?) But this sure is dang slow to
recover from a bad write. I don't understand the engineering rational
for this. Maybe it's a firmware bug?

Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: URE, link resets, user hostile defaults
  2016-07-01 20:43         ` Chris Murphy
@ 2016-07-04  6:00           ` Hannes Reinecke
  0 siblings, 0 replies; 16+ messages in thread
From: Hannes Reinecke @ 2016-07-04  6:00 UTC (permalink / raw)
  To: Chris Murphy, Edward Kuns; +Cc: Zygo Blaxell, Linux-RAID

On 07/01/2016 10:43 PM, Chris Murphy wrote:
> Here's a fun one of these I just got off the Fedora users mailing list
> with a laptop drive that's apparently hanging on *write*. This I would
> not expect to take a long time for a drive to figure out, but... there
> are more resets than there are write errors, and in fact there's no
> discrete write error from the drive, all we know is the failed command
> is a WRITE command.
> 
> What seems to happen is, everything in the queue gets obliterated in
> the reset, and when ext4 finds out everything failed, not just one
> write, it barfs and goes read only.
> 
> http://pastebin.com/3JAL297z
> 
> How might this turn out differently if the drive reported a single
> discrete write error? I don't know how any file system tolerates this
> because it's so rare. Would ext4 just try to write again? Would it try
> to write to the same sector or another one? Or maybe the write finally
> succeeds by resulting in a remap (?) But this sure is dang slow to
> recover from a bad write. I don't understand the engineering rational
> for this. Maybe it's a firmware bug?
> 
> 
Could be. At the very least it's an issue with EH interaction.
ATA COMRESET fails, ie libata EH fails to reset the SATA link.
Which is pretty terminal, so the device is set to offline afterwards.

This is most definitely an ATA issue, and doesn't really belong in this
context.
(Have you reported it on linux-ide?)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: URE, link resets, user hostile defaults
  2016-06-29 12:17     ` Zygo Blaxell
  2016-06-29 18:16       ` Edward Kuns
@ 2016-07-04 21:43       ` Pasi Kärkkäinen
  2016-08-19 10:00         ` Pasi Kärkkäinen
  2016-08-19 15:30         ` Chris Murphy
  1 sibling, 2 replies; 16+ messages in thread
From: Pasi Kärkkäinen @ 2016-07-04 21:43 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Hannes Reinecke, linux-raid

On Wed, Jun 29, 2016 at 08:17:51AM -0400, Zygo Blaxell wrote:
> On Tue, Jun 28, 2016 at 11:33:36AM -0600, Chris Murphy wrote:
> > On Tue, Jun 28, 2016 at 12:33 AM, Hannes Reinecke <hare@suse.de> wrote:
> > > Can you post a message log detailing this problem?
> >
> > Just over the weekend Phil Turmel posted an email with a bunch of back
> > reading on the subject of timeout mismatches for someone to read. I've
> > lost track of how many user emails he's replied to, discovering this
> > common misconfiguration, and get it straightened out and more often
> > than not helping the user recover data that otherwise would have been
> > lost *because* of hard link resetting instead of explicit read errors.
> 
> OK, but the two links you provided are not examples of these.
> 

Here's one of the threads where Phil explains the issue:

http://marc.info/?l=linux-raid&m=133665797115876&w=2

quote:

"A very common report I see on this mailing list is people who have lost arrays 
where the drives all appear to be healthy.  
Given the large size of today's hard drives, even healthy drives will occasionally 
have an unrecoverable read error.

When this happens in a raid array with a desktop drive without SCTERC,
the driver times out and reports an error to MD.  MD proceeds to
reconstruct the missing data and tries to write it back to the bad
sector.  However, that drive is still trying to read the bad sector and
ignores the controller.  The write is immediately rejected.  BOOM!  The
*write* error ejects that member from the array.  And you are now
degraded.

If you don't notice the degraded array right away, you probably won't
notice until a URE on another drive pops up.  Once that happens, you
can't complete a resync to revive the array.

Running a "check" or "repair" on an array without TLER will have the
opposite of the intended effect: any URE will kick a drive out instead
of fixing it.

In the same scenario with an enterprise drive, or a drive with SCTERC
turned on, the drive read times out before the controller driver, the
controller never resets the link to the drive, and the followup write
succeeds.  (The sector is either successfully corrected in place, or
it is relocated by the drive.)  No BOOM."

-- Pasi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: URE, link resets, user hostile defaults
  2016-07-04 21:43       ` Pasi Kärkkäinen
@ 2016-08-19 10:00         ` Pasi Kärkkäinen
  2016-08-19 12:36           ` Phil Turmel
  2016-08-19 15:30         ` Chris Murphy
  1 sibling, 1 reply; 16+ messages in thread
From: Pasi Kärkkäinen @ 2016-08-19 10:00 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Chris Murphy, Hannes Reinecke, linux-raid


ping

Let's not forget this thread :)


-- Pasi

On Tue, Jul 05, 2016 at 12:43:04AM +0300, Pasi Kärkkäinen wrote:
> On Wed, Jun 29, 2016 at 08:17:51AM -0400, Zygo Blaxell wrote:
> > On Tue, Jun 28, 2016 at 11:33:36AM -0600, Chris Murphy wrote:
> > > On Tue, Jun 28, 2016 at 12:33 AM, Hannes Reinecke <hare@suse.de> wrote:
> > > > Can you post a message log detailing this problem?
> > >
> > > Just over the weekend Phil Turmel posted an email with a bunch of back
> > > reading on the subject of timeout mismatches for someone to read. I've
> > > lost track of how many user emails he's replied to, discovering this
> > > common misconfiguration, and get it straightened out and more often
> > > than not helping the user recover data that otherwise would have been
> > > lost *because* of hard link resetting instead of explicit read errors.
> > 
> > OK, but the two links you provided are not examples of these.
> > 
> 
> Here's one of the threads where Phil explains the issue:
> 
> http://marc.info/?l=linux-raid&m=133665797115876&w=2
> 
> quote:
> 
> 
> "A very common report I see on this mailing list is people who have lost arrays 
> where the drives all appear to be healthy.  
> Given the large size of today's hard drives, even healthy drives will occasionally 
> have an unrecoverable read error.
> 
> When this happens in a raid array with a desktop drive without SCTERC,
> the driver times out and reports an error to MD.  MD proceeds to
> reconstruct the missing data and tries to write it back to the bad
> sector.  However, that drive is still trying to read the bad sector and
> ignores the controller.  The write is immediately rejected.  BOOM!  The
> *write* error ejects that member from the array.  And you are now
> degraded.
> 
> If you don't notice the degraded array right away, you probably won't
> notice until a URE on another drive pops up.  Once that happens, you
> can't complete a resync to revive the array.
> 
> Running a "check" or "repair" on an array without TLER will have the
> opposite of the intended effect: any URE will kick a drive out instead
> of fixing it.
> 
> In the same scenario with an enterprise drive, or a drive with SCTERC
> turned on, the drive read times out before the controller driver, the
> controller never resets the link to the drive, and the followup write
> succeeds.  (The sector is either successfully corrected in place, or
> it is relocated by the drive.)  No BOOM."
> 
> 
> 
> -- Pasi
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: URE, link resets, user hostile defaults
  2016-08-19 10:00         ` Pasi Kärkkäinen
@ 2016-08-19 12:36           ` Phil Turmel
  0 siblings, 0 replies; 16+ messages in thread
From: Phil Turmel @ 2016-08-19 12:36 UTC (permalink / raw)
  To: Pasi Kärkkäinen, Zygo Blaxell
  Cc: Chris Murphy, Hannes Reinecke, linux-raid

On 08/19/2016 06:00 AM, Pasi Kärkkäinen wrote:
> 
> ping
> 
> Let's not forget this thread :)

Not forgotten.  Just busy working to the pay the bills... :-(


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: URE, link resets, user hostile defaults
  2016-07-04 21:43       ` Pasi Kärkkäinen
  2016-08-19 10:00         ` Pasi Kärkkäinen
@ 2016-08-19 15:30         ` Chris Murphy
  1 sibling, 0 replies; 16+ messages in thread
From: Chris Murphy @ 2016-08-19 15:30 UTC (permalink / raw)
  To: Pasi Kärkkäinen
  Cc: Zygo Blaxell, Chris Murphy, Hannes Reinecke, Linux-RAID

On Mon, Jul 4, 2016 at 3:43 PM, Pasi Kärkkäinen <pasik@iki.fi> wrote:
> On Wed, Jun 29, 2016 at 08:17:51AM -0400, Zygo Blaxell wrote:
>> On Tue, Jun 28, 2016 at 11:33:36AM -0600, Chris Murphy wrote:
>> > On Tue, Jun 28, 2016 at 12:33 AM, Hannes Reinecke <hare@suse.de> wrote:
>> > > Can you post a message log detailing this problem?
>> >
>> > Just over the weekend Phil Turmel posted an email with a bunch of back
>> > reading on the subject of timeout mismatches for someone to read. I've
>> > lost track of how many user emails he's replied to, discovering this
>> > common misconfiguration, and get it straightened out and more often
>> > than not helping the user recover data that otherwise would have been
>> > lost *because* of hard link resetting instead of explicit read errors.
>>
>> OK, but the two links you provided are not examples of these.
>>
>
> Here's one of the threads where Phil explains the issue:
>
> http://marc.info/?l=linux-raid&m=133665797115876&w=2
>
> quote:
>
>
> "A very common report I see on this mailing list is people who have lost arrays
> where the drives all appear to be healthy.
> Given the large size of today's hard drives, even healthy drives will occasionally
> have an unrecoverable read error.
>
> When this happens in a raid array with a desktop drive without SCTERC,
> the driver times out and reports an error to MD.  MD proceeds to
> reconstruct the missing data and tries to write it back to the bad
> sector.  However, that drive is still trying to read the bad sector and
> ignores the controller.  The write is immediately rejected.  BOOM!  The
> *write* error ejects that member from the array.  And you are now
> degraded.
>
> If you don't notice the degraded array right away, you probably won't
> notice until a URE on another drive pops up.  Once that happens, you
> can't complete a resync to revive the array.
>
> Running a "check" or "repair" on an array without TLER will have the
> opposite of the intended effect: any URE will kick a drive out instead
> of fixing it.
>
> In the same scenario with an enterprise drive, or a drive with SCTERC
> turned on, the drive read times out before the controller driver, the
> controller never resets the link to the drive, and the followup write
> succeeds.  (The sector is either successfully corrected in place, or
> it is relocated by the drive.)  No BOOM."


The more I think about this, the more the command timer for SATA and
USB drives default just needs to change. It is really the simplest
solution to the problem. Parsing for device SCT ERC support, and then
whether or not there are drive firmware bugs enabling it is risky. And
it's an open question if it persists on all drives after suspend (to
RAM or disk).

Further the problem is if SCT ERC is enabled by default, and the user
wants to disable it for some reason, they might not be able to do this
simply from user space with smartctl -l scterc because I've
encountered drives that only accept one state change, changing it back
to disabled causes the device to "crash" and vanish off the SATA bus.
Clearly a firmware bug.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-08-19 15:30 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-06-27 16:42 URE, link resets, user hostile defaults Chris Murphy
2016-06-28  6:33 ` Hannes Reinecke
2016-06-28 17:33   ` Chris Murphy
2016-06-28 18:28     ` Phil Turmel
2016-06-28 20:46       ` Wols Lists
2016-06-28 22:17         ` Chris Murphy
2016-06-29  6:01     ` Hannes Reinecke
2016-06-29 10:48       ` Pasi Kärkkäinen
2016-06-29 12:17     ` Zygo Blaxell
2016-06-29 18:16       ` Edward Kuns
2016-07-01 20:43         ` Chris Murphy
2016-07-04  6:00           ` Hannes Reinecke
2016-07-04 21:43       ` Pasi Kärkkäinen
2016-08-19 10:00         ` Pasi Kärkkäinen
2016-08-19 12:36           ` Phil Turmel
2016-08-19 15:30         ` Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).