Re: Adventures in btrfs raid5 disk recovery

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Chris Murphy <lists@colorremedies.com>
Cc: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
	Andrei Borzenkov <arvidjaar@gmail.com>,
	Hugo Mills <hugo@carfax.org.uk>,
	kreijack@inwind.it, Roman Mamedov <rm@romanrm.net>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Adventures in btrfs raid5 disk recovery
Date: Mon, 27 Jun 2016 17:57:26 -0400	[thread overview]
Message-ID: <20160627215726.GG14667@hungrycats.org> (raw)
In-Reply-To: <CAJCQCtQugDoR6fnPeion37FLS3LarjfP6dt+-Z3jPgLG0Xkmwg@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3856 bytes --]

On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 5:21 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
> > On 2016-06-25 12:44, Chris Murphy wrote:
> >> On Fri, Jun 24, 2016 at 12:19 PM, Austin S. Hemmelgarn
> >> <ahferroin7@gmail.com> wrote:
> >>
> >> OK but hold on. During scrub, it should read data, compute checksums
> >> *and* parity, and compare those to what's on-disk - > EXTENT_CSUM in
> >> the checksum tree, and the parity strip in the chunk tree. And if
> >> parity is wrong, then it should be replaced.
> >
> > Except that's horribly inefficient.  With limited exceptions involving
> > highly situational co-processors, computing a checksum of a parity block is
> > always going to be faster than computing parity for the stripe.  By using
> > that to check parity, we can safely speed up the common case of near zero
> > errors during a scrub by a pretty significant factor.
> 
> OK I'm in favor of that. Although somehow md gets away with this by
> computing and checking parity for its scrubs, and still manages to
> keep drives saturated in the process - at least HDDs, I'm not sure how
> it fares on SSDs.

A modest desktop CPU can compute raid6 parity at 6GB/sec, a less-modest
one at more than 10GB/sec.  Maybe a bottleneck is within reach of an
array of SSDs vs. a slow CPU.

> It just came up again in a thread over the weekend on linux-raid@. I'm
> going to ask while people are paying attention if a patch to change
> the 30 second time out to something a lot higher has ever been
> floated, what the negatives might be, and where to get this fixed if
> it wouldn't be accepted in the kernel code directly.

Defaults are defaults, they're not for everyone.  30 seconds is about
two minutes too short for an SMR drive's worst-case write latency, or
28 seconds too long for an OLTP system, or just right for an end-user's
personal machine with a low-energy desktop drive and a long spin-up time.

Once a drive starts taking 30+ seconds to do I/O, I consider the drive
failed in the sense that it's too slow to meet latency requirements.
When the problem is that it's already taking too long, the solution is
not waiting even longer.  To put things in perspective, consider that
server hardware watchdog timeouts are typically 60 seconds by default
(if not maximum).

If anything, I want the timeout to be shorter so that upper layers with
redundancy can get an EIO and initiate repair promptly, and admins can
get notified to evict chronic offenders from their drive slots, without
having to pay extra for hard disk firmware with that feature.

> *Ideally* I think we'd want two timeouts. I'd like to see commands
> have a timer that results in merely a warning that could be used by
> e.g. btrfs scrub to know "hey this sector range is 'slow' I'm going to
> write over those sectors". That's how bad sectors start out, they read
> slower and eventually go beyond 30 seconds and now it's all link
> resets. If the problem could be fixed before then... that's the best
> scenario.

What's the downside of a link reset?  Can the driver not just return
EIO for all the outstanding IOs in progress at reset, and let the upper
layers deal with it?  Or is the problem that the upper layers are all
horribly broken by EIOs, or drive firmware horribly broken by link resets?

The upper layers could time the IOs, and make their own decisions based
on the timing (e.g. btrfs or mdadm could proactively repair anything that
took more than 10 seconds to read).  That might be a better approach,
since shortening the time to an EIO is only useful when you have a
redundancy layer in place to do something about them.

> The 2nd timer would be, OK the controller or drive just face planted, reset.
> 
> -- 
> Chris Murphy
> 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

next prev parent reply	other threads:[~2016-06-27 21:57 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-20  3:44 Adventures in btrfs raid5 disk recovery Zygo Blaxell
2016-06-20 18:13 ` Roman Mamedov
2016-06-20 19:11   ` Zygo Blaxell
2016-06-20 19:30     ` Chris Murphy
2016-06-20 20:40       ` Zygo Blaxell
2016-06-20 21:27         ` Chris Murphy
2016-06-21  1:55           ` Zygo Blaxell
2016-06-21  3:53             ` Zygo Blaxell
2016-06-22 17:14             ` Chris Murphy
2016-06-22 20:35               ` Zygo Blaxell
2016-06-23 19:32                 ` Goffredo Baroncelli
2016-06-24  0:26                   ` Chris Murphy
2016-06-24  1:47                     ` Zygo Blaxell
2016-06-24  4:02                       ` Andrei Borzenkov
2016-06-24  8:50                         ` Hugo Mills
2016-06-24  9:52                           ` Andrei Borzenkov
2016-06-24 10:16                             ` Hugo Mills
2016-06-24 10:19                               ` Andrei Borzenkov
2016-06-24 10:59                                 ` Hugo Mills
2016-06-24 11:36                                   ` Austin S. Hemmelgarn
2016-06-24 17:40                               ` Chris Murphy
2016-06-24 18:06                                 ` Zygo Blaxell
2016-06-24 17:06                             ` Chris Murphy
2016-06-24 17:21                               ` Andrei Borzenkov
2016-06-24 17:52                                 ` Chris Murphy
2016-06-24 18:19                                   ` Austin S. Hemmelgarn
2016-06-25 16:44                                     ` Chris Murphy
2016-06-25 21:52                                       ` Chris Murphy
2016-06-26  7:54                                         ` Andrei Borzenkov
2016-06-26 15:03                                           ` Duncan
2016-06-26 19:30                                           ` Chris Murphy
2016-06-26 19:52                                             ` Zygo Blaxell
2016-06-27 11:21                                       ` Austin S. Hemmelgarn
2016-06-27 16:17                                         ` Chris Murphy
2016-06-27 20:54                                           ` Chris Murphy
2016-06-27 21:02                                           ` Henk Slager
2016-06-27 21:57                                           ` Zygo Blaxell [this message]
2016-06-27 22:30                                             ` Chris Murphy
2016-06-28  1:52                                               ` Zygo Blaxell
2016-06-28  2:39                                                 ` Chris Murphy
2016-06-28  3:17                                                   ` Zygo Blaxell
2016-06-28 11:23                                                     ` Austin S. Hemmelgarn
2016-06-28 12:05                                             ` Austin S. Hemmelgarn
2016-06-28 12:14                                               ` Steven Haigh
2016-06-28 12:25                                                 ` Austin S. Hemmelgarn
2016-06-28 16:40                                                   ` Steven Haigh
2016-06-28 18:01                                                     ` Chris Murphy
2016-06-28 18:17                                                       ` Steven Haigh
2016-07-05 23:05                                                         ` Chris Murphy
2016-07-06 11:51                                                           ` Austin S. Hemmelgarn
2016-07-06 16:43                                                             ` Chris Murphy
2016-07-06 17:18                                                               ` Austin S. Hemmelgarn
2016-07-06 18:45                                                                 ` Chris Murphy
2016-07-06 19:15                                                                   ` Austin S. Hemmelgarn
2016-07-06 21:01                                                                     ` Chris Murphy
2016-06-24 16:52                           ` Chris Murphy
2016-06-24 16:56                             ` Hugo Mills
2016-06-24 16:39                         ` Zygo Blaxell
2016-06-24  1:36                   ` Zygo Blaxell
2016-06-23 23:37               ` Chris Murphy
2016-06-24  2:07                 ` Zygo Blaxell
2016-06-24  5:20                   ` Chris Murphy
2016-06-24 10:16                     ` Andrei Borzenkov
2016-06-24 17:33                       ` Chris Murphy
2016-06-24 11:24                     ` Austin S. Hemmelgarn
2016-06-24 16:32                     ` Zygo Blaxell
2016-06-24  2:17                 ` Zygo Blaxell
2016-06-22  4:06 ` Adventures in btrfs raid5 disk recovery - update Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160627215726.GG14667@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=ahferroin7@gmail.com \
    --cc=arvidjaar@gmail.com \
    --cc=hugo@carfax.org.uk \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    --cc=rm@romanrm.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.