All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
To: Chris Murphy <lists@colorremedies.com>
Cc: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
	Andrei Borzenkov <arvidjaar@gmail.com>,
	Hugo Mills <hugo@carfax.org.uk>,
	kreijack@inwind.it, Roman Mamedov <rm@romanrm.net>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: Adventures in btrfs raid5 disk recovery
Date: Mon, 27 Jun 2016 21:52:46 -0400	[thread overview]
Message-ID: <20160628015245.GI14667@hungrycats.org> (raw)
In-Reply-To: <CAJCQCtRbrLaXSGm9DA6FrHO9Pd9ozaaJpKd6HjMGXCojdxbZoQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3875 bytes --]

On Mon, Jun 27, 2016 at 04:30:23PM -0600, Chris Murphy wrote:
> On Mon, Jun 27, 2016 at 3:57 PM, Zygo Blaxell
> <ce3g8jdj@umail.furryterror.org> wrote:
> > On Mon, Jun 27, 2016 at 10:17:04AM -0600, Chris Murphy wrote:
> > If anything, I want the timeout to be shorter so that upper layers with
> > redundancy can get an EIO and initiate repair promptly, and admins can
> > get notified to evict chronic offenders from their drive slots, without
> > having to pay extra for hard disk firmware with that feature.
> 
> The drive totally thwarts this. It doesn't report back to the kernel
> what command is hung, as far as I'm aware. It just hangs and goes into
> a so called "deep recovery" there is no way to know what sector is
> causing the problem

I'm proposing just treat the link reset _as_ an EIO, unless transparent
link resets are required for link speed negotiation or something.
The drive wouldn't be thwarting anything, the host would just ignore it
(unless the drive doesn't respond to a link reset until after its internal
timeout, in which case nothing is saved by shortening the timeout).

> until the drive reports a read error, which will
> include the affected sector LBA.

It doesn't matter which sector.  Chances are good that it was more than
one of the outstanding requested sectors anyway.  Rewrite them all.

We know which sectors they are because somebody has an IO operation
waiting for a status on each of them (unless they're using AIO or some
other API where a request can be fired at a hard drive and the reply
discarded).  Notify all of them that their IO failed and move on.

> Btrfs does have something of a work around for when things get slow,
> and that's balance, read and rewrite everything. The write forces
> sector remapping by the drive firmware for bad sectors.

It's a crude form of "resilvering" as ZFS calls it.

> > The upper layers could time the IOs, and make their own decisions based
> > on the timing (e.g. btrfs or mdadm could proactively repair anything that
> > took more than 10 seconds to read).  That might be a better approach,
> > since shortening the time to an EIO is only useful when you have a
> > redundancy layer in place to do something about them.
> 
> For RAID with redundancy, that's doable, although I have no idea what
> work is needed, or even if it's possible, to track commands in this
> manner, and fall back to some kind of repair mode as if it were a read
> error.

If btrfs sees EIO from a lower block layer it will try to reconstruct the
missing data (but not repair it).  If that happens during a scrub,
it will also attempt to rewrite the missing data over the original
offending sectors.  This happens every few months in my server pool,
and seems to be working even on btrfs raid5.

Last time I checked all the RAID implementations on Linux (ok, so that's
pretty much just md-raid) had some sort of repair capability.  lvm uses
(or can use) the md-raid implementation.  ext4 and xfs on naked disk
partitions will have problems, but that's because they were designed in
the 1990's when we were young and naive and still believed hard disks
would one day become reliable devices without buggy firmware.

> For single drives and RAID 0, the only possible solution is to not do
> link resets for up to 3 minutes and hope the drive returns the single
> copy of data.

So perhaps the timeout should be influenced by higher layers, e.g. if a
disk becomes part of a raid1, its timeout should be shortened by default,
while a timeout for a disk that is not used in by redundant layer should
be longer.

> Even in the case of Btrfs DUP, it's thwarted without a read error
> reported from the drive (or it returning bad data).

That case gets messy--different timeouts for different parts of the disk.
Probably not practical.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

  reply	other threads:[~2016-06-28  1:56 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-20  3:44 Adventures in btrfs raid5 disk recovery Zygo Blaxell
2016-06-20 18:13 ` Roman Mamedov
2016-06-20 19:11   ` Zygo Blaxell
2016-06-20 19:30     ` Chris Murphy
2016-06-20 20:40       ` Zygo Blaxell
2016-06-20 21:27         ` Chris Murphy
2016-06-21  1:55           ` Zygo Blaxell
2016-06-21  3:53             ` Zygo Blaxell
2016-06-22 17:14             ` Chris Murphy
2016-06-22 20:35               ` Zygo Blaxell
2016-06-23 19:32                 ` Goffredo Baroncelli
2016-06-24  0:26                   ` Chris Murphy
2016-06-24  1:47                     ` Zygo Blaxell
2016-06-24  4:02                       ` Andrei Borzenkov
2016-06-24  8:50                         ` Hugo Mills
2016-06-24  9:52                           ` Andrei Borzenkov
2016-06-24 10:16                             ` Hugo Mills
2016-06-24 10:19                               ` Andrei Borzenkov
2016-06-24 10:59                                 ` Hugo Mills
2016-06-24 11:36                                   ` Austin S. Hemmelgarn
2016-06-24 17:40                               ` Chris Murphy
2016-06-24 18:06                                 ` Zygo Blaxell
2016-06-24 17:06                             ` Chris Murphy
2016-06-24 17:21                               ` Andrei Borzenkov
2016-06-24 17:52                                 ` Chris Murphy
2016-06-24 18:19                                   ` Austin S. Hemmelgarn
2016-06-25 16:44                                     ` Chris Murphy
2016-06-25 21:52                                       ` Chris Murphy
2016-06-26  7:54                                         ` Andrei Borzenkov
2016-06-26 15:03                                           ` Duncan
2016-06-26 19:30                                           ` Chris Murphy
2016-06-26 19:52                                             ` Zygo Blaxell
2016-06-27 11:21                                       ` Austin S. Hemmelgarn
2016-06-27 16:17                                         ` Chris Murphy
2016-06-27 20:54                                           ` Chris Murphy
2016-06-27 21:02                                           ` Henk Slager
2016-06-27 21:57                                           ` Zygo Blaxell
2016-06-27 22:30                                             ` Chris Murphy
2016-06-28  1:52                                               ` Zygo Blaxell [this message]
2016-06-28  2:39                                                 ` Chris Murphy
2016-06-28  3:17                                                   ` Zygo Blaxell
2016-06-28 11:23                                                     ` Austin S. Hemmelgarn
2016-06-28 12:05                                             ` Austin S. Hemmelgarn
2016-06-28 12:14                                               ` Steven Haigh
2016-06-28 12:25                                                 ` Austin S. Hemmelgarn
2016-06-28 16:40                                                   ` Steven Haigh
2016-06-28 18:01                                                     ` Chris Murphy
2016-06-28 18:17                                                       ` Steven Haigh
2016-07-05 23:05                                                         ` Chris Murphy
2016-07-06 11:51                                                           ` Austin S. Hemmelgarn
2016-07-06 16:43                                                             ` Chris Murphy
2016-07-06 17:18                                                               ` Austin S. Hemmelgarn
2016-07-06 18:45                                                                 ` Chris Murphy
2016-07-06 19:15                                                                   ` Austin S. Hemmelgarn
2016-07-06 21:01                                                                     ` Chris Murphy
2016-06-24 16:52                           ` Chris Murphy
2016-06-24 16:56                             ` Hugo Mills
2016-06-24 16:39                         ` Zygo Blaxell
2016-06-24  1:36                   ` Zygo Blaxell
2016-06-23 23:37               ` Chris Murphy
2016-06-24  2:07                 ` Zygo Blaxell
2016-06-24  5:20                   ` Chris Murphy
2016-06-24 10:16                     ` Andrei Borzenkov
2016-06-24 17:33                       ` Chris Murphy
2016-06-24 11:24                     ` Austin S. Hemmelgarn
2016-06-24 16:32                     ` Zygo Blaxell
2016-06-24  2:17                 ` Zygo Blaxell
2016-06-22  4:06 ` Adventures in btrfs raid5 disk recovery - update Zygo Blaxell

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160628015245.GI14667@hungrycats.org \
    --to=ce3g8jdj@umail.furryterror.org \
    --cc=ahferroin7@gmail.com \
    --cc=arvidjaar@gmail.com \
    --cc=hugo@carfax.org.uk \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    --cc=rm@romanrm.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.