Re: [RFC] Btrfs device and pool management (wip)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Chris Murphy <lists@colorremedies.com>
Cc: Anand Jain <anand.jain@oracle.com>,
	Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: [RFC] Btrfs device and pool management (wip)
Date: Mon, 30 Nov 2015 15:37:54 -0500	[thread overview]
Message-ID: <565CB3A2.30705@gmail.com> (raw)
In-Reply-To: <CAJCQCtQPdzQG0She3yjxAmDmXSAphdMriW11Pe=BMrBRXYTh6Q@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3145 bytes --]

On 2015-11-30 15:17, Chris Murphy wrote:
> On Mon, Nov 30, 2015 at 7:51 AM, Austin S Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>
>> General thoughts on this:
>> 1. If there's a write error, we fail unconditionally right now.  It would be
>> nice to have a configurable number of retries before failing.
>
> I'm unconvinced. I pretty much immediately do not trust a block device
> that fails even a single write, and I'd expect the file system to
> quickly get confused if it can't rely on flushing pending writes to
> that device. Unless Btrfs gets into the business of tracking bad
> sectors (failed writes), the block device is a gonor upon a single
> write failure, although it could still be reliable for reads.
I've had multiple cases of disks that got one write error then were fine 
for more than a year before any further issues.  My thought is add an 
option to retry that single write after some short delay (1-2s maybe), 
and if it still fails, then mark the disk as failed.  This will provide 
an option for people like me who don't want to need to immediately 
replace a disk when it hits a write error.  (Possibly add some counter 
in and if we get another write error within a given period of time, we 
just kick the disk instead of retrying).  Transient errors do happen, 
and in some cases more often than people would expect.  We should 
reasonably account for this.

This discussion actually brings to mind the rather annoying behavior of 
some of the proprietary NAS systems we have where I work.  They check 
SMART attributes on a regular basis, and if anything the disk firmware 
marks as pre-failure changes at all, it kicks the disk from the RAID 
array.  It only kicks on a change though, so you can just disconnect and 
reconnect the disk itself, and it accepts it as a new disk as long as 
the attribute didn't cross the threshold the disk firmware lists.  (I 
discovered this rather short-sighted behavior by accident, but I've used 
the old disks in other systems just fine for months with no issue 
whatsoever).
>
> Possibly reasonable, is the user indicting a preference for what
> happens after the max number of write failures is exceeded:
>
> - Volume goes degraded: Faulty block device is ignored entirely,
> degraded writes permitted.
> - Volumes goes ro: Faulty block device is still used for reads,
> degraded writes not permitted.
>
> As far as I know, md and lvm only do the former. And md/mdadm did
> recently get the ability to support bad block maps so it can continue
> using drives lacking reserve sectors (typically that's the reason for
> write failures on conventional rotational drives).
>
>
>
>> 2. Similar for read errors, possibly with the ability to ignore them below
>> some threshold.
>
> Agreed. Maybe it would be an error rate (set by ratio)?
>
I was thinking of either:
a. A running count, using the current error counting mechanisms, with 
some max number allowed before the device gets kicked.
b. A count that decays over time, this would need two tunables (how long 
an error is considered, and how many are allowed).



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

next prev parent reply	other threads:[~2015-11-30 20:38 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-30  7:59 [RFC] Btrfs device and pool management (wip) Anand Jain
2015-11-30 12:43 ` Qu Wenruo
2015-12-01 18:01   ` Goffredo Baroncelli
2015-12-01 23:43     ` Qu Wenruo
2015-12-02 19:07       ` Goffredo Baroncelli
2015-12-02 23:36         ` Qu Wenruo
2015-11-30 14:51 ` Austin S Hemmelgarn
2015-11-30 20:17   ` Chris Murphy
2015-11-30 20:37     ` Austin S Hemmelgarn [this message]
2015-11-30 21:09       ` Chris Murphy
2015-12-01 10:05         ` Brendan Hide
2015-12-01 13:11           ` Brendan Hide
2015-12-09  4:39     ` Christoph Anton Mitterer
2015-12-01  0:43   ` Qu Wenruo
  -- strict thread matches above, loose matches on Subject: below --
2015-11-30  7:54 Anand Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=565CB3A2.30705@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=anand.jain@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.