Re: raid10 corruption while removing failing disk

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: "Agustín DallʼAlba" <agustin@dallalba.com.ar>
To: Chris Murphy <lists@colorremedies.com>
Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: raid10 corruption while removing failing disk
Date: Tue, 11 Aug 2020 02:06:02 -0300	[thread overview]
Message-ID: <dc0bea2ee916ce4d1a53fe59869b7b7d8868f617.camel@dallalba.com.ar> (raw)
In-Reply-To: <CAJCQCtReHKtyjHL2SXZXeZ4TwdXf-Ag2KysSS0Oan5ZDMzm8OQ@mail.gmail.com>

On Mon, 2020-08-10 at 20:34 -0600, Chris Murphy wrote:
> On Mon, Aug 10, 2020 at 1:03 AM Agustín DallʼAlba
> <agustin@dallalba.com.ar> wrote:
> > Hello!
> > 
> > The last quarterly scrub on our btrfs filesystem found a few bad
> > sectors in one of its devices (/dev/sdd), and because there's nobody on
> > site to replace the failing disk I decided to remove it from the array
> > with `btrfs device remove` before the problem could get worse.
> 
> It doesn't much matter if it gets worse, because you still have
> redundancy on that dying drive until the moment it's completely toast.
> And btrfs doesn't care if it's spewing read errors. 

By 'get worse', I mean another drive failing, and then we'd definitely
lose data. Because of the pandemic there was (and still is) nobody on
site to replace the drive, and I won't be able to go there for who
knows how many months.

> Do you have a complete dmesg for this time period? Because (a) bad
> sectors should not exist on a recently scrubbed system (b) even if
> they do exist, during device removal it's a read error like any other
> time, and btrfs grabs the copy instead. Slowness suggests to me there
> is a timing mismatch between SCT ERC and the default SCSI command
> timer. It leads to lengthy delays and prevents bad sectors from being
> properly fixed.

I have a _partial_ dmesg of this time period. It's got a lot of gaps in
between reboots. I'll send it to you without ccing the list. The
failing drive is an atrocious WD green for which I forgot to set the
idle3 timer, that doesn't support SCT ERC and lately just hangs forever
and requires a power cycle. So there's no way around the slowness. It
was added on a pinch a year ago because we needed more space. I
probably should have ask someone to disconnect it and used 'remove
missing'.

> > # btrfs check --force --readonly /dev/sda
> > WARNING: filesystem mounted, continuing because of --force
> > Checking filesystem on /dev/sda
> > UUID: 4d3acf20-d408-49ab-b0a6-182396a9f27c
> > checksum verify failed on 10919566688256 found BAB1746E wanted A8A48266
> > checksum verify failed on 10919566688256 found BAB1746E wanted A8A48266
> 
> So they aren't at all the same, that's unexpected.

What do you mean by this?

> My advice is to mount ro, backup (or two copies for important info),
> and start with a new Btrfs file system and restore. It's not worth
> repairing.

Sigh, I was expecting I'd have to do this. At least no data was lost,
and the system still functions even though it's read-only. Do you think
check --repair is not worth trying? Everything of value is already
backed up, but restoring it would take many hours of work.

Thanks for all the information, I hope you have a good day.

next prev parent reply	other threads:[~2020-08-11  5:06 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-10  7:03 raid10 corruption while removing failing disk Agustín DallʼAlba
2020-08-10  7:22 ` Nikolay Borisov
2020-08-10  7:38   ` Martin Steigerwald
2020-08-10  7:51     ` Nikolay Borisov
2020-08-10  8:57       ` Martin Steigerwald
2020-08-11  1:30       ` Chris Murphy
2020-08-10  7:59     ` Agustín DallʼAlba
2020-08-10  8:21 ` Nikolay Borisov
2020-08-10 22:24   ` Zygo Blaxell
2020-08-11  1:18   ` Agustín DallʼAlba
2020-08-11  1:48     ` Chris Murphy
2020-08-11  2:34 ` Chris Murphy
2020-08-11  5:06   ` Agustín DallʼAlba [this message]
2020-08-11 19:17     ` Chris Murphy
2020-08-11 20:40       ` Agustín DallʼAlba
2020-08-12  3:03         ` Chris Murphy
2020-08-31 20:05       ` Agustín DallʼAlba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dc0bea2ee916ce4d1a53fe59869b7b7d8868f617.camel@dallalba.com.ar \
    --to=agustin@dallalba.com.ar \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox