From: "Agustín DallʼAlba" <agustin@dallalba.com.ar>
To: Chris Murphy <lists@colorremedies.com>
Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: raid10 corruption while removing failing disk
Date: Tue, 11 Aug 2020 02:06:02 -0300 [thread overview]
Message-ID: <dc0bea2ee916ce4d1a53fe59869b7b7d8868f617.camel@dallalba.com.ar> (raw)
In-Reply-To: <CAJCQCtReHKtyjHL2SXZXeZ4TwdXf-Ag2KysSS0Oan5ZDMzm8OQ@mail.gmail.com>
On Mon, 2020-08-10 at 20:34 -0600, Chris Murphy wrote:
> On Mon, Aug 10, 2020 at 1:03 AM Agustín DallʼAlba
> <agustin@dallalba.com.ar> wrote:
> > Hello!
> >
> > The last quarterly scrub on our btrfs filesystem found a few bad
> > sectors in one of its devices (/dev/sdd), and because there's nobody on
> > site to replace the failing disk I decided to remove it from the array
> > with `btrfs device remove` before the problem could get worse.
>
> It doesn't much matter if it gets worse, because you still have
> redundancy on that dying drive until the moment it's completely toast.
> And btrfs doesn't care if it's spewing read errors.
By 'get worse', I mean another drive failing, and then we'd definitely
lose data. Because of the pandemic there was (and still is) nobody on
site to replace the drive, and I won't be able to go there for who
knows how many months.
> Do you have a complete dmesg for this time period? Because (a) bad
> sectors should not exist on a recently scrubbed system (b) even if
> they do exist, during device removal it's a read error like any other
> time, and btrfs grabs the copy instead. Slowness suggests to me there
> is a timing mismatch between SCT ERC and the default SCSI command
> timer. It leads to lengthy delays and prevents bad sectors from being
> properly fixed.
I have a _partial_ dmesg of this time period. It's got a lot of gaps in
between reboots. I'll send it to you without ccing the list. The
failing drive is an atrocious WD green for which I forgot to set the
idle3 timer, that doesn't support SCT ERC and lately just hangs forever
and requires a power cycle. So there's no way around the slowness. It
was added on a pinch a year ago because we needed more space. I
probably should have ask someone to disconnect it and used 'remove
missing'.
> > # btrfs check --force --readonly /dev/sda
> > WARNING: filesystem mounted, continuing because of --force
> > Checking filesystem on /dev/sda
> > UUID: 4d3acf20-d408-49ab-b0a6-182396a9f27c
> > checksum verify failed on 10919566688256 found BAB1746E wanted A8A48266
> > checksum verify failed on 10919566688256 found BAB1746E wanted A8A48266
>
> So they aren't at all the same, that's unexpected.
What do you mean by this?
> My advice is to mount ro, backup (or two copies for important info),
> and start with a new Btrfs file system and restore. It's not worth
> repairing.
Sigh, I was expecting I'd have to do this. At least no data was lost,
and the system still functions even though it's read-only. Do you think
check --repair is not worth trying? Everything of value is already
backed up, but restoring it would take many hours of work.
Thanks for all the information, I hope you have a good day.
next prev parent reply other threads:[~2020-08-11 5:06 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-08-10 7:03 raid10 corruption while removing failing disk Agustín DallʼAlba
2020-08-10 7:22 ` Nikolay Borisov
2020-08-10 7:38 ` Martin Steigerwald
2020-08-10 7:51 ` Nikolay Borisov
2020-08-10 8:57 ` Martin Steigerwald
2020-08-11 1:30 ` Chris Murphy
2020-08-10 7:59 ` Agustín DallʼAlba
2020-08-10 8:21 ` Nikolay Borisov
2020-08-10 22:24 ` Zygo Blaxell
2020-08-11 1:18 ` Agustín DallʼAlba
2020-08-11 1:48 ` Chris Murphy
2020-08-11 2:34 ` Chris Murphy
2020-08-11 5:06 ` Agustín DallʼAlba [this message]
2020-08-11 19:17 ` Chris Murphy
2020-08-11 20:40 ` Agustín DallʼAlba
2020-08-12 3:03 ` Chris Murphy
2020-08-31 20:05 ` Agustín DallʼAlba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=dc0bea2ee916ce4d1a53fe59869b7b7d8868f617.camel@dallalba.com.ar \
--to=agustin@dallalba.com.ar \
--cc=linux-btrfs@vger.kernel.org \
--cc=lists@colorremedies.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox