Re: Expected behavior of bad sectors on one drive in a RAID1

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Russell Coker <russell@coker.com.au>,
	james harvey <jamespharvey20@gmail.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Expected behavior of bad sectors on one drive in a RAID1
Date: Tue, 20 Oct 2015 09:00:59 -0400	[thread overview]
Message-ID: <56263B0B.4050502@gmail.com> (raw)
In-Reply-To: <201510201545.50705.russell@coker.com.au>

[-- Attachment #1: Type: text/plain, Size: 3331 bytes --]

On 2015-10-20 00:45, Russell Coker wrote:
> On Tue, 20 Oct 2015 03:16:15 PM james harvey wrote:
>> sda appears to be going bad, with my low threshold of "going bad", and
>> will be replaced ASAP.  It just developed 16 reallocated sectors, and
>> has 40 current pending sectors.
>>
>> I'm currently running a "btrfs scrub start -B -d -r /terra", which
>> status on another term shows me has found 32 errors after running for
>> an hour.
>
> https://www.gnu.org/software/ddrescue/
>
> At this stage I would use ddrescue or something similar to copy data from the
> failing disk to a fresh disk, then do a BTRFS scrub to regenerate the missing
> data.
>
> I wouldn't remove the disk entirely because then you lose badly if you get
> another failure.  I wouldn't use a BTRFS replace because you already have the
> system apart and I expect ddrescue could copy the data faster.  Also as the
> drive has been causing system failures (I'm guessing a problem with the power
> connector) you REALLY don't want BTRFS to corrupt data on the other disks.  If
> you have a system with the failing disk and a new disk attached then there's
> no risk of further contamination.
BIG DISCLAIMER: For the filesystem to be safely mountable it is 
ABSOLUTELY NECESSARY to remove the old disk after doing a block level 
copy of it.  By all means, keep the disk around, but do not keep it 
visible to the kernel after doing a block level copy of it.  Also, you 
will probably have to run 'btrfs device scan' after copying the disk and 
removing it for the filesystem to work right.  This is an inherent 
result of how BTRFS's multi-device functionality works, and also applies 
to doing stuff like LVM snapshots of BTRFS filesystems.
>
>> Question 2 - Before having ran the scrub, booting off the raid with
>> bad sectors, would btrfs "on the fly" recognize it was getting bad
>> sector data with the checksum being off, and checking the other
>> drives?  Or, is it expected that I could get a bad sector read in a
>> critical piece of operating system and/or kernel, which could be
>> causing my lockup issues?
>
> Unless you have disabled CoW then BTRFS will not return bad data.
It is worth clarifying also that:
a. While BTRFS will not return bad data in this case, it also won't 
automatically repair the corruption.
b. In the unlikely event that both copies are bad, trying to read the 
data will return an IO error.
c. It is theoretically possible (although statistically impossible) that 
the block could become corrupted, but the checksum could still be 
correct (CRC32c is good at detecting small errors, but it's not hard to 
generate a hash collision for any arbitrary value, so if a large portion 
of the block goes bad, then it can theoretically still have a valid 
checksum).
>
>> Question 3 - Probably doesn't matter, but how can I see which files
>> (or metadata to files) the 40 current bad sectors are in?  (On extX,
>> I'd use tune2fs and debugfs to be able to see this information.)
>
> Read all the files in the system and syslog will report it.  But really don't
> do that until after you have copied the disk.
It may also be possible to use some of the debug tools from BTRFS to do 
this without hitting the disks so hard, but it will likely take a lot 
more effort.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

next prev parent reply	other threads:[~2015-10-20 13:01 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-20  4:16 Expected behavior of bad sectors on one drive in a RAID1 james harvey
2015-10-20  4:45 ` Russell Coker
2015-10-20 13:00   ` Austin S Hemmelgarn [this message]
2015-10-20 13:15     ` Russell Coker
2015-10-20 13:59       ` Austin S Hemmelgarn
2015-10-20 19:20         ` Duncan
2015-10-20 19:59           ` Austin S Hemmelgarn
2015-10-20 20:54             ` Tim Walberg
2015-10-21 11:51             ` Austin S Hemmelgarn
2015-10-21 12:07               ` Austin S Hemmelgarn
2015-10-21 16:01                 ` Chris Murphy
2015-10-21 17:28                   ` Austin S Hemmelgarn
2015-10-20 18:54 ` Duncan
2015-10-20 19:48   ` Austin S Hemmelgarn
2015-10-20 21:24     ` Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56263B0B.4050502@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=jamespharvey20@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=russell@coker.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).