linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Russell Coker <russell@coker.com.au>
Cc: james harvey <jamespharvey20@gmail.com>, linux-btrfs@vger.kernel.org
Subject: Re: Expected behavior of bad sectors on one drive in a RAID1
Date: Tue, 20 Oct 2015 09:59:17 -0400	[thread overview]
Message-ID: <562648B5.2020401@gmail.com> (raw)
In-Reply-To: <201510210015.54337.russell@coker.com.au>

[-- Attachment #1: Type: text/plain, Size: 7425 bytes --]

On 2015-10-20 09:15, Russell Coker wrote:
> On Wed, 21 Oct 2015 12:00:59 AM Austin S Hemmelgarn wrote:
>>> https://www.gnu.org/software/ddrescue/
>>>
>>> At this stage I would use ddrescue or something similar to copy data from
>>> the failing disk to a fresh disk, then do a BTRFS scrub to regenerate
>>> the missing data.
>>>
>>> I wouldn't remove the disk entirely because then you lose badly if you
>>> get another failure.  I wouldn't use a BTRFS replace because you already
>>> have the system apart and I expect ddrescue could copy the data faster.
>>> Also as the drive has been causing system failures (I'm guessing a
>>> problem with the power connector) you REALLY don't want BTRFS to corrupt
>>> data on the other disks.  If you have a system with the failing disk and
>>> a new disk attached then there's no risk of further contamination.
>>
>> BIG DISCLAIMER: For the filesystem to be safely mountable it is
>> ABSOLUTELY NECESSARY to remove the old disk after doing a block level
>
> You are correct, my message wasn't clear.
>
> What I meant to say is that doing a "btrfs device remove" or "btrfs replace"
> is generally a bad idea in such a situation.  "btrfs replace" is pretty good
> if you are replacing a disk with a larger one or replacing a disk that has
> only minor errors (a disk that just gets a few bad sectors is unlikely to get
> many more in a hurry).
I kind of figured that was what you meant, I just wanted to make it as 
clear as possible, because this is something that has bitten me in the 
past.  It's worth noting though that there is an option for 'btrfs 
replace' to avoid reading from the device being replaced if at all 
possible.  I've used that option myself a couple of times when 
re-provisioning my systems, and it works well (although I used it to 
just control what disks were getting IO sent to them, not because any of 
the were bad).
>
>> copy of it.  By all means, keep the disk around, but do not keep it
>> visible to the kernel after doing a block level copy of it.  Also, you
>> will probably have to run 'btrfs device scan' after copying the disk and
>> removing it for the filesystem to work right.  This is an inherent
>> result of how BTRFS's multi-device functionality works, and also applies
>> to doing stuff like LVM snapshots of BTRFS filesystems.
>
> Good advice.  I recommend just rebooting the system.  I think that if anyone
> who has the background knowledge to do such things without rebooting will
> probably just do it without needing to ask us for advice.
Normally I would agree, but given the boot issues that were mentioned 
WRT the system in question, it may be safer to just use 'btrfs dev scan' 
without rebooting (unless of course the system doesn't properly support 
SATA hot-plug/hot-remove).
>
>>>> Question 2 - Before having ran the scrub, booting off the raid with
>>>> bad sectors, would btrfs "on the fly" recognize it was getting bad
>>>> sector data with the checksum being off, and checking the other
>>>> drives?  Or, is it expected that I could get a bad sector read in a
>>>> critical piece of operating system and/or kernel, which could be
>>>> causing my lockup issues?
>>>
>>> Unless you have disabled CoW then BTRFS will not return bad data.
>>
>> It is worth clarifying also that:
>> a. While BTRFS will not return bad data in this case, it also won't
>> automatically repair the corruption.
>
> Really?  If so I think that's a bug in BTRFS.  When mounted rw I think that
> every time corruption is discovered it should be automatically fixed.
That's debatable.  While it is safer to try and do this with BTRFS than 
say with MD-RAID, it's still not something many seasoned system 
administrators would want happening behind their back.  It's worth 
noting that ZFS does not automatically fix errors, it just reports them 
and works around them, and many distributed storage options (like Ceph 
for example) behave like this also.  All that the checksum mismatch 
really tells you is that at some point, the data got corrupted, it could 
be that the copy on the disk is bad, but it could also be caused by bad 
RAM, a bad storage controller, a loose cable, or even a bad power supply.
>
>> b. In the unlikely event that both copies are bad, trying to read the
>> data will return an IO error.
>> c. It is theoretically possible (although statistically impossible) that
>> the block could become corrupted, but the checksum could still be
>> correct (CRC32c is good at detecting small errors, but it's not hard to
>> generate a hash collision for any arbitrary value, so if a large portion
>> of the block goes bad, then it can theoretically still have a valid
>> checksum).
>
> It would be interesting to see some research into how CRC32 fits with the more
> common disk errors.  For a disk to return bad data and claim it to be good the
> data must either be a misplaced write or read (which is almost certain to be
> caught by BTRFS as the metadata won't match), or a random sector that matches
> the disk's CRC.  Is generating a hash collision for a CRC32 inside a CRC
> protected block much more difficult?
In general, most disk errors will be just a few flipped bits.  For a 
single bit flip in a data stream, a CRC is 100% guaranteed to change, 
the same goes for any odd number of bit flips in the data stream.  For 
an even number of bit flips however, the chance that there will be a 
collision is proportionate to the size of the CRC, and for 32-bits it's 
a statistical impossibility that there will be a collision due to two 
bits flipping without there being some malicious intent involved.  Once 
you get to larger numbers of bit flips and bigger blocks of data, it 
becomes more likely.  The chances of a collision with a 4k block with 
any random set of bit flips is astronomically small, and it's only 
marginally larger with 16k blocks (which are the default right now for 
BTRFS).
>
>>>> Question 3 - Probably doesn't matter, but how can I see which files
>>>> (or metadata to files) the 40 current bad sectors are in?  (On extX,
>>>> I'd use tune2fs and debugfs to be able to see this information.)
>>>
>>> Read all the files in the system and syslog will report it.  But really
>>> don't do that until after you have copied the disk.
>>
>> It may also be possible to use some of the debug tools from BTRFS to do
>> this without hitting the disks so hard, but it will likely take a lot
>> more effort.
>
> I don't think that you can do that without hitting the disks hard.
Ah, you're right, I forgot that there's no way on most hard disks to get 
the LBA's of the reallocated sectors, which would be required to use the 
debug tools to get the files.
>
> That said last time I checked (last time an executive of a hard drive
> manufacturer was willing to talk to me) drives were apparently designed to
> perform any sequence of operations for their warranty period.  So for a disk
> that is believed to be good this shouldn't be a problem.  For a disk that is
> known to be dying it would be a really bad idea to do anything other than copy
> the data off at maximum speed.
Well yes, but the less stress you put on something, the longer it's 
likely to last.  And if you actually care about the data, you should 
have backups (or some other way of trivially reproducing it)



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

  reply	other threads:[~2015-10-20 13:59 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-20  4:16 Expected behavior of bad sectors on one drive in a RAID1 james harvey
2015-10-20  4:45 ` Russell Coker
2015-10-20 13:00   ` Austin S Hemmelgarn
2015-10-20 13:15     ` Russell Coker
2015-10-20 13:59       ` Austin S Hemmelgarn [this message]
2015-10-20 19:20         ` Duncan
2015-10-20 19:59           ` Austin S Hemmelgarn
2015-10-20 20:54             ` Tim Walberg
2015-10-21 11:51             ` Austin S Hemmelgarn
2015-10-21 12:07               ` Austin S Hemmelgarn
2015-10-21 16:01                 ` Chris Murphy
2015-10-21 17:28                   ` Austin S Hemmelgarn
2015-10-20 18:54 ` Duncan
2015-10-20 19:48   ` Austin S Hemmelgarn
2015-10-20 21:24     ` Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=562648B5.2020401@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=jamespharvey20@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=russell@coker.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).