From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-io0-f179.google.com ([209.85.223.179]:34317 "EHLO mail-io0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752497AbbJTN76 (ORCPT ); Tue, 20 Oct 2015 09:59:58 -0400 Received: by iow1 with SMTP id 1so22729230iow.1 for ; Tue, 20 Oct 2015 06:59:57 -0700 (PDT) Subject: Re: Expected behavior of bad sectors on one drive in a RAID1 To: Russell Coker References: <201510201545.50705.russell@coker.com.au> <56263B0B.4050502@gmail.com> <201510210015.54337.russell@coker.com.au> Cc: james harvey , linux-btrfs@vger.kernel.org From: Austin S Hemmelgarn Message-ID: <562648B5.2020401@gmail.com> Date: Tue, 20 Oct 2015 09:59:17 -0400 MIME-Version: 1.0 In-Reply-To: <201510210015.54337.russell@coker.com.au> Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha-512; boundary="------------ms040200070107050606000103" Sender: linux-btrfs-owner@vger.kernel.org List-ID: This is a cryptographically signed message in MIME format. --------------ms040200070107050606000103 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: quoted-printable On 2015-10-20 09:15, Russell Coker wrote: > On Wed, 21 Oct 2015 12:00:59 AM Austin S Hemmelgarn wrote: >>> https://www.gnu.org/software/ddrescue/ >>> >>> At this stage I would use ddrescue or something similar to copy data = from >>> the failing disk to a fresh disk, then do a BTRFS scrub to regenerate= >>> the missing data. >>> >>> I wouldn't remove the disk entirely because then you lose badly if yo= u >>> get another failure. I wouldn't use a BTRFS replace because you alre= ady >>> have the system apart and I expect ddrescue could copy the data faste= r. >>> Also as the drive has been causing system failures (I'm guessing a >>> problem with the power connector) you REALLY don't want BTRFS to corr= upt >>> data on the other disks. If you have a system with the failing disk = and >>> a new disk attached then there's no risk of further contamination. >> >> BIG DISCLAIMER: For the filesystem to be safely mountable it is >> ABSOLUTELY NECESSARY to remove the old disk after doing a block level > > You are correct, my message wasn't clear. > > What I meant to say is that doing a "btrfs device remove" or "btrfs rep= lace" > is generally a bad idea in such a situation. "btrfs replace" is pretty= good > if you are replacing a disk with a larger one or replacing a disk that = has > only minor errors (a disk that just gets a few bad sectors is unlikely = to get > many more in a hurry). I kind of figured that was what you meant, I just wanted to make it as=20 clear as possible, because this is something that has bitten me in the=20 past. It's worth noting though that there is an option for 'btrfs=20 replace' to avoid reading from the device being replaced if at all=20 possible. I've used that option myself a couple of times when=20 re-provisioning my systems, and it works well (although I used it to=20 just control what disks were getting IO sent to them, not because any of = the were bad). > >> copy of it. By all means, keep the disk around, but do not keep it >> visible to the kernel after doing a block level copy of it. Also, you= >> will probably have to run 'btrfs device scan' after copying the disk a= nd >> removing it for the filesystem to work right. This is an inherent >> result of how BTRFS's multi-device functionality works, and also appli= es >> to doing stuff like LVM snapshots of BTRFS filesystems. > > Good advice. I recommend just rebooting the system. I think that if a= nyone > who has the background knowledge to do such things without rebooting wi= ll > probably just do it without needing to ask us for advice. Normally I would agree, but given the boot issues that were mentioned=20 WRT the system in question, it may be safer to just use 'btrfs dev scan' = without rebooting (unless of course the system doesn't properly support=20 SATA hot-plug/hot-remove). > >>>> Question 2 - Before having ran the scrub, booting off the raid with >>>> bad sectors, would btrfs "on the fly" recognize it was getting bad >>>> sector data with the checksum being off, and checking the other >>>> drives? Or, is it expected that I could get a bad sector read in a >>>> critical piece of operating system and/or kernel, which could be >>>> causing my lockup issues? >>> >>> Unless you have disabled CoW then BTRFS will not return bad data. >> >> It is worth clarifying also that: >> a. While BTRFS will not return bad data in this case, it also won't >> automatically repair the corruption. > > Really? If so I think that's a bug in BTRFS. When mounted rw I think = that > every time corruption is discovered it should be automatically fixed. That's debatable. While it is safer to try and do this with BTRFS than=20 say with MD-RAID, it's still not something many seasoned system=20 administrators would want happening behind their back. It's worth=20 noting that ZFS does not automatically fix errors, it just reports them=20 and works around them, and many distributed storage options (like Ceph=20 for example) behave like this also. All that the checksum mismatch=20 really tells you is that at some point, the data got corrupted, it could = be that the copy on the disk is bad, but it could also be caused by bad=20 RAM, a bad storage controller, a loose cable, or even a bad power supply.= > >> b. In the unlikely event that both copies are bad, trying to read the >> data will return an IO error. >> c. It is theoretically possible (although statistically impossible) th= at >> the block could become corrupted, but the checksum could still be >> correct (CRC32c is good at detecting small errors, but it's not hard t= o >> generate a hash collision for any arbitrary value, so if a large porti= on >> of the block goes bad, then it can theoretically still have a valid >> checksum). > > It would be interesting to see some research into how CRC32 fits with t= he more > common disk errors. For a disk to return bad data and claim it to be g= ood the > data must either be a misplaced write or read (which is almost certain = to be > caught by BTRFS as the metadata won't match), or a random sector that m= atches > the disk's CRC. Is generating a hash collision for a CRC32 inside a CR= C > protected block much more difficult? In general, most disk errors will be just a few flipped bits. For a=20 single bit flip in a data stream, a CRC is 100% guaranteed to change,=20 the same goes for any odd number of bit flips in the data stream. For=20 an even number of bit flips however, the chance that there will be a=20 collision is proportionate to the size of the CRC, and for 32-bits it's=20 a statistical impossibility that there will be a collision due to two=20 bits flipping without there being some malicious intent involved. Once=20 you get to larger numbers of bit flips and bigger blocks of data, it=20 becomes more likely. The chances of a collision with a 4k block with=20 any random set of bit flips is astronomically small, and it's only=20 marginally larger with 16k blocks (which are the default right now for=20 BTRFS). > >>>> Question 3 - Probably doesn't matter, but how can I see which files >>>> (or metadata to files) the 40 current bad sectors are in? (On extX,= >>>> I'd use tune2fs and debugfs to be able to see this information.) >>> >>> Read all the files in the system and syslog will report it. But real= ly >>> don't do that until after you have copied the disk. >> >> It may also be possible to use some of the debug tools from BTRFS to d= o >> this without hitting the disks so hard, but it will likely take a lot >> more effort. > > I don't think that you can do that without hitting the disks hard. Ah, you're right, I forgot that there's no way on most hard disks to get = the LBA's of the reallocated sectors, which would be required to use the = debug tools to get the files. > > That said last time I checked (last time an executive of a hard drive > manufacturer was willing to talk to me) drives were apparently designed= to > perform any sequence of operations for their warranty period. So for a= disk > that is believed to be good this shouldn't be a problem. For a disk th= at is > known to be dying it would be a really bad idea to do anything other th= an copy > the data off at maximum speed. Well yes, but the less stress you put on something, the longer it's=20 likely to last. And if you actually care about the data, you should=20 have backups (or some other way of trivially reproducing it) --------------ms040200070107050606000103 Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" Content-Description: S/MIME Cryptographic Signature MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgMFADCABgkqhkiG9w0BBwEAAKCC Brgwgga0MIIEnKADAgECAgMRLfgwDQYJKoZIhvcNAQENBQAweTEQMA4GA1UEChMHUm9vdCBD QTEeMBwGA1UECxMVaHR0cDovL3d3dy5jYWNlcnQub3JnMSIwIAYDVQQDExlDQSBDZXJ0IFNp Z25pbmcgQXV0aG9yaXR5MSEwHwYJKoZIhvcNAQkBFhJzdXBwb3J0QGNhY2VydC5vcmcwHhcN MTUwOTIxMTEzNTEzWhcNMTYwMzE5MTEzNTEzWjBjMRgwFgYDVQQDEw9DQWNlcnQgV29UIFVz ZXIxIzAhBgkqhkiG9w0BCQEWFGFoZmVycm9pbjdAZ21haWwuY29tMSIwIAYJKoZIhvcNAQkB FhNhaGVtbWVsZ0BvaGlvZ3QuY29tMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEA nQ/81tq0QBQi5w316VsVNfjg6kVVIMx760TuwA1MUaNQgQ3NyUl+UyFtjhpkNwwChjgAqfGd LIMTHAdObcwGfzO5uI2o1a8MHVQna8FRsU3QGouysIOGQlX8jFYXMKPEdnlt0GoQcd+BtESr pivbGWUEkPs1CwM6WOrs+09bAJP3qzKIr0VxervFrzrC5Dg9Rf18r9WXHElBuWHg4GYHNJ2V Ab8iKc10h44FnqxZK8RDN8ts/xX93i9bIBmHnFfyNRfiOUtNVeynJbf6kVtdHP+CRBkXCNRZ qyQT7gbTGD24P92PS2UTmDfplSBcWcTn65o3xWfesbf02jF6PL3BCrVnDRI4RgYxG3zFBJuG qvMoEODLhHKSXPAyQhwZINigZNdw5G1NqjXqUw+lIqdQvoPijK9J3eijiakh9u2bjWOMaleI SMRR6XsdM2O5qun1dqOrCgRkM0XSNtBQ2JjY7CycIx+qifJWsRaYWZz0aQU4ZrtAI7gVhO9h pyNaAGjvm7PdjEBiXq57e4QcgpwzvNlv8pG1c/hnt0msfDWNJtl3b6elhQ2Pz4w/QnWifZ8E BrFEmjeeJa2dqjE3giPVWrsH+lOvQQONsYJOuVb8b0zao4vrWeGmW2q2e3pdv0Axzm/60cJQ haZUv8+JdX9ZzqxOm5w5eUQSclt84u+D+hsCAwEAAaOCAVkwggFVMAwGA1UdEwEB/wQCMAAw VgYJYIZIAYb4QgENBEkWR1RvIGdldCB5b3VyIG93biBjZXJ0aWZpY2F0ZSBmb3IgRlJFRSBo ZWFkIG92ZXIgdG8gaHR0cDovL3d3dy5DQWNlcnQub3JnMA4GA1UdDwEB/wQEAwIDqDBABgNV HSUEOTA3BggrBgEFBQcDBAYIKwYBBQUHAwIGCisGAQQBgjcKAwQGCisGAQQBgjcKAwMGCWCG SAGG+EIEATAyBggrBgEFBQcBAQQmMCQwIgYIKwYBBQUHMAGGFmh0dHA6Ly9vY3NwLmNhY2Vy dC5vcmcwMQYDVR0fBCowKDAmoCSgIoYgaHR0cDovL2NybC5jYWNlcnQub3JnL3Jldm9rZS5j cmwwNAYDVR0RBC0wK4EUYWhmZXJyb2luN0BnbWFpbC5jb22BE2FoZW1tZWxnQG9oaW9ndC5j b20wDQYJKoZIhvcNAQENBQADggIBADMnxtSLiIunh/TQcjnRdf63yf2D8jMtYUm4yDoCF++J jCXbPQBGrpCEHztlNSGIkF3PH7ohKZvlqF4XePWxpY9dkr/pNyCF1PRkwxUURqvuHXbu8Lwn 8D3U2HeOEU3KmrfEo65DcbanJCMTTW7+mU9lZICPP7ZA9/zB+L0Gm1UNFZ6AU50N/86vjQfY WgkCd6dZD4rQ5y8L+d/lRbJW7ZGEQw1bSFVTRpkxxDTOwXH4/GpQfnfqTAtQuJ1CsKT12e+H NSD/RUWGTr289dA3P4nunBlz7qfvKamxPymHeBEUcuICKkL9/OZrnuYnGROFwcdvfjGE5iLB kjp/ttrY4aaVW5EsLASNgiRmA6mbgEAMlw3RwVx0sVelbiIAJg9Twzk4Ct6U9uBKiJ8S0sS2 8RCSyTmCRhJs0vvva5W9QUFGmp5kyFQEoSfBRJlbZfGX2ehI2Hi3U2/PMUm2ONuQG1E+a0AP u7I0NJc/Xil7rqR0gdbfkbWp0a+8dAvaM6J00aIcNo+HkcQkUgtfrw+C2Oyl3q8IjivGXZqT 5UdGUb2KujLjqjG91Dun3/RJ/qgQlotH7WkVBs7YJVTCxfkdN36rToPcnMYOI30FWa0Q06gn F6gUv9/mo6riv3A5bem/BdbgaJoPnWQD9D8wSyci9G4LKC+HQAMdLmGoeZfpJzKHMYIE0TCC BM0CAQEwgYAweTEQMA4GA1UEChMHUm9vdCBDQTEeMBwGA1UECxMVaHR0cDovL3d3dy5jYWNl cnQub3JnMSIwIAYDVQQDExlDQSBDZXJ0IFNpZ25pbmcgQXV0aG9yaXR5MSEwHwYJKoZIhvcN AQkBFhJzdXBwb3J0QGNhY2VydC5vcmcCAxEt+DANBglghkgBZQMEAgMFAKCCAiEwGAYJKoZI hvcNAQkDMQsGCSqGSIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMTUxMDIwMTM1OTE3WjBPBgkq hkiG9w0BCQQxQgRA5jd5LrRWcZ2vfubO8VWjZPLjppKypXlzFA54JXIMvNIgRhUMvBw6PnDb lfQFHBNejqFP8bA0ESxNkNZbFCPp0jBsBgkqhkiG9w0BCQ8xXzBdMAsGCWCGSAFlAwQBKjAL BglghkgBZQMEAQIwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFA MAcGBSsOAwIHMA0GCCqGSIb3DQMCAgEoMIGRBgkrBgEEAYI3EAQxgYMwgYAweTEQMA4GA1UE ChMHUm9vdCBDQTEeMBwGA1UECxMVaHR0cDovL3d3dy5jYWNlcnQub3JnMSIwIAYDVQQDExlD QSBDZXJ0IFNpZ25pbmcgQXV0aG9yaXR5MSEwHwYJKoZIhvcNAQkBFhJzdXBwb3J0QGNhY2Vy dC5vcmcCAxEt+DCBkwYLKoZIhvcNAQkQAgsxgYOggYAweTEQMA4GA1UEChMHUm9vdCBDQTEe MBwGA1UECxMVaHR0cDovL3d3dy5jYWNlcnQub3JnMSIwIAYDVQQDExlDQSBDZXJ0IFNpZ25p bmcgQXV0aG9yaXR5MSEwHwYJKoZIhvcNAQkBFhJzdXBwb3J0QGNhY2VydC5vcmcCAxEt+DAN BgkqhkiG9w0BAQEFAASCAgAoLqEBCWCZf98O5bU8Qb/i50dlhW/tqftCAC4s/rCR0u70mvnq nlznKtQ5FGdgTYGBzPBrC8q5IkrqP9lckfS4iWsOa1CaXPucSAqdyPuwEDEhrpO7jnrTjiKV LNFsBVFQtXC2xZa2tu6fPTpESHe66fIEGrInmh7X7Mc5QTS0JVmfFW/buiKjJI5ScBbRaIYB g3joN39s53LUbwfQbDJ2XoXDrqtW5pSNqUiVFg0aYzXKmn7OlRkNr9kfL0keacOMz5Mv6r/G 82mBC4CSKsz1JakNCnkKoYdPBMxqhBi2eYw6RHXKxYXdecdnTUfG1jNYa3GR4/70oIMIJE49 AbAl73I8M0FtXi18v0K90IfkgBKnbcqEsvu/Zmt/jwcxNwKe3gegHvtnjSqMyCm28qr8zkfQ 0ZSQ35krr4us8Jsi/r2r6zyrYnP6SXDo0iSuOhzUkl4OV80LvX/tWUG6mcA2/LJwax0pBHvN 2zhu4A8Y2XZmnWLidGcmNuJlYvCUF9StPKmsk1kL+7zlOsmfLy21TSlC5Nf/BfBc/sqM0D/L iYETPBXRXbnctf5hudu6IK+VuITE4uk4YTcCqinw3DlbHPjsTaR0YeW/Z+V0g2LKFMSx5QxE 1mb6CQY5cCr1J+DKaVhCPwdqINf1qplKcHFSvob3/53yOB2Gt5v3IYbtygAAAAAAAA== --------------ms040200070107050606000103--