* Expected behavior of bad sectors on one drive in a RAID1
@ 2015-10-20 4:16 james harvey
2015-10-20 4:45 ` Russell Coker
2015-10-20 18:54 ` Duncan
0 siblings, 2 replies; 15+ messages in thread
From: james harvey @ 2015-10-20 4:16 UTC (permalink / raw)
To: linux-btrfs
Background -----
My fileserver had a "bad event" last week. Shut it down normally to
add a new hard drive, and it would no longer post. Tried about 50
times, doing the typical everything non-essential unplugged, trying 1
of 4 memory modules at a time, and 1 of 2 processors at a time. Got
no where.
Inexpensive HP workstation, so purchased a used identical model
(complete other than hard drives) on eBay. Replacement arrived today.
Posts fine. Moved hard drives over (again, identical model, and Arch
Linux not Windows) and it started giving "Watchdog detected hard
LOCKUP" type errors I've never seen before.
Decided I'd diagnose which part in the original server was bad. By
sitting turned off for a week, it suddenly started posting just fine.
But, with the hard drives back in it, I'm getting the same hard lockup
errors.
An Arch ISO DVD runs stress testing perfectly.
Btrfs-specific -----
The current problem I'm having must be a bad hard drive or corrupted data.
3 drive btrfs RAID1 (data and metadata.) sda has 1GB of the 3GB of
data, and 1GB of the 1GB of metadata.
sda appears to be going bad, with my low threshold of "going bad", and
will be replaced ASAP. It just developed 16 reallocated sectors, and
has 40 current pending sectors.
I'm currently running a "btrfs scrub start -B -d -r /terra", which
status on another term shows me has found 32 errors after running for
an hour.
Question 1 - I'm expecting if I re-run the scrub without the read-only
option, that it will detect from the checksum data which sector is
correct, and re-write to the drive with bad sectors the data to a new
sector. Correct?
Question 2 - Before having ran the scrub, booting off the raid with
bad sectors, would btrfs "on the fly" recognize it was getting bad
sector data with the checksum being off, and checking the other
drives? Or, is it expected that I could get a bad sector read in a
critical piece of operating system and/or kernel, which could be
causing my lockup issues?
Question 3 - Probably doesn't matter, but how can I see which files
(or metadata to files) the 40 current bad sectors are in? (On extX,
I'd use tune2fs and debugfs to be able to see this information.)
I do have hourly snapshots, from when it was properly running, so once
I'm that far in the process, I can also compare the most recent
snapshots, and see if there's any changes that happened to files that
shouldn't have.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1
2015-10-20 4:16 Expected behavior of bad sectors on one drive in a RAID1 james harvey
@ 2015-10-20 4:45 ` Russell Coker
2015-10-20 13:00 ` Austin S Hemmelgarn
2015-10-20 18:54 ` Duncan
1 sibling, 1 reply; 15+ messages in thread
From: Russell Coker @ 2015-10-20 4:45 UTC (permalink / raw)
To: james harvey; +Cc: linux-btrfs
On Tue, 20 Oct 2015 03:16:15 PM james harvey wrote:
> sda appears to be going bad, with my low threshold of "going bad", and
> will be replaced ASAP. It just developed 16 reallocated sectors, and
> has 40 current pending sectors.
>
> I'm currently running a "btrfs scrub start -B -d -r /terra", which
> status on another term shows me has found 32 errors after running for
> an hour.
https://www.gnu.org/software/ddrescue/
At this stage I would use ddrescue or something similar to copy data from the
failing disk to a fresh disk, then do a BTRFS scrub to regenerate the missing
data.
I wouldn't remove the disk entirely because then you lose badly if you get
another failure. I wouldn't use a BTRFS replace because you already have the
system apart and I expect ddrescue could copy the data faster. Also as the
drive has been causing system failures (I'm guessing a problem with the power
connector) you REALLY don't want BTRFS to corrupt data on the other disks. If
you have a system with the failing disk and a new disk attached then there's
no risk of further contamination.
> Question 2 - Before having ran the scrub, booting off the raid with
> bad sectors, would btrfs "on the fly" recognize it was getting bad
> sector data with the checksum being off, and checking the other
> drives? Or, is it expected that I could get a bad sector read in a
> critical piece of operating system and/or kernel, which could be
> causing my lockup issues?
Unless you have disabled CoW then BTRFS will not return bad data.
> Question 3 - Probably doesn't matter, but how can I see which files
> (or metadata to files) the 40 current bad sectors are in? (On extX,
> I'd use tune2fs and debugfs to be able to see this information.)
Read all the files in the system and syslog will report it. But really don't
do that until after you have copied the disk.
> I do have hourly snapshots, from when it was properly running, so once
> I'm that far in the process, I can also compare the most recent
> snapshots, and see if there's any changes that happened to files that
> shouldn't have.
Snapshots refer to the same data blocks, so if a data block is corrupted in a
way that BTRFS doesn't notice (which should be almost impossible) then all
snapshots will have it.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1
2015-10-20 4:45 ` Russell Coker
@ 2015-10-20 13:00 ` Austin S Hemmelgarn
2015-10-20 13:15 ` Russell Coker
0 siblings, 1 reply; 15+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-20 13:00 UTC (permalink / raw)
To: Russell Coker, james harvey; +Cc: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 3331 bytes --]
On 2015-10-20 00:45, Russell Coker wrote:
> On Tue, 20 Oct 2015 03:16:15 PM james harvey wrote:
>> sda appears to be going bad, with my low threshold of "going bad", and
>> will be replaced ASAP. It just developed 16 reallocated sectors, and
>> has 40 current pending sectors.
>>
>> I'm currently running a "btrfs scrub start -B -d -r /terra", which
>> status on another term shows me has found 32 errors after running for
>> an hour.
>
> https://www.gnu.org/software/ddrescue/
>
> At this stage I would use ddrescue or something similar to copy data from the
> failing disk to a fresh disk, then do a BTRFS scrub to regenerate the missing
> data.
>
> I wouldn't remove the disk entirely because then you lose badly if you get
> another failure. I wouldn't use a BTRFS replace because you already have the
> system apart and I expect ddrescue could copy the data faster. Also as the
> drive has been causing system failures (I'm guessing a problem with the power
> connector) you REALLY don't want BTRFS to corrupt data on the other disks. If
> you have a system with the failing disk and a new disk attached then there's
> no risk of further contamination.
BIG DISCLAIMER: For the filesystem to be safely mountable it is
ABSOLUTELY NECESSARY to remove the old disk after doing a block level
copy of it. By all means, keep the disk around, but do not keep it
visible to the kernel after doing a block level copy of it. Also, you
will probably have to run 'btrfs device scan' after copying the disk and
removing it for the filesystem to work right. This is an inherent
result of how BTRFS's multi-device functionality works, and also applies
to doing stuff like LVM snapshots of BTRFS filesystems.
>
>> Question 2 - Before having ran the scrub, booting off the raid with
>> bad sectors, would btrfs "on the fly" recognize it was getting bad
>> sector data with the checksum being off, and checking the other
>> drives? Or, is it expected that I could get a bad sector read in a
>> critical piece of operating system and/or kernel, which could be
>> causing my lockup issues?
>
> Unless you have disabled CoW then BTRFS will not return bad data.
It is worth clarifying also that:
a. While BTRFS will not return bad data in this case, it also won't
automatically repair the corruption.
b. In the unlikely event that both copies are bad, trying to read the
data will return an IO error.
c. It is theoretically possible (although statistically impossible) that
the block could become corrupted, but the checksum could still be
correct (CRC32c is good at detecting small errors, but it's not hard to
generate a hash collision for any arbitrary value, so if a large portion
of the block goes bad, then it can theoretically still have a valid
checksum).
>
>> Question 3 - Probably doesn't matter, but how can I see which files
>> (or metadata to files) the 40 current bad sectors are in? (On extX,
>> I'd use tune2fs and debugfs to be able to see this information.)
>
> Read all the files in the system and syslog will report it. But really don't
> do that until after you have copied the disk.
It may also be possible to use some of the debug tools from BTRFS to do
this without hitting the disks so hard, but it will likely take a lot
more effort.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1
2015-10-20 13:00 ` Austin S Hemmelgarn
@ 2015-10-20 13:15 ` Russell Coker
2015-10-20 13:59 ` Austin S Hemmelgarn
0 siblings, 1 reply; 15+ messages in thread
From: Russell Coker @ 2015-10-20 13:15 UTC (permalink / raw)
To: Austin S Hemmelgarn; +Cc: james harvey, linux-btrfs
On Wed, 21 Oct 2015 12:00:59 AM Austin S Hemmelgarn wrote:
> > https://www.gnu.org/software/ddrescue/
> >
> > At this stage I would use ddrescue or something similar to copy data from
> > the failing disk to a fresh disk, then do a BTRFS scrub to regenerate
> > the missing data.
> >
> > I wouldn't remove the disk entirely because then you lose badly if you
> > get another failure. I wouldn't use a BTRFS replace because you already
> > have the system apart and I expect ddrescue could copy the data faster.
> > Also as the drive has been causing system failures (I'm guessing a
> > problem with the power connector) you REALLY don't want BTRFS to corrupt
> > data on the other disks. If you have a system with the failing disk and
> > a new disk attached then there's no risk of further contamination.
>
> BIG DISCLAIMER: For the filesystem to be safely mountable it is
> ABSOLUTELY NECESSARY to remove the old disk after doing a block level
You are correct, my message wasn't clear.
What I meant to say is that doing a "btrfs device remove" or "btrfs replace"
is generally a bad idea in such a situation. "btrfs replace" is pretty good
if you are replacing a disk with a larger one or replacing a disk that has
only minor errors (a disk that just gets a few bad sectors is unlikely to get
many more in a hurry).
> copy of it. By all means, keep the disk around, but do not keep it
> visible to the kernel after doing a block level copy of it. Also, you
> will probably have to run 'btrfs device scan' after copying the disk and
> removing it for the filesystem to work right. This is an inherent
> result of how BTRFS's multi-device functionality works, and also applies
> to doing stuff like LVM snapshots of BTRFS filesystems.
Good advice. I recommend just rebooting the system. I think that if anyone
who has the background knowledge to do such things without rebooting will
probably just do it without needing to ask us for advice.
> >> Question 2 - Before having ran the scrub, booting off the raid with
> >> bad sectors, would btrfs "on the fly" recognize it was getting bad
> >> sector data with the checksum being off, and checking the other
> >> drives? Or, is it expected that I could get a bad sector read in a
> >> critical piece of operating system and/or kernel, which could be
> >> causing my lockup issues?
> >
> > Unless you have disabled CoW then BTRFS will not return bad data.
>
> It is worth clarifying also that:
> a. While BTRFS will not return bad data in this case, it also won't
> automatically repair the corruption.
Really? If so I think that's a bug in BTRFS. When mounted rw I think that
every time corruption is discovered it should be automatically fixed.
> b. In the unlikely event that both copies are bad, trying to read the
> data will return an IO error.
> c. It is theoretically possible (although statistically impossible) that
> the block could become corrupted, but the checksum could still be
> correct (CRC32c is good at detecting small errors, but it's not hard to
> generate a hash collision for any arbitrary value, so if a large portion
> of the block goes bad, then it can theoretically still have a valid
> checksum).
It would be interesting to see some research into how CRC32 fits with the more
common disk errors. For a disk to return bad data and claim it to be good the
data must either be a misplaced write or read (which is almost certain to be
caught by BTRFS as the metadata won't match), or a random sector that matches
the disk's CRC. Is generating a hash collision for a CRC32 inside a CRC
protected block much more difficult?
> >> Question 3 - Probably doesn't matter, but how can I see which files
> >> (or metadata to files) the 40 current bad sectors are in? (On extX,
> >> I'd use tune2fs and debugfs to be able to see this information.)
> >
> > Read all the files in the system and syslog will report it. But really
> > don't do that until after you have copied the disk.
>
> It may also be possible to use some of the debug tools from BTRFS to do
> this without hitting the disks so hard, but it will likely take a lot
> more effort.
I don't think that you can do that without hitting the disks hard.
That said last time I checked (last time an executive of a hard drive
manufacturer was willing to talk to me) drives were apparently designed to
perform any sequence of operations for their warranty period. So for a disk
that is believed to be good this shouldn't be a problem. For a disk that is
known to be dying it would be a really bad idea to do anything other than copy
the data off at maximum speed.
--
My Main Blog http://etbe.coker.com.au/
My Documents Blog http://doc.coker.com.au/
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1
2015-10-20 13:15 ` Russell Coker
@ 2015-10-20 13:59 ` Austin S Hemmelgarn
2015-10-20 19:20 ` Duncan
0 siblings, 1 reply; 15+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-20 13:59 UTC (permalink / raw)
To: Russell Coker; +Cc: james harvey, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 7425 bytes --]
On 2015-10-20 09:15, Russell Coker wrote:
> On Wed, 21 Oct 2015 12:00:59 AM Austin S Hemmelgarn wrote:
>>> https://www.gnu.org/software/ddrescue/
>>>
>>> At this stage I would use ddrescue or something similar to copy data from
>>> the failing disk to a fresh disk, then do a BTRFS scrub to regenerate
>>> the missing data.
>>>
>>> I wouldn't remove the disk entirely because then you lose badly if you
>>> get another failure. I wouldn't use a BTRFS replace because you already
>>> have the system apart and I expect ddrescue could copy the data faster.
>>> Also as the drive has been causing system failures (I'm guessing a
>>> problem with the power connector) you REALLY don't want BTRFS to corrupt
>>> data on the other disks. If you have a system with the failing disk and
>>> a new disk attached then there's no risk of further contamination.
>>
>> BIG DISCLAIMER: For the filesystem to be safely mountable it is
>> ABSOLUTELY NECESSARY to remove the old disk after doing a block level
>
> You are correct, my message wasn't clear.
>
> What I meant to say is that doing a "btrfs device remove" or "btrfs replace"
> is generally a bad idea in such a situation. "btrfs replace" is pretty good
> if you are replacing a disk with a larger one or replacing a disk that has
> only minor errors (a disk that just gets a few bad sectors is unlikely to get
> many more in a hurry).
I kind of figured that was what you meant, I just wanted to make it as
clear as possible, because this is something that has bitten me in the
past. It's worth noting though that there is an option for 'btrfs
replace' to avoid reading from the device being replaced if at all
possible. I've used that option myself a couple of times when
re-provisioning my systems, and it works well (although I used it to
just control what disks were getting IO sent to them, not because any of
the were bad).
>
>> copy of it. By all means, keep the disk around, but do not keep it
>> visible to the kernel after doing a block level copy of it. Also, you
>> will probably have to run 'btrfs device scan' after copying the disk and
>> removing it for the filesystem to work right. This is an inherent
>> result of how BTRFS's multi-device functionality works, and also applies
>> to doing stuff like LVM snapshots of BTRFS filesystems.
>
> Good advice. I recommend just rebooting the system. I think that if anyone
> who has the background knowledge to do such things without rebooting will
> probably just do it without needing to ask us for advice.
Normally I would agree, but given the boot issues that were mentioned
WRT the system in question, it may be safer to just use 'btrfs dev scan'
without rebooting (unless of course the system doesn't properly support
SATA hot-plug/hot-remove).
>
>>>> Question 2 - Before having ran the scrub, booting off the raid with
>>>> bad sectors, would btrfs "on the fly" recognize it was getting bad
>>>> sector data with the checksum being off, and checking the other
>>>> drives? Or, is it expected that I could get a bad sector read in a
>>>> critical piece of operating system and/or kernel, which could be
>>>> causing my lockup issues?
>>>
>>> Unless you have disabled CoW then BTRFS will not return bad data.
>>
>> It is worth clarifying also that:
>> a. While BTRFS will not return bad data in this case, it also won't
>> automatically repair the corruption.
>
> Really? If so I think that's a bug in BTRFS. When mounted rw I think that
> every time corruption is discovered it should be automatically fixed.
That's debatable. While it is safer to try and do this with BTRFS than
say with MD-RAID, it's still not something many seasoned system
administrators would want happening behind their back. It's worth
noting that ZFS does not automatically fix errors, it just reports them
and works around them, and many distributed storage options (like Ceph
for example) behave like this also. All that the checksum mismatch
really tells you is that at some point, the data got corrupted, it could
be that the copy on the disk is bad, but it could also be caused by bad
RAM, a bad storage controller, a loose cable, or even a bad power supply.
>
>> b. In the unlikely event that both copies are bad, trying to read the
>> data will return an IO error.
>> c. It is theoretically possible (although statistically impossible) that
>> the block could become corrupted, but the checksum could still be
>> correct (CRC32c is good at detecting small errors, but it's not hard to
>> generate a hash collision for any arbitrary value, so if a large portion
>> of the block goes bad, then it can theoretically still have a valid
>> checksum).
>
> It would be interesting to see some research into how CRC32 fits with the more
> common disk errors. For a disk to return bad data and claim it to be good the
> data must either be a misplaced write or read (which is almost certain to be
> caught by BTRFS as the metadata won't match), or a random sector that matches
> the disk's CRC. Is generating a hash collision for a CRC32 inside a CRC
> protected block much more difficult?
In general, most disk errors will be just a few flipped bits. For a
single bit flip in a data stream, a CRC is 100% guaranteed to change,
the same goes for any odd number of bit flips in the data stream. For
an even number of bit flips however, the chance that there will be a
collision is proportionate to the size of the CRC, and for 32-bits it's
a statistical impossibility that there will be a collision due to two
bits flipping without there being some malicious intent involved. Once
you get to larger numbers of bit flips and bigger blocks of data, it
becomes more likely. The chances of a collision with a 4k block with
any random set of bit flips is astronomically small, and it's only
marginally larger with 16k blocks (which are the default right now for
BTRFS).
>
>>>> Question 3 - Probably doesn't matter, but how can I see which files
>>>> (or metadata to files) the 40 current bad sectors are in? (On extX,
>>>> I'd use tune2fs and debugfs to be able to see this information.)
>>>
>>> Read all the files in the system and syslog will report it. But really
>>> don't do that until after you have copied the disk.
>>
>> It may also be possible to use some of the debug tools from BTRFS to do
>> this without hitting the disks so hard, but it will likely take a lot
>> more effort.
>
> I don't think that you can do that without hitting the disks hard.
Ah, you're right, I forgot that there's no way on most hard disks to get
the LBA's of the reallocated sectors, which would be required to use the
debug tools to get the files.
>
> That said last time I checked (last time an executive of a hard drive
> manufacturer was willing to talk to me) drives were apparently designed to
> perform any sequence of operations for their warranty period. So for a disk
> that is believed to be good this shouldn't be a problem. For a disk that is
> known to be dying it would be a really bad idea to do anything other than copy
> the data off at maximum speed.
Well yes, but the less stress you put on something, the longer it's
likely to last. And if you actually care about the data, you should
have backups (or some other way of trivially reproducing it)
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1
2015-10-20 4:16 Expected behavior of bad sectors on one drive in a RAID1 james harvey
2015-10-20 4:45 ` Russell Coker
@ 2015-10-20 18:54 ` Duncan
2015-10-20 19:48 ` Austin S Hemmelgarn
1 sibling, 1 reply; 15+ messages in thread
From: Duncan @ 2015-10-20 18:54 UTC (permalink / raw)
To: linux-btrfs
james harvey posted on Tue, 20 Oct 2015 00:16:15 -0400 as excerpted:
> Background -----
>
> My fileserver had a "bad event" last week. Shut it down normally to add
> a new hard drive, and it would no longer post. Tried about 50 times,
> doing the typical everything non-essential unplugged, trying 1 of 4
> memory modules at a time, and 1 of 2 processors at a time. Got no
> where.
>
> Inexpensive HP workstation, so purchased a used identical model
> (complete other than hard drives) on eBay. Replacement arrived today.
> Posts fine. Moved hard drives over (again, identical model, and Arch
> Linux not Windows) and it started giving "Watchdog detected hard LOCKUP"
> type errors I've never seen before.
>
> Decided I'd diagnose which part in the original server was bad. By
> sitting turned off for a week, it suddenly started posting just fine.
> But, with the hard drives back in it, I'm getting the same hard lockup
> errors.
>
> An Arch ISO DVD runs stress testing perfectly.
>
> Btrfs-specific -----
>
> The current problem I'm having must be a bad hard drive or corrupted
> data.
>
> 3 drive btrfs RAID1 (data and metadata.) sda has 1GB of the 3GB of
> data, and 1GB of the 1GB of metadata.
>
> sda appears to be going bad, with my low threshold of "going bad", and
> will be replaced ASAP. It just developed 16 reallocated sectors, and
> has 40 current pending sectors.
>
> I'm currently running a "btrfs scrub start -B -d -r /terra", which
> status on another term shows me has found 32 errors after running for an
> hour.
>
> Question 1 - I'm expecting if I re-run the scrub without the read-only
> option, that it will detect from the checksum data which sector is
> correct, and re-write to the drive with bad sectors the data to a new
> sector. Correct?
I actually ran a number of independent btrfs raid1 filesystems[1] on a
pair of ssds, with one of the ssds slowly dying, with more and more
reallocated sectors over time, for something like six months.[2] SMART
started with a 254 "cooked" value for reallocated sectors, immediately
dropped to what was apparently the percentage still good (still rounding
to 100) on first sector replace (according to raw value), and dropped to
about 85 (again, %) during the continued usage time, with a threshold
value of IIRC 36, so I never came close on that value, tho the raw-read-
error-rate value dropped into failing-now a couple times near the end,
when I'd do scrubs and get dozens of reallocated sectors in just a few
minutes, but it'd recover on reboot and report failing-in-the-past, and
it wouldn't trip into failing mode unless I had the system off for awhile
and then did a scrub of several of those independent btrfs in quick
succession.
Anyway, yes, as long as the other copy is good, btrfs scrub does fix up
the problems without much pain beyond the wait time (which was generally
under a minute per btrfs, all under 50 gig each, on the ssds).
Tho I should mention: If btrfs returns any unverified errors, rerun the
scrub again, and it'll likely fix more. I'm not absolutely sure what
these actually are in btrfs terms, but I took them to be places where
metadata checksum errors occurred, where that metadata in turn had
checksums of data and metadata further down (up?) the tree, closer to the
data. Only after those metadata blocks were scrubbed in an early pass,
could a later pass actually verify their checksums and thus rely on the
checksums they in turn contained, for metadata blocks closer to the data
or for the data itself. Sometimes I'd end up rerunning scrub a few times
(never more that five, IIRC, however), almost always correcting less
errors each time, tho it'd occasionally jump up a bit for one pass,
before dropping again on the one after that.
But rerun scrubs returning unverified errors and you should eventually
fix everything, assuming of course that the second copy is always valid.
Obviously this was rather easier for me, however, at under a minute per
filesystem scrub run and generally under 15 minutes total for the
multiple runs on multiple filesystems (tho I didn't always verify /all/
btrfs, only the ones I normally mounted), than it's going to be for you,
at over an hour reported and still going. At hours per run, it'll
require some patience...
I had absolutely zero scrub failures here, because as I said my second
ssd was (and remains) absolutely solid).
> Question 2 - Before having ran the scrub, booting off the raid with bad
> sectors, would btrfs "on the fly" recognize it was getting bad sector
> data with the checksum being off, and checking the other drives? Or, is
> it expected that I could get a bad sector read in a critical piece of
> operating system and/or kernel, which could be causing my lockup issues?
"With the checksums being off" is unfortunately ambiguous.
Do you mean with the nodatasum mount option and/or nocow set, so btrfs
wasn't checksumming, or do you mean (as I assume you do) with the
checksums on, but simply failing to verify due to the hardware errors?
If you mean the first... if there's no checksum to verify, as would be
the case with nocow files since that turns of checksumming as well...
then btrfs, as most other filesystems, simply returns whatever it gets
from the hardware, because it doesn't have checksums to verify it
against. But no checksum stored normally only applies to data (and a few
misc things like the free-space-cache, accounting for the non-zero no-
checksums numbers you may see even if you haven't turned off cow or
checksumming on anything); metadata is always checksummed.
If you mean the second, "off" actually meaning "on but failing to
verify", as I suspect you do, then yes, btrfs should always reach for the
second copy when it finds the first one invalid.
But tho I'm a user not a dev and thus haven't actually checked the source
code itself, my believe here is with Russ and disagrees with Austin, as
based on what I've read both on the wiki and seen here previously, btrfs
runtime (that is, not during scrub) actually repairs the problem on-
hardware as well, from that second copy, not just fetching it for use
without the repair, the distinction between normal runtime error
detection and scrub thus being that scrub systematically checks
everything, while normal runtime on most systems will only check the
stuff it reads in normal usage, thus getting the stuff that's regularly
used, but not the stuff that's only stored and never read.
*WARNING*: From my experience at least, at least on initial mount, btrfs
isn't particularly robust when the number of read errors on one device
start to go up dramatically. Despite never seeing an error in scrub that
it couldn't fix, twice I had enough reads fail on a mount that the mount
itself failed and I couldn't mount successfully despite repeated
attempts. In both cases, I was able to use btrfs restore to restore the
contents of the filesystem to some other place (as it happens, the
reiserfs on spinning rust I use for my media filesystem, since being for
big media files, that had enough space to recover the as I said above
reasonably small btrfs into), and ultimate recreating the filesystem
using mkfs.btrfs.
But given that despite not being able to mount, neither SMART nor dmesg
ever mentioned anything about the "good" device having errors, I'm left
to conclude that btrfs itself ultimately crashed on attempt to mount the
filesystem, even tho only the one copy was bad. After a couple of those
events I started scrubbing much more frequently, thus fixing the errors
while btrfs could still mount the filesystem and /let/ me run a scrub.
It was actually those more frequent scrubs that quickly became the hassle
and lead me to give up on the device. If btrfs had been able to fall
back to the second/valid copy even in that case, as it really should have
done, then I would have very possibly waited quite a bit longer to
replace the dying device.
So on that one I'd say to be sure, get confirmation either directly from
the code (if you can read it) or from a dev who has actually looked at it
and is basing his post on that, tho I still /believe/ btrfs still runtime-
corrects checksumming issues actually on-device, if there's a validating
second copy it can use to do so.
> Question 3 - Probably doesn't matter, but how can I see which files (or
> metadata to files) the 40 current bad sectors are in? (On extX,
> I'd use tune2fs and debugfs to be able to see this information.)
Here, a read-only scrub seemed to print the path to the bad file -- when
there was one, sometimes it was a metadata block and thus not
specifically identifiable. Writable scrubs seemed to print the info
sometimes but not always. I'm actually confused as to why, but I did
specifically observe btrfs scrub printing path names in read-only mode,
that it didn't always appear to print in the scrub output. I didn't look
extremely carefully, however, or compare the outputs side-by-side, so
maybe I just missed it in the writable/fix-it mode output.
> I do have hourly snapshots, from when it was properly running, so once
> I'm that far in the process, I can also compare the most recent
> snapshots, and see if there's any changes that happened to files that
> shouldn't have.
Hourly snapshots:
Note that btrfs has significant scaling issues with snapshots, etc, when
the number reaches into the tens of thousands. If you're doing such
scheduled snapshots (and not already doing scheduled thinning), the
strong recommendation is to schedule reasonable snapshot thinning as well.
Think about it. If you need to retrieve something from a snapshot a year
ago, are you going to really know or care what specific hour it was?
Unlikely. You'll almost certainly be just fine finding correct day, and
a year out, you'll very possibly be just fine with weekly, monthly or
even quarterly, and if they haven't been thinned all those many many
hourly snapshots will simply make it harder to efficiently find and use
one you actually need amongst all the "noise".
So do hourly snapshots for say six hours (6, plus upto 6 more before the
thin drops 5 of them, so 12 max), then thin to six-hourly. Keep your
four-a-day-six-hourly snapshots for a couple days (8-12, plus the 6-12
for the last six hours, upto 24 total), and thin to 2-a-day-12-hourly.
Keep those for a week and thin to daily (12-26, upto 50 total), and those
for another week (6-13, upto 63) before dropping to weekly. That's two
weeks of snapshots so far. Keep the weekly snapshots out to a quarter
(13 weeks so 11 more, plus another 13 before thinning, 11-24, upto 87
total).
At a quarter, you really should be thinking about proper non-snapshot
full data backup, if you haven't before now, after which you can drop the
older snapshots, thereby freeing extents that only the old snapshots were
still referencing. But you'll want to keep a quarter's snapshots at all
times so will continue to accumulate another 13 weeks of snapshots before
you drop the quarter back. That's a total of 100 snapshots, max.
At 100 snapshots per subvolume, you can have 10 subvolume's worth before
hitting 1000 snapshots on the filesystem. A target of under 1000
snapshots per filesystem should keep scaling issues due to those
snapshots to a minimum.
If the 100 snapshots per subvolume snapshot thinning program I suggested
above is too strict for you, try to keep it to say 250 per subvolume
anyway, which would give you 8 subvolume's worth at the 2000 snapshot per
filesystem target. I would definitely try to keep it below that, because
between there and 10k the scaling issues take larger and larger bites out
of your btrfs maintenance command (check, balance) efficiency, and the
time to complete those commands will go up drastically. At 100k, the
time for maintenance can be weeks, so it's generally easier to just kill
it and restore from backup, if indeed your pain threshold hasn't already
been reached at 10k.
Hopefully it's not already a problem for you... 365 days @ 24 hours per
day is already ~8700 snaps, so it could be if you've been running it a
year and haven't thinned, even if there's just the single subvolume being
snapshotted.
Similarly, BTW, with btrfs quotas, except that btrfs quotas are still
broken anyway, so unless you're actively working with the devs to test/
trace/fix them, either you need quota features and thus should be using a
filesystem more stable and mature than btrfs where they work reliably, or
you don't, so you can run btrfs while keeping quotas off. That'll
dramatically reduce the overhead/tracking work btrfs has to do right
there, eliminating both that overhead and any brokenness related to btrfs
quota bugs in one whack.
---
[1] A number of independent btrfs... on a pair of ssds, with the ssds
partitioned up identically and multiple independent small btrfs, each on
its own set of parallel partitions on the two ssds. Multiple independent
btrfs instead of subvolumes or similar on a single filesystem, because I
don't want all my data eggs in the same single filesystem basket, such
that if that single filesystem goes down, everything goes with it.
[2] Why continue to run a known-dying ssd for six months? Simple. The
other ssd of the pair never had a single reallocated sector or
indications of any other problems the entire time, and btrfs'
checksumming and data integrity features, along with backups, gave me a
chance to actually play with the dying ssd for a few months without
risking real data loss. And I had never had that opportunity before and
was curious to see how the problem would develop over time, plus it gave
me some real useful experience with btrfs raid1 scrubs and recoveries.
So I took the opportunity that presented itself. =:^)
Eventually, however, I was scrubbing and correcting significant errors
after every shutdown of hours and/or after every major system update, and
by then the novelty had worn off, so I eventually just gave up and did
the btrfs replace to another ssd I had as a spare the entire time.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1
2015-10-20 13:59 ` Austin S Hemmelgarn
@ 2015-10-20 19:20 ` Duncan
2015-10-20 19:59 ` Austin S Hemmelgarn
0 siblings, 1 reply; 15+ messages in thread
From: Duncan @ 2015-10-20 19:20 UTC (permalink / raw)
To: linux-btrfs
Austin S Hemmelgarn posted on Tue, 20 Oct 2015 09:59:17 -0400 as
excerpted:
>>> It is worth clarifying also that:
>>> a. While BTRFS will not return bad data in this case, it also won't
>>> automatically repair the corruption.
>>
>> Really? If so I think that's a bug in BTRFS. When mounted rw I think
>> that every time corruption is discovered it should be automatically
>> fixed.
> That's debatable. While it is safer to try and do this with BTRFS than
> say with MD-RAID, it's still not something many seasoned system
> administrators would want happening behind their back. It's worth
> noting that ZFS does not automatically fix errors, it just reports them
> and works around them, and many distributed storage options (like Ceph
> for example) behave like this also. All that the checksum mismatch
> really tells you is that at some point, the data got corrupted, it could
> be that the copy on the disk is bad, but it could also be caused by bad
> RAM, a bad storage controller, a loose cable, or even a bad power
> supply.
There's a significant difference between btrfs in dup/raid1/raid10 modes
anyway and some of the others you mentioned, however. Btrfs in these
modes actually has a second copy of the data itself available. That's a
world of difference compared to parity, for instance. With parity you're
reconstructing the data and thus have dangers such as the write hole, and
the possibility of bad-ram corrupting the data before it was ever saved
(this last one being the reason zfs has such strong recommendations/
warnings regarding the use of non-ecc RAM, based on what a number of
posters with zfs experience have said, here). With btrfs, there's an
actual second copy, with both copies covered by checksum. If one of the
copies verifies against its checksum and the other doesn't, the odds of
the one that verifies being any worse than the one that doesn't are...
pretty slim, to say the least. (So slim I'd intuitively compare them to
the odds of getting hit by lightning, tho I've no idea what the
mathematically rigorous comparison might be.)
Yes, there's some small but not infinitesimal chance the checksum may be
wrong, but if there's two copies of the data and the checksum on one is
wrong while the checksum on the other verifies... yes, there's still that
small chance that the one that verifies is wrong too, but that it's any
worse than the one that does not verify? /That's/ getting close to
infinitesimal, or at least close enough for the purposes of a mailing-
list claim without links to supporting evidence by someone who has
already characterized it as not mathematically rigorous... and for me,
personally. I'm not spending any serious time thinking about getting hit
by lightening, either, tho by the same token I don't go out flying kites
or waving long metal rods around in lightning storms, either.
Meanwhile, it's worth noting that btrfs itself isn't yet entirely stable
or mature, and that the chances of just plain old bugs killing the
filesystem are far *FAR* higher than of a verified-checksum copy being
any worse than a failed-checksum copy. If you're worried about that at
this point, why are you even on the btrfs list in the first place?
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1
2015-10-20 18:54 ` Duncan
@ 2015-10-20 19:48 ` Austin S Hemmelgarn
2015-10-20 21:24 ` Duncan
0 siblings, 1 reply; 15+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-20 19:48 UTC (permalink / raw)
To: Duncan, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 5409 bytes --]
On 2015-10-20 14:54, Duncan wrote:
> But tho I'm a user not a dev and thus haven't actually checked the source
> code itself, my believe here is with Russ and disagrees with Austin, as
> based on what I've read both on the wiki and seen here previously, btrfs
> runtime (that is, not during scrub) actually repairs the problem on-
> hardware as well, from that second copy, not just fetching it for use
> without the repair, the distinction between normal runtime error
> detection and scrub thus being that scrub systematically checks
> everything, while normal runtime on most systems will only check the
> stuff it reads in normal usage, thus getting the stuff that's regularly
> used, but not the stuff that's only stored and never read.
>
> *WARNING*: From my experience at least, at least on initial mount, btrfs
> isn't particularly robust when the number of read errors on one device
> start to go up dramatically. Despite never seeing an error in scrub that
> it couldn't fix, twice I had enough reads fail on a mount that the mount
> itself failed and I couldn't mount successfully despite repeated
> attempts. In both cases, I was able to use btrfs restore to restore the
> contents of the filesystem to some other place (as it happens, the
> reiserfs on spinning rust I use for my media filesystem, since being for
> big media files, that had enough space to recover the as I said above
> reasonably small btrfs into), and ultimate recreating the filesystem
> using mkfs.btrfs.
>
> But given that despite not being able to mount, neither SMART nor dmesg
> ever mentioned anything about the "good" device having errors, I'm left
> to conclude that btrfs itself ultimately crashed on attempt to mount the
> filesystem, even tho only the one copy was bad. After a couple of those
> events I started scrubbing much more frequently, thus fixing the errors
> while btrfs could still mount the filesystem and /let/ me run a scrub.
> It was actually those more frequent scrubs that quickly became the hassle
> and lead me to give up on the device. If btrfs had been able to fall
> back to the second/valid copy even in that case, as it really should have
> done, then I would have very possibly waited quite a bit longer to
> replace the dying device.
>
> So on that one I'd say to be sure, get confirmation either directly from
> the code (if you can read it) or from a dev who has actually looked at it
> and is basing his post on that, tho I still /believe/ btrfs still runtime-
> corrects checksumming issues actually on-device, if there's a validating
> second copy it can use to do so.
>
FWIW, my assessment is based on some testing I did a while back (kernel
3.14 IIRC) using a VM. The (significantly summarized of course)
procedure I used was:
1. Create a basic minimalistic Linux system in a VM (in my case, I just
used a stage3 tarball for Gentoo, with a paravirtuaized Xen domain)
using BTRFS as the root filesystem with a raid1 setup. Make sure and
verify that it actually boots.
2. Shutdown the VM, use btrfs-progs on the host to find the physical
location of an arbitrary file (ideally one that is not touched at all
during the boot process, IIRC, I think I used one of the e2fsprogs
binaries), and then intentionally clear the CRC in one of the copies of
a block from the file.
3. Boot the VM, read the file.
4. Shutdown the VM again.
5. Verify whether the file block you cleared the checksum on has a valid
checksum now.
I repeated this more than a dozen times using different files and
different methods of reading the file, and each time the CRC I had
cleared was untouched. Based on this, unless BTRFS does some kind of
deferred re-write that doesn't get forced during a clean unmount of the
FS, I felt it was relatively safe to conclude that it did not
automatically fix corrupted blocks. I did not however, test corrupting
the block itself instead of the checksum, but I doubt that that would
impact anything in this case.
As I mentioned, many veteran sysadmins would want to disable
automatically fixing this in the FS driver without having some kind of
notification. This preference largely dates back to traditional RAID1,
where the system has no way to know for certain which copy is correct in
the case of a mismatch, and therefore to safely fix mismatches, the
admin needs to intervene. While it is possible to fix this safely
because of how BTRFS is designed, there is still the possibility of it
getting things wrong. There was one time I had a BTRFS raid1 filesystem
where one copy of a block got corrupted but miraculously had a correct
CRC (which is statistically impossible), and the other copy of the block
was correct, but the CRC for it was wrong (which, while unlikely, is
very much possible). In such a case (which was a serious pain to
debug), automatically 'fixing' the supposedly bad block would have
resulted in data loss. Of course, the chance that happening more than
once in a lifetime is astronomically small, but it is still possible.
It's also worth noting that ZFS has been considered mature for more than
a decade now, and the ZFS developers _still_ aren't willing to risk
their user's data with something like this, which should be an immediate
red flag for anyone developing a filesystem with features like ZFS.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1
2015-10-20 19:20 ` Duncan
@ 2015-10-20 19:59 ` Austin S Hemmelgarn
2015-10-20 20:54 ` Tim Walberg
2015-10-21 11:51 ` Austin S Hemmelgarn
0 siblings, 2 replies; 15+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-20 19:59 UTC (permalink / raw)
To: Duncan, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 4604 bytes --]
On 2015-10-20 15:20, Duncan wrote:
> Austin S Hemmelgarn posted on Tue, 20 Oct 2015 09:59:17 -0400 as
> excerpted:
>
>
>>>> It is worth clarifying also that:
>>>> a. While BTRFS will not return bad data in this case, it also won't
>>>> automatically repair the corruption.
>>>
>>> Really? If so I think that's a bug in BTRFS. When mounted rw I think
>>> that every time corruption is discovered it should be automatically
>>> fixed.
>> That's debatable. While it is safer to try and do this with BTRFS than
>> say with MD-RAID, it's still not something many seasoned system
>> administrators would want happening behind their back. It's worth
>> noting that ZFS does not automatically fix errors, it just reports them
>> and works around them, and many distributed storage options (like Ceph
>> for example) behave like this also. All that the checksum mismatch
>> really tells you is that at some point, the data got corrupted, it could
>> be that the copy on the disk is bad, but it could also be caused by bad
>> RAM, a bad storage controller, a loose cable, or even a bad power
>> supply.
>
> There's a significant difference between btrfs in dup/raid1/raid10 modes
> anyway and some of the others you mentioned, however. Btrfs in these
> modes actually has a second copy of the data itself available. That's a
> world of difference compared to parity, for instance. With parity you're
> reconstructing the data and thus have dangers such as the write hole, and
> the possibility of bad-ram corrupting the data before it was ever saved
> (this last one being the reason zfs has such strong recommendations/
> warnings regarding the use of non-ecc RAM, based on what a number of
> posters with zfs experience have said, here). With btrfs, there's an
> actual second copy, with both copies covered by checksum. If one of the
> copies verifies against its checksum and the other doesn't, the odds of
> the one that verifies being any worse than the one that doesn't are...
> pretty slim, to say the least. (So slim I'd intuitively compare them to
> the odds of getting hit by lightning, tho I've no idea what the
> mathematically rigorous comparison might be.)
ZFS doesn't just do parity, it also does RAID1 and RAID10 (and RAID0,
although I doubt that most people actually use that with ZFS), and Ceph
uses n-way replication by default, not erasure coding (which is
technically a super-set of the parity algorithms used for RAID[56]). In
both cases, they behave just like BTRFS, they log the error and fetch a
good copy to return to userspace, but do not modify the copy with the
error unless explicitly told to do so.
>
> Yes, there's some small but not infinitesimal chance the checksum may be
> wrong, but if there's two copies of the data and the checksum on one is
> wrong while the checksum on the other verifies... yes, there's still that
> small chance that the one that verifies is wrong too, but that it's any
> worse than the one that does not verify? /That's/ getting close to
> infinitesimal, or at least close enough for the purposes of a mailing-
> list claim without links to supporting evidence by someone who has
> already characterized it as not mathematically rigorous... and for me,
> personally. I'm not spending any serious time thinking about getting hit
> by lightening, either, tho by the same token I don't go out flying kites
> or waving long metal rods around in lightning storms, either.
With a 32-bit checksum and a 4k block (the math is easier with smaller
numbers), that's 4128 bits, which means that a random single bit error
will have a approximately 0.24% chance of occurring in a given bit,
which translates to an approximately 7.75% chance that it will occur in
one of the checksum bits. For a 16k block it's smaller of course
(around 1.8% I think, but that's just a guess), but it's still
sufficiently statistically likely that it should be considered.
>
> Meanwhile, it's worth noting that btrfs itself isn't yet entirely stable
> or mature, and that the chances of just plain old bugs killing the
> filesystem are far *FAR* higher than of a verified-checksum copy being
> any worse than a failed-checksum copy. If you're worried about that at
> this point, why are you even on the btrfs list in the first place?
Actually, the improved data safety relative to ext4 is just a bonus for
me, my biggest reason for using BTRFS is the ease of reprovisioning
(there are few other ways to move entire systems to new storage devices
online with zero downtime).
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1
2015-10-20 19:59 ` Austin S Hemmelgarn
@ 2015-10-20 20:54 ` Tim Walberg
2015-10-21 11:51 ` Austin S Hemmelgarn
1 sibling, 0 replies; 15+ messages in thread
From: Tim Walberg @ 2015-10-20 20:54 UTC (permalink / raw)
To: Austin S Hemmelgarn; +Cc: Duncan, linux-btrfs
On 10/20/2015 15:59 -0400, Austin S Hemmelgarn wrote:
>> .........
>> With a 32-bit checksum and a 4k block (the math is easier with
>> smaller numbers), that's 4128 bits, which means that a random
>> single bit error will have a approximately 0.24% chance of
>> occurring in a given bit, which translates to an approximately
>> 7.75% chance that it will occur in one of the checksum bits. For a
>> 16k block it's smaller of course (around 1.8% I think, but that's
>> just a guess), but it's still sufficiently statistically likely
>> that it should be considered.
>> .........
Last I checked, a 4 kilo-BYTE block consisted of 32768 BITs... So the
percentages should in fact be considerably smaller than that.
--
twalberg@gmail.com
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1
2015-10-20 19:48 ` Austin S Hemmelgarn
@ 2015-10-20 21:24 ` Duncan
0 siblings, 0 replies; 15+ messages in thread
From: Duncan @ 2015-10-20 21:24 UTC (permalink / raw)
To: linux-btrfs
Austin S Hemmelgarn posted on Tue, 20 Oct 2015 15:48:07 -0400 as
excerpted:
> FWIW, my assessment is based on some testing I did a while back (kernel
> 3.14 IIRC) using a VM. The (significantly summarized of course)
> procedure I used was:
> 1. Create a basic minimalistic Linux system in a VM (in my case, I just
> used a stage3 tarball for Gentoo, with a paravirtuaized Xen domain)
> using BTRFS as the root filesystem with a raid1 setup. Make sure and
> verify that it actually boots.
> 2. Shutdown the VM, use btrfs-progs on the host to find the physical
> location of an arbitrary file (ideally one that is not touched at all
> during the boot process, IIRC, I think I used one of the e2fsprogs
> binaries), and then intentionally clear the CRC in one of the copies of
> a block from the file.
> 3. Boot the VM, read the file.
> 4. Shutdown the VM again.
> 5. Verify whether the file block you cleared the checksum on has a valid
> checksum now.
>
> I repeated this more than a dozen times using different files and
> different methods of reading the file, and each time the CRC I had
> cleared was untouched. Based on this, unless BTRFS does some kind of
> deferred re-write that doesn't get forced during a clean unmount of the
> FS, I felt it was relatively safe to conclude that it did not
> automatically fix corrupted blocks. I did not however, test corrupting
> the block itself instead of the checksum, but I doubt that that would
> impact anything in this case.
AFAIK:
1) It would only run into the corruption if the raid1 read-scheduler
picked that copy based on the even/odd of the requesting PID.
However, statistically that should be a 50% hit rate and if you tested
more than a dozen times, you'd have quite the luck to fail to hit it on
at least /one/ of them.
2) (Based on what I understood from the discussion of btrfs check's init-
csum-tree patches a couple cycles ago, before which it was clearing but
not reinitializing...) Btrfs interprets missing checksums differently
than invalid checksums. Would your "cleared" CRC be interpreted as
invalid or missing? If missing, AFAIK it would leave it missing.
In which case corrupting the data block itself would indeed have had a
different result than "clearing" the csum, tho simply corrupting the csum
should have resulted in an update.
However, by actually testing you've gone farther than I have, and pending
further info to the contrary, I'll yield to that, changing my own
thoughts on the matter as well, to "I formerly thought... but someone's
testing some versions ago anyway suggested otherwise, so being too lazy
to actually do my own testing, I'll cautiously agree with the results of
his."
=:^)
Thanks. I'd rather find out I was wrong, than not find out! =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1
2015-10-20 19:59 ` Austin S Hemmelgarn
2015-10-20 20:54 ` Tim Walberg
@ 2015-10-21 11:51 ` Austin S Hemmelgarn
2015-10-21 12:07 ` Austin S Hemmelgarn
1 sibling, 1 reply; 15+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-21 11:51 UTC (permalink / raw)
To: Duncan, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 3763 bytes --]
On 2015-10-20 15:59, Austin S Hemmelgarn wrote:
> On 2015-10-20 15:20, Duncan wrote:
>> Yes, there's some small but not infinitesimal chance the checksum may be
>> wrong, but if there's two copies of the data and the checksum on one is
>> wrong while the checksum on the other verifies... yes, there's still that
>> small chance that the one that verifies is wrong too, but that it's any
>> worse than the one that does not verify? /That's/ getting close to
>> infinitesimal, or at least close enough for the purposes of a mailing-
>> list claim without links to supporting evidence by someone who has
>> already characterized it as not mathematically rigorous... and for me,
>> personally. I'm not spending any serious time thinking about getting hit
>> by lightening, either, tho by the same token I don't go out flying kites
>> or waving long metal rods around in lightning storms, either.
> With a 32-bit checksum and a 4k block (the math is easier with smaller
> numbers), that's 4128 bits, which means that a random single bit error
> will have a approximately 0.24% chance of occurring in a given bit,
> which translates to an approximately 7.75% chance that it will occur in
> one of the checksum bits. For a 16k block it's smaller of course
> (around 1.8% I think, but that's just a guess), but it's still
> sufficiently statistically likely that it should be considered.
As mentioned in my other reply to this, I did the math wrong (bit of a
difference between kilobit and kilobyte), so here's a (hopefully)
correct and more thorough analysis:
For 4kb blocks (32768 bits):
There are a total of 32800 bits when including a 32 bit checksum outside
the block, this makes the chance of a single bit error in either the
block or the checksum ~0.30%. This in turn means an approximately 9.7%
chance of a single bit error in the checksum.
For 16kb blocks (131072 bits):
There are a total of 131104 bits when including a 32 bit checksum
outside the block, this makes the chance of a single bit error in either
the block or the checksum ~0.07%. This in turn means an approximately
2.4% chance of a single bit error in the checksum.
This all of course assumes a naive interpretation of how modern block
storage devices work. All modern hard drives and SSD's include at a
minimum the ability to correct single bit errors per byte, and detect
double bit errors per byte, which means that we need a triple bit error
in the same byte to get bad data back, which in turn makes the numbers
small enough that it's impractical to represent them without scientific
notation (on the order of 10^-5).
That in turn assumes zero correlation beyond what's required to get bad
data back from the storage, however, if there is enough correlation for
that to happen, it's statistically likely that there will be other
errors very close by. This in turn means that it's more likely that the
checksum is either correct or absolutely completely wrong, which
increases the chances that the resultant metadata block containing the
checksum will nnot appear to have an incorrect checksum itself (because
checksums are good at detecting proportionately small errors, but only
mediocre at detecting very big errors).
The approximate proportionate chances of an error in the data versus the
checksum however are still roughly the same however, irrespective of how
small the chances of getting any error are. Based on this, the ratio of
the size of the checksum to the size of the data is a tradeoff that
needs to be considered, the closer the ratio is to 1, the higher the
chance of having an error in the checksum, but the less data you need to
correct/verify when there is an error.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1
2015-10-21 11:51 ` Austin S Hemmelgarn
@ 2015-10-21 12:07 ` Austin S Hemmelgarn
2015-10-21 16:01 ` Chris Murphy
0 siblings, 1 reply; 15+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-21 12:07 UTC (permalink / raw)
To: Duncan, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 1795 bytes --]
On 2015-10-21 07:51, Austin S Hemmelgarn wrote:
> On 2015-10-20 15:59, Austin S Hemmelgarn wrote:
>> On 2015-10-20 15:20, Duncan wrote:
>>> Yes, there's some small but not infinitesimal chance the checksum may be
>>> wrong, but if there's two copies of the data and the checksum on one is
>>> wrong while the checksum on the other verifies... yes, there's still
>>> that
>>> small chance that the one that verifies is wrong too, but that it's any
>>> worse than the one that does not verify? /That's/ getting close to
>>> infinitesimal, or at least close enough for the purposes of a mailing-
>>> list claim without links to supporting evidence by someone who has
>>> already characterized it as not mathematically rigorous... and for me,
>>> personally. I'm not spending any serious time thinking about getting
>>> hit
>>> by lightening, either, tho by the same token I don't go out flying kites
>>> or waving long metal rods around in lightning storms, either.
>> With a 32-bit checksum and a 4k block (the math is easier with smaller
>> numbers), that's 4128 bits, which means that a random single bit error
>> will have a approximately 0.24% chance of occurring in a given bit,
>> which translates to an approximately 7.75% chance that it will occur in
>> one of the checksum bits. For a 16k block it's smaller of course
>> (around 1.8% I think, but that's just a guess), but it's still
>> sufficiently statistically likely that it should be considered.
> As mentioned in my other reply to this, I did the math wrong (bit of a
> difference between kilobit and kilobyte)
And I realize of course right after sending this that my other reply
didn't get through because GMail refuses to send mail in plain text, no
matter how hard I beat it over the head...
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1
2015-10-21 12:07 ` Austin S Hemmelgarn
@ 2015-10-21 16:01 ` Chris Murphy
2015-10-21 17:28 ` Austin S Hemmelgarn
0 siblings, 1 reply; 15+ messages in thread
From: Chris Murphy @ 2015-10-21 16:01 UTC (permalink / raw)
To: Austin S Hemmelgarn; +Cc: Btrfs BTRFS
On Wed, Oct 21, 2015 at 2:07 PM, Austin S Hemmelgarn
<ahferroin7@gmail.com> wrote:
> And I realize of course right after sending this that my other reply didn't
> get through because GMail refuses to send mail in plain text, no matter how
> hard I beat it over the head...
In the web browser version, to the right of the trash can for an email
being written, there is an arrow with a drop down menu that includes
"plain text mode" option which will work. This is often sticky, but
randomly with the btrfs list the replies won't have this option
checked and then they bounce. It's annoying. And then both the Gmail
and Inbox Android apps have no such option so it's not possible reply
to list emails from a mobile device short of changing mail clients
just for this purpose.
The smarter thing to do is server side conversion of HTML to plain
text, stripping superfluous formatting. Bouncing mails is just as bad
a UX as Google not providing a plain text option in their mobile apps.
--
Chris Murphy
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1
2015-10-21 16:01 ` Chris Murphy
@ 2015-10-21 17:28 ` Austin S Hemmelgarn
0 siblings, 0 replies; 15+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-21 17:28 UTC (permalink / raw)
To: Chris Murphy; +Cc: Btrfs BTRFS
[-- Attachment #1: Type: text/plain, Size: 1617 bytes --]
On 2015-10-21 12:01, Chris Murphy wrote:
> On Wed, Oct 21, 2015 at 2:07 PM, Austin S Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> And I realize of course right after sending this that my other reply didn't
>> get through because GMail refuses to send mail in plain text, no matter how
>> hard I beat it over the head...
>
> In the web browser version, to the right of the trash can for an email
> being written, there is an arrow with a drop down menu that includes
> "plain text mode" option which will work. This is often sticky, but
> randomly with the btrfs list the replies won't have this option
> checked and then they bounce. It's annoying. And then both the Gmail
> and Inbox Android apps have no such option so it's not possible reply
> to list emails from a mobile device short of changing mail clients
> just for this purpose.
I actually didn't know about the option in the drop down menu in the
Web-UI, although that wouldn't have been particularly relevant in this
case as I was replying from my phone. What's really annoying in that
case is that the 'Reply Inline' option makes things _look_ like they're
plain text, but they really aren't.
I've considered getting a different mail app, but for some reason the
only one I can find for Android that supports plain text e-mail is K-9
Mail, and I'm not too fond of the UI for that, and it takes way more
effort to set up than I'm willing to put in for something I almost never
use anyway (that and it doesn't (AFAICT) support S/MIME or Hashcash,
although GMail doesn't either, so that one's not a show stopper).
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2015-10-21 17:29 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-20 4:16 Expected behavior of bad sectors on one drive in a RAID1 james harvey
2015-10-20 4:45 ` Russell Coker
2015-10-20 13:00 ` Austin S Hemmelgarn
2015-10-20 13:15 ` Russell Coker
2015-10-20 13:59 ` Austin S Hemmelgarn
2015-10-20 19:20 ` Duncan
2015-10-20 19:59 ` Austin S Hemmelgarn
2015-10-20 20:54 ` Tim Walberg
2015-10-21 11:51 ` Austin S Hemmelgarn
2015-10-21 12:07 ` Austin S Hemmelgarn
2015-10-21 16:01 ` Chris Murphy
2015-10-21 17:28 ` Austin S Hemmelgarn
2015-10-20 18:54 ` Duncan
2015-10-20 19:48 ` Austin S Hemmelgarn
2015-10-20 21:24 ` Duncan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).