Expected behavior of bad sectors on one drive in a RAID1

linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Expected behavior of bad sectors on one drive in a RAID1
@ 2015-10-20  4:16 james harvey
  2015-10-20  4:45 ` Russell Coker
  2015-10-20 18:54 ` Duncan
  0 siblings, 2 replies; 15+ messages in thread
From: james harvey @ 2015-10-20  4:16 UTC (permalink / raw)
  To: linux-btrfs

Background -----

My fileserver had a "bad event" last week.  Shut it down normally to
add a new hard drive, and it would no longer post.  Tried about 50
times, doing the typical everything non-essential unplugged, trying 1
of 4 memory modules at a time, and 1 of 2 processors at a time.  Got
no where.

Inexpensive HP workstation, so purchased a used identical model
(complete other than hard drives) on eBay.  Replacement arrived today.
Posts fine.  Moved hard drives over (again, identical model, and Arch
Linux not Windows) and it started giving "Watchdog detected hard
LOCKUP" type errors I've never seen before.

Decided I'd diagnose which part in the original server was bad.  By
sitting turned off for a week, it suddenly started posting just fine.
But, with the hard drives back in it, I'm getting the same hard lockup
errors.

An Arch ISO DVD runs stress testing perfectly.

Btrfs-specific -----

The current problem I'm having must be a bad hard drive or corrupted data.

3 drive btrfs RAID1 (data and metadata.)  sda has 1GB of the 3GB of
data, and 1GB of the 1GB of metadata.

sda appears to be going bad, with my low threshold of "going bad", and
will be replaced ASAP.  It just developed 16 reallocated sectors, and
has 40 current pending sectors.

I'm currently running a "btrfs scrub start -B -d -r /terra", which
status on another term shows me has found 32 errors after running for
an hour.

Question 1 - I'm expecting if I re-run the scrub without the read-only
option, that it will detect from the checksum data which sector is
correct, and re-write to the drive with bad sectors the data to a new
sector.  Correct?

Question 2 - Before having ran the scrub, booting off the raid with
bad sectors, would btrfs "on the fly" recognize it was getting bad
sector data with the checksum being off, and checking the other
drives?  Or, is it expected that I could get a bad sector read in a
critical piece of operating system and/or kernel, which could be
causing my lockup issues?

Question 3 - Probably doesn't matter, but how can I see which files
(or metadata to files) the 40 current bad sectors are in?  (On extX,
I'd use tune2fs and debugfs to be able to see this information.)

I do have hourly snapshots, from when it was properly running, so once
I'm that far in the process, I can also compare the most recent
snapshots, and see if there's any changes that happened to files that
shouldn't have.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Expected behavior of bad sectors on one drive in a RAID1
  2015-10-20  4:16 Expected behavior of bad sectors on one drive in a RAID1 james harvey
@ 2015-10-20  4:45 ` Russell Coker
  2015-10-20 13:00   ` Austin S Hemmelgarn
  2015-10-20 18:54 ` Duncan
  1 sibling, 1 reply; 15+ messages in thread
From: Russell Coker @ 2015-10-20  4:45 UTC (permalink / raw)
  To: james harvey; +Cc: linux-btrfs

On Tue, 20 Oct 2015 03:16:15 PM james harvey wrote:
> sda appears to be going bad, with my low threshold of "going bad", and
> will be replaced ASAP.  It just developed 16 reallocated sectors, and
> has 40 current pending sectors.
> 
> I'm currently running a "btrfs scrub start -B -d -r /terra", which
> status on another term shows me has found 32 errors after running for
> an hour.

https://www.gnu.org/software/ddrescue/

At this stage I would use ddrescue or something similar to copy data from the 
failing disk to a fresh disk, then do a BTRFS scrub to regenerate the missing 
data.

I wouldn't remove the disk entirely because then you lose badly if you get 
another failure.  I wouldn't use a BTRFS replace because you already have the 
system apart and I expect ddrescue could copy the data faster.  Also as the 
drive has been causing system failures (I'm guessing a problem with the power 
connector) you REALLY don't want BTRFS to corrupt data on the other disks.  If 
you have a system with the failing disk and a new disk attached then there's 
no risk of further contamination.

> Question 2 - Before having ran the scrub, booting off the raid with
> bad sectors, would btrfs "on the fly" recognize it was getting bad
> sector data with the checksum being off, and checking the other
> drives?  Or, is it expected that I could get a bad sector read in a
> critical piece of operating system and/or kernel, which could be
> causing my lockup issues?

Unless you have disabled CoW then BTRFS will not return bad data.

> Question 3 - Probably doesn't matter, but how can I see which files
> (or metadata to files) the 40 current bad sectors are in?  (On extX,
> I'd use tune2fs and debugfs to be able to see this information.)

Read all the files in the system and syslog will report it.  But really don't 
do that until after you have copied the disk.

> I do have hourly snapshots, from when it was properly running, so once
> I'm that far in the process, I can also compare the most recent
> snapshots, and see if there's any changes that happened to files that
> shouldn't have.

Snapshots refer to the same data blocks, so if a data block is corrupted in a 
way that BTRFS doesn't notice (which should be almost impossible) then all 
snapshots will have it.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Expected behavior of bad sectors on one drive in a RAID1
  2015-10-20  4:45 ` Russell Coker
@ 2015-10-20 13:00   ` Austin S Hemmelgarn
  2015-10-20 13:15     ` Russell Coker
  0 siblings, 1 reply; 15+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-20 13:00 UTC (permalink / raw)
  To: Russell Coker, james harvey; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3331 bytes --]

On 2015-10-20 00:45, Russell Coker wrote:
> On Tue, 20 Oct 2015 03:16:15 PM james harvey wrote:
>> sda appears to be going bad, with my low threshold of "going bad", and
>> will be replaced ASAP.  It just developed 16 reallocated sectors, and
>> has 40 current pending sectors.
>>
>> I'm currently running a "btrfs scrub start -B -d -r /terra", which
>> status on another term shows me has found 32 errors after running for
>> an hour.
>
> https://www.gnu.org/software/ddrescue/
>
> At this stage I would use ddrescue or something similar to copy data from the
> failing disk to a fresh disk, then do a BTRFS scrub to regenerate the missing
> data.
>
> I wouldn't remove the disk entirely because then you lose badly if you get
> another failure.  I wouldn't use a BTRFS replace because you already have the
> system apart and I expect ddrescue could copy the data faster.  Also as the
> drive has been causing system failures (I'm guessing a problem with the power
> connector) you REALLY don't want BTRFS to corrupt data on the other disks.  If
> you have a system with the failing disk and a new disk attached then there's
> no risk of further contamination.
BIG DISCLAIMER: For the filesystem to be safely mountable it is 
ABSOLUTELY NECESSARY to remove the old disk after doing a block level 
copy of it.  By all means, keep the disk around, but do not keep it 
visible to the kernel after doing a block level copy of it.  Also, you 
will probably have to run 'btrfs device scan' after copying the disk and 
removing it for the filesystem to work right.  This is an inherent 
result of how BTRFS's multi-device functionality works, and also applies 
to doing stuff like LVM snapshots of BTRFS filesystems.
>
>> Question 2 - Before having ran the scrub, booting off the raid with
>> bad sectors, would btrfs "on the fly" recognize it was getting bad
>> sector data with the checksum being off, and checking the other
>> drives?  Or, is it expected that I could get a bad sector read in a
>> critical piece of operating system and/or kernel, which could be
>> causing my lockup issues?
>
> Unless you have disabled CoW then BTRFS will not return bad data.
It is worth clarifying also that:
a. While BTRFS will not return bad data in this case, it also won't 
automatically repair the corruption.
b. In the unlikely event that both copies are bad, trying to read the 
data will return an IO error.
c. It is theoretically possible (although statistically impossible) that 
the block could become corrupted, but the checksum could still be 
correct (CRC32c is good at detecting small errors, but it's not hard to 
generate a hash collision for any arbitrary value, so if a large portion 
of the block goes bad, then it can theoretically still have a valid 
checksum).
>
>> Question 3 - Probably doesn't matter, but how can I see which files
>> (or metadata to files) the 40 current bad sectors are in?  (On extX,
>> I'd use tune2fs and debugfs to be able to see this information.)
>
> Read all the files in the system and syslog will report it.  But really don't
> do that until after you have copied the disk.
It may also be possible to use some of the debug tools from BTRFS to do 
this without hitting the disks so hard, but it will likely take a lot 
more effort.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Expected behavior of bad sectors on one drive in a RAID1
  2015-10-20 13:00   ` Austin S Hemmelgarn
@ 2015-10-20 13:15     ` Russell Coker
  2015-10-20 13:59       ` Austin S Hemmelgarn
  0 siblings, 1 reply; 15+ messages in thread
From: Russell Coker @ 2015-10-20 13:15 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: james harvey, linux-btrfs

On Wed, 21 Oct 2015 12:00:59 AM Austin S Hemmelgarn wrote:
> > https://www.gnu.org/software/ddrescue/
> > 
> > At this stage I would use ddrescue or something similar to copy data from
> > the failing disk to a fresh disk, then do a BTRFS scrub to regenerate
> > the missing data.
> > 
> > I wouldn't remove the disk entirely because then you lose badly if you
> > get another failure.  I wouldn't use a BTRFS replace because you already
> > have the system apart and I expect ddrescue could copy the data faster. 
> > Also as the drive has been causing system failures (I'm guessing a
> > problem with the power connector) you REALLY don't want BTRFS to corrupt
> > data on the other disks.  If you have a system with the failing disk and
> > a new disk attached then there's no risk of further contamination.
> 
> BIG DISCLAIMER: For the filesystem to be safely mountable it is
> ABSOLUTELY NECESSARY to remove the old disk after doing a block level

You are correct, my message wasn't clear.

What I meant to say is that doing a "btrfs device remove" or "btrfs replace" 
is generally a bad idea in such a situation.  "btrfs replace" is pretty good 
if you are replacing a disk with a larger one or replacing a disk that has 
only minor errors (a disk that just gets a few bad sectors is unlikely to get 
many more in a hurry).

> copy of it.  By all means, keep the disk around, but do not keep it
> visible to the kernel after doing a block level copy of it.  Also, you
> will probably have to run 'btrfs device scan' after copying the disk and
> removing it for the filesystem to work right.  This is an inherent
> result of how BTRFS's multi-device functionality works, and also applies
> to doing stuff like LVM snapshots of BTRFS filesystems.

Good advice.  I recommend just rebooting the system.  I think that if anyone 
who has the background knowledge to do such things without rebooting will 
probably just do it without needing to ask us for advice.

> >> Question 2 - Before having ran the scrub, booting off the raid with
> >> bad sectors, would btrfs "on the fly" recognize it was getting bad
> >> sector data with the checksum being off, and checking the other
> >> drives?  Or, is it expected that I could get a bad sector read in a
> >> critical piece of operating system and/or kernel, which could be
> >> causing my lockup issues?
> > 
> > Unless you have disabled CoW then BTRFS will not return bad data.
> 
> It is worth clarifying also that:
> a. While BTRFS will not return bad data in this case, it also won't
> automatically repair the corruption.

Really?  If so I think that's a bug in BTRFS.  When mounted rw I think that 
every time corruption is discovered it should be automatically fixed.

> b. In the unlikely event that both copies are bad, trying to read the
> data will return an IO error.
> c. It is theoretically possible (although statistically impossible) that
> the block could become corrupted, but the checksum could still be
> correct (CRC32c is good at detecting small errors, but it's not hard to
> generate a hash collision for any arbitrary value, so if a large portion
> of the block goes bad, then it can theoretically still have a valid
> checksum).

It would be interesting to see some research into how CRC32 fits with the more 
common disk errors.  For a disk to return bad data and claim it to be good the 
data must either be a misplaced write or read (which is almost certain to be 
caught by BTRFS as the metadata won't match), or a random sector that matches 
the disk's CRC.  Is generating a hash collision for a CRC32 inside a CRC 
protected block much more difficult?

> >> Question 3 - Probably doesn't matter, but how can I see which files
> >> (or metadata to files) the 40 current bad sectors are in?  (On extX,
> >> I'd use tune2fs and debugfs to be able to see this information.)
> > 
> > Read all the files in the system and syslog will report it.  But really
> > don't do that until after you have copied the disk.
> 
> It may also be possible to use some of the debug tools from BTRFS to do
> this without hitting the disks so hard, but it will likely take a lot
> more effort.

I don't think that you can do that without hitting the disks hard.

That said last time I checked (last time an executive of a hard drive 
manufacturer was willing to talk to me) drives were apparently designed to 
perform any sequence of operations for their warranty period.  So for a disk 
that is believed to be good this shouldn't be a problem.  For a disk that is 
known to be dying it would be a really bad idea to do anything other than copy 
the data off at maximum speed.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Expected behavior of bad sectors on one drive in a RAID1
  2015-10-20 13:15     ` Russell Coker
@ 2015-10-20 13:59       ` Austin S Hemmelgarn
  2015-10-20 19:20         ` Duncan
  0 siblings, 1 reply; 15+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-20 13:59 UTC (permalink / raw)
  To: Russell Coker; +Cc: james harvey, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 7425 bytes --]

On 2015-10-20 09:15, Russell Coker wrote:
> On Wed, 21 Oct 2015 12:00:59 AM Austin S Hemmelgarn wrote:
>>> https://www.gnu.org/software/ddrescue/
>>>
>>> At this stage I would use ddrescue or something similar to copy data from
>>> the failing disk to a fresh disk, then do a BTRFS scrub to regenerate
>>> the missing data.
>>>
>>> I wouldn't remove the disk entirely because then you lose badly if you
>>> get another failure.  I wouldn't use a BTRFS replace because you already
>>> have the system apart and I expect ddrescue could copy the data faster.
>>> Also as the drive has been causing system failures (I'm guessing a
>>> problem with the power connector) you REALLY don't want BTRFS to corrupt
>>> data on the other disks.  If you have a system with the failing disk and
>>> a new disk attached then there's no risk of further contamination.
>>
>> BIG DISCLAIMER: For the filesystem to be safely mountable it is
>> ABSOLUTELY NECESSARY to remove the old disk after doing a block level
>
> You are correct, my message wasn't clear.
>
> What I meant to say is that doing a "btrfs device remove" or "btrfs replace"
> is generally a bad idea in such a situation.  "btrfs replace" is pretty good
> if you are replacing a disk with a larger one or replacing a disk that has
> only minor errors (a disk that just gets a few bad sectors is unlikely to get
> many more in a hurry).
I kind of figured that was what you meant, I just wanted to make it as 
clear as possible, because this is something that has bitten me in the 
past.  It's worth noting though that there is an option for 'btrfs 
replace' to avoid reading from the device being replaced if at all 
possible.  I've used that option myself a couple of times when 
re-provisioning my systems, and it works well (although I used it to 
just control what disks were getting IO sent to them, not because any of 
the were bad).
>
>> copy of it.  By all means, keep the disk around, but do not keep it
>> visible to the kernel after doing a block level copy of it.  Also, you
>> will probably have to run 'btrfs device scan' after copying the disk and
>> removing it for the filesystem to work right.  This is an inherent
>> result of how BTRFS's multi-device functionality works, and also applies
>> to doing stuff like LVM snapshots of BTRFS filesystems.
>
> Good advice.  I recommend just rebooting the system.  I think that if anyone
> who has the background knowledge to do such things without rebooting will
> probably just do it without needing to ask us for advice.
Normally I would agree, but given the boot issues that were mentioned 
WRT the system in question, it may be safer to just use 'btrfs dev scan' 
without rebooting (unless of course the system doesn't properly support 
SATA hot-plug/hot-remove).
>
>>>> Question 2 - Before having ran the scrub, booting off the raid with
>>>> bad sectors, would btrfs "on the fly" recognize it was getting bad
>>>> sector data with the checksum being off, and checking the other
>>>> drives?  Or, is it expected that I could get a bad sector read in a
>>>> critical piece of operating system and/or kernel, which could be
>>>> causing my lockup issues?
>>>
>>> Unless you have disabled CoW then BTRFS will not return bad data.
>>
>> It is worth clarifying also that:
>> a. While BTRFS will not return bad data in this case, it also won't
>> automatically repair the corruption.
>
> Really?  If so I think that's a bug in BTRFS.  When mounted rw I think that
> every time corruption is discovered it should be automatically fixed.
That's debatable.  While it is safer to try and do this with BTRFS than 
say with MD-RAID, it's still not something many seasoned system 
administrators would want happening behind their back.  It's worth 
noting that ZFS does not automatically fix errors, it just reports them 
and works around them, and many distributed storage options (like Ceph 
for example) behave like this also.  All that the checksum mismatch 
really tells you is that at some point, the data got corrupted, it could 
be that the copy on the disk is bad, but it could also be caused by bad 
RAM, a bad storage controller, a loose cable, or even a bad power supply.
>
>> b. In the unlikely event that both copies are bad, trying to read the
>> data will return an IO error.
>> c. It is theoretically possible (although statistically impossible) that
>> the block could become corrupted, but the checksum could still be
>> correct (CRC32c is good at detecting small errors, but it's not hard to
>> generate a hash collision for any arbitrary value, so if a large portion
>> of the block goes bad, then it can theoretically still have a valid
>> checksum).
>
> It would be interesting to see some research into how CRC32 fits with the more
> common disk errors.  For a disk to return bad data and claim it to be good the
> data must either be a misplaced write or read (which is almost certain to be
> caught by BTRFS as the metadata won't match), or a random sector that matches
> the disk's CRC.  Is generating a hash collision for a CRC32 inside a CRC
> protected block much more difficult?
In general, most disk errors will be just a few flipped bits.  For a 
single bit flip in a data stream, a CRC is 100% guaranteed to change, 
the same goes for any odd number of bit flips in the data stream.  For 
an even number of bit flips however, the chance that there will be a 
collision is proportionate to the size of the CRC, and for 32-bits it's 
a statistical impossibility that there will be a collision due to two 
bits flipping without there being some malicious intent involved.  Once 
you get to larger numbers of bit flips and bigger blocks of data, it 
becomes more likely.  The chances of a collision with a 4k block with 
any random set of bit flips is astronomically small, and it's only 
marginally larger with 16k blocks (which are the default right now for 
BTRFS).
>
>>>> Question 3 - Probably doesn't matter, but how can I see which files
>>>> (or metadata to files) the 40 current bad sectors are in?  (On extX,
>>>> I'd use tune2fs and debugfs to be able to see this information.)
>>>
>>> Read all the files in the system and syslog will report it.  But really
>>> don't do that until after you have copied the disk.
>>
>> It may also be possible to use some of the debug tools from BTRFS to do
>> this without hitting the disks so hard, but it will likely take a lot
>> more effort.
>
> I don't think that you can do that without hitting the disks hard.
Ah, you're right, I forgot that there's no way on most hard disks to get 
the LBA's of the reallocated sectors, which would be required to use the 
debug tools to get the files.
>
> That said last time I checked (last time an executive of a hard drive
> manufacturer was willing to talk to me) drives were apparently designed to
> perform any sequence of operations for their warranty period.  So for a disk
> that is believed to be good this shouldn't be a problem.  For a disk that is
> known to be dying it would be a really bad idea to do anything other than copy
> the data off at maximum speed.
Well yes, but the less stress you put on something, the longer it's 
likely to last.  And if you actually care about the data, you should 
have backups (or some other way of trivially reproducing it)



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Expected behavior of bad sectors on one drive in a RAID1
  2015-10-20 13:59       ` Austin S Hemmelgarn
@ 2015-10-20 19:20         ` Duncan
  2015-10-20 19:59           ` Austin S Hemmelgarn
  0 siblings, 1 reply; 15+ messages in thread
From: Duncan @ 2015-10-20 19:20 UTC (permalink / raw)
  To: linux-btrfs

Austin S Hemmelgarn posted on Tue, 20 Oct 2015 09:59:17 -0400 as
excerpted:

>>> It is worth clarifying also that:
>>> a. While BTRFS will not return bad data in this case, it also won't
>>> automatically repair the corruption.
>>
>> Really?  If so I think that's a bug in BTRFS.  When mounted rw I think
>> that every time corruption is discovered it should be automatically
>> fixed.
> That's debatable.  While it is safer to try and do this with BTRFS than
> say with MD-RAID, it's still not something many seasoned system
> administrators would want happening behind their back.  It's worth
> noting that ZFS does not automatically fix errors, it just reports them
> and works around them, and many distributed storage options (like Ceph
> for example) behave like this also.  All that the checksum mismatch
> really tells you is that at some point, the data got corrupted, it could
> be that the copy on the disk is bad, but it could also be caused by bad
> RAM, a bad storage controller, a loose cable, or even a bad power
> supply.

There's a significant difference between btrfs in dup/raid1/raid10 modes 
anyway and some of the others you mentioned, however.  Btrfs in these 
modes actually has a second copy of the data itself available.  That's a 
world of difference compared to parity, for instance.  With parity you're 
reconstructing the data and thus have dangers such as the write hole, and 
the possibility of bad-ram corrupting the data before it was ever saved 
(this last one being the reason zfs has such strong recommendations/
warnings regarding the use of non-ecc RAM, based on what a number of 
posters with zfs experience have said, here).  With btrfs, there's an 
actual second copy, with both copies covered by checksum.  If one of the 
copies verifies against its checksum and the other doesn't, the odds of 
the one that verifies being any worse than the one that doesn't are... 
pretty slim, to say the least.  (So slim I'd intuitively compare them to 
the odds of getting hit by lightning, tho I've no idea what the 
mathematically rigorous comparison might be.)

Yes, there's some small but not infinitesimal chance the checksum may be 
wrong, but if there's two copies of the data and the checksum on one is 
wrong while the checksum on the other verifies... yes, there's still that 
small chance that the one that verifies is wrong too, but that it's any 
worse than the one that does not verify?  /That's/ getting close to 
infinitesimal, or at least close enough for the purposes of a mailing-
list claim without links to supporting evidence by someone who has 
already characterized it as not mathematically rigorous... and for me, 
personally.  I'm not spending any serious time thinking about getting hit 
by lightening, either, tho by the same token I don't go out flying kites 
or waving long metal rods around in lightning storms, either.

Meanwhile, it's worth noting that btrfs itself isn't yet entirely stable 
or mature, and that the chances of just plain old bugs killing the 
filesystem are far *FAR* higher than of a verified-checksum copy being 
any worse than a failed-checksum copy.  If you're worried about that at 
this point, why are you even on the btrfs list in the first place?

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Expected behavior of bad sectors on one drive in a RAID1
  2015-10-20 19:20         ` Duncan
@ 2015-10-20 19:59           ` Austin S Hemmelgarn
  2015-10-20 20:54             ` Tim Walberg
  2015-10-21 11:51             ` Austin S Hemmelgarn
  0 siblings, 2 replies; 15+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-20 19:59 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4604 bytes --]

On 2015-10-20 15:20, Duncan wrote:
> Austin S Hemmelgarn posted on Tue, 20 Oct 2015 09:59:17 -0400 as
> excerpted:
>
>
>>>> It is worth clarifying also that:
>>>> a. While BTRFS will not return bad data in this case, it also won't
>>>> automatically repair the corruption.
>>>
>>> Really?  If so I think that's a bug in BTRFS.  When mounted rw I think
>>> that every time corruption is discovered it should be automatically
>>> fixed.
>> That's debatable.  While it is safer to try and do this with BTRFS than
>> say with MD-RAID, it's still not something many seasoned system
>> administrators would want happening behind their back.  It's worth
>> noting that ZFS does not automatically fix errors, it just reports them
>> and works around them, and many distributed storage options (like Ceph
>> for example) behave like this also.  All that the checksum mismatch
>> really tells you is that at some point, the data got corrupted, it could
>> be that the copy on the disk is bad, but it could also be caused by bad
>> RAM, a bad storage controller, a loose cable, or even a bad power
>> supply.
>
> There's a significant difference between btrfs in dup/raid1/raid10 modes
> anyway and some of the others you mentioned, however.  Btrfs in these
> modes actually has a second copy of the data itself available.  That's a
> world of difference compared to parity, for instance.  With parity you're
> reconstructing the data and thus have dangers such as the write hole, and
> the possibility of bad-ram corrupting the data before it was ever saved
> (this last one being the reason zfs has such strong recommendations/
> warnings regarding the use of non-ecc RAM, based on what a number of
> posters with zfs experience have said, here).  With btrfs, there's an
> actual second copy, with both copies covered by checksum.  If one of the
> copies verifies against its checksum and the other doesn't, the odds of
> the one that verifies being any worse than the one that doesn't are...
> pretty slim, to say the least.  (So slim I'd intuitively compare them to
> the odds of getting hit by lightning, tho I've no idea what the
> mathematically rigorous comparison might be.)
ZFS doesn't just do parity, it also does RAID1 and RAID10 (and RAID0, 
although I doubt that most people actually use that with ZFS), and Ceph 
uses n-way replication by default, not erasure coding (which is 
technically a super-set of the parity algorithms used for RAID[56]).  In 
both cases, they behave just like BTRFS, they log the error and fetch a 
good copy to return to userspace, but do not modify the copy with the 
error unless explicitly told to do so.
>
> Yes, there's some small but not infinitesimal chance the checksum may be
> wrong, but if there's two copies of the data and the checksum on one is
> wrong while the checksum on the other verifies... yes, there's still that
> small chance that the one that verifies is wrong too, but that it's any
> worse than the one that does not verify?  /That's/ getting close to
> infinitesimal, or at least close enough for the purposes of a mailing-
> list claim without links to supporting evidence by someone who has
> already characterized it as not mathematically rigorous... and for me,
> personally.  I'm not spending any serious time thinking about getting hit
> by lightening, either, tho by the same token I don't go out flying kites
> or waving long metal rods around in lightning storms, either.
With a 32-bit checksum and a 4k block (the math is easier with smaller 
numbers), that's 4128 bits, which means that a random single bit error 
will have a approximately 0.24% chance of occurring in a given bit, 
which translates to an approximately 7.75% chance that it will occur in 
one of the checksum bits.  For a 16k block it's smaller of course 
(around 1.8% I think, but that's just a guess), but it's still 
sufficiently statistically likely that it should be considered.
>
> Meanwhile, it's worth noting that btrfs itself isn't yet entirely stable
> or mature, and that the chances of just plain old bugs killing the
> filesystem are far *FAR* higher than of a verified-checksum copy being
> any worse than a failed-checksum copy.  If you're worried about that at
> this point, why are you even on the btrfs list in the first place?
Actually, the improved data safety relative to ext4 is just a bonus for 
me, my biggest reason for using BTRFS is the ease of reprovisioning 
(there are few other ways to move entire systems to new storage devices 
online with zero downtime).



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Expected behavior of bad sectors on one drive in a RAID1
  2015-10-20 19:59           ` Austin S Hemmelgarn
@ 2015-10-20 20:54             ` Tim Walberg
  2015-10-21 11:51             ` Austin S Hemmelgarn
  1 sibling, 0 replies; 15+ messages in thread
From: Tim Walberg @ 2015-10-20 20:54 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: Duncan, linux-btrfs

On 10/20/2015 15:59 -0400, Austin S Hemmelgarn wrote:
>>      .........
>>	With a 32-bit checksum and a 4k block (the math is easier with
>>	smaller numbers), that's 4128 bits, which means that a random
>>	single bit error will have a approximately 0.24% chance of
>>	occurring in a given bit, which translates to an approximately
>>	7.75% chance that it will occur in one of the checksum bits.  For a
>>	16k block it's smaller of course (around 1.8% I think, but that's
>>	just a guess), but it's still sufficiently statistically likely
>>	that it should be considered.
>>      .........



Last I checked, a 4 kilo-BYTE block consisted of 32768 BITs... So the
percentages should in fact be considerably smaller than that.


-- 
twalberg@gmail.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Expected behavior of bad sectors on one drive in a RAID1
  2015-10-20 19:59           ` Austin S Hemmelgarn
  2015-10-20 20:54             ` Tim Walberg
@ 2015-10-21 11:51             ` Austin S Hemmelgarn
  2015-10-21 12:07               ` Austin S Hemmelgarn
  1 sibling, 1 reply; 15+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-21 11:51 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3763 bytes --]

On 2015-10-20 15:59, Austin S Hemmelgarn wrote:
> On 2015-10-20 15:20, Duncan wrote:
>> Yes, there's some small but not infinitesimal chance the checksum may be
>> wrong, but if there's two copies of the data and the checksum on one is
>> wrong while the checksum on the other verifies... yes, there's still that
>> small chance that the one that verifies is wrong too, but that it's any
>> worse than the one that does not verify?  /That's/ getting close to
>> infinitesimal, or at least close enough for the purposes of a mailing-
>> list claim without links to supporting evidence by someone who has
>> already characterized it as not mathematically rigorous... and for me,
>> personally.  I'm not spending any serious time thinking about getting hit
>> by lightening, either, tho by the same token I don't go out flying kites
>> or waving long metal rods around in lightning storms, either.
> With a 32-bit checksum and a 4k block (the math is easier with smaller
> numbers), that's 4128 bits, which means that a random single bit error
> will have a approximately 0.24% chance of occurring in a given bit,
> which translates to an approximately 7.75% chance that it will occur in
> one of the checksum bits.  For a 16k block it's smaller of course
> (around 1.8% I think, but that's just a guess), but it's still
> sufficiently statistically likely that it should be considered.
As mentioned in my other reply to this, I did the math wrong (bit of a 
difference between kilobit and kilobyte), so here's a (hopefully) 
correct and more thorough analysis:

For 4kb blocks (32768 bits):
There are a total of 32800 bits when including a 32 bit checksum outside 
the block, this makes the chance of a single bit error in either the 
block or the checksum ~0.30%.  This in turn means an approximately 9.7% 
chance of a single bit error in the checksum.

For 16kb blocks (131072 bits):
There are a total of 131104 bits when including a 32 bit checksum 
outside the block, this makes the chance of a single bit error in either 
the block or the checksum ~0.07%.  This in turn means an approximately 
2.4% chance of a single bit error in the checksum.

This all of course assumes a naive interpretation of how modern block 
storage devices work.  All modern hard drives and SSD's include at a 
minimum the ability to correct single bit errors per byte, and detect 
double bit errors per byte, which means that we need a triple bit error 
in the same byte to get bad data back, which in turn makes the numbers 
small enough that it's impractical to represent them without scientific 
notation (on the order of 10^-5).

That in turn assumes zero correlation beyond what's required to get bad 
data back from the storage, however, if there is enough correlation for 
that to happen, it's statistically likely that there will be other 
errors very close by.  This in turn means that it's more likely that the 
checksum is either correct or absolutely completely wrong, which 
increases the chances that the resultant metadata block containing the 
checksum will nnot appear to have an incorrect checksum itself (because 
checksums are good at detecting proportionately small errors, but only 
mediocre at detecting very big errors).

The approximate proportionate chances of an error in the data versus the 
checksum however are still roughly the same however, irrespective of how 
small the chances of getting any error are.  Based on this, the ratio of 
the size of the checksum to the size of the data is a tradeoff that 
needs to be considered, the closer the ratio is to 1, the higher the 
chance of having an error in the checksum, but the less data you need to 
correct/verify when there is an error.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Expected behavior of bad sectors on one drive in a RAID1
  2015-10-21 11:51             ` Austin S Hemmelgarn
@ 2015-10-21 12:07               ` Austin S Hemmelgarn
  2015-10-21 16:01                 ` Chris Murphy
  0 siblings, 1 reply; 15+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-21 12:07 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1795 bytes --]

On 2015-10-21 07:51, Austin S Hemmelgarn wrote:
> On 2015-10-20 15:59, Austin S Hemmelgarn wrote:
>> On 2015-10-20 15:20, Duncan wrote:
>>> Yes, there's some small but not infinitesimal chance the checksum may be
>>> wrong, but if there's two copies of the data and the checksum on one is
>>> wrong while the checksum on the other verifies... yes, there's still
>>> that
>>> small chance that the one that verifies is wrong too, but that it's any
>>> worse than the one that does not verify?  /That's/ getting close to
>>> infinitesimal, or at least close enough for the purposes of a mailing-
>>> list claim without links to supporting evidence by someone who has
>>> already characterized it as not mathematically rigorous... and for me,
>>> personally.  I'm not spending any serious time thinking about getting
>>> hit
>>> by lightening, either, tho by the same token I don't go out flying kites
>>> or waving long metal rods around in lightning storms, either.
>> With a 32-bit checksum and a 4k block (the math is easier with smaller
>> numbers), that's 4128 bits, which means that a random single bit error
>> will have a approximately 0.24% chance of occurring in a given bit,
>> which translates to an approximately 7.75% chance that it will occur in
>> one of the checksum bits.  For a 16k block it's smaller of course
>> (around 1.8% I think, but that's just a guess), but it's still
>> sufficiently statistically likely that it should be considered.
> As mentioned in my other reply to this, I did the math wrong (bit of a
> difference between kilobit and kilobyte)
And I realize of course right after sending this that my other reply 
didn't get through because GMail refuses to send mail in plain text, no 
matter how hard I beat it over the head...



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Expected behavior of bad sectors on one drive in a RAID1
  2015-10-21 12:07               ` Austin S Hemmelgarn
@ 2015-10-21 16:01                 ` Chris Murphy
  2015-10-21 17:28                   ` Austin S Hemmelgarn
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Murphy @ 2015-10-21 16:01 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: Btrfs BTRFS

On Wed, Oct 21, 2015 at 2:07 PM, Austin S Hemmelgarn
<ahferroin7@gmail.com> wrote:
> And I realize of course right after sending this that my other reply didn't
> get through because GMail refuses to send mail in plain text, no matter how
> hard I beat it over the head...

In the web browser version, to the right of the trash can for an email
being written, there is an arrow with a drop down menu that includes
"plain text mode" option which will work. This is often sticky, but
randomly with the btrfs list the replies won't have this option
checked and then they bounce. It's annoying. And then both the Gmail
and Inbox Android apps have no such option so it's not possible reply
to list emails from a mobile device short of changing mail clients
just for this purpose.

The smarter thing to do is server side conversion of HTML to plain
text, stripping superfluous formatting. Bouncing mails is just as bad
a UX as Google not providing a plain text option in their mobile apps.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Expected behavior of bad sectors on one drive in a RAID1
  2015-10-21 16:01                 ` Chris Murphy
@ 2015-10-21 17:28                   ` Austin S Hemmelgarn
  0 siblings, 0 replies; 15+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-21 17:28 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

[-- Attachment #1: Type: text/plain, Size: 1617 bytes --]

On 2015-10-21 12:01, Chris Murphy wrote:
> On Wed, Oct 21, 2015 at 2:07 PM, Austin S Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> And I realize of course right after sending this that my other reply didn't
>> get through because GMail refuses to send mail in plain text, no matter how
>> hard I beat it over the head...
>
> In the web browser version, to the right of the trash can for an email
> being written, there is an arrow with a drop down menu that includes
> "plain text mode" option which will work. This is often sticky, but
> randomly with the btrfs list the replies won't have this option
> checked and then they bounce. It's annoying. And then both the Gmail
> and Inbox Android apps have no such option so it's not possible reply
> to list emails from a mobile device short of changing mail clients
> just for this purpose.
I actually didn't know about the option in the drop down menu in the 
Web-UI, although that wouldn't have been particularly relevant in this 
case as I was replying from my phone.  What's really annoying in that 
case is that the 'Reply Inline' option makes things _look_ like they're 
plain text, but they really aren't.

I've considered getting a different mail app, but for some reason the 
only one I can find for Android that supports plain text e-mail is K-9 
Mail, and I'm not too fond of the UI for that, and it takes way more 
effort to set up than I'm willing to put in for something I almost never 
use anyway (that and it doesn't (AFAICT) support S/MIME or Hashcash, 
although GMail doesn't either, so that one's not a show stopper).



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Expected behavior of bad sectors on one drive in a RAID1
  2015-10-20  4:16 Expected behavior of bad sectors on one drive in a RAID1 james harvey
  2015-10-20  4:45 ` Russell Coker
@ 2015-10-20 18:54 ` Duncan
  2015-10-20 19:48   ` Austin S Hemmelgarn
  1 sibling, 1 reply; 15+ messages in thread
From: Duncan @ 2015-10-20 18:54 UTC (permalink / raw)
  To: linux-btrfs

james harvey posted on Tue, 20 Oct 2015 00:16:15 -0400 as excerpted:

> Background -----
> 
> My fileserver had a "bad event" last week.  Shut it down normally to add
> a new hard drive, and it would no longer post.  Tried about 50 times,
> doing the typical everything non-essential unplugged, trying 1 of 4
> memory modules at a time, and 1 of 2 processors at a time.  Got no
> where.
> 
> Inexpensive HP workstation, so purchased a used identical model
> (complete other than hard drives) on eBay.  Replacement arrived today.
> Posts fine.  Moved hard drives over (again, identical model, and Arch
> Linux not Windows) and it started giving "Watchdog detected hard LOCKUP"
> type errors I've never seen before.
> 
> Decided I'd diagnose which part in the original server was bad.  By
> sitting turned off for a week, it suddenly started posting just fine.
> But, with the hard drives back in it, I'm getting the same hard lockup
> errors.
> 
> An Arch ISO DVD runs stress testing perfectly.
> 
> Btrfs-specific -----
> 
> The current problem I'm having must be a bad hard drive or corrupted
> data.
> 
> 3 drive btrfs RAID1 (data and metadata.)  sda has 1GB of the 3GB of
> data, and 1GB of the 1GB of metadata.
> 
> sda appears to be going bad, with my low threshold of "going bad", and
> will be replaced ASAP.  It just developed 16 reallocated sectors, and
> has 40 current pending sectors.
> 
> I'm currently running a "btrfs scrub start -B -d -r /terra", which
> status on another term shows me has found 32 errors after running for an
> hour.
> 
> Question 1 - I'm expecting if I re-run the scrub without the read-only
> option, that it will detect from the checksum data which sector is
> correct, and re-write to the drive with bad sectors the data to a new
> sector.  Correct?

I actually ran a number of independent btrfs raid1 filesystems[1] on a 
pair of ssds, with one of the ssds slowly dying, with more and more 
reallocated sectors over time, for something like six months.[2]  SMART 
started with a 254 "cooked" value for reallocated sectors, immediately 
dropped to what was apparently the percentage still good (still rounding 
to 100) on first sector replace (according to raw value), and dropped to 
about 85 (again, %) during the continued usage time, with a threshold 
value of IIRC 36, so I never came close on that value, tho the raw-read-
error-rate value dropped into failing-now a couple times near the end, 
when I'd do scrubs and get dozens of reallocated sectors in just a few 
minutes, but it'd recover on reboot and report failing-in-the-past, and 
it wouldn't trip into failing mode unless I had the system off for awhile 
and then did a scrub of several of those independent btrfs in quick 
succession.

Anyway, yes, as long as the other copy is good, btrfs scrub does fix up 
the problems without much pain beyond the wait time (which was generally 
under a minute per btrfs, all under 50 gig each, on the ssds).

Tho I should mention: If btrfs returns any unverified errors, rerun the 
scrub again, and it'll likely fix more.  I'm not absolutely sure what 
these actually are in btrfs terms, but I took them to be places where 
metadata checksum errors occurred, where that metadata in turn had 
checksums of data and metadata further down (up?) the tree, closer to the 
data.  Only after those metadata blocks were scrubbed in an early pass, 
could a later pass actually verify their checksums and thus rely on the 
checksums they in turn contained, for metadata blocks closer to the data 
or for the data itself.  Sometimes I'd end up rerunning scrub a few times 
(never more that five, IIRC, however), almost always correcting less 
errors each time, tho it'd occasionally jump up a bit for one pass, 
before dropping again on the one after that.

But rerun scrubs returning unverified errors and you should eventually 
fix everything, assuming of course that the second copy is always valid.

Obviously this was rather easier for me, however, at under a minute per 
filesystem scrub run and generally under 15 minutes total for the 
multiple runs on multiple filesystems (tho I didn't always verify /all/ 
btrfs, only the ones I normally mounted), than it's going to be for you, 
at over an hour reported and still going.  At hours per run, it'll 
require some patience...

I had absolutely zero scrub failures here, because as I said my second 
ssd was (and remains) absolutely solid).

> Question 2 - Before having ran the scrub, booting off the raid with bad
> sectors, would btrfs "on the fly" recognize it was getting bad sector
> data with the checksum being off, and checking the other drives?  Or, is
> it expected that I could get a bad sector read in a critical piece of
> operating system and/or kernel, which could be causing my lockup issues?

"With the checksums being off" is unfortunately ambiguous.

Do you mean with the nodatasum mount option and/or nocow set, so btrfs 
wasn't checksumming, or do you mean (as I assume you do) with the 
checksums on, but simply failing to verify due to the hardware errors?

If you mean the first... if there's no checksum to verify, as would be 
the case with nocow files since that turns of checksumming as well... 
then btrfs, as most other filesystems, simply returns whatever it gets 
from the hardware, because it doesn't have checksums to verify it 
against.  But no checksum stored normally only applies to data (and a few 
misc things like the free-space-cache, accounting for the non-zero no-
checksums numbers you may see even if you haven't turned off cow or 
checksumming on anything); metadata is always checksummed.

If you mean the second, "off" actually meaning "on but failing to 
verify", as I suspect you do, then yes, btrfs should always reach for the 
second copy when it finds the first one invalid.

But tho I'm a user not a dev and thus haven't actually checked the source 
code itself, my believe here is with Russ and disagrees with Austin, as 
based on what I've read both on the wiki and seen here previously, btrfs 
runtime (that is, not during scrub) actually repairs the problem on-
hardware as well, from that second copy, not just fetching it for use 
without the repair, the distinction between normal runtime error 
detection and scrub thus being that scrub systematically checks 
everything, while normal runtime on most systems will only check the 
stuff it reads in normal usage, thus getting the stuff that's regularly 
used, but not the stuff that's only stored and never read.

*WARNING*:  From my experience at least, at least on initial mount, btrfs 
isn't particularly robust when the number of read errors on one device 
start to go up dramatically.  Despite never seeing an error in scrub that 
it couldn't fix, twice I had enough reads fail on a mount that the mount 
itself failed and I couldn't mount successfully despite repeated 
attempts.  In both cases, I was able to use btrfs restore to restore the 
contents of the filesystem to some other place (as it happens, the 
reiserfs on spinning rust I use for my media filesystem, since being for 
big media files, that had enough space to recover the as I said above 
reasonably small btrfs into), and ultimate recreating the filesystem 
using mkfs.btrfs.

But given that despite not being able to mount, neither SMART nor dmesg 
ever mentioned anything about the "good" device having errors, I'm left 
to conclude that btrfs itself ultimately crashed on attempt to mount the 
filesystem, even tho only the one copy was bad.  After a couple of those 
events I started scrubbing much more frequently, thus fixing the errors 
while btrfs could still mount the filesystem and /let/ me run a scrub.  
It was actually those more frequent scrubs that quickly became the hassle 
and lead me to give up on the device.  If btrfs had been able to fall 
back to the second/valid copy even in that case, as it really should have 
done, then I would have very possibly waited quite a bit longer to 
replace the dying device.

So on that one I'd say to be sure, get confirmation either directly from 
the code (if you can read it) or from a dev who has actually looked at it 
and is basing his post on that, tho I still /believe/ btrfs still runtime-
corrects checksumming issues actually on-device, if there's a validating 
second copy it can use to do so.

> Question 3 - Probably doesn't matter, but how can I see which files (or
> metadata to files) the 40 current bad sectors are in?  (On extX,
> I'd use tune2fs and debugfs to be able to see this information.)

Here, a read-only scrub seemed to print the path to the bad file -- when 
there was one, sometimes it was a metadata block and thus not 
specifically identifiable.  Writable scrubs seemed to print the info 
sometimes but not always.  I'm actually confused as to why, but I did 
specifically observe btrfs scrub printing path names in read-only mode, 
that it didn't always appear to print in the scrub output.  I didn't look 
extremely carefully, however, or compare the outputs side-by-side, so 
maybe I just missed it in the writable/fix-it mode output. 

> I do have hourly snapshots, from when it was properly running, so once
> I'm that far in the process, I can also compare the most recent
> snapshots, and see if there's any changes that happened to files that
> shouldn't have.

Hourly snapshots:

Note that btrfs has significant scaling issues with snapshots, etc, when 
the number reaches into the tens of thousands.  If you're doing such 
scheduled snapshots (and not already doing scheduled thinning), the 
strong recommendation is to schedule reasonable snapshot thinning as well.

Think about it.  If you need to retrieve something from a snapshot a year 
ago, are you going to really know or care what specific hour it was?  
Unlikely.  You'll almost certainly be just fine finding correct day, and 
a year out, you'll very possibly be just fine with weekly, monthly or 
even quarterly, and if they haven't been thinned all those many many 
hourly snapshots will simply make it harder to efficiently find and use 
one you actually need amongst all the "noise".

So do hourly snapshots for say six hours (6, plus upto 6 more before the 
thin drops 5 of them, so 12 max), then thin to six-hourly.  Keep your 
four-a-day-six-hourly snapshots for a couple days (8-12, plus the 6-12 
for the last six hours, upto 24 total), and thin to 2-a-day-12-hourly.  
Keep those for a week and thin to daily (12-26, upto 50 total), and those 
for another week (6-13, upto 63) before dropping to weekly.  That's two 
weeks of snapshots so far.  Keep the weekly snapshots out to a quarter 
(13 weeks so 11 more, plus another 13 before thinning, 11-24, upto 87 
total).

At a quarter, you really should be thinking about proper non-snapshot 
full data backup, if you haven't before now, after which you can drop the 
older snapshots, thereby freeing extents that only the old snapshots were 
still referencing.  But you'll want to keep a quarter's snapshots at all 
times so will continue to accumulate another 13 weeks of snapshots before 
you drop the quarter back.  That's a total of 100 snapshots, max.

At 100 snapshots per subvolume, you can have 10 subvolume's worth before 
hitting 1000 snapshots on the filesystem.  A target of under 1000 
snapshots per filesystem should keep scaling issues due to those 
snapshots to a minimum.

If the 100 snapshots per subvolume snapshot thinning program I suggested 
above is too strict for you, try to keep it to say 250 per subvolume 
anyway, which would give you 8 subvolume's worth at the 2000 snapshot per 
filesystem target.  I would definitely try to keep it below that, because 
between there and 10k the scaling issues take larger and larger bites out 
of your btrfs maintenance command (check, balance) efficiency, and the 
time to complete those commands will go up drastically.  At 100k, the 
time for maintenance can be weeks, so it's generally easier to just kill 
it and restore from backup, if indeed your pain threshold hasn't already 
been reached at 10k.

Hopefully it's not already a problem for you... 365 days @ 24 hours per 
day is already ~8700 snaps, so it could be if you've been running it a 
year and haven't thinned, even if there's just the single subvolume being 
snapshotted.

Similarly, BTW, with btrfs quotas, except that btrfs quotas are still 
broken anyway, so unless you're actively working with the devs to test/
trace/fix them, either you need quota features and thus should be using a 
filesystem more stable and mature than btrfs where they work reliably, or 
you don't, so you can run btrfs while keeping quotas off.  That'll 
dramatically reduce the overhead/tracking work btrfs has to do right 
there, eliminating both that overhead and any brokenness related to btrfs 
quota bugs in one whack.

---
[1] A number of independent btrfs... on a pair of ssds, with the ssds 
partitioned up identically and multiple independent small btrfs, each on 
its own set of parallel partitions on the two ssds.  Multiple independent 
btrfs instead of subvolumes or similar on a single filesystem, because I 
don't want all my data eggs in the same single filesystem basket, such 
that if that single filesystem goes down, everything goes with it.

[2] Why continue to run a known-dying ssd for six months?  Simple.  The 
other ssd of the pair never had a single reallocated sector or 
indications of any other problems the entire time, and btrfs' 
checksumming and data integrity features, along with backups, gave me a 
chance to actually play with the dying ssd for a few months without 
risking real data loss.  And I had never had that opportunity before and 
was curious to see how the problem would develop over time, plus it gave 
me some real useful experience with btrfs raid1 scrubs and recoveries.  
So I took the opportunity that presented itself.   =:^)

Eventually, however, I was scrubbing and correcting significant errors 
after every shutdown of hours and/or after every major system update, and 
by then the novelty had worn off, so I eventually just gave up and did 
the btrfs replace to another ssd I had as a spare the entire time.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Expected behavior of bad sectors on one drive in a RAID1
  2015-10-20 18:54 ` Duncan
@ 2015-10-20 19:48   ` Austin S Hemmelgarn
  2015-10-20 21:24     ` Duncan
  0 siblings, 1 reply; 15+ messages in thread
From: Austin S Hemmelgarn @ 2015-10-20 19:48 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 5409 bytes --]

On 2015-10-20 14:54, Duncan wrote:
> But tho I'm a user not a dev and thus haven't actually checked the source
> code itself, my believe here is with Russ and disagrees with Austin, as
> based on what I've read both on the wiki and seen here previously, btrfs
> runtime (that is, not during scrub) actually repairs the problem on-
> hardware as well, from that second copy, not just fetching it for use
> without the repair, the distinction between normal runtime error
> detection and scrub thus being that scrub systematically checks
> everything, while normal runtime on most systems will only check the
> stuff it reads in normal usage, thus getting the stuff that's regularly
> used, but not the stuff that's only stored and never read.
>
> *WARNING*:  From my experience at least, at least on initial mount, btrfs
> isn't particularly robust when the number of read errors on one device
> start to go up dramatically.  Despite never seeing an error in scrub that
> it couldn't fix, twice I had enough reads fail on a mount that the mount
> itself failed and I couldn't mount successfully despite repeated
> attempts.  In both cases, I was able to use btrfs restore to restore the
> contents of the filesystem to some other place (as it happens, the
> reiserfs on spinning rust I use for my media filesystem, since being for
> big media files, that had enough space to recover the as I said above
> reasonably small btrfs into), and ultimate recreating the filesystem
> using mkfs.btrfs.
>
> But given that despite not being able to mount, neither SMART nor dmesg
> ever mentioned anything about the "good" device having errors, I'm left
> to conclude that btrfs itself ultimately crashed on attempt to mount the
> filesystem, even tho only the one copy was bad.  After a couple of those
> events I started scrubbing much more frequently, thus fixing the errors
> while btrfs could still mount the filesystem and /let/ me run a scrub.
> It was actually those more frequent scrubs that quickly became the hassle
> and lead me to give up on the device.  If btrfs had been able to fall
> back to the second/valid copy even in that case, as it really should have
> done, then I would have very possibly waited quite a bit longer to
> replace the dying device.
>
> So on that one I'd say to be sure, get confirmation either directly from
> the code (if you can read it) or from a dev who has actually looked at it
> and is basing his post on that, tho I still /believe/ btrfs still runtime-
> corrects checksumming issues actually on-device, if there's a validating
> second copy it can use to do so.
>
FWIW, my assessment is based on some testing I did a while back (kernel 
3.14 IIRC) using a VM.  The (significantly summarized of course) 
procedure I used was:
1. Create a basic minimalistic Linux system in a VM (in my case, I just 
used a stage3 tarball for Gentoo, with a paravirtuaized Xen domain) 
using BTRFS as the root filesystem with a raid1 setup.  Make sure and 
verify that it actually boots.
2. Shutdown the VM, use btrfs-progs on the host to find the physical 
location of an arbitrary file (ideally one that is not touched at all 
during the boot process, IIRC, I think I used one of the e2fsprogs 
binaries), and then intentionally clear the CRC in one of the copies of 
a block from the file.
3. Boot the VM, read the file.
4. Shutdown the VM again.
5. Verify whether the file block you cleared the checksum on has a valid 
checksum now.

I repeated this more than a dozen times using different files and 
different methods of reading the file, and each time the CRC I had 
cleared was untouched.  Based on this, unless BTRFS does some kind of 
deferred re-write that doesn't get forced during a clean unmount of the 
FS, I felt it was relatively safe to conclude that it did not 
automatically fix corrupted blocks.  I did not however, test corrupting 
the block itself instead of the checksum, but I doubt that that would 
impact anything in this case.

As I mentioned, many veteran sysadmins would want to disable 
automatically fixing this in the FS driver without having some kind of 
notification.  This preference largely dates back to traditional RAID1, 
where the system has no way to know for certain which copy is correct in 
the case of a mismatch, and therefore to safely fix mismatches, the 
admin needs to intervene.  While it is possible to fix this safely 
because of how BTRFS is designed, there is still the possibility of it 
getting things wrong.  There was one time I had a BTRFS raid1 filesystem 
where one copy of a block got corrupted but miraculously had a correct 
CRC (which is statistically impossible), and the other copy of the block 
was correct, but the CRC for it was wrong (which, while unlikely, is 
very much possible).  In such a case (which was a serious pain to 
debug), automatically 'fixing' the supposedly bad block would have 
resulted in data loss.  Of course, the chance that happening more than 
once in a lifetime is astronomically small, but it is still possible.

It's also worth noting that ZFS has been considered mature for more than 
a decade now, and the ZFS developers _still_ aren't willing to risk 
their user's data with something like this, which should be an immediate 
red flag for anyone developing a filesystem with features like ZFS.

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Expected behavior of bad sectors on one drive in a RAID1
  2015-10-20 19:48   ` Austin S Hemmelgarn
@ 2015-10-20 21:24     ` Duncan
  0 siblings, 0 replies; 15+ messages in thread
From: Duncan @ 2015-10-20 21:24 UTC (permalink / raw)
  To: linux-btrfs

Austin S Hemmelgarn posted on Tue, 20 Oct 2015 15:48:07 -0400 as
excerpted:

> FWIW, my assessment is based on some testing I did a while back (kernel
> 3.14 IIRC) using a VM.  The (significantly summarized of course)
> procedure I used was:
> 1. Create a basic minimalistic Linux system in a VM (in my case, I just
> used a stage3 tarball for Gentoo, with a paravirtuaized Xen domain)
> using BTRFS as the root filesystem with a raid1 setup.  Make sure and
> verify that it actually boots.
> 2. Shutdown the VM, use btrfs-progs on the host to find the physical
> location of an arbitrary file (ideally one that is not touched at all
> during the boot process, IIRC, I think I used one of the e2fsprogs
> binaries), and then intentionally clear the CRC in one of the copies of
> a block from the file.
> 3. Boot the VM, read the file.
> 4. Shutdown the VM again.
> 5. Verify whether the file block you cleared the checksum on has a valid
> checksum now.
> 
> I repeated this more than a dozen times using different files and
> different methods of reading the file, and each time the CRC I had
> cleared was untouched.  Based on this, unless BTRFS does some kind of
> deferred re-write that doesn't get forced during a clean unmount of the
> FS, I felt it was relatively safe to conclude that it did not
> automatically fix corrupted blocks.  I did not however, test corrupting
> the block itself instead of the checksum, but I doubt that that would
> impact anything in this case.

AFAIK:

1) It would only run into the corruption if the raid1 read-scheduler 
picked that copy based on the even/odd of the requesting PID.

However, statistically that should be a 50% hit rate and if you tested 
more than a dozen times, you'd have quite the luck to fail to hit it on 
at least /one/ of them.

2) (Based on what I understood from the discussion of btrfs check's init-
csum-tree patches a couple cycles ago, before which it was clearing but 
not reinitializing...) Btrfs interprets missing checksums differently 
than invalid checksums.  Would your "cleared" CRC be interpreted as 
invalid or missing?  If missing, AFAIK it would leave it missing.

In which case corrupting the data block itself would indeed have had a 
different result than "clearing" the csum, tho simply corrupting the csum 
should have resulted in an update.

However, by actually testing you've gone farther than I have, and pending 
further info to the contrary, I'll yield to that, changing my own 
thoughts on the matter as well, to "I formerly thought... but someone's 
testing some versions ago anyway suggested otherwise, so being too lazy 
to actually do my own testing, I'll cautiously agree with the results of 
his."

=:^)

Thanks.  I'd rather find out I was wrong, than not find out! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-10-21 17:29 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-20  4:16 Expected behavior of bad sectors on one drive in a RAID1 james harvey
2015-10-20  4:45 ` Russell Coker
2015-10-20 13:00   ` Austin S Hemmelgarn
2015-10-20 13:15     ` Russell Coker
2015-10-20 13:59       ` Austin S Hemmelgarn
2015-10-20 19:20         ` Duncan
2015-10-20 19:59           ` Austin S Hemmelgarn
2015-10-20 20:54             ` Tim Walberg
2015-10-21 11:51             ` Austin S Hemmelgarn
2015-10-21 12:07               ` Austin S Hemmelgarn
2015-10-21 16:01                 ` Chris Murphy
2015-10-21 17:28                   ` Austin S Hemmelgarn
2015-10-20 18:54 ` Duncan
2015-10-20 19:48   ` Austin S Hemmelgarn
2015-10-20 21:24     ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).