* Expected behavior of bad sectors on one drive in a RAID1 @ 2015-10-20 4:16 james harvey 2015-10-20 4:45 ` Russell Coker 2015-10-20 18:54 ` Duncan 0 siblings, 2 replies; 15+ messages in thread From: james harvey @ 2015-10-20 4:16 UTC (permalink / raw) To: linux-btrfs Background ----- My fileserver had a "bad event" last week. Shut it down normally to add a new hard drive, and it would no longer post. Tried about 50 times, doing the typical everything non-essential unplugged, trying 1 of 4 memory modules at a time, and 1 of 2 processors at a time. Got no where. Inexpensive HP workstation, so purchased a used identical model (complete other than hard drives) on eBay. Replacement arrived today. Posts fine. Moved hard drives over (again, identical model, and Arch Linux not Windows) and it started giving "Watchdog detected hard LOCKUP" type errors I've never seen before. Decided I'd diagnose which part in the original server was bad. By sitting turned off for a week, it suddenly started posting just fine. But, with the hard drives back in it, I'm getting the same hard lockup errors. An Arch ISO DVD runs stress testing perfectly. Btrfs-specific ----- The current problem I'm having must be a bad hard drive or corrupted data. 3 drive btrfs RAID1 (data and metadata.) sda has 1GB of the 3GB of data, and 1GB of the 1GB of metadata. sda appears to be going bad, with my low threshold of "going bad", and will be replaced ASAP. It just developed 16 reallocated sectors, and has 40 current pending sectors. I'm currently running a "btrfs scrub start -B -d -r /terra", which status on another term shows me has found 32 errors after running for an hour. Question 1 - I'm expecting if I re-run the scrub without the read-only option, that it will detect from the checksum data which sector is correct, and re-write to the drive with bad sectors the data to a new sector. Correct? Question 2 - Before having ran the scrub, booting off the raid with bad sectors, would btrfs "on the fly" recognize it was getting bad sector data with the checksum being off, and checking the other drives? Or, is it expected that I could get a bad sector read in a critical piece of operating system and/or kernel, which could be causing my lockup issues? Question 3 - Probably doesn't matter, but how can I see which files (or metadata to files) the 40 current bad sectors are in? (On extX, I'd use tune2fs and debugfs to be able to see this information.) I do have hourly snapshots, from when it was properly running, so once I'm that far in the process, I can also compare the most recent snapshots, and see if there's any changes that happened to files that shouldn't have. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1 2015-10-20 4:16 Expected behavior of bad sectors on one drive in a RAID1 james harvey @ 2015-10-20 4:45 ` Russell Coker 2015-10-20 13:00 ` Austin S Hemmelgarn 2015-10-20 18:54 ` Duncan 1 sibling, 1 reply; 15+ messages in thread From: Russell Coker @ 2015-10-20 4:45 UTC (permalink / raw) To: james harvey; +Cc: linux-btrfs On Tue, 20 Oct 2015 03:16:15 PM james harvey wrote: > sda appears to be going bad, with my low threshold of "going bad", and > will be replaced ASAP. It just developed 16 reallocated sectors, and > has 40 current pending sectors. > > I'm currently running a "btrfs scrub start -B -d -r /terra", which > status on another term shows me has found 32 errors after running for > an hour. https://www.gnu.org/software/ddrescue/ At this stage I would use ddrescue or something similar to copy data from the failing disk to a fresh disk, then do a BTRFS scrub to regenerate the missing data. I wouldn't remove the disk entirely because then you lose badly if you get another failure. I wouldn't use a BTRFS replace because you already have the system apart and I expect ddrescue could copy the data faster. Also as the drive has been causing system failures (I'm guessing a problem with the power connector) you REALLY don't want BTRFS to corrupt data on the other disks. If you have a system with the failing disk and a new disk attached then there's no risk of further contamination. > Question 2 - Before having ran the scrub, booting off the raid with > bad sectors, would btrfs "on the fly" recognize it was getting bad > sector data with the checksum being off, and checking the other > drives? Or, is it expected that I could get a bad sector read in a > critical piece of operating system and/or kernel, which could be > causing my lockup issues? Unless you have disabled CoW then BTRFS will not return bad data. > Question 3 - Probably doesn't matter, but how can I see which files > (or metadata to files) the 40 current bad sectors are in? (On extX, > I'd use tune2fs and debugfs to be able to see this information.) Read all the files in the system and syslog will report it. But really don't do that until after you have copied the disk. > I do have hourly snapshots, from when it was properly running, so once > I'm that far in the process, I can also compare the most recent > snapshots, and see if there's any changes that happened to files that > shouldn't have. Snapshots refer to the same data blocks, so if a data block is corrupted in a way that BTRFS doesn't notice (which should be almost impossible) then all snapshots will have it. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1 2015-10-20 4:45 ` Russell Coker @ 2015-10-20 13:00 ` Austin S Hemmelgarn 2015-10-20 13:15 ` Russell Coker 0 siblings, 1 reply; 15+ messages in thread From: Austin S Hemmelgarn @ 2015-10-20 13:00 UTC (permalink / raw) To: Russell Coker, james harvey; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3331 bytes --] On 2015-10-20 00:45, Russell Coker wrote: > On Tue, 20 Oct 2015 03:16:15 PM james harvey wrote: >> sda appears to be going bad, with my low threshold of "going bad", and >> will be replaced ASAP. It just developed 16 reallocated sectors, and >> has 40 current pending sectors. >> >> I'm currently running a "btrfs scrub start -B -d -r /terra", which >> status on another term shows me has found 32 errors after running for >> an hour. > > https://www.gnu.org/software/ddrescue/ > > At this stage I would use ddrescue or something similar to copy data from the > failing disk to a fresh disk, then do a BTRFS scrub to regenerate the missing > data. > > I wouldn't remove the disk entirely because then you lose badly if you get > another failure. I wouldn't use a BTRFS replace because you already have the > system apart and I expect ddrescue could copy the data faster. Also as the > drive has been causing system failures (I'm guessing a problem with the power > connector) you REALLY don't want BTRFS to corrupt data on the other disks. If > you have a system with the failing disk and a new disk attached then there's > no risk of further contamination. BIG DISCLAIMER: For the filesystem to be safely mountable it is ABSOLUTELY NECESSARY to remove the old disk after doing a block level copy of it. By all means, keep the disk around, but do not keep it visible to the kernel after doing a block level copy of it. Also, you will probably have to run 'btrfs device scan' after copying the disk and removing it for the filesystem to work right. This is an inherent result of how BTRFS's multi-device functionality works, and also applies to doing stuff like LVM snapshots of BTRFS filesystems. > >> Question 2 - Before having ran the scrub, booting off the raid with >> bad sectors, would btrfs "on the fly" recognize it was getting bad >> sector data with the checksum being off, and checking the other >> drives? Or, is it expected that I could get a bad sector read in a >> critical piece of operating system and/or kernel, which could be >> causing my lockup issues? > > Unless you have disabled CoW then BTRFS will not return bad data. It is worth clarifying also that: a. While BTRFS will not return bad data in this case, it also won't automatically repair the corruption. b. In the unlikely event that both copies are bad, trying to read the data will return an IO error. c. It is theoretically possible (although statistically impossible) that the block could become corrupted, but the checksum could still be correct (CRC32c is good at detecting small errors, but it's not hard to generate a hash collision for any arbitrary value, so if a large portion of the block goes bad, then it can theoretically still have a valid checksum). > >> Question 3 - Probably doesn't matter, but how can I see which files >> (or metadata to files) the 40 current bad sectors are in? (On extX, >> I'd use tune2fs and debugfs to be able to see this information.) > > Read all the files in the system and syslog will report it. But really don't > do that until after you have copied the disk. It may also be possible to use some of the debug tools from BTRFS to do this without hitting the disks so hard, but it will likely take a lot more effort. [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3019 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1 2015-10-20 13:00 ` Austin S Hemmelgarn @ 2015-10-20 13:15 ` Russell Coker 2015-10-20 13:59 ` Austin S Hemmelgarn 0 siblings, 1 reply; 15+ messages in thread From: Russell Coker @ 2015-10-20 13:15 UTC (permalink / raw) To: Austin S Hemmelgarn; +Cc: james harvey, linux-btrfs On Wed, 21 Oct 2015 12:00:59 AM Austin S Hemmelgarn wrote: > > https://www.gnu.org/software/ddrescue/ > > > > At this stage I would use ddrescue or something similar to copy data from > > the failing disk to a fresh disk, then do a BTRFS scrub to regenerate > > the missing data. > > > > I wouldn't remove the disk entirely because then you lose badly if you > > get another failure. I wouldn't use a BTRFS replace because you already > > have the system apart and I expect ddrescue could copy the data faster. > > Also as the drive has been causing system failures (I'm guessing a > > problem with the power connector) you REALLY don't want BTRFS to corrupt > > data on the other disks. If you have a system with the failing disk and > > a new disk attached then there's no risk of further contamination. > > BIG DISCLAIMER: For the filesystem to be safely mountable it is > ABSOLUTELY NECESSARY to remove the old disk after doing a block level You are correct, my message wasn't clear. What I meant to say is that doing a "btrfs device remove" or "btrfs replace" is generally a bad idea in such a situation. "btrfs replace" is pretty good if you are replacing a disk with a larger one or replacing a disk that has only minor errors (a disk that just gets a few bad sectors is unlikely to get many more in a hurry). > copy of it. By all means, keep the disk around, but do not keep it > visible to the kernel after doing a block level copy of it. Also, you > will probably have to run 'btrfs device scan' after copying the disk and > removing it for the filesystem to work right. This is an inherent > result of how BTRFS's multi-device functionality works, and also applies > to doing stuff like LVM snapshots of BTRFS filesystems. Good advice. I recommend just rebooting the system. I think that if anyone who has the background knowledge to do such things without rebooting will probably just do it without needing to ask us for advice. > >> Question 2 - Before having ran the scrub, booting off the raid with > >> bad sectors, would btrfs "on the fly" recognize it was getting bad > >> sector data with the checksum being off, and checking the other > >> drives? Or, is it expected that I could get a bad sector read in a > >> critical piece of operating system and/or kernel, which could be > >> causing my lockup issues? > > > > Unless you have disabled CoW then BTRFS will not return bad data. > > It is worth clarifying also that: > a. While BTRFS will not return bad data in this case, it also won't > automatically repair the corruption. Really? If so I think that's a bug in BTRFS. When mounted rw I think that every time corruption is discovered it should be automatically fixed. > b. In the unlikely event that both copies are bad, trying to read the > data will return an IO error. > c. It is theoretically possible (although statistically impossible) that > the block could become corrupted, but the checksum could still be > correct (CRC32c is good at detecting small errors, but it's not hard to > generate a hash collision for any arbitrary value, so if a large portion > of the block goes bad, then it can theoretically still have a valid > checksum). It would be interesting to see some research into how CRC32 fits with the more common disk errors. For a disk to return bad data and claim it to be good the data must either be a misplaced write or read (which is almost certain to be caught by BTRFS as the metadata won't match), or a random sector that matches the disk's CRC. Is generating a hash collision for a CRC32 inside a CRC protected block much more difficult? > >> Question 3 - Probably doesn't matter, but how can I see which files > >> (or metadata to files) the 40 current bad sectors are in? (On extX, > >> I'd use tune2fs and debugfs to be able to see this information.) > > > > Read all the files in the system and syslog will report it. But really > > don't do that until after you have copied the disk. > > It may also be possible to use some of the debug tools from BTRFS to do > this without hitting the disks so hard, but it will likely take a lot > more effort. I don't think that you can do that without hitting the disks hard. That said last time I checked (last time an executive of a hard drive manufacturer was willing to talk to me) drives were apparently designed to perform any sequence of operations for their warranty period. So for a disk that is believed to be good this shouldn't be a problem. For a disk that is known to be dying it would be a really bad idea to do anything other than copy the data off at maximum speed. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1 2015-10-20 13:15 ` Russell Coker @ 2015-10-20 13:59 ` Austin S Hemmelgarn 2015-10-20 19:20 ` Duncan 0 siblings, 1 reply; 15+ messages in thread From: Austin S Hemmelgarn @ 2015-10-20 13:59 UTC (permalink / raw) To: Russell Coker; +Cc: james harvey, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 7425 bytes --] On 2015-10-20 09:15, Russell Coker wrote: > On Wed, 21 Oct 2015 12:00:59 AM Austin S Hemmelgarn wrote: >>> https://www.gnu.org/software/ddrescue/ >>> >>> At this stage I would use ddrescue or something similar to copy data from >>> the failing disk to a fresh disk, then do a BTRFS scrub to regenerate >>> the missing data. >>> >>> I wouldn't remove the disk entirely because then you lose badly if you >>> get another failure. I wouldn't use a BTRFS replace because you already >>> have the system apart and I expect ddrescue could copy the data faster. >>> Also as the drive has been causing system failures (I'm guessing a >>> problem with the power connector) you REALLY don't want BTRFS to corrupt >>> data on the other disks. If you have a system with the failing disk and >>> a new disk attached then there's no risk of further contamination. >> >> BIG DISCLAIMER: For the filesystem to be safely mountable it is >> ABSOLUTELY NECESSARY to remove the old disk after doing a block level > > You are correct, my message wasn't clear. > > What I meant to say is that doing a "btrfs device remove" or "btrfs replace" > is generally a bad idea in such a situation. "btrfs replace" is pretty good > if you are replacing a disk with a larger one or replacing a disk that has > only minor errors (a disk that just gets a few bad sectors is unlikely to get > many more in a hurry). I kind of figured that was what you meant, I just wanted to make it as clear as possible, because this is something that has bitten me in the past. It's worth noting though that there is an option for 'btrfs replace' to avoid reading from the device being replaced if at all possible. I've used that option myself a couple of times when re-provisioning my systems, and it works well (although I used it to just control what disks were getting IO sent to them, not because any of the were bad). > >> copy of it. By all means, keep the disk around, but do not keep it >> visible to the kernel after doing a block level copy of it. Also, you >> will probably have to run 'btrfs device scan' after copying the disk and >> removing it for the filesystem to work right. This is an inherent >> result of how BTRFS's multi-device functionality works, and also applies >> to doing stuff like LVM snapshots of BTRFS filesystems. > > Good advice. I recommend just rebooting the system. I think that if anyone > who has the background knowledge to do such things without rebooting will > probably just do it without needing to ask us for advice. Normally I would agree, but given the boot issues that were mentioned WRT the system in question, it may be safer to just use 'btrfs dev scan' without rebooting (unless of course the system doesn't properly support SATA hot-plug/hot-remove). > >>>> Question 2 - Before having ran the scrub, booting off the raid with >>>> bad sectors, would btrfs "on the fly" recognize it was getting bad >>>> sector data with the checksum being off, and checking the other >>>> drives? Or, is it expected that I could get a bad sector read in a >>>> critical piece of operating system and/or kernel, which could be >>>> causing my lockup issues? >>> >>> Unless you have disabled CoW then BTRFS will not return bad data. >> >> It is worth clarifying also that: >> a. While BTRFS will not return bad data in this case, it also won't >> automatically repair the corruption. > > Really? If so I think that's a bug in BTRFS. When mounted rw I think that > every time corruption is discovered it should be automatically fixed. That's debatable. While it is safer to try and do this with BTRFS than say with MD-RAID, it's still not something many seasoned system administrators would want happening behind their back. It's worth noting that ZFS does not automatically fix errors, it just reports them and works around them, and many distributed storage options (like Ceph for example) behave like this also. All that the checksum mismatch really tells you is that at some point, the data got corrupted, it could be that the copy on the disk is bad, but it could also be caused by bad RAM, a bad storage controller, a loose cable, or even a bad power supply. > >> b. In the unlikely event that both copies are bad, trying to read the >> data will return an IO error. >> c. It is theoretically possible (although statistically impossible) that >> the block could become corrupted, but the checksum could still be >> correct (CRC32c is good at detecting small errors, but it's not hard to >> generate a hash collision for any arbitrary value, so if a large portion >> of the block goes bad, then it can theoretically still have a valid >> checksum). > > It would be interesting to see some research into how CRC32 fits with the more > common disk errors. For a disk to return bad data and claim it to be good the > data must either be a misplaced write or read (which is almost certain to be > caught by BTRFS as the metadata won't match), or a random sector that matches > the disk's CRC. Is generating a hash collision for a CRC32 inside a CRC > protected block much more difficult? In general, most disk errors will be just a few flipped bits. For a single bit flip in a data stream, a CRC is 100% guaranteed to change, the same goes for any odd number of bit flips in the data stream. For an even number of bit flips however, the chance that there will be a collision is proportionate to the size of the CRC, and for 32-bits it's a statistical impossibility that there will be a collision due to two bits flipping without there being some malicious intent involved. Once you get to larger numbers of bit flips and bigger blocks of data, it becomes more likely. The chances of a collision with a 4k block with any random set of bit flips is astronomically small, and it's only marginally larger with 16k blocks (which are the default right now for BTRFS). > >>>> Question 3 - Probably doesn't matter, but how can I see which files >>>> (or metadata to files) the 40 current bad sectors are in? (On extX, >>>> I'd use tune2fs and debugfs to be able to see this information.) >>> >>> Read all the files in the system and syslog will report it. But really >>> don't do that until after you have copied the disk. >> >> It may also be possible to use some of the debug tools from BTRFS to do >> this without hitting the disks so hard, but it will likely take a lot >> more effort. > > I don't think that you can do that without hitting the disks hard. Ah, you're right, I forgot that there's no way on most hard disks to get the LBA's of the reallocated sectors, which would be required to use the debug tools to get the files. > > That said last time I checked (last time an executive of a hard drive > manufacturer was willing to talk to me) drives were apparently designed to > perform any sequence of operations for their warranty period. So for a disk > that is believed to be good this shouldn't be a problem. For a disk that is > known to be dying it would be a really bad idea to do anything other than copy > the data off at maximum speed. Well yes, but the less stress you put on something, the longer it's likely to last. And if you actually care about the data, you should have backups (or some other way of trivially reproducing it) [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3019 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1 2015-10-20 13:59 ` Austin S Hemmelgarn @ 2015-10-20 19:20 ` Duncan 2015-10-20 19:59 ` Austin S Hemmelgarn 0 siblings, 1 reply; 15+ messages in thread From: Duncan @ 2015-10-20 19:20 UTC (permalink / raw) To: linux-btrfs Austin S Hemmelgarn posted on Tue, 20 Oct 2015 09:59:17 -0400 as excerpted: >>> It is worth clarifying also that: >>> a. While BTRFS will not return bad data in this case, it also won't >>> automatically repair the corruption. >> >> Really? If so I think that's a bug in BTRFS. When mounted rw I think >> that every time corruption is discovered it should be automatically >> fixed. > That's debatable. While it is safer to try and do this with BTRFS than > say with MD-RAID, it's still not something many seasoned system > administrators would want happening behind their back. It's worth > noting that ZFS does not automatically fix errors, it just reports them > and works around them, and many distributed storage options (like Ceph > for example) behave like this also. All that the checksum mismatch > really tells you is that at some point, the data got corrupted, it could > be that the copy on the disk is bad, but it could also be caused by bad > RAM, a bad storage controller, a loose cable, or even a bad power > supply. There's a significant difference between btrfs in dup/raid1/raid10 modes anyway and some of the others you mentioned, however. Btrfs in these modes actually has a second copy of the data itself available. That's a world of difference compared to parity, for instance. With parity you're reconstructing the data and thus have dangers such as the write hole, and the possibility of bad-ram corrupting the data before it was ever saved (this last one being the reason zfs has such strong recommendations/ warnings regarding the use of non-ecc RAM, based on what a number of posters with zfs experience have said, here). With btrfs, there's an actual second copy, with both copies covered by checksum. If one of the copies verifies against its checksum and the other doesn't, the odds of the one that verifies being any worse than the one that doesn't are... pretty slim, to say the least. (So slim I'd intuitively compare them to the odds of getting hit by lightning, tho I've no idea what the mathematically rigorous comparison might be.) Yes, there's some small but not infinitesimal chance the checksum may be wrong, but if there's two copies of the data and the checksum on one is wrong while the checksum on the other verifies... yes, there's still that small chance that the one that verifies is wrong too, but that it's any worse than the one that does not verify? /That's/ getting close to infinitesimal, or at least close enough for the purposes of a mailing- list claim without links to supporting evidence by someone who has already characterized it as not mathematically rigorous... and for me, personally. I'm not spending any serious time thinking about getting hit by lightening, either, tho by the same token I don't go out flying kites or waving long metal rods around in lightning storms, either. Meanwhile, it's worth noting that btrfs itself isn't yet entirely stable or mature, and that the chances of just plain old bugs killing the filesystem are far *FAR* higher than of a verified-checksum copy being any worse than a failed-checksum copy. If you're worried about that at this point, why are you even on the btrfs list in the first place? -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1 2015-10-20 19:20 ` Duncan @ 2015-10-20 19:59 ` Austin S Hemmelgarn 2015-10-20 20:54 ` Tim Walberg 2015-10-21 11:51 ` Austin S Hemmelgarn 0 siblings, 2 replies; 15+ messages in thread From: Austin S Hemmelgarn @ 2015-10-20 19:59 UTC (permalink / raw) To: Duncan, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 4604 bytes --] On 2015-10-20 15:20, Duncan wrote: > Austin S Hemmelgarn posted on Tue, 20 Oct 2015 09:59:17 -0400 as > excerpted: > > >>>> It is worth clarifying also that: >>>> a. While BTRFS will not return bad data in this case, it also won't >>>> automatically repair the corruption. >>> >>> Really? If so I think that's a bug in BTRFS. When mounted rw I think >>> that every time corruption is discovered it should be automatically >>> fixed. >> That's debatable. While it is safer to try and do this with BTRFS than >> say with MD-RAID, it's still not something many seasoned system >> administrators would want happening behind their back. It's worth >> noting that ZFS does not automatically fix errors, it just reports them >> and works around them, and many distributed storage options (like Ceph >> for example) behave like this also. All that the checksum mismatch >> really tells you is that at some point, the data got corrupted, it could >> be that the copy on the disk is bad, but it could also be caused by bad >> RAM, a bad storage controller, a loose cable, or even a bad power >> supply. > > There's a significant difference between btrfs in dup/raid1/raid10 modes > anyway and some of the others you mentioned, however. Btrfs in these > modes actually has a second copy of the data itself available. That's a > world of difference compared to parity, for instance. With parity you're > reconstructing the data and thus have dangers such as the write hole, and > the possibility of bad-ram corrupting the data before it was ever saved > (this last one being the reason zfs has such strong recommendations/ > warnings regarding the use of non-ecc RAM, based on what a number of > posters with zfs experience have said, here). With btrfs, there's an > actual second copy, with both copies covered by checksum. If one of the > copies verifies against its checksum and the other doesn't, the odds of > the one that verifies being any worse than the one that doesn't are... > pretty slim, to say the least. (So slim I'd intuitively compare them to > the odds of getting hit by lightning, tho I've no idea what the > mathematically rigorous comparison might be.) ZFS doesn't just do parity, it also does RAID1 and RAID10 (and RAID0, although I doubt that most people actually use that with ZFS), and Ceph uses n-way replication by default, not erasure coding (which is technically a super-set of the parity algorithms used for RAID[56]). In both cases, they behave just like BTRFS, they log the error and fetch a good copy to return to userspace, but do not modify the copy with the error unless explicitly told to do so. > > Yes, there's some small but not infinitesimal chance the checksum may be > wrong, but if there's two copies of the data and the checksum on one is > wrong while the checksum on the other verifies... yes, there's still that > small chance that the one that verifies is wrong too, but that it's any > worse than the one that does not verify? /That's/ getting close to > infinitesimal, or at least close enough for the purposes of a mailing- > list claim without links to supporting evidence by someone who has > already characterized it as not mathematically rigorous... and for me, > personally. I'm not spending any serious time thinking about getting hit > by lightening, either, tho by the same token I don't go out flying kites > or waving long metal rods around in lightning storms, either. With a 32-bit checksum and a 4k block (the math is easier with smaller numbers), that's 4128 bits, which means that a random single bit error will have a approximately 0.24% chance of occurring in a given bit, which translates to an approximately 7.75% chance that it will occur in one of the checksum bits. For a 16k block it's smaller of course (around 1.8% I think, but that's just a guess), but it's still sufficiently statistically likely that it should be considered. > > Meanwhile, it's worth noting that btrfs itself isn't yet entirely stable > or mature, and that the chances of just plain old bugs killing the > filesystem are far *FAR* higher than of a verified-checksum copy being > any worse than a failed-checksum copy. If you're worried about that at > this point, why are you even on the btrfs list in the first place? Actually, the improved data safety relative to ext4 is just a bonus for me, my biggest reason for using BTRFS is the ease of reprovisioning (there are few other ways to move entire systems to new storage devices online with zero downtime). [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3019 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1 2015-10-20 19:59 ` Austin S Hemmelgarn @ 2015-10-20 20:54 ` Tim Walberg 2015-10-21 11:51 ` Austin S Hemmelgarn 1 sibling, 0 replies; 15+ messages in thread From: Tim Walberg @ 2015-10-20 20:54 UTC (permalink / raw) To: Austin S Hemmelgarn; +Cc: Duncan, linux-btrfs On 10/20/2015 15:59 -0400, Austin S Hemmelgarn wrote: >> ......... >> With a 32-bit checksum and a 4k block (the math is easier with >> smaller numbers), that's 4128 bits, which means that a random >> single bit error will have a approximately 0.24% chance of >> occurring in a given bit, which translates to an approximately >> 7.75% chance that it will occur in one of the checksum bits. For a >> 16k block it's smaller of course (around 1.8% I think, but that's >> just a guess), but it's still sufficiently statistically likely >> that it should be considered. >> ......... Last I checked, a 4 kilo-BYTE block consisted of 32768 BITs... So the percentages should in fact be considerably smaller than that. -- twalberg@gmail.com ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1 2015-10-20 19:59 ` Austin S Hemmelgarn 2015-10-20 20:54 ` Tim Walberg @ 2015-10-21 11:51 ` Austin S Hemmelgarn 2015-10-21 12:07 ` Austin S Hemmelgarn 1 sibling, 1 reply; 15+ messages in thread From: Austin S Hemmelgarn @ 2015-10-21 11:51 UTC (permalink / raw) To: Duncan, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3763 bytes --] On 2015-10-20 15:59, Austin S Hemmelgarn wrote: > On 2015-10-20 15:20, Duncan wrote: >> Yes, there's some small but not infinitesimal chance the checksum may be >> wrong, but if there's two copies of the data and the checksum on one is >> wrong while the checksum on the other verifies... yes, there's still that >> small chance that the one that verifies is wrong too, but that it's any >> worse than the one that does not verify? /That's/ getting close to >> infinitesimal, or at least close enough for the purposes of a mailing- >> list claim without links to supporting evidence by someone who has >> already characterized it as not mathematically rigorous... and for me, >> personally. I'm not spending any serious time thinking about getting hit >> by lightening, either, tho by the same token I don't go out flying kites >> or waving long metal rods around in lightning storms, either. > With a 32-bit checksum and a 4k block (the math is easier with smaller > numbers), that's 4128 bits, which means that a random single bit error > will have a approximately 0.24% chance of occurring in a given bit, > which translates to an approximately 7.75% chance that it will occur in > one of the checksum bits. For a 16k block it's smaller of course > (around 1.8% I think, but that's just a guess), but it's still > sufficiently statistically likely that it should be considered. As mentioned in my other reply to this, I did the math wrong (bit of a difference between kilobit and kilobyte), so here's a (hopefully) correct and more thorough analysis: For 4kb blocks (32768 bits): There are a total of 32800 bits when including a 32 bit checksum outside the block, this makes the chance of a single bit error in either the block or the checksum ~0.30%. This in turn means an approximately 9.7% chance of a single bit error in the checksum. For 16kb blocks (131072 bits): There are a total of 131104 bits when including a 32 bit checksum outside the block, this makes the chance of a single bit error in either the block or the checksum ~0.07%. This in turn means an approximately 2.4% chance of a single bit error in the checksum. This all of course assumes a naive interpretation of how modern block storage devices work. All modern hard drives and SSD's include at a minimum the ability to correct single bit errors per byte, and detect double bit errors per byte, which means that we need a triple bit error in the same byte to get bad data back, which in turn makes the numbers small enough that it's impractical to represent them without scientific notation (on the order of 10^-5). That in turn assumes zero correlation beyond what's required to get bad data back from the storage, however, if there is enough correlation for that to happen, it's statistically likely that there will be other errors very close by. This in turn means that it's more likely that the checksum is either correct or absolutely completely wrong, which increases the chances that the resultant metadata block containing the checksum will nnot appear to have an incorrect checksum itself (because checksums are good at detecting proportionately small errors, but only mediocre at detecting very big errors). The approximate proportionate chances of an error in the data versus the checksum however are still roughly the same however, irrespective of how small the chances of getting any error are. Based on this, the ratio of the size of the checksum to the size of the data is a tradeoff that needs to be considered, the closer the ratio is to 1, the higher the chance of having an error in the checksum, but the less data you need to correct/verify when there is an error. [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3019 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1 2015-10-21 11:51 ` Austin S Hemmelgarn @ 2015-10-21 12:07 ` Austin S Hemmelgarn 2015-10-21 16:01 ` Chris Murphy 0 siblings, 1 reply; 15+ messages in thread From: Austin S Hemmelgarn @ 2015-10-21 12:07 UTC (permalink / raw) To: Duncan, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 1795 bytes --] On 2015-10-21 07:51, Austin S Hemmelgarn wrote: > On 2015-10-20 15:59, Austin S Hemmelgarn wrote: >> On 2015-10-20 15:20, Duncan wrote: >>> Yes, there's some small but not infinitesimal chance the checksum may be >>> wrong, but if there's two copies of the data and the checksum on one is >>> wrong while the checksum on the other verifies... yes, there's still >>> that >>> small chance that the one that verifies is wrong too, but that it's any >>> worse than the one that does not verify? /That's/ getting close to >>> infinitesimal, or at least close enough for the purposes of a mailing- >>> list claim without links to supporting evidence by someone who has >>> already characterized it as not mathematically rigorous... and for me, >>> personally. I'm not spending any serious time thinking about getting >>> hit >>> by lightening, either, tho by the same token I don't go out flying kites >>> or waving long metal rods around in lightning storms, either. >> With a 32-bit checksum and a 4k block (the math is easier with smaller >> numbers), that's 4128 bits, which means that a random single bit error >> will have a approximately 0.24% chance of occurring in a given bit, >> which translates to an approximately 7.75% chance that it will occur in >> one of the checksum bits. For a 16k block it's smaller of course >> (around 1.8% I think, but that's just a guess), but it's still >> sufficiently statistically likely that it should be considered. > As mentioned in my other reply to this, I did the math wrong (bit of a > difference between kilobit and kilobyte) And I realize of course right after sending this that my other reply didn't get through because GMail refuses to send mail in plain text, no matter how hard I beat it over the head... [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3019 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1 2015-10-21 12:07 ` Austin S Hemmelgarn @ 2015-10-21 16:01 ` Chris Murphy 2015-10-21 17:28 ` Austin S Hemmelgarn 0 siblings, 1 reply; 15+ messages in thread From: Chris Murphy @ 2015-10-21 16:01 UTC (permalink / raw) To: Austin S Hemmelgarn; +Cc: Btrfs BTRFS On Wed, Oct 21, 2015 at 2:07 PM, Austin S Hemmelgarn <ahferroin7@gmail.com> wrote: > And I realize of course right after sending this that my other reply didn't > get through because GMail refuses to send mail in plain text, no matter how > hard I beat it over the head... In the web browser version, to the right of the trash can for an email being written, there is an arrow with a drop down menu that includes "plain text mode" option which will work. This is often sticky, but randomly with the btrfs list the replies won't have this option checked and then they bounce. It's annoying. And then both the Gmail and Inbox Android apps have no such option so it's not possible reply to list emails from a mobile device short of changing mail clients just for this purpose. The smarter thing to do is server side conversion of HTML to plain text, stripping superfluous formatting. Bouncing mails is just as bad a UX as Google not providing a plain text option in their mobile apps. -- Chris Murphy ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1 2015-10-21 16:01 ` Chris Murphy @ 2015-10-21 17:28 ` Austin S Hemmelgarn 0 siblings, 0 replies; 15+ messages in thread From: Austin S Hemmelgarn @ 2015-10-21 17:28 UTC (permalink / raw) To: Chris Murphy; +Cc: Btrfs BTRFS [-- Attachment #1: Type: text/plain, Size: 1617 bytes --] On 2015-10-21 12:01, Chris Murphy wrote: > On Wed, Oct 21, 2015 at 2:07 PM, Austin S Hemmelgarn > <ahferroin7@gmail.com> wrote: >> And I realize of course right after sending this that my other reply didn't >> get through because GMail refuses to send mail in plain text, no matter how >> hard I beat it over the head... > > In the web browser version, to the right of the trash can for an email > being written, there is an arrow with a drop down menu that includes > "plain text mode" option which will work. This is often sticky, but > randomly with the btrfs list the replies won't have this option > checked and then they bounce. It's annoying. And then both the Gmail > and Inbox Android apps have no such option so it's not possible reply > to list emails from a mobile device short of changing mail clients > just for this purpose. I actually didn't know about the option in the drop down menu in the Web-UI, although that wouldn't have been particularly relevant in this case as I was replying from my phone. What's really annoying in that case is that the 'Reply Inline' option makes things _look_ like they're plain text, but they really aren't. I've considered getting a different mail app, but for some reason the only one I can find for Android that supports plain text e-mail is K-9 Mail, and I'm not too fond of the UI for that, and it takes way more effort to set up than I'm willing to put in for something I almost never use anyway (that and it doesn't (AFAICT) support S/MIME or Hashcash, although GMail doesn't either, so that one's not a show stopper). [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3019 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1 2015-10-20 4:16 Expected behavior of bad sectors on one drive in a RAID1 james harvey 2015-10-20 4:45 ` Russell Coker @ 2015-10-20 18:54 ` Duncan 2015-10-20 19:48 ` Austin S Hemmelgarn 1 sibling, 1 reply; 15+ messages in thread From: Duncan @ 2015-10-20 18:54 UTC (permalink / raw) To: linux-btrfs james harvey posted on Tue, 20 Oct 2015 00:16:15 -0400 as excerpted: > Background ----- > > My fileserver had a "bad event" last week. Shut it down normally to add > a new hard drive, and it would no longer post. Tried about 50 times, > doing the typical everything non-essential unplugged, trying 1 of 4 > memory modules at a time, and 1 of 2 processors at a time. Got no > where. > > Inexpensive HP workstation, so purchased a used identical model > (complete other than hard drives) on eBay. Replacement arrived today. > Posts fine. Moved hard drives over (again, identical model, and Arch > Linux not Windows) and it started giving "Watchdog detected hard LOCKUP" > type errors I've never seen before. > > Decided I'd diagnose which part in the original server was bad. By > sitting turned off for a week, it suddenly started posting just fine. > But, with the hard drives back in it, I'm getting the same hard lockup > errors. > > An Arch ISO DVD runs stress testing perfectly. > > Btrfs-specific ----- > > The current problem I'm having must be a bad hard drive or corrupted > data. > > 3 drive btrfs RAID1 (data and metadata.) sda has 1GB of the 3GB of > data, and 1GB of the 1GB of metadata. > > sda appears to be going bad, with my low threshold of "going bad", and > will be replaced ASAP. It just developed 16 reallocated sectors, and > has 40 current pending sectors. > > I'm currently running a "btrfs scrub start -B -d -r /terra", which > status on another term shows me has found 32 errors after running for an > hour. > > Question 1 - I'm expecting if I re-run the scrub without the read-only > option, that it will detect from the checksum data which sector is > correct, and re-write to the drive with bad sectors the data to a new > sector. Correct? I actually ran a number of independent btrfs raid1 filesystems[1] on a pair of ssds, with one of the ssds slowly dying, with more and more reallocated sectors over time, for something like six months.[2] SMART started with a 254 "cooked" value for reallocated sectors, immediately dropped to what was apparently the percentage still good (still rounding to 100) on first sector replace (according to raw value), and dropped to about 85 (again, %) during the continued usage time, with a threshold value of IIRC 36, so I never came close on that value, tho the raw-read- error-rate value dropped into failing-now a couple times near the end, when I'd do scrubs and get dozens of reallocated sectors in just a few minutes, but it'd recover on reboot and report failing-in-the-past, and it wouldn't trip into failing mode unless I had the system off for awhile and then did a scrub of several of those independent btrfs in quick succession. Anyway, yes, as long as the other copy is good, btrfs scrub does fix up the problems without much pain beyond the wait time (which was generally under a minute per btrfs, all under 50 gig each, on the ssds). Tho I should mention: If btrfs returns any unverified errors, rerun the scrub again, and it'll likely fix more. I'm not absolutely sure what these actually are in btrfs terms, but I took them to be places where metadata checksum errors occurred, where that metadata in turn had checksums of data and metadata further down (up?) the tree, closer to the data. Only after those metadata blocks were scrubbed in an early pass, could a later pass actually verify their checksums and thus rely on the checksums they in turn contained, for metadata blocks closer to the data or for the data itself. Sometimes I'd end up rerunning scrub a few times (never more that five, IIRC, however), almost always correcting less errors each time, tho it'd occasionally jump up a bit for one pass, before dropping again on the one after that. But rerun scrubs returning unverified errors and you should eventually fix everything, assuming of course that the second copy is always valid. Obviously this was rather easier for me, however, at under a minute per filesystem scrub run and generally under 15 minutes total for the multiple runs on multiple filesystems (tho I didn't always verify /all/ btrfs, only the ones I normally mounted), than it's going to be for you, at over an hour reported and still going. At hours per run, it'll require some patience... I had absolutely zero scrub failures here, because as I said my second ssd was (and remains) absolutely solid). > Question 2 - Before having ran the scrub, booting off the raid with bad > sectors, would btrfs "on the fly" recognize it was getting bad sector > data with the checksum being off, and checking the other drives? Or, is > it expected that I could get a bad sector read in a critical piece of > operating system and/or kernel, which could be causing my lockup issues? "With the checksums being off" is unfortunately ambiguous. Do you mean with the nodatasum mount option and/or nocow set, so btrfs wasn't checksumming, or do you mean (as I assume you do) with the checksums on, but simply failing to verify due to the hardware errors? If you mean the first... if there's no checksum to verify, as would be the case with nocow files since that turns of checksumming as well... then btrfs, as most other filesystems, simply returns whatever it gets from the hardware, because it doesn't have checksums to verify it against. But no checksum stored normally only applies to data (and a few misc things like the free-space-cache, accounting for the non-zero no- checksums numbers you may see even if you haven't turned off cow or checksumming on anything); metadata is always checksummed. If you mean the second, "off" actually meaning "on but failing to verify", as I suspect you do, then yes, btrfs should always reach for the second copy when it finds the first one invalid. But tho I'm a user not a dev and thus haven't actually checked the source code itself, my believe here is with Russ and disagrees with Austin, as based on what I've read both on the wiki and seen here previously, btrfs runtime (that is, not during scrub) actually repairs the problem on- hardware as well, from that second copy, not just fetching it for use without the repair, the distinction between normal runtime error detection and scrub thus being that scrub systematically checks everything, while normal runtime on most systems will only check the stuff it reads in normal usage, thus getting the stuff that's regularly used, but not the stuff that's only stored and never read. *WARNING*: From my experience at least, at least on initial mount, btrfs isn't particularly robust when the number of read errors on one device start to go up dramatically. Despite never seeing an error in scrub that it couldn't fix, twice I had enough reads fail on a mount that the mount itself failed and I couldn't mount successfully despite repeated attempts. In both cases, I was able to use btrfs restore to restore the contents of the filesystem to some other place (as it happens, the reiserfs on spinning rust I use for my media filesystem, since being for big media files, that had enough space to recover the as I said above reasonably small btrfs into), and ultimate recreating the filesystem using mkfs.btrfs. But given that despite not being able to mount, neither SMART nor dmesg ever mentioned anything about the "good" device having errors, I'm left to conclude that btrfs itself ultimately crashed on attempt to mount the filesystem, even tho only the one copy was bad. After a couple of those events I started scrubbing much more frequently, thus fixing the errors while btrfs could still mount the filesystem and /let/ me run a scrub. It was actually those more frequent scrubs that quickly became the hassle and lead me to give up on the device. If btrfs had been able to fall back to the second/valid copy even in that case, as it really should have done, then I would have very possibly waited quite a bit longer to replace the dying device. So on that one I'd say to be sure, get confirmation either directly from the code (if you can read it) or from a dev who has actually looked at it and is basing his post on that, tho I still /believe/ btrfs still runtime- corrects checksumming issues actually on-device, if there's a validating second copy it can use to do so. > Question 3 - Probably doesn't matter, but how can I see which files (or > metadata to files) the 40 current bad sectors are in? (On extX, > I'd use tune2fs and debugfs to be able to see this information.) Here, a read-only scrub seemed to print the path to the bad file -- when there was one, sometimes it was a metadata block and thus not specifically identifiable. Writable scrubs seemed to print the info sometimes but not always. I'm actually confused as to why, but I did specifically observe btrfs scrub printing path names in read-only mode, that it didn't always appear to print in the scrub output. I didn't look extremely carefully, however, or compare the outputs side-by-side, so maybe I just missed it in the writable/fix-it mode output. > I do have hourly snapshots, from when it was properly running, so once > I'm that far in the process, I can also compare the most recent > snapshots, and see if there's any changes that happened to files that > shouldn't have. Hourly snapshots: Note that btrfs has significant scaling issues with snapshots, etc, when the number reaches into the tens of thousands. If you're doing such scheduled snapshots (and not already doing scheduled thinning), the strong recommendation is to schedule reasonable snapshot thinning as well. Think about it. If you need to retrieve something from a snapshot a year ago, are you going to really know or care what specific hour it was? Unlikely. You'll almost certainly be just fine finding correct day, and a year out, you'll very possibly be just fine with weekly, monthly or even quarterly, and if they haven't been thinned all those many many hourly snapshots will simply make it harder to efficiently find and use one you actually need amongst all the "noise". So do hourly snapshots for say six hours (6, plus upto 6 more before the thin drops 5 of them, so 12 max), then thin to six-hourly. Keep your four-a-day-six-hourly snapshots for a couple days (8-12, plus the 6-12 for the last six hours, upto 24 total), and thin to 2-a-day-12-hourly. Keep those for a week and thin to daily (12-26, upto 50 total), and those for another week (6-13, upto 63) before dropping to weekly. That's two weeks of snapshots so far. Keep the weekly snapshots out to a quarter (13 weeks so 11 more, plus another 13 before thinning, 11-24, upto 87 total). At a quarter, you really should be thinking about proper non-snapshot full data backup, if you haven't before now, after which you can drop the older snapshots, thereby freeing extents that only the old snapshots were still referencing. But you'll want to keep a quarter's snapshots at all times so will continue to accumulate another 13 weeks of snapshots before you drop the quarter back. That's a total of 100 snapshots, max. At 100 snapshots per subvolume, you can have 10 subvolume's worth before hitting 1000 snapshots on the filesystem. A target of under 1000 snapshots per filesystem should keep scaling issues due to those snapshots to a minimum. If the 100 snapshots per subvolume snapshot thinning program I suggested above is too strict for you, try to keep it to say 250 per subvolume anyway, which would give you 8 subvolume's worth at the 2000 snapshot per filesystem target. I would definitely try to keep it below that, because between there and 10k the scaling issues take larger and larger bites out of your btrfs maintenance command (check, balance) efficiency, and the time to complete those commands will go up drastically. At 100k, the time for maintenance can be weeks, so it's generally easier to just kill it and restore from backup, if indeed your pain threshold hasn't already been reached at 10k. Hopefully it's not already a problem for you... 365 days @ 24 hours per day is already ~8700 snaps, so it could be if you've been running it a year and haven't thinned, even if there's just the single subvolume being snapshotted. Similarly, BTW, with btrfs quotas, except that btrfs quotas are still broken anyway, so unless you're actively working with the devs to test/ trace/fix them, either you need quota features and thus should be using a filesystem more stable and mature than btrfs where they work reliably, or you don't, so you can run btrfs while keeping quotas off. That'll dramatically reduce the overhead/tracking work btrfs has to do right there, eliminating both that overhead and any brokenness related to btrfs quota bugs in one whack. --- [1] A number of independent btrfs... on a pair of ssds, with the ssds partitioned up identically and multiple independent small btrfs, each on its own set of parallel partitions on the two ssds. Multiple independent btrfs instead of subvolumes or similar on a single filesystem, because I don't want all my data eggs in the same single filesystem basket, such that if that single filesystem goes down, everything goes with it. [2] Why continue to run a known-dying ssd for six months? Simple. The other ssd of the pair never had a single reallocated sector or indications of any other problems the entire time, and btrfs' checksumming and data integrity features, along with backups, gave me a chance to actually play with the dying ssd for a few months without risking real data loss. And I had never had that opportunity before and was curious to see how the problem would develop over time, plus it gave me some real useful experience with btrfs raid1 scrubs and recoveries. So I took the opportunity that presented itself. =:^) Eventually, however, I was scrubbing and correcting significant errors after every shutdown of hours and/or after every major system update, and by then the novelty had worn off, so I eventually just gave up and did the btrfs replace to another ssd I had as a spare the entire time. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1 2015-10-20 18:54 ` Duncan @ 2015-10-20 19:48 ` Austin S Hemmelgarn 2015-10-20 21:24 ` Duncan 0 siblings, 1 reply; 15+ messages in thread From: Austin S Hemmelgarn @ 2015-10-20 19:48 UTC (permalink / raw) To: Duncan, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 5409 bytes --] On 2015-10-20 14:54, Duncan wrote: > But tho I'm a user not a dev and thus haven't actually checked the source > code itself, my believe here is with Russ and disagrees with Austin, as > based on what I've read both on the wiki and seen here previously, btrfs > runtime (that is, not during scrub) actually repairs the problem on- > hardware as well, from that second copy, not just fetching it for use > without the repair, the distinction between normal runtime error > detection and scrub thus being that scrub systematically checks > everything, while normal runtime on most systems will only check the > stuff it reads in normal usage, thus getting the stuff that's regularly > used, but not the stuff that's only stored and never read. > > *WARNING*: From my experience at least, at least on initial mount, btrfs > isn't particularly robust when the number of read errors on one device > start to go up dramatically. Despite never seeing an error in scrub that > it couldn't fix, twice I had enough reads fail on a mount that the mount > itself failed and I couldn't mount successfully despite repeated > attempts. In both cases, I was able to use btrfs restore to restore the > contents of the filesystem to some other place (as it happens, the > reiserfs on spinning rust I use for my media filesystem, since being for > big media files, that had enough space to recover the as I said above > reasonably small btrfs into), and ultimate recreating the filesystem > using mkfs.btrfs. > > But given that despite not being able to mount, neither SMART nor dmesg > ever mentioned anything about the "good" device having errors, I'm left > to conclude that btrfs itself ultimately crashed on attempt to mount the > filesystem, even tho only the one copy was bad. After a couple of those > events I started scrubbing much more frequently, thus fixing the errors > while btrfs could still mount the filesystem and /let/ me run a scrub. > It was actually those more frequent scrubs that quickly became the hassle > and lead me to give up on the device. If btrfs had been able to fall > back to the second/valid copy even in that case, as it really should have > done, then I would have very possibly waited quite a bit longer to > replace the dying device. > > So on that one I'd say to be sure, get confirmation either directly from > the code (if you can read it) or from a dev who has actually looked at it > and is basing his post on that, tho I still /believe/ btrfs still runtime- > corrects checksumming issues actually on-device, if there's a validating > second copy it can use to do so. > FWIW, my assessment is based on some testing I did a while back (kernel 3.14 IIRC) using a VM. The (significantly summarized of course) procedure I used was: 1. Create a basic minimalistic Linux system in a VM (in my case, I just used a stage3 tarball for Gentoo, with a paravirtuaized Xen domain) using BTRFS as the root filesystem with a raid1 setup. Make sure and verify that it actually boots. 2. Shutdown the VM, use btrfs-progs on the host to find the physical location of an arbitrary file (ideally one that is not touched at all during the boot process, IIRC, I think I used one of the e2fsprogs binaries), and then intentionally clear the CRC in one of the copies of a block from the file. 3. Boot the VM, read the file. 4. Shutdown the VM again. 5. Verify whether the file block you cleared the checksum on has a valid checksum now. I repeated this more than a dozen times using different files and different methods of reading the file, and each time the CRC I had cleared was untouched. Based on this, unless BTRFS does some kind of deferred re-write that doesn't get forced during a clean unmount of the FS, I felt it was relatively safe to conclude that it did not automatically fix corrupted blocks. I did not however, test corrupting the block itself instead of the checksum, but I doubt that that would impact anything in this case. As I mentioned, many veteran sysadmins would want to disable automatically fixing this in the FS driver without having some kind of notification. This preference largely dates back to traditional RAID1, where the system has no way to know for certain which copy is correct in the case of a mismatch, and therefore to safely fix mismatches, the admin needs to intervene. While it is possible to fix this safely because of how BTRFS is designed, there is still the possibility of it getting things wrong. There was one time I had a BTRFS raid1 filesystem where one copy of a block got corrupted but miraculously had a correct CRC (which is statistically impossible), and the other copy of the block was correct, but the CRC for it was wrong (which, while unlikely, is very much possible). In such a case (which was a serious pain to debug), automatically 'fixing' the supposedly bad block would have resulted in data loss. Of course, the chance that happening more than once in a lifetime is astronomically small, but it is still possible. It's also worth noting that ZFS has been considered mature for more than a decade now, and the ZFS developers _still_ aren't willing to risk their user's data with something like this, which should be an immediate red flag for anyone developing a filesystem with features like ZFS. [-- Attachment #2: S/MIME Cryptographic Signature --] [-- Type: application/pkcs7-signature, Size: 3019 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Expected behavior of bad sectors on one drive in a RAID1 2015-10-20 19:48 ` Austin S Hemmelgarn @ 2015-10-20 21:24 ` Duncan 0 siblings, 0 replies; 15+ messages in thread From: Duncan @ 2015-10-20 21:24 UTC (permalink / raw) To: linux-btrfs Austin S Hemmelgarn posted on Tue, 20 Oct 2015 15:48:07 -0400 as excerpted: > FWIW, my assessment is based on some testing I did a while back (kernel > 3.14 IIRC) using a VM. The (significantly summarized of course) > procedure I used was: > 1. Create a basic minimalistic Linux system in a VM (in my case, I just > used a stage3 tarball for Gentoo, with a paravirtuaized Xen domain) > using BTRFS as the root filesystem with a raid1 setup. Make sure and > verify that it actually boots. > 2. Shutdown the VM, use btrfs-progs on the host to find the physical > location of an arbitrary file (ideally one that is not touched at all > during the boot process, IIRC, I think I used one of the e2fsprogs > binaries), and then intentionally clear the CRC in one of the copies of > a block from the file. > 3. Boot the VM, read the file. > 4. Shutdown the VM again. > 5. Verify whether the file block you cleared the checksum on has a valid > checksum now. > > I repeated this more than a dozen times using different files and > different methods of reading the file, and each time the CRC I had > cleared was untouched. Based on this, unless BTRFS does some kind of > deferred re-write that doesn't get forced during a clean unmount of the > FS, I felt it was relatively safe to conclude that it did not > automatically fix corrupted blocks. I did not however, test corrupting > the block itself instead of the checksum, but I doubt that that would > impact anything in this case. AFAIK: 1) It would only run into the corruption if the raid1 read-scheduler picked that copy based on the even/odd of the requesting PID. However, statistically that should be a 50% hit rate and if you tested more than a dozen times, you'd have quite the luck to fail to hit it on at least /one/ of them. 2) (Based on what I understood from the discussion of btrfs check's init- csum-tree patches a couple cycles ago, before which it was clearing but not reinitializing...) Btrfs interprets missing checksums differently than invalid checksums. Would your "cleared" CRC be interpreted as invalid or missing? If missing, AFAIK it would leave it missing. In which case corrupting the data block itself would indeed have had a different result than "clearing" the csum, tho simply corrupting the csum should have resulted in an update. However, by actually testing you've gone farther than I have, and pending further info to the contrary, I'll yield to that, changing my own thoughts on the matter as well, to "I formerly thought... but someone's testing some versions ago anyway suggested otherwise, so being too lazy to actually do my own testing, I'll cautiously agree with the results of his." =:^) Thanks. I'd rather find out I was wrong, than not find out! =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2015-10-21 17:29 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-10-20 4:16 Expected behavior of bad sectors on one drive in a RAID1 james harvey 2015-10-20 4:45 ` Russell Coker 2015-10-20 13:00 ` Austin S Hemmelgarn 2015-10-20 13:15 ` Russell Coker 2015-10-20 13:59 ` Austin S Hemmelgarn 2015-10-20 19:20 ` Duncan 2015-10-20 19:59 ` Austin S Hemmelgarn 2015-10-20 20:54 ` Tim Walberg 2015-10-21 11:51 ` Austin S Hemmelgarn 2015-10-21 12:07 ` Austin S Hemmelgarn 2015-10-21 16:01 ` Chris Murphy 2015-10-21 17:28 ` Austin S Hemmelgarn 2015-10-20 18:54 ` Duncan 2015-10-20 19:48 ` Austin S Hemmelgarn 2015-10-20 21:24 ` Duncan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).