* dear developers, can we have notdatacow + checksumming, plz? @ 2015-12-14 4:59 Christoph Anton Mitterer 2015-12-14 6:42 ` Russell Coker 2015-12-14 14:16 ` Austin S. Hemmelgarn 0 siblings, 2 replies; 12+ messages in thread From: Christoph Anton Mitterer @ 2015-12-14 4:59 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 10291 bytes --] (consider that question being asked with that face on: http://goo.gl/LQaOuA) Hey. I've had some discussions on the list these days about not having checksumming with nodatacow (mostly with Hugo and Duncan). They both basically told me it wouldn't be straight possible with CoW, and Duncan thinks it may not be so much necessary, but none of them could give me really hard arguments, why it cannot work (or perhaps I was just too stupid to understand them ^^)... while at the same time I think that it would be generally utmost important to have checksumming (real world examples below). Also, I remember that in 2014, Ted Ts'o told me that there are some plans ongoing to get data checksumming into ext4, with possibly even some guy at RH actually doing it sooner or later. Since these threads were rather admin-work-centric, developers may have skipped it, therefore, I decided to write down some thoughts&ideas label them with a more attracting subject and give it some bigger attention. O:-) 1) Motivation why, it makes sense to have checksumming (especially also in the nodatacow case) I think of all major btrfs features I know of (apart from the CoW itself and having things like reflinks), checksumming is perhaps the one that distinguishes it the most from traditional filesystems. Sure we have snapshots, multi-device support and compression - but we could have had that as well with LVM and software/hardware RAID... (and ntfs supported compression IIRC ;) ). Of course, btrfs does all that in a much smarter way, I know, but it's nothing generally new. The *data* checksumming at filesystem level, to my knowledge, is however. Especially that it's always verified. Awesome. :-) When one starts to get a bit deeper into btrfs (from the admin/end-user side) one sooner or later stumbles across the recommendation/need to use nodatacow for certain types of data (DBs, VM images, etc.) and the reason, AFAIU, being the inherent fragmentation that comes along with the CoW, which is especially noticeable for those types of files with lots of random internal writes. Now duncan implied, that this could improve in the future, with the auto-defragmentation getting (even) better, defrag becoming usable again for those that do snapshots or reflinked copies and btrfs itself generally maturing more and more. But I kinda wonder to what extent one will be really able to solve that, what seems to me a CoW-inherent "problem",... Even *if* one can make the auto-defrag much smarter, it would still mean that such files, like big DBs, VMs, or scientific datasets that are internally rewritten, may get more or less constantly defragmented. That may be quite undesired... a) for performance reasons (when I consider our research software which often has IO as the limiting factor and where we want as much IO being used by actual programs as possible)... b) SSDs... Not really sure about that; btrfs seems to enable the autodefrag even when an SSD is detected,... what is it doing? Placing the block in a smart way on different chips so that accesses can be better parallelised by the controller? Anyway, (a) is could be already argument enough, not to run solve the problem by a smart-[auto-]defrag, should that actually be implemented. So I think having notdatacow is great and not just a workaround till everything else gets better to handle these cases. Thus, checksumming, which is such a vital feature, should also be possible for that. Duncan also mention that in some of those cases, the integrity is already protected by the application layer, making it less important to have it at the fs-layer. Well, this may be true for file-sharing protocols, but I wouldn't know that relational DBs really do cheksuming of the data. They have journals, of course, but these protect against crashes, not against silent block errors and that like. And I wouldn't know that VM hypervisors do checksuming (but perhaps I've just missed that). Here I can give a real-world example, from the Tier-2 that I run for LHC at work/university. We have large amounts of storage (perhaps not as large as what Google and Facebook have, or what the NSA stores about us)... but it's still some ~ 2PiB, or a bit more. That's managed with some special storage management software called dCache. dCache even stores checksums, but per file, so that means for normal reads, these cannot be verified (well technically it's supported, but with our usual file sizes, this is not working) so what remains are scrubs. For The two PiB, we have some... roughly 50-60 nodes, each with something between 12 and 24 disks, usually in either one or two RAID6 volumes, all different kinds of hard disks. And we do run these scrubs quite rarely, since it costs IO that could be used for actual computing jobs (a problem that wouldn't be there with how btrfs calculates the sums on read, the data is then read anyway)... so likely there are even more errors that are just never noticed, because the datasets are removed again, before being scrubbed. Long story short, it does happen every now and then, that a scrub shows file errors, for neither the RAID was broken, nor there were any block errors reported by the disks, or anything suspicious in SMART. In other words, silent block corruption. One may rely on the applications to do integrity protection, but I think that's not realistic, and perhaps that shouldn't be their task anyway (at least not when it's about storage device block errors and that like). I don't think it's on the horizon that things like DBs or large scientific data files do their own integrity protection (i.e. one that protects against bad blocks, and not just journalling that preserves consistency in case of crashes). And handling that on the fs level is anyway quite nice, I think. It doesn't mean that countless applications need to handle this on the application layer, making it configurable whether it should be enabled (for integrity protection) or disabled (for more speed), each of them writing a lot of code for that. If we can control that on the fs layer, by setting datasum/nodatasum, all needed is already there - except, that as of now, nodatacowed stuff is excluded in btrfs. 2) Technical Okay the following is obviously based on my naive view of how things could work, which may not necessarily go well with how an actual fs developer sees things ;-) As said in the introduction, I can't quite believe that data checksumming should in principle be possible for ext4, but not for btrfs non-CoWed parts. Duncan&Hugo said, the reason is basically it cannot do checksums with no-CoW, because there's no guarantee that the fs doesn't end up inconsistently... But, AFAIU, not doing CoW, while not having a journal (or does it have one for these cases???) almost certainly means that the data (not necessarily the fs) will be inconsistent in case of a crash during a no-CoWed write anyway, right? Wouldn't it be basically like ext2? Or we have the case of multi-device, e.g. RAID1, multiple copies of the same blocks, a crash has happened during writing such (no-CoWed and no- checksummed)... Again it's almost certainly that at least one (maybe even both) of the blocks contains garbage and likely (at least a 50% chance) we get that one when the actual read happens later (I was told btrfs would behave in these cases like e.g MD RAID does,... deliver what the first readable block said). If btrfs would calculate checksums and write them e.g. after or before the actual data was written,... what would be the worst that could happen (in my naive understanding of course ;-) ) at a crash? - I'd say either one is lucky, and checksum and data matches. Yay. - Or it doesn't match, which could boil down to the following two cases: - the data wasn't written out correctly and is actually garbage => then we can be happy, that the checksum wouldn't match and we'd get an error - the data was written out correctly, but before the csum was written the system crashed, so the csum would now tell us that the block is bad, while in reality it isn't. or the other way round: the csum was written out (completely)... and no data was written at all before the system crashed (so the old block would be still completely there) => in both cases: so what? Having that particular case happening is probably far less likely, than csumming actually detecting a bad block, or not completely written data in case of a crash. (Not to talk about all the cases where nothing crashes, and where we simply would want to detect block errors, bus errors, etc.) => Of course it wouldn't be as nice as in CoW, where it could simply take the most recent consistent state of that block, but still way better than: - delivering bogus data to the application in n other cases - not being able to decide which of m block copies is valid, if a RAID is scrubbed And as said before, AFAIU, nodatacow'ed files have no journal in btrfs as in ext3/4, so it's basically anyway that such files, when written during a crash, may end up in any state, right? Which makes not having a csum sound even worse, since nothing tells that this file is possibly bad. Not having checksumming seems to be especially bad in the multi-device case... what happens when one runs a scrub? AFAIU, it simply does what e.g. MD does: taking the first readable block, writing it to any others, thereby possibly destroying the actually good one? Not sure about whether the following would make any practical sense: If data checksumming would work for nodatacow, then maybe some people may even choose to run btrfs in CoW1 mode,.. they still could have most fancy features from btrfs (checksumming, snapshots, perhaps even refcopy?) but unless snapshots or refcopies are explicitly made, btrfs doesn't do CoW. Well, thanks for spending (hopefully not wasting ;-) ) your time on reading my X-Mas wish ;) Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5313 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dear developers, can we have notdatacow + checksumming, plz? 2015-12-14 4:59 dear developers, can we have notdatacow + checksumming, plz? Christoph Anton Mitterer @ 2015-12-14 6:42 ` Russell Coker 2015-12-15 1:02 ` Christoph Anton Mitterer 2015-12-14 14:16 ` Austin S. Hemmelgarn 1 sibling, 1 reply; 12+ messages in thread From: Russell Coker @ 2015-12-14 6:42 UTC (permalink / raw) To: Christoph Anton Mitterer; +Cc: linux-btrfs On Mon, 14 Dec 2015 03:59:18 PM Christoph Anton Mitterer wrote: > I've had some discussions on the list these days about not having > checksumming with nodatacow (mostly with Hugo and Duncan). > > They both basically told me it wouldn't be straight possible with CoW, > and Duncan thinks it may not be so much necessary, but none of them > could give me really hard arguments, why it cannot work (or perhaps I > was just too stupid to understand them ^^)... while at the same time I > think that it would be generally utmost important to have checksumming > (real world examples below). My understanding of BTRFS is that the metadata referencing data blocks has the checksums for those blocks, then the blocks which link to that metadata (EG directory entries referencing file metadata) has checksums of those. For each metadata block there is a new version that is eventually linked from a new version of the tree root. This means that the regular checksum mechanisms can't work with nocow data. A filesystem can have checksums just pointing to data blocks but you need to cater for the case where a corrupt metadata block points to an old version of a data block and matching checksum. The way that BTRFS works with an entire checksumed tree means that there's no possibility of pointing to an old version of a data block. The NetApp published research into hard drive errors indicates that they are usually in small numbers and located in small areas of the disk. So if BTRFS had a nocow file with any storage method other than dup you would have metadata and file data far enough apart that they are not likely to be hit by the same corruption (and the same thing would apply with most Ext4 Inode tables and data blocks). I think that a file mode where there were checksums on data blocks with no checksums on the metadata tree would be useful. But it would require a moderate amount of coding and there's lots of other things that the developers are working on. -- My Main Blog http://etbe.coker.com.au/ My Documents Blog http://doc.coker.com.au/ ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dear developers, can we have notdatacow + checksumming, plz? 2015-12-14 6:42 ` Russell Coker @ 2015-12-15 1:02 ` Christoph Anton Mitterer 0 siblings, 0 replies; 12+ messages in thread From: Christoph Anton Mitterer @ 2015-12-15 1:02 UTC (permalink / raw) To: Russell Coker; +Cc: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 3002 bytes --] On Mon, 2015-12-14 at 17:42 +1100, Russell Coker wrote: > My understanding of BTRFS is that the metadata referencing data > blocks has the > checksums for those blocks, then the blocks which link to that > metadata (EG > directory entries referencing file metadata) has checksums of those. You mean basically, that all metadata is chained, right? > For each > metadata block there is a new version that is eventually linked from > a new > version of the tree root. > > This means that the regular checksum mechanisms can't work with nocow > data. A > filesystem can have checksums just pointing to data blocks but you > need to > cater for the case where a corrupt metadata block points to an old > version of > a data block and matching checksum. The way that BTRFS works with an > entire > checksumed tree means that there's no possibility of pointing to an > old > version of a data block. Hmm I'm not sure whether I understand that (or better said, I'm probably sure I don't :D). AFAIU, the metadata is always CoWed, right? So when a nodatacow file is written, I'd assume it's mtime was update, which already leads to CoWing of metadata... just that now, the checksums should be written as well. If the metadata block is corrupt, then should that be noticed via the csums on that? And you said "The way that BTRFS works with an entire checksumed tree means that there's no possibility of pointing to an old version of a data block."... how would that work for nodatacow'ed blocks? If there is a crash it cannot know whether it was still the old block or the new one or any garbage in between?! > The NetApp published research into hard drive errors indicates that > they are > usually in small numbers and located in small areas of the disk. So > if BTRFS > had a nocow file with any storage method other than dup you would > have metadata > and file data far enough apart that they are not likely to be hit by > the same > corruption (and the same thing would apply with most Ext4 Inode > tables and > data blocks). Well put aside any such research (whose results aren't guaranteed to be always the case)... but that's just one reason from my motivation why I've said checksums for no-CoWed files would be great (I used the multi-device example though, not DUP). > I think that a file mode where there were checksums on data > blocks with no checksums on the metadata tree would be useful. But > it would > require a moderate amount of coding Do you mean in general, or having this as a mode for nodatacow'ed files? Loosing the meta data checksumming, doesn't seem really much more appealing than not having data checksumming :-( > and there's lots of other things that the > developers are working on. Sure, I just wanted to bring this to their attending... I already imagined that they wouldn't drop their current work to do that, just because me whining for it ;-) Thanks, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5313 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dear developers, can we have notdatacow + checksumming, plz? 2015-12-14 4:59 dear developers, can we have notdatacow + checksumming, plz? Christoph Anton Mitterer 2015-12-14 6:42 ` Russell Coker @ 2015-12-14 14:16 ` Austin S. Hemmelgarn 2015-12-15 3:15 ` Christoph Anton Mitterer 1 sibling, 1 reply; 12+ messages in thread From: Austin S. Hemmelgarn @ 2015-12-14 14:16 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs On 2015-12-13 23:59, Christoph Anton Mitterer wrote: > (consider that question being asked with that face on: http://goo.gl/LQaOuA) > > Hey. > > I've had some discussions on the list these days about not having > checksumming with nodatacow (mostly with Hugo and Duncan). > > They both basically told me it wouldn't be straight possible with CoW, > and Duncan thinks it may not be so much necessary, but none of them > could give me really hard arguments, why it cannot work (or perhaps I > was just too stupid to understand them ^^)... while at the same time I > think that it would be generally utmost important to have checksumming > (real world examples below). > > Also, I remember that in 2014, Ted Ts'o told me that there are some > plans ongoing to get data checksumming into ext4, with possibly even > some guy at RH actually doing it sooner or later. > > Since these threads were rather admin-work-centric, developers may have > skipped it, therefore, I decided to write down some thoughts&ideas > label them with a more attracting subject and give it some bigger > attention. > O:-) > > > > > 1) Motivation why, it makes sense to have checksumming (especially also > in the nodatacow case) > > > I think of all major btrfs features I know of (apart from the CoW > itself and having things like reflinks), checksumming is perhaps the > one that distinguishes it the most from traditional filesystems. > > Sure we have snapshots, multi-device support and compression - but we > could have had that as well with LVM and software/hardware RAID... (and > ntfs supported compression IIRC ;) ). > Of course, btrfs does all that in a much smarter way, I know, but it's > nothing generally new. > The *data* checksumming at filesystem level, to my knowledge, is > however. Especially that it's always verified. Awesome. :-) > > > When one starts to get a bit deeper into btrfs (from the admin/end-user > side) one sooner or later stumbles across the recommendation/need to > use nodatacow for certain types of data (DBs, VM images, etc.) and the > reason, AFAIU, being the inherent fragmentation that comes along with > the CoW, which is especially noticeable for those types of files with > lots of random internal writes. It is worth pointing out that in the case of DB's at least, this is because at least some of the do COW internally to provide the transactional semantics that are required for many workloads. > > Now duncan implied, that this could improve in the future, with the > auto-defragmentation getting (even) better, defrag becoming usable > again for those that do snapshots or reflinked copies and btrfs itself > generally maturing more and more. > But I kinda wonder to what extent one will be really able to solve > that, what seems to me a CoW-inherent "problem",... > Even *if* one can make the auto-defrag much smarter, it would still > mean that such files, like big DBs, VMs, or scientific datasets that > are internally rewritten, may get more or less constantly defragmented. > That may be quite undesired... > a) for performance reasons (when I consider our research software which > often has IO as the limiting factor and where we want as much IO being > used by actual programs as possible)... There are other things that can be done to improve this. I would assume of course that you're already doing some of them (stuff like using dedicated storage controller cards instead of the stuff on the motherboard), but some things often get overlooked, like actually taking the time to fine-tune the I/O scheduler for the workload (Linux has particularly brain-dead default settings for CFQ, and the deadline I/O scheduler is only good in hard-real-time usage or on small hard drives that actually use spinning disks). > b) SSDs... > Not really sure about that; btrfs seems to enable the autodefrag even > when an SSD is detected,... what is it doing? Placing the block in a > smart way on different chips so that accesses can be better > parallelised by the controller? This really isn't possible with an SSD. Except for NVMe and Open Channel SSD's, they use the same interfaces as a regular hard drive, which means you get absolutely no information about the data layout on the device. The big argument for defragmenting a SSD is that it makes it such that you require fewer I/O requests to the device to read a file, and in most cases, the device will outlive it's usefulness because of performance long before it dies due to wearing out the flash storage. > Anyway, (a) is could be already argument enough, not to run solve the > problem by a smart-[auto-]defrag, should that actually be implemented. > > So I think having notdatacow is great and not just a workaround till > everything else gets better to handle these cases. > Thus, checksumming, which is such a vital feature, should also be > possible for that. The problem is not entirely the lack of COW semantics, it's also the fact that it's impossible to implement an atomic write on a hard disk. If we could tell the disk 'ensure that this set of writes either all happen, or none of them happen', then we could do checksumming without using COW in the filesystem safely, except that that would require the disk to either do COW, or use the block level equivalent of a log structured filesystem, thus pushing the issue further down the storage stack. > > > Duncan also mention that in some of those cases, the integrity is > already protected by the application layer, making it less important to > have it at the fs-layer. > Well, this may be true for file-sharing protocols, but I wouldn't know > that relational DBs really do cheksuming of the data. All the ones I know of except GDBM and BerkDB do in fact provide the option of checksumming. It's pretty much mandatory if you want to be considered for usage in financial, military, or medical applications. > They have journals, of course, but these protect against crashes, not > against silent block errors and that like. > And I wouldn't know that VM hypervisors do checksuming (but perhaps > I've just missed that). > > Here I can give a real-world example, from the Tier-2 that I run for > LHC at work/university. > We have large amounts of storage (perhaps not as large as what Google > and Facebook have, or what the NSA stores about us)... but it's still > some ~ 2PiB, or a bit more. > That's managed with some special storage management software called > dCache. dCache even stores checksums, but per file, so that means for > normal reads, these cannot be verified (well technically it's > supported, but with our usual file sizes, this is not working) so what > remains are scrubs. > For The two PiB, we have some... roughly 50-60 nodes, each with > something between 12 and 24 disks, usually in either one or two RAID6 > volumes, all different kinds of hard disks. > And we do run these scrubs quite rarely, since it costs IO that could > be used for actual computing jobs (a problem that wouldn't be there > with how btrfs calculates the sums on read, the data is then read > anyway)... so likely there are even more errors that are just never > noticed, because the datasets are removed again, before being scrubbed. > > > Long story short, it does happen every now and then, that a scrub shows > file errors, for neither the RAID was broken, nor there were any block > errors reported by the disks, or anything suspicious in SMART. > In other words, silent block corruption. Or a transient error in system RAM that ECC didn't catch, or a undetected error in the physical link layer to the disks, or an error in the disk cache or controller, or any number of other things. BTRFS could only protect against some cases, not all (for example, if you have a big enough error in RAM that ECC doesn't catch it, you've got serious issues that just about nothing short of a cold reboot can save you from). > > One may rely on the applications to do integrity protection, but I > think that's not realistic, and perhaps that shouldn't be their task > anyway (at least not when it's about storage device block errors and > that like). That depends, if the application has data safety requirements above and beyond what the OS can provide, then it very much is their job to ensure those requirements are met. > > I don't think it's on the horizon that things like DBs or large > scientific data files do their own integrity protection (i.e. one that > protects against bad blocks, and not just journalling that preserves > consistency in case of crashes). Actually, a lot of them do in fact do this (or at least, many database systems do), precisely because most existing filesystems don't provide guarantees of data consistency without a ridiculous hit to performance. > And handling that on the fs level is anyway quite nice, I think. > It doesn't mean that countless applications need to handle this on the > application layer, making it configurable whether it should be enabled > (for integrity protection) or disabled (for more speed), each of them > writing a lot of code for that. > If we can control that on the fs layer, by setting datasum/nodatasum, > all needed is already there - except, that as of now, nodatacowed stuff > is excluded in btrfs. > > > > > > 2) Technical > > > Okay the following is obviously based on my naive view of how things > could work, which may not necessarily go well with how an actual fs > developer sees things ;-) > > As said in the introduction, I can't quite believe that data > checksumming should in principle be possible for ext4, but not for > btrfs non-CoWed parts. Except that for this to work safely, ext4 would have to add COW support, which I think they added for the in-line encryption stuff (in-line data transformations like encryption or compression have the exact same issues that data checksumming does when run on a non-COW filesystem). > > Duncan&Hugo said, the reason is basically it cannot do checksums with > no-CoW, because there's no guarantee that the fs doesn't end up > inconsistently... Exactly. > > But, AFAIU, not doing CoW, while not having a journal (or does it have > one for these cases???) almost certainly means that the data (not > necessarily the fs) will be inconsistent in case of a crash during a > no-CoWed write anyway, right? > Wouldn't it be basically like ext2? Kind of, but not quite. Even with nodatacow, metadata is still COW, which is functionally as safe as a traditional journaling filesystem like XFS or ext4. Absolute worst case scenario for both nodatacow on BTRFS, and a traditional journaling filesystem, the contents of the file are inconsistent. However, almost all of the things that are recommended use cases for nodatacow (primarily database files and VM images) have some internal method of detecting and dealing with corruption (because of the traditional filesystem semantics ensuring metadata consistency, but not data consistency). > > Or we have the case of multi-device, e.g. RAID1, multiple copies of the > same blocks, a crash has happened during writing such (no-CoWed and no- > checksummed)... > Again it's almost certainly that at least one (maybe even both) of the > blocks contains garbage and likely (at least a 50% chance) we get that > one when the actual read happens later (I was told btrfs would behave > in these cases like e.g MD RAID does,... deliver what the first > readable block said). > > If btrfs would calculate checksums and write them e.g. after or before > the actual data was written,... what would be the worst that could > happen (in my naive understanding of course ;-) ) at a crash? > - I'd say either one is lucky, and checksum and data matches. > Yay. > - Or it doesn't match, which could boil down to the following two > cases: > - the data wasn't written out correctly and is actually garbage > => then we can be happy, that the checksum wouldn't match and we'd > get an error > - the data was written out correctly, but before the csum was > written the system crashed, so the csum would now tell us that the > block is bad, while in reality it isn't. > or the other way round: > the csum was written out (completely)... and no data was written > at all before the system crashed (so the old block would be still > completely there) > => in both cases: so what? Having that particular case happening > is probably far less likely, than csumming actually detecting a > bad block, or not completely written data in case of a crash. > (Not to talk about all the cases where nothing crashes, and > where we simply would want to detect block errors, bus errors, > etc.) There is another case to consider, the data got written out, but the crash happened while writing the checksum (so the checksum was partially written, and is corrupt). This means we get a false positive on a disk error that isn't there, even when the data is correct, and that should be avoided if at all possible. Also, because of how disks work, and the internal layout of BTRFS, it's a lot more likely than you think that the data would be written but the checksum wouldn't. The checksum isn't part of the data block, nor is it stored with it, it's actually a part of the metadata block that stores the layout of the data for that file on disk. Because of the nature of the stuff that nodatacow is supposed to be used for, it's almost always better to return bad data than it is to return no data (if you can get any data, then it's usually possible to recover the database file or VM image, but if you get none, it's a lot harder to recover the file). > => Of course it wouldn't be as nice as in CoW, where it could > simply take the most recent consistent state of that block, but > still way better than: > - delivering bogus data to the application in n other cases > - not being able to decide which of m block copies is valid, if a > RAID is scrubbed This gets _really_ scarily dangerous for a RAID setup, because we _absolutely_ can't ensure consistency between disks without using COW. As of right now, we dispatch writes to disks one at a time (although this would still be just as dangerous even if we dispatched writes in parallel), so if we crash it's possible that one disk would hold the old data, one would hold the new data, and _both_ would have correct checksums, which means that we would non-deterministically return one block or the other when an application tries to read it, and which block we return could change _each_ time the read is attempted, which absolutely breaks the semantics required of a filesystem on any modern OS (namely, the file won't change unless something writes to it). > > And as said before, AFAIU, nodatacow'ed files have no journal in btrfs > as in ext3/4, so it's basically anyway that such files, when written > during a crash, may end up in any state, right? Which makes not having > a csum sound even worse, since nothing tells that this file is possibly > bad. As I stated above, most of the stuff that nodatacow is intended for already has it's own built-in protection. No self-respecting RDBMS would be caught dead without internal consistency checks, and they all do COW internally anyway (because it's required for atomic transactions, which are an absolute requirement for database systems), and in fact that's part of why performance is so horrible for them on a COW filesystem. As far as VM's go, either the disk image should have it's own internal consistency checks (for example, qcow2 format, used by QEMU, which also does COW internally), or the guest OS should have such checks. > > Not having checksumming seems to be especially bad in the multi-device > case... what happens when one runs a scrub? AFAIU, it simply does what > e.g. MD does: taking the first readable block, writing it to any > others, thereby possibly destroying the actually good one? AFAICT from the code, yes, that is the case. > > Not sure about whether the following would make any practical sense: > If data checksumming would work for nodatacow, then maybe some people > may even choose to run btrfs in CoW1 mode,.. they still could have most > fancy features from btrfs (checksumming, snapshots, perhaps even > refcopy?) but unless snapshots or refcopies are explicitly made, btrfs > doesn't do CoW. That might have some use when people _really_ don't care about consistency across a crash (for example, when it's a filesystem that gets reinitialized every boot). ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dear developers, can we have notdatacow + checksumming, plz? 2015-12-14 14:16 ` Austin S. Hemmelgarn @ 2015-12-15 3:15 ` Christoph Anton Mitterer 2015-12-15 16:00 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 12+ messages in thread From: Christoph Anton Mitterer @ 2015-12-15 3:15 UTC (permalink / raw) To: Austin S. Hemmelgarn, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 15302 bytes --] On Mon, 2015-12-14 at 09:16 -0500, Austin S. Hemmelgarn wrote: > > When one starts to get a bit deeper into btrfs (from the admin/end- > > user > > side) one sooner or later stumbles across the recommendation/need > > to > > use nodatacow for certain types of data (DBs, VM images, etc.) and > > the > > reason, AFAIU, being the inherent fragmentation that comes along > > with > > the CoW, which is especially noticeable for those types of files > > with > > lots of random internal writes. > It is worth pointing out that in the case of DB's at least, this is > because at least some of the do COW internally to provide the > transactional semantics that are required for many workloads. Guess that also applies to some VM images then, IIRC qcow2 does CoW. > > a) for performance reasons (when I consider our research software > > which > > often has IO as the limiting factor and where we want as much IO > > being > > used by actual programs as possible)... > There are other things that can be done to improve this. I would > assume > of course that you're already doing some of them (stuff like using > dedicated storage controller cards instead of the stuff on the > motherboard), but some things often get overlooked, like actually > taking > the time to fine-tune the I/O scheduler for the workload (Linux has > particularly brain-dead default settings for CFQ, and the deadline > I/O > scheduler is only good in hard-real-time usage or on small hard > drives > that actually use spinning disks). Well sure, I think we'de done most of this and have dedicated controllers, at least of a quality that funding allows us ;-) But regardless how much one tunes, and how good the hardware is. If you'd then loose always a fraction of your overall IO, and be it just 5%, to defragging these types of files, one may actually want to avoid this at all, for which nodatacow seems *the* solution. > The big argument for defragmenting a SSD is that it makes it such > that > you require fewer I/O requests to the device to read a file I've had read about that too, but since I haven't had much personal experience or measurements in that respect, I didn't list it :) > The problem is not entirely the lack of COW semantics, it's also the > fact that it's impossible to implement an atomic write on a hard > disk. Sure... but that's just the same for the nodatacow writes of data. (And the same, AFAIU, for CoW itself, just that we'd notice any corruption in case of a crash due to the CoWed nature of the fs and could go back to the last generation). > > but I wouldn't know that relational DBs really do cheksuming of the > > data. > All the ones I know of except GDBM and BerkDB do in fact provide the > option of checksumming. It's pretty much mandatory if you want to be > considered for usage in financial, military, or medical applications. Hmm I see... PostgreSQL seem to have it since 9.3 ... didn't know that... only crc16 but at least something. > > Long story short, it does happen every now and then, that a scrub > > shows > > file errors, for neither the RAID was broken, nor there were any > > block > > errors reported by the disks, or anything suspicious in SMART. > > In other words, silent block corruption. > Or a transient error in system RAM that ECC didn't catch, or a > undetected error in the physical link layer to the disks, or an error > in > the disk cache or controller, or any number of other things. Well sure,... I was referring to these particular cases, where silent block corruption was the most likely reason. The data was reproducibly read identical, which probably rules out bad RAM or controller, etc. > BTRFS > could only protect against some cases, not all (for example, if you > have > a big enough error in RAM that ECC doesn't catch it, you've got > serious > issues that just about nothing short of a cold reboot can save you > from). Sure, I haven't claimed, that checksumming for no-CoWed data is a solution for everything. > > But, AFAIU, not doing CoW, while not having a journal (or does it > > have > > one for these cases???) almost certainly means that the data (not > > necessarily the fs) will be inconsistent in case of a crash during > > a > > no-CoWed write anyway, right? > > Wouldn't it be basically like ext2? > Kind of, but not quite. Even with nodatacow, metadata is still COW, > which is functionally as safe as a traditional journaling filesystem > like XFS or ext4. Sure, I was referring to the data part only, should have made that more clear. > Absolute worst case scenario for both nodatacow on > BTRFS, and a traditional journaling filesystem, the contents of the > file > are inconsistent. However, almost all of the things that are > recommended use cases for nodatacow (primarily database files and VM > images) have some internal method of detecting and dealing with > corruption (because of the traditional filesystem semantics ensuring > metadata consistency, but not data consistency). What about VMs? At least a quick google search didn't give me any results on whether there would be e.g. checksumming support for qcow2. For raw images there surely is not. And even if DBs do some checksumming now, it may be just a consequence of that missing in the filesystems. As I've written somewhere else in the previous mail: it's IMHO much better if one system takes care on this, where the code is well tested, than each application doing it's own thing. > > - the data was written out correctly, but before the csum was > > written the system crashed, so the csum would now tell us that > > the > > block is bad, while in reality it isn't. > There is another case to consider, the data got written out, but the > crash happened while writing the checksum (so the checksum was > partially > written, and is corrupt). This means we get a false positive on a > disk > error that isn't there, even when the data is correct, and that > should > be avoided if at all possible. I've had that, and I've left it quoted above. But as I've said before: That's one case out of many? How likely is it that the crash happens exactly after a large data block has been written followed by a relatively tiny amount of checksum data. I'd assume it's far more likely that the crash happens during writing the data. And regarding "reporting data to be in error, which is actually correct"... isn't that what all journaling systems may do? And, AFAIU, isn't that also what can happen in btrfs? The data was already CoWed, but the metadata wasn't written out... so it would fall back somehow - here's where the unicorn[0] does it's job - to an older generation? So that would be nothing really new. > Also, because of how disks work, and the internal layout of BTRFS, > it's > a lot more likely than you think that the data would be written but > the > checksum wouldn't. The checksum isn't part of the data block, nor is > it > stored with it, it's actually a part of the metadata block that > stores > the layout of the data for that file on disk. Well it was clear to me, that data+csum isn't sequentially on disk are there any numbers from real studies how often it would happen that data is written correctly but not the metadata? And even if such study would show that - crash isn't the only problem we want to protect here (silent block errors, bus errors, etc). I don't want to say crashes never happen, but in my practical experience they don't happen that often either,... Losing a few blocks of valid data in the rare case of crashes, seems to be a penalty worth, when one gains confidence in data integrity in all others. > Because of the nature of > the stuff that nodatacow is supposed to be used for, it's almost > always > better to return bad data than it is to return no data (if you can > get > any data, then it's usually possible to recover the database file or > VM > image, but if you get none, it's a lot harder to recover the file). No. Simply no! :D Seriously: If you have bad data, for whichever reason (crash, silent block errors, etc.), it's always best to notice. *Then* you can decide what to do: - Is there a backup and does one want to get the data from that backup, rather than continuing to use bad data, possibly even overwriting good backups one week later - Is there either no backup or the effort of recovering it is to big and the corruption doesn't matter enough (e.g. when you have large video files, and there is a sinlge bit flip... well that may just mean that one colour looks a tiny bit different) But that's nothing the fs could or should decide for the user. After I've had sent the initial mail from this thread I remembered what I've had forgotten to add: Is there a way in btrfs, to tell it that gives clearance to a file which it found to be in error based on checksums? Cause *this* is IMHO the proper solution for your "it's almost always better to return bad data than it is to return no data". When we at the Tier-2 detect a file error that we cannot correct by means of replicas, we determine the owner of that file, tell him about the issue, and if he wants to continue using the broken file, there's a way in the storage management system to rewrite the checksum. > > => Of course it wouldn't be as nice as in CoW, where it could > > simply take the most recent consistent state of that block, but > > still way better than: > > - delivering bogus data to the application in n other cases > > - not being able to decide which of m block copies is valid, if > > a > > RAID is scrubbed > This gets _really_ scarily dangerous for a RAID setup, because we > _absolutely_ can't ensure consistency between disks without using > COW. Hmm now I just thought "damn he got me" ;-) > As of right now, we dispatch writes to disks one at a time (although > this would still be just as dangerous even if we dispatched writes in > parallel) Sure... > so if we crash it's possible that one disk would hold the old > data, one would hold the new data sure.. > and _both_ would have correct > checksums, which means that we would non-deterministically return one > block or the other when an application tries to read it, and which > block > we return could change _each_ time the read is attempted, which > absolutely breaks the semantics required of a filesystem on any > modern > OS (namely, the file won't change unless something writes to it). Here I do not longer follow you, so perhaps you (or someone else) can explain a bit further. :-) a) Are checksums really stored per device (and not just once in the metadata? At least from my naive understanding this would either mean that there's a waste of storage, or that the csums are made on data that could vary from device to device (e.g. the same data split up in different extents, or compression on one device but not on the other). but.. b) that problem (different data each with valid corresponding csums) should in principle exist for CoWed data as well, right? And there, I guess, it's solved by CoWing the metadata... (which would still be the case for no-dataCoWed files). Don't know what btrfs does in the CoWed case when such incident happens... how does it decide which of two such corresponding blocks would be the newer one? The generations? Anyway, since metadata would still be CoWed, I think I may have gotten once again out of the tight spot - at least until you explain me, why my naive understanding, as laid out just above, doesn't work out O:-) > As I stated above, most of the stuff that nodatacow is intended for > already has it's own built-in protection. No self-respecting RDBMS > would be caught dead without internal consistency checks, and they > all > do COW internally anyway (because it's required for atomic > transactions, > which are an absolute requirement for database systems), and in fact > that's part of why performance is so horrible for them on a COW > filesystem. As far as VM's go, either the disk image should have > it's > own internal consistency checks (for example, qcow2 format, used by > QEMU, which also does COW internally), or the guest OS should have > such > checks. Well, for PostgreSQL it's still fairly new (9.3, as I've said above, ht tps://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.3#Data_Chec ksums), but it's not done per default (http://www.postgresql.org/docs/c urrent/static/app-initdb.html), and they warn about a noticable performance benefit (though I have of course no data whether this would be better/similar/worse to what is implied by btrfs checksumming). I've tried to find something for MySQL/MariaDB, but the only thing I could find there was: CHECKSUM TABLE But that seems to be a SQL command, i.e. not on-read checksumming as we're talking about, but rather something the application/admin would need to do manually. BDB seems to support it (https://docs.oracle.com/cd/E17076_04/html/api_ reference/C/dbset_flags.html), but again not per default. (And yes, we have quite big ones of them ^^) SQLite doesn't seem to do it, at least not per default? (https://www.sq lite.org/fileformat.html) I tried once again to find any reference that qcow2 (which alone I think would justify having csum support for nodatacow) supports checksumming. https://people.gnome.org/~markmc/qcow-image-format.html which seems to be the original definition, doesn't tell[1] anything about it. raw image, do of course not to any form of checksumming... I had a short glance at OVF, but nothing popped up immediately that would make me believe it supports checksumming. Well there's VDI and VHD left... but are these still used seriously? I guess KVM and Xen people mostly use raw or qcow2 these days, don't they? So given all that, the picture looks a bit different again, I think. None of major FLOSS DBs doesn't do any checksumming per default, MySQL doesn't seem to support it, AFAICT. No VM image format seems to even support it. And not to talk about countless of scientific data formats, which are mostly not widely known to the FLOSS world, but which are used with FLOSS software/Linux. So AFAICT, the only thing left is torrent/edonkey files. And do these store the checksums along the files? Or do they rather wait until a chunk has been received, verify that and then throw it away? In any case however, at least some of these files types eventually end up in the raw files, without any checksum (as that's only used during download),... so when the files remain in the nodatacow area, they're again at risk (+ during the time after the P2P software has finally committed them to disk, and they'd be moved to CoWed and thus checksummed areas) Cheers, Chris. :-) [0] http://abstrusegoose.com/120 [1] admittedly I just cross read over it, and searched for the usual suspect strings (hash, crc, sum) ;) [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5313 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dear developers, can we have notdatacow + checksumming, plz? 2015-12-15 3:15 ` Christoph Anton Mitterer @ 2015-12-15 16:00 ` Austin S. Hemmelgarn 2015-12-16 9:15 ` Duncan ` (2 more replies) 0 siblings, 3 replies; 12+ messages in thread From: Austin S. Hemmelgarn @ 2015-12-15 16:00 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs On 2015-12-14 22:15, Christoph Anton Mitterer wrote: > On Mon, 2015-12-14 at 09:16 -0500, Austin S. Hemmelgarn wrote: >>> When one starts to get a bit deeper into btrfs (from the admin/end- >>> user >>> side) one sooner or later stumbles across the recommendation/need >>> to >>> use nodatacow for certain types of data (DBs, VM images, etc.) and >>> the >>> reason, AFAIU, being the inherent fragmentation that comes along >>> with >>> the CoW, which is especially noticeable for those types of files >>> with >>> lots of random internal writes. >> It is worth pointing out that in the case of DB's at least, this is >> because at least some of the do COW internally to provide the >> transactional semantics that are required for many workloads. > Guess that also applies to some VM images then, IIRC qcow2 does CoW. Yep, and I think that VMWare's image format does too. > > > >>> a) for performance reasons (when I consider our research software >>> which >>> often has IO as the limiting factor and where we want as much IO >>> being >>> used by actual programs as possible)... >> There are other things that can be done to improve this. I would >> assume >> of course that you're already doing some of them (stuff like using >> dedicated storage controller cards instead of the stuff on the >> motherboard), but some things often get overlooked, like actually >> taking >> the time to fine-tune the I/O scheduler for the workload (Linux has >> particularly brain-dead default settings for CFQ, and the deadline >> I/O >> scheduler is only good in hard-real-time usage or on small hard >> drives >> that actually use spinning disks). > Well sure, I think we'de done most of this and have dedicated > controllers, at least of a quality that funding allows us ;-) > But regardless how much one tunes, and how good the hardware is. If > you'd then loose always a fraction of your overall IO, and be it just > 5%, to defragging these types of files, one may actually want to avoid > this at all, for which nodatacow seems *the* solution. nodatacow only works for that if the file is pre-allocated, if it isn't, then it still ends up fragmented. > > >> The big argument for defragmenting a SSD is that it makes it such >> that >> you require fewer I/O requests to the device to read a file > I've had read about that too, but since I haven't had much personal > experience or measurements in that respect, I didn't list it :) I can't give any real numbers, but I've seen noticeable performance improvements on good SSD's (Intel, Samsung, and Crucial) when making sure that things are defragmented. > >> The problem is not entirely the lack of COW semantics, it's also the >> fact that it's impossible to implement an atomic write on a hard >> disk. > Sure... but that's just the same for the nodatacow writes of data. > (And the same, AFAIU, for CoW itself, just that we'd notice any > corruption in case of a crash due to the CoWed nature of the fs and > could go back to the last generation). Yes, but it's also the reason that using either COW or a log-structured filesystem (like NILFS2, LogFS, or I think F2FS) is important for consistency. > > >>> but I wouldn't know that relational DBs really do cheksuming of the >>> data. >> All the ones I know of except GDBM and BerkDB do in fact provide the >> option of checksumming. It's pretty much mandatory if you want to be >> considered for usage in financial, military, or medical applications. > Hmm I see... PostgreSQL seem to have it since 9.3 ... didn't know > that... only crc16 but at least something. > > >>> Long story short, it does happen every now and then, that a scrub >>> shows >>> file errors, for neither the RAID was broken, nor there were any >>> block >>> errors reported by the disks, or anything suspicious in SMART. >>> In other words, silent block corruption. >> Or a transient error in system RAM that ECC didn't catch, or a >> undetected error in the physical link layer to the disks, or an error >> in >> the disk cache or controller, or any number of other things. > Well sure,... I was referring to these particular cases, where silent > block corruption was the most likely reason. > The data was reproducibly read identical, which probably rules out bad > RAM or controller, etc. > > >> BTRFS >> could only protect against some cases, not all (for example, if you >> have >> a big enough error in RAM that ECC doesn't catch it, you've got >> serious >> issues that just about nothing short of a cold reboot can save you >> from). > Sure, I haven't claimed, that checksumming for no-CoWed data is a > solution for everything. > > >>> But, AFAIU, not doing CoW, while not having a journal (or does it >>> have >>> one for these cases???) almost certainly means that the data (not >>> necessarily the fs) will be inconsistent in case of a crash during >>> a >>> no-CoWed write anyway, right? >>> Wouldn't it be basically like ext2? >> Kind of, but not quite. Even with nodatacow, metadata is still COW, >> which is functionally as safe as a traditional journaling filesystem >> like XFS or ext4. > Sure, I was referring to the data part only, should have made that more > clear. > > >> Absolute worst case scenario for both nodatacow on >> BTRFS, and a traditional journaling filesystem, the contents of the >> file >> are inconsistent. However, almost all of the things that are >> recommended use cases for nodatacow (primarily database files and VM >> images) have some internal method of detecting and dealing with >> corruption (because of the traditional filesystem semantics ensuring >> metadata consistency, but not data consistency). > What about VMs? At least a quick google search didn't give me any > results on whether there would be e.g. checksumming support for qcow2. > For raw images there surely is not. I don't mean that the VMM does checksumming, I mean that the guest OS should be the one to handle the corruption. No sane OS doesn't run at least some form of consistency checks when mounting a filesystem. > > And even if DBs do some checksumming now, it may be just a consequence > of that missing in the filesystems. > As I've written somewhere else in the previous mail: it's IMHO much > better if one system takes care on this, where the code is well tested, > than each application doing it's own thing. That's really a subjective opinion. The application knows better than we do what type of data integrity it needs, and can almost certainly do a better job of providing it than we can. This is actually essentially the same reason that BTRFS and ZFS have multi-device support, the filesystem knows much better than the block device how it stores data, so it makes more sense to handle laying that data out across the disks in the filesystem. > > >>> - the data was written out correctly, but before the csum was >>> written the system crashed, so the csum would now tell us that >>> the >>> block is bad, while in reality it isn't. >> There is another case to consider, the data got written out, but the >> crash happened while writing the checksum (so the checksum was >> partially >> written, and is corrupt). This means we get a false positive on a >> disk >> error that isn't there, even when the data is correct, and that >> should >> be avoided if at all possible. > I've had that, and I've left it quoted above. > But as I've said before: That's one case out of many? How likely is it > that the crash happens exactly after a large data block has been > written followed by a relatively tiny amount of checksum data. > I'd assume it's far more likely that the crash happens during writing > the data. Except that the whole metadata block pointing to that data block gets rewritten, not just the checksum. > > And regarding "reporting data to be in error, which is actually > correct"... isn't that what all journaling systems may do? No, most of them don't actually do that. The general design of a journaling filesystem is that the journal is used as what's called a Write-Intent-Log (WIL), the purpose of which is to say 'Hey, I'm going to write this data here in a little while.' so that when your system dies while writing that data, you can then finish writing it correctly when the system gets booted up again. And in particular, the only journaling filesystem that I know of that even allows the option of journaling the file contents instead of just metadata is ext4. > And, AFAIU, isn't that also what can happen in btrfs? The data was > already CoWed, but the metadata wasn't written out... so it would fall > back somehow - here's where the unicorn[0] does it's job - to an older > generation? Kind of, there are some really rare cases where it's possible if you get _really_ unlucky on a multi-device filesystem that things get corrupted such that the filesystem thinks that data that is perfectly correct is invalid, and thinks that the other copy which is corrupted is valid. (I've actually had this happen before, it was not fun trying to recover from it). > So that would be nothing really new. > > >> Also, because of how disks work, and the internal layout of BTRFS, >> it's >> a lot more likely than you think that the data would be written but >> the >> checksum wouldn't. The checksum isn't part of the data block, nor is >> it >> stored with it, it's actually a part of the metadata block that >> stores >> the layout of the data for that file on disk. > Well it was clear to me, that data+csum isn't sequentially on disk are > there any numbers from real studies how often it would happen that data > is written correctly but not the metadata? > And even if such study would show that - crash isn't the only problem > we want to protect here (silent block errors, bus errors, etc). > I don't want to say crashes never happen, but in my practical > experience they don't happen that often either,... > > Losing a few blocks of valid data in the rare case of crashes, seems to > be a penalty worth, when one gains confidence in data integrity in all > others. That _really_ depends on what the data is. If you made that argument to the IT department at a financial institution, they would probably fall over laughing at you. > > >> Because of the nature of >> the stuff that nodatacow is supposed to be used for, it's almost >> always >> better to return bad data than it is to return no data (if you can >> get >> any data, then it's usually possible to recover the database file or >> VM >> image, but if you get none, it's a lot harder to recover the file). > No. Simply no! :D > > Seriously: > If you have bad data, for whichever reason (crash, silent block errors, > etc.), it's always best to notice. > *Then* you can decide what to do: > - Is there a backup and does one want to get the data from that > backup, rather than continuing to use bad data, possibly even > overwriting good backups one week later > - Is there either no backup or the effort of recovering it is to big > and the corruption doesn't matter enough (e.g. when you have large > video files, and there is a sinlge bit flip... well that may just > mean that one colour looks a tiny bit different) > > But that's nothing the fs could or should decide for the user. OK, good point about this being policy. And in some cases (executables, configuration for administrative software, similar things), it is better to just return an error, but in many cases, that's not what most desktop users would want. Think document files, where a single byte error could easily be corrected by the user, or configuration files for sanely written apps (It's a lot nicer (and less confusing for someone without a lot of low-level computer background) to say 'Hey, your configuration file is messed up, here's how to fix it', than it is to say 'Hey, I couldn't read your configuration file'). And because BTRFS is supposed to be a general purpose filesystem, it has to account for the case of desktop users, and because server admins are supposed to be smart, the default should be for desktop usage. > > After I've had sent the initial mail from this thread I remembered what > I've had forgotten to add: > Is there a way in btrfs, to tell it that gives clearance to a file > which it found to be in error based on checksums? > > Cause *this* is IMHO the proper solution for your "it's almost always > better to return bad data than it is to return no data". > > When we at the Tier-2 detect a file error that we cannot correct by > means of replicas, we determine the owner of that file, tell him about > the issue, and if he wants to continue using the broken file, there's a > way in the storage management system to rewrite the checksum. > > >>> => Of course it wouldn't be as nice as in CoW, where it could >>> simply take the most recent consistent state of that block, but >>> still way better than: >>> - delivering bogus data to the application in n other cases >>> - not being able to decide which of m block copies is valid, if >>> a >>> RAID is scrubbed >> This gets _really_ scarily dangerous for a RAID setup, because we >> _absolutely_ can't ensure consistency between disks without using >> COW. > Hmm now I just thought "damn he got me" ;-) > >> As of right now, we dispatch writes to disks one at a time (although >> this would still be just as dangerous even if we dispatched writes in >> parallel) > Sure... > > >> so if we crash it's possible that one disk would hold the old >> data, one would hold the new data > sure.. > > >> and _both_ would have correct >> checksums, which means that we would non-deterministically return one >> block or the other when an application tries to read it, and which >> block >> we return could change _each_ time the read is attempted, which >> absolutely breaks the semantics required of a filesystem on any >> modern >> OS (namely, the file won't change unless something writes to it). > Here I do not longer follow you, so perhaps you (or someone else) can > explain a bit further. :-) > > a) Are checksums really stored per device (and not just once in the > metadata? At least from my naive understanding this would either mean > that there's a waste of storage, or that the csums are made on data > that could vary from device to device (e.g. the same data split up in > different extents, or compression on one device but not on the other). > but.. AFAIUI, checksums are stored per-instance for every block. This is important in a multi-device filesystem in case you lose a device, so that you still have a checksum for the block. There should be no difference between extent layout and compression between devices however. > > b) that problem (different data each with valid corresponding csums) > should in principle exist for CoWed data as well, right? And there, I > guess, it's solved by CoWing the metadata... (which would still be the > case for no-dataCoWed files). Yes. > Don't know what btrfs does in the CoWed case when such incident > happens... how does it decide which of two such corresponding blocks > would be the newer one? The generations? Usually, but like I mentioned above there are edge cases that can occur as a result of data corruption on disk or other really rare circumstances. In the particular case of multiple copies of a block with different data but valid checksums, I'm about 95% certain that it will non-deterministically return one block or the other on an arbitrary read when the read doesn't hit the VFS cache. This is a potential issue for COW as well, but much less likely because it can more easily detect the corruption and fix it. > > Anyway, since metadata would still be CoWed, I think I may have gotten > once again out of the tight spot - at least until you explain me, why > my naive understanding, as laid out just above, doesn't work out O:-) Hmm, I had forgotten about the metadata being COW, that does avoid the situation above under the specified circumstances, but does not avoid it happening due to disk errors (although that's extremely unlikely,a s it would require direct correlation of the errors in a way that is statistically impossible). > > >> As I stated above, most of the stuff that nodatacow is intended for >> already has it's own built-in protection. No self-respecting RDBMS >> would be caught dead without internal consistency checks, and they >> all >> do COW internally anyway (because it's required for atomic >> transactions, >> which are an absolute requirement for database systems), and in fact >> that's part of why performance is so horrible for them on a COW >> filesystem. As far as VM's go, either the disk image should have >> it's >> own internal consistency checks (for example, qcow2 format, used by >> QEMU, which also does COW internally), or the guest OS should have >> such >> checks. > Well, for PostgreSQL it's still fairly new (9.3, as I've said above, ht > tps://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.3#Data_Chec > ksums), but it's not done per default (http://www.postgresql.org/docs/c > urrent/static/app-initdb.html), and they warn about a noticable > performance benefit (though I have of course no data whether this would > be better/similar/worse to what is implied by btrfs checksumming). > > I've tried to find something for MySQL/MariaDB, but the only thing I > could find there was: CHECKSUM TABLE > But that seems to be a SQL command, i.e. not on-read checksumming as > we're talking about, but rather something the application/admin would > need to do manually. I actually had been referring to this, with the assumption that the application would use it to verify it's own data. I hadn't realized PostgreSQL had in-line support for it. > > > BDB seems to support it (https://docs.oracle.com/cd/E17076_04/html/api_ > reference/C/dbset_flags.html), but again not per default. > (And yes, we have quite big ones of them ^^) > > SQLite doesn't seem to do it, at least not per default? (https://www.sq > lite.org/fileformat.html) > > > I tried once again to find any reference that qcow2 (which alone I > think would justify having csum support for nodatacow) supports > checksumming. > https://people.gnome.org/~markmc/qcow-image-format.html which seems to > be the original definition, doesn't tell[1] anything about it. > raw image, do of course not to any form of checksumming... > I had a short glance at OVF, but nothing popped up immediately that > would make me believe it supports checksumming. > Well there's VDI and VHD left... but are these still used seriously? > I guess KVM and Xen people mostly use raw or qcow2 these days, don't > they? VDI is still widely used, because it's the default for Virtual Box when creating a VM. VHD is way more widely used than it should be, solely because there are insane people out there using Windows as a virtualization host. You also forgot VMDK, which is what VMWare uses almost exclusively, but I don't think it has built-in checksumming. As for Xen, the BCP are to avoid using image files like the plague, and use disks directly instead (or more commonly, use either LVM, or ZFS with zvols). > > > So given all that, the picture looks a bit different again, I think. > None of major FLOSS DBs doesn't do any checksumming per default, MySQL > doesn't seem to support it, AFAICT. No VM image format seems to even > support it. Again, most of my intent in referring to those was that the application or the Guest OS would do the verification itself. > > And not to talk about countless of scientific data formats, which are > mostly not widely known to the FLOSS world, but which are used with > FLOSS software/Linux. If the application doesn't have that type of thing built in, then that's not something the filesystem should be worrying about, that's the job of the application developers to deal with. The point of a filesystem is to store data within the integrity guarantees provided by the hardware, possibly with some additional protection, not to save the user or application from making stupid choices. > > So AFAICT, the only thing left is torrent/edonkey files. > And do these store the checksums along the files? Or do they rather > wait until a chunk has been received, verify that and then throw it > away? > In any case however, at least some of these files types eventually end > up in the raw files, without any checksum (as that's only used during > download),... so when the files remain in the nodatacow area, they're > again at risk (+ during the time after the P2P software has finally > committed them to disk, and they'd be moved to CoWed and thus > checksummed areas) In the case of stuff like torrents and such, all the good software for working with them has an option to verify the file after downloading. > > [0] http://abstrusegoose.com/120 > [1] admittedly I just cross read over it, and searched for the usual > suspect strings (hash, crc, sum) ;) > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dear developers, can we have notdatacow + checksumming, plz? 2015-12-15 16:00 ` Austin S. Hemmelgarn @ 2015-12-16 9:15 ` Duncan 2015-12-16 9:55 ` Duncan 2015-12-17 2:09 ` Christoph Anton Mitterer 2 siblings, 0 replies; 12+ messages in thread From: Duncan @ 2015-12-16 9:15 UTC (permalink / raw) To: linux-btrfs Austin S. Hemmelgarn posted on Tue, 15 Dec 2015 11:00:40 -0500 as excerpted: > And in particular, the only > journaling filesystem that I know of that even allows the option of > journaling the file contents instead of just metadata is ext4. IIRC, ext3 was the first to have it in Linux mainline, with data=writeback for the speed freaks that don't care about data loss, data=ordered as the default normal option (except for that infamous period when Linus lost his head and let people talk him into switching to data=writeback, despite the risks... he later came back to his senses and reverted that), and data=journal for the folks that were willing to pay trade a bit of speed for better data protection (tho it was famous for surprising everybody, in that in certain use-cases it was extremely fast, faster than data=writeback, something I don't think was ever fully explained). To my knowledge ext3 still has that, tho I haven't used it probably a decade. Reiserfs has all three data= options as well, with data=ordered the default, tho it only had data=writeback initially. While I've used reiserfs for years, it has always been with the default data=ordered since that was introduced, and I'd be surprised if data=journal had the same use-case speed advantage that it did on ext3, as it's too different. Meanwhile, that early data=writeback default is where reiserfs got its ill repute for data loss, but it had long switched to data=ordered by default by the time Linus lost his senses and tried data=writeback by default on ext3. Because I was on reiserfs from data=writeback era, I was rather glad most kernel hackers didn't want to touch it by the time Linus let them talk him into data=writeback on ext3, and thus left reiserfs (which again had long been data=ordered by default by then) well enough alone. But I did help a few people running ext3 trace down their new ext3 stability issues to that bad data=writeback experiment, and persuaded them to specify data=ordered, which solved their problems, so indeed they /were/ data=writeback related. And happily, Linus did eventually regain his senses and return ext3 to data=ordered by default once again. And based on what you said, ext4 still has all three data= options, including data=journal. But I wasn't sure on that myself (tho I would have assumed it inherited it from ext3) and thus am /definitely/ not sure whether it inherits ext3's data=journal speed advantages in certain corner-cases. I have no idea whether other journaled filesystems allow choosing the journal level or not, tho. I only know of those three. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dear developers, can we have notdatacow + checksumming, plz? 2015-12-15 16:00 ` Austin S. Hemmelgarn 2015-12-16 9:15 ` Duncan @ 2015-12-16 9:55 ` Duncan 2015-12-17 2:09 ` Christoph Anton Mitterer 2 siblings, 0 replies; 12+ messages in thread From: Duncan @ 2015-12-16 9:55 UTC (permalink / raw) To: linux-btrfs Austin S. Hemmelgarn posted on Tue, 15 Dec 2015 11:00:40 -0500 as excerpted: > AFAIUI, checksums are stored per-instance for every block. This is > important in a multi-device filesystem in case you lose a device, so > that you still have a checksum for the block. There should be no > difference between extent layout and compression between devices > however. I don't believe that's quite correct. What is correct, to the best of my knowledge, is that checksums are metadata, and thus have whatever duplication/parity level metadata is assigned. For single devices, that is of course by default dup, 2X the metadata and thus 2X the checksums, both on the single data (as effectively the only choice on a single device, at least thru 4.3, tho there's a patch adding dup data as an option that I think should be in 4.4) when covering data, dup metadata when covering it. For multiple devices, it's default raid1 metadata, default single data, so the picture doesn't differ much by default from the single-device default picture. It's also possible to do single metadata, raidN data, which really doesn't make sense except for raid0 data, and thus I believe there's a warning about that sort of layout in newer mkfs.btrfs, or when lowering the metadata redundancy using balance filters. But of course it's possible to do raid1 data and metadata, which would be two copies of each, regardless of the number of devices (except that it's 2+, of course). But the copies aren't 1:1 assigned. That is, if they're equal generation, btrfs can read either checksum and apply it to either data/metadata block. (Of course if they're not equal generation, btrfs will choose the higher one, thus covering the case of writing at the time of a crash, since either they will both be the same generation if the root block wasn't updated to the new one on either one yet, or one will be a higher/newer generation than the other, if it had already finished writing one but not the other at the time of the crash.) This is why it's an extremely good idea if you have a pair of devices in raid1, and you mount one of them degraded/writable with the other unavailable for some reason, that you don't also mount the other one writable and then try to recombined them. Chances are the generations wouldn't match and it'd pick the one with the higher generation, but if they did for some reason match, and both checksums were valid on their data, but the data differed... either one could be chosen, and a scrub might choose either one to fix the other, as well, which could in theory result in a file with intermixed blocks from the two different versions! Just ensure that if one is mounted writable, it's the only one mounted writable if there's a chance of recombining, and you'll be fine, as it'll be the only one with advancing generations. And if by some accident both are mounted writable separately, the best bet is to be sure and wipe the one, then add it as a new device, if you're going to reintroduce it to the same filesystem. Of course this gets a bit more complicated with 3+ device raid1, since currently, there's still only two copies of each block and two copies of the checksum, meaning there's at least one device without a copy of each block, and if the filesystem is mounted degraded writable repeatedly with a random device missing... Similarly, the permutations can be calculated for the other raid types, and for mixed raid types like raid6 data (specified) and raid1 metadata (unspecified so the default used), but I won't attempt that here. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dear developers, can we have notdatacow + checksumming, plz? 2015-12-15 16:00 ` Austin S. Hemmelgarn 2015-12-16 9:15 ` Duncan 2015-12-16 9:55 ` Duncan @ 2015-12-17 2:09 ` Christoph Anton Mitterer 2015-12-21 13:36 ` Austin S. Hemmelgarn 2 siblings, 1 reply; 12+ messages in thread From: Christoph Anton Mitterer @ 2015-12-17 2:09 UTC (permalink / raw) To: Austin S. Hemmelgarn, linux-btrfs [-- Attachment #1: Type: text/plain, Size: 20453 bytes --] On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote: > > Well sure, I think we'de done most of this and have dedicated > > controllers, at least of a quality that funding allows us ;-) > > But regardless how much one tunes, and how good the hardware is. If > > you'd then loose always a fraction of your overall IO, and be it > > just > > 5%, to defragging these types of files, one may actually want to > > avoid > > this at all, for which nodatacow seems *the* solution. > nodatacow only works for that if the file is pre-allocated, if it > isn't, > then it still ends up fragmented. Hmm is that "it may end up fragmented" or a "it will definitely? Cause I'd have hoped, that if nothing else had been written in the meantime, btrfs would perhaps try to write next to the already allocated blocks. > > > The problem is not entirely the lack of COW semantics, it's also > > > the > > > fact that it's impossible to implement an atomic write on a hard > > > disk. > > Sure... but that's just the same for the nodatacow writes of data. > > (And the same, AFAIU, for CoW itself, just that we'd notice any > > corruption in case of a crash due to the CoWed nature of the fs and > > could go back to the last generation). > Yes, but it's also the reason that using either COW or a log- > structured > filesystem (like NILFS2, LogFS, or I think F2FS) is important for > consistency. So then it's no reason why it shouldn't work. The meta-data is CoWed, any incomplete writes of checksumdata in that (be it for CoWed data or no-CoWed data, should the later be implemented), would be protected at that level. Currently, the no-CoWed data is, AFAIU completely at risk of being corrupted (no checksums, no journal). Checksums on no-CoWed data would just improve that. > > What about VMs? At least a quick google search didn't give me any > > results on whether there would be e.g. checksumming support for > > qcow2. > > For raw images there surely is not. > I don't mean that the VMM does checksumming, I mean that the guest OS > should be the one to handle the corruption. No sane OS doesn't run > at > least some form of consistency checks when mounting a filesystem. Well but we're not talking about having a filesystem that "looks clear" here. For this alone we wouldn't need any checksumming at all. We talk about data integrity protection, i.e. all files and their contents. Nothing which a fsck inside a guest VM would ever notice (I mean by a fsck), if there are just some bit flips or things like that. > > > > And even if DBs do some checksumming now, it may be just a > > consequence > > of that missing in the filesystems. > > As I've written somewhere else in the previous mail: it's IMHO much > > better if one system takes care on this, where the code is well > > tested, > > than each application doing it's own thing. > That's really a subjective opinion. The application knows better > than > we do what type of data integrity it needs, and can almost certainly > do > a better job of providing it than we can. Hmm I don't see that. When we, at the filesystem level, provide data integrity, than all data is guaranteed to be valid. What more should an application be able to provide? At best they can do the same thing faster, but even for that I see no immediate reason to believe it. And in practise it seems far more likely that if countless applications should such task on their own, that it's more error prone (that's why we have libraries for all kinds of code, trying to reuse code, minimising the possibility of errors in countless home-brew solutions), or not done at all. > > > > - the data was written out correctly, but before the csum > > > > was > > > > written the system crashed, so the csum would now tell us > > > > that > > > > the > > > > block is bad, while in reality it isn't. > > > There is another case to consider, the data got written out, but > > > the > > > crash happened while writing the checksum (so the checksum was > > > partially > > > written, and is corrupt). This means we get a false positive on > > > a > > > disk > > > error that isn't there, even when the data is correct, and that > > > should > > > be avoided if at all possible. > > I've had that, and I've left it quoted above. > > But as I've said before: That's one case out of many? How likely is > > it > > that the crash happens exactly after a large data block has been > > written followed by a relatively tiny amount of checksum data. > > I'd assume it's far more likely that the crash happens during > > writing > > the data. > Except that the whole metadata block pointing to that data block gets > rewritten, not just the checksum. But that's the case anyway, isn't it? With or without checksums. > > And regarding "reporting data to be in error, which is actually > > correct"... isn't that what all journaling systems may do? > No, most of them don't actually do that. The general design of a > journaling filesystem is that the journal is used as what's called a > Write-Intent-Log (WIL), the purpose of which is to say 'Hey, I'm > going > to write this data here in a little while.' so that when your system > dies while writing that data, you can then finish writing it > correctly > when the system gets booted up again. And in particular, the only > journaling filesystem that I know of that even allows the option of > journaling the file contents instead of just metadata is ext4. Well but that's just what I say... the system crashes,... the journal tells about anything that's not for sure cleanly on disk, even though it may have actually made it it. Nothing more than what would happen in our case. > > And, AFAIU, isn't that also what can happen in btrfs? The data was > > already CoWed, but the metadata wasn't written out... so it would > > fall > > back somehow - here's where the unicorn[0] does it's job - to an > > older > > generation? > Kind of, there are some really rare cases where it's possible if you > get > _really_ unlucky on a multi-device filesystem that things get > corrupted > such that the filesystem thinks that data that is perfectly correct > is > invalid, and thinks that the other copy which is corrupted is valid. > (I've actually had this happen before, it was not fun trying to > recover > from it). Doesn't really speak against nodatacow checksumming, AFAICS. > > Well it was clear to me, that data+csum isn't sequentially on disk > > are > > there any numbers from real studies how often it would happen that > > data > > is written correctly but not the metadata? > > And even if such study would show that - crash isn't the only > > problem > > we want to protect here (silent block errors, bus errors, etc). > > I don't want to say crashes never happen, but in my practical > > experience they don't happen that often either,... > > > > Losing a few blocks of valid data in the rare case of crashes, > > seems to > > be a penalty worth, when one gains confidence in data integrity in > > all > > others. > That _really_ depends on what the data is. If you made that argument > to > the IT department at a financial institution, they would probably > fall > over laughing at you. Well but your point is completely moot, because for someone who cares so much in data, they wouldn't use nodatacow when btrfs has no journal and the data could end up in any state in case of crash. And I'm quite certain that each financial institution rather clearly gets an error message (i.e. because the checksums don't very), after which they can get a backup, than having corrupt data taking for valid, and the debts of their customers being zeroed. It's kinda strange how you argue against better integrity protection ;-) > > But that's nothing the fs could or should decide for the user. > OK, good point about this being policy. And in some cases > (executables, > configuration for administrative software, similar things), it is > better > to just return an error, but in many cases, that's not what most > desktop > users would want. Think document files, where a single byte error > could > easily be corrected by the user, or configuration files for sanely > written apps (It's a lot nicer (and less confusing for someone > without a > lot of low-level computer background) to say 'Hey, your configuration > file is messed up, here's how to fix it', than it is to say 'Hey, I > couldn't read your configuration file'). And because BTRFS is > supposed > to be a general purpose filesystem, it has to account for the case of > desktop users, and because server admins are supposed to be smart, > the > default should be for desktop usage. Well but that's just the point I've made. The fs cannot decide what's better or not. Your document could be an important config file that allows/disallows remote users access to resources. The single byte error could make a 0 to a 1, allowing world wide access. It could be your thesis' data, or part of the document file, changing some numbers, which you won't easily notice but which makes everything bogus when examined. I had brought the example with the video file, where it may not matter. But in any case it's nothing what the fs can decide. The best it can do is give an error on read, and the tools to give clearance to such files (when they could not be auto-recovered by e.g. other copies). All this is however only possible with checksumming. > > a) Are checksums really stored per device (and not just once in the > > metadata? At least from my naive understanding this would either > > mean > > that there's a waste of storage, or that the csums are made on data > > that could vary from device to device (e.g. the same data split up > > in > > different extents, or compression on one device but not on the > > other). > > but.. > AFAIUI, checksums are stored per-instance for every block. This is > important in a multi-device filesystem in case you lose a device, so > that you still have a checksum for the block. There should be no > difference between extent layout and compression between devices > however. hmm but if that's the case, especially the later, that the extents are the same on all devices,... then there's IMHO no need for data being stored per-instance (I guess you mean per device instance?) for every block. The meta-data would have e.g. DUP anyway, so even if one device fails metadata would hopefully be still there. And if metadata is coompletely lost, the fs is lost anyway, and csums don't matter anymore. > > b) that problem (different data each with valid corresponding > > csums) > > should in principle exist for CoWed data as well, right? And there, > > I > > guess, it's solved by CoWing the metadata... (which would still be > > the > > case for no-dataCoWed files). > Yes. > > Don't know what btrfs does in the CoWed case when such incident > > happens... how does it decide which of two such corresponding > > blocks > > would be the newer one? The generations? > Usually, but like I mentioned above there are edge cases that can > occur > as a result of data corruption on disk or other really rare > circumstances. In the particular case of multiple copies of a block > with different data but valid checksums, I'm about 95% certain that > it > will non-deterministically return one block or the other on an > arbitrary > read when the read doesn't hit the VFS cache. Hmm would be quite worrysome if that could happen, especially also in the CoW case. > This is a potential issue > for COW as well, but much less likely because it can more easily > detect > the corruption and fix it. But then again, there should be no difference to checksumming the no-CoWed data - the checksums would be CoWed again, if btrfs can detect it there, it should be fine. > > > > Anyway, since metadata would still be CoWed, I think I may have > > gotten > > once again out of the tight spot - at least until you explain me, > > why > > my naive understanding, as laid out just above, doesn't work out > > O:-) > Hmm, I had forgotten about the metadata being COW, that does avoid > the > situation above under the specified circumstances, but does not avoid > it > happening due to disk errors (although that's extremely unlikely,a s > it > would require direct correlation of the errors in a way that is > statistically impossible). Ah... here we go :-) What exactly do you mean with disk errors here? IOW, what scenario do you think of, in which checksumming no-CoWed data could lead to any more corruptions than it can to without checksumming as well, or where any inconsistencies could get into the filesystem's meta-data, that couldn't already come in for checksummed+CoWed data and/or non- checksummed+CoWed data? > > Well, for PostgreSQL it's still fairly new (9.3, as I've said > > above, ht > > tps://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.3#Data_ > > Chec > > ksums), but it's not done per default (http://www.postgresql.org/do > > cs/c > > urrent/static/app-initdb.html), and they warn about a noticable > > performance benefit (though I have of course no data whether this > > would > > be better/similar/worse to what is implied by btrfs checksumming). > > > > I've tried to find something for MySQL/MariaDB, but the only thing > > I > > could find there was: CHECKSUM TABLE > > But that seems to be a SQL command, i.e. not on-read checksumming > > as > > we're talking about, but rather something the application/admin > > would > > need to do manually. > I actually had been referring to this, with the assumption that the > application would use it to verify it's own data. I hadn't realized > PostgreSQL had in-line support for it. Well but the (fairly new) in-line support, is the only thing that we can really count here. What mysql does is that it requires the app to do it. a) It's likely that there are many apps which don't use this (maybe simply because they don't know it) and it's unlikely they'll all change. While what we can do at the btrfs level (or what postgresql would do) works out of the box for everything. b) I may simply not very well understand the CHECKSUM TABLE, but to me it doesn't seem useful to provide data integrity in the sense we're talking here about (i.e. silent block errors, bus errors, etc.) Why? First, it seems to checksum the whole data of the whole table, and AFAICS, it uses only CRC32... given that such tables may be easily GiB in size, CRC32 is IMHO simply not good enough. Postgresql/btrfs in turn do the checksums on much smaller amounts of data. Second, verification seems to only take place when that command is called. I'm not sure whether it implies locking the table in memory then (didn't dig too deep), but I can't believe it would, which system could keep a 100 GiB table in mem? So it seems to be basically a one-shot verification, not covering any corruptions that happen in between. In fact, the documentation of the function even tells that this is for backups/rollbacks/etc. only... so it's absolutely not that kind of data integrity protection we're talking about (and even for that purpose, CRC32 seems to be a poor choice). > VDI is still widely used, because it's the default for Virtual Box > when > creating a VM. Guess I just disbelieved that VirtualBox is still widely used O;-) > VHD is way more widely used than it should be, solely > because there are insane people out there using Windows as a > virtualization host. You also forgot VMDK, which is what VMWare uses > almost exclusively, but I don't think it has built-in checksumming. > > As for Xen, the BCP are to avoid using image files like the plague, > and > use disks directly instead (or more commonly, use either LVM, or ZFS > with zvols). Anyway... what it comes down to: None of the VM image formats seem to support checksumming. > > So given all that, the picture looks a bit different again, I > > think. > > None of major FLOSS DBs doesn't do any checksumming per default, > > MySQL > > doesn't seem to support it, AFAICT. No VM image format seems to > > even > > support it. > Again, most of my intent in referring to those was that the > application > or the Guest OS would do the verification itself. I've answered that above already, IIRC (our mails get too lengthy O:-) ). The guest OS doesn't verify more than what our typical host OS (Linux) does. And that (except when btrfs with CoWed data is used ;-) does filesystem integrity verification - which is however not data integrity verification. btw: That makes me think about something interesting: If btrfs will ever support checksumming on no-CoWed data... then the documentation should describe, that depending on the actual scenario, it may make sense that btrfs filesystems inside the guest run generally with nodatasum. The idea begin: Why verifying twice? The constraints being, AFAICS, the following: - If the VM image is ever to be moved off the host's btrfs image (which would have checksumming enabled) to a fs without checksumming or if it would be ever copied remotely, than not having the "internal" checksumming (i.e. from the btrfs filesystems inside the gues), would make one loose the integrity protection - It further would only work, if the hypervisor itself, would properly pass on any IO errors when it reads from the image files in the host's btrfs, to block IO errors for the guest. If it wouldn't, then the guest (with disabled "internal" checksumming) wouldn't probably notice any data integrity errors, which he would, if the "internal" checksumming wasn't turned off. > If the application doesn't have that type of thing built in, then > that's > not something the filesystem should be worrying about, that's the job > of > the application developers to deal with. No. If you see it like that, you could as well drop data checksumming in btrfs completely. You'd anyway argue that it would be the applications duty to do that. The same way you could argue, that your MP4, JPG, ISO image or whatever you downloaded via bittorrent needs to contain checksum data, and that the actual application (which is not bittorrent, but e.g. mplayer, imagemagick or wodim) need to verify these. However this is not the case. In fact, it's quite usual and proper to have the lower layer handle stuff which the higher layer don't have any real direct interest in (except for the case, that the lower layer doesn't do it). > The point of a filesystem is > to store data within the integrity guarantees provided by the > hardware, > possibly with some additional protection If you're convinced by that, you should probably propose that btrfs removes data checksumming altogether. I guess you won't make much friends with that idea ;) > In the case of stuff like torrents and such, all the good software > for > working with them has an option to verify the file after downloading. Not sure what you mean with "and such", if it's again VMs, and DBs (the IMHO actually more important use case than file sharing), I showed you in the last mail, that non of these do any verification by default, and half of the important ones don't even support it, with nothing on the horizon that this would change (probably because they argue just the other way round than you do: the fs should handle data integrity, and ZFS and btrfs give them partially right ;-) ). And I'm not sure with torrents, but I'd have suspected once the file's downloaded completely, any checksumming data is no longer kept. If my guess is correct, even the torrent software doesn't really do overall data integrity protection, but just until the download is finished; at least this used to be the case with the other P2P network softwares. Thanks for the discussion so far :) It actually made me just more confident that no-CoWed data checksumming should work and is actually needed ;) Cheers, Chris. [-- Attachment #2: smime.p7s --] [-- Type: application/x-pkcs7-signature, Size: 5313 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dear developers, can we have notdatacow + checksumming, plz? 2015-12-17 2:09 ` Christoph Anton Mitterer @ 2015-12-21 13:36 ` Austin S. Hemmelgarn 2015-12-22 9:12 ` Duncan 0 siblings, 1 reply; 12+ messages in thread From: Austin S. Hemmelgarn @ 2015-12-21 13:36 UTC (permalink / raw) To: Christoph Anton Mitterer, linux-btrfs On 2015-12-16 21:09, Christoph Anton Mitterer wrote: > On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote: >>> Well sure, I think we'de done most of this and have dedicated >>> controllers, at least of a quality that funding allows us ;-) >>> But regardless how much one tunes, and how good the hardware is. If >>> you'd then loose always a fraction of your overall IO, and be it >>> just >>> 5%, to defragging these types of files, one may actually want to >>> avoid >>> this at all, for which nodatacow seems *the* solution. >> nodatacow only works for that if the file is pre-allocated, if it >> isn't, >> then it still ends up fragmented. > Hmm is that "it may end up fragmented" or a "it will definitely? > Cause I'd have hoped, that if nothing else had been written in the > meantime, btrfs would perhaps try to write next to the already > allocated blocks. If there are multiple files being written, then there is a relatively high probability that they will end up fragmented if they are more than about 64k and aren't pre-allocated. > > >>>> The problem is not entirely the lack of COW semantics, it's also >>>> the >>>> fact that it's impossible to implement an atomic write on a hard >>>> disk. >>> Sure... but that's just the same for the nodatacow writes of data. >>> (And the same, AFAIU, for CoW itself, just that we'd notice any >>> corruption in case of a crash due to the CoWed nature of the fs and >>> could go back to the last generation). >> Yes, but it's also the reason that using either COW or a log- >> structured >> filesystem (like NILFS2, LogFS, or I think F2FS) is important for >> consistency. > So then it's no reason why it shouldn't work. > The meta-data is CoWed, any incomplete writes of checksumdata in that > (be it for CoWed data or no-CoWed data, should the later be > implemented), would be protected at that level. > > Currently, the no-CoWed data is, AFAIU completely at risk of being > corrupted (no checksums, no journal). > > Checksums on no-CoWed data would just improve that. Except that without COW semantics on the data blocks, you can't be sure whether the checksum is for the data that is there, the data that was going to be written there, or data that had been there previously. This will significantly increase the chances of having false positives, which really isn't a viable tradeoff. > > >>> What about VMs? At least a quick google search didn't give me any >>> results on whether there would be e.g. checksumming support for >>> qcow2. >>> For raw images there surely is not. >> I don't mean that the VMM does checksumming, I mean that the guest OS >> should be the one to handle the corruption. No sane OS doesn't run >> at >> least some form of consistency checks when mounting a filesystem. > Well but we're not talking about having a filesystem that "looks clear" > here. For this alone we wouldn't need any checksumming at all. > > We talk about data integrity protection, i.e. all files and their > contents. Nothing which a fsck inside a guest VM would ever notice (I > mean by a fsck), if there are just some bit flips or things like that. That really depends on what is being done inside the VM. If you're using BTRFS or even dm-verity, you should have no issues detecting the corruption. > > >>> >>> And even if DBs do some checksumming now, it may be just a >>> consequence >>> of that missing in the filesystems. >>> As I've written somewhere else in the previous mail: it's IMHO much >>> better if one system takes care on this, where the code is well >>> tested, >>> than each application doing it's own thing. >> That's really a subjective opinion. The application knows better >> than >> we do what type of data integrity it needs, and can almost certainly >> do >> a better job of providing it than we can. > Hmm I don't see that. > When we, at the filesystem level, provide data integrity, than all data > is guaranteed to be valid. > What more should an application be able to provide? At best they can do > the same thing faster, but even for that I see no immediate reason to > believe it. Any number of things. As of right now, there are no local filesystems on Linux that provide: 1. Cryptographic verification of the file data (Technically possible with IMA and EVM, or with DM-Verity (if the data is supposed to be read-only), but those require extra setup, and aren't part of the FS). 2. Erasure coding other than what is provided by RAID5/6 (At least one distributed cluster filesystem provides this (Ceph), but running such a FS on a single node is impractical). 3. Efficient transactional logging (for example, the type that is needed by most RDBMS software). 4. Easy selective protections (Some applications need only part of their data protected). Item 1 can't really be provided by BTRFS under it's current design, it would require at least implementing support for cryptographically secure hashes in place of CRC32c (and each attempt to do that has been pretty much shot down). Item 2 is possible, and is something I would love to see support for, but would require a significant amount of coding, and almost certainly wouldn't anywhere near as flexible as letting the application do it itself. Item 3 can't be done without making the filesystem application specific, because you need to know enough about the data being logged to do it efficiently (see the original Oracle Cluster Filesystem for an example (not OCFS2), it was designed solely for Oracle's database software). Item 4 is technically possible, but not all that practical, as the amount of metadata required to track different levels of protection within a file would prohibitive. > > And in practise it seems far more likely that if countless applications > should such task on their own, that it's more error prone (that's why > we have libraries for all kinds of code, trying to reuse code, > minimising the possibility of errors in countless home-brew solutions), > or not done at all. Yes, and _all_ the libraries are in userspace, which is even more argument for the protection being done there. > > >>>>> - the data was written out correctly, but before the csum >>>>> was >>>>> written the system crashed, so the csum would now tell us >>>>> that >>>>> the >>>>> block is bad, while in reality it isn't. >>>> There is another case to consider, the data got written out, but >>>> the >>>> crash happened while writing the checksum (so the checksum was >>>> partially >>>> written, and is corrupt). This means we get a false positive on >>>> a >>>> disk >>>> error that isn't there, even when the data is correct, and that >>>> should >>>> be avoided if at all possible. >>> I've had that, and I've left it quoted above. >>> But as I've said before: That's one case out of many? How likely is >>> it >>> that the crash happens exactly after a large data block has been >>> written followed by a relatively tiny amount of checksum data. >>> I'd assume it's far more likely that the crash happens during >>> writing >>> the data. >> Except that the whole metadata block pointing to that data block gets >> rewritten, not just the checksum. > But that's the case anyway, isn't it? With or without checksums. Yes, and it's also one of the less well documented failure modes for nodatacow. If the data is COW, then BTRFS doesn't even look at the new data, because the only metadata block that points to it is invalid, so you see old data, but you are also guaranteed to see verified data. > > > >>> And regarding "reporting data to be in error, which is actually >>> correct"... isn't that what all journaling systems may do? >> No, most of them don't actually do that. The general design of a >> journaling filesystem is that the journal is used as what's called a >> Write-Intent-Log (WIL), the purpose of which is to say 'Hey, I'm >> going >> to write this data here in a little while.' so that when your system >> dies while writing that data, you can then finish writing it >> correctly >> when the system gets booted up again. And in particular, the only >> journaling filesystem that I know of that even allows the option of >> journaling the file contents instead of just metadata is ext4. > Well but that's just what I say... the system crashes,... the journal > tells about anything that's not for sure cleanly on disk, even though > it may have actually made it it. Except, like I said, it doesn't track data, only metadata, so only stuff for which allocations changed would be covered by the journal. > > Nothing more than what would happen in our case. > > >>> And, AFAIU, isn't that also what can happen in btrfs? The data was >>> already CoWed, but the metadata wasn't written out... so it would >>> fall >>> back somehow - here's where the unicorn[0] does it's job - to an >>> older >>> generation? >> Kind of, there are some really rare cases where it's possible if you >> get >> _really_ unlucky on a multi-device filesystem that things get >> corrupted >> such that the filesystem thinks that data that is perfectly correct >> is >> invalid, and thinks that the other copy which is corrupted is valid. >> (I've actually had this happen before, it was not fun trying to >> recover >> from it). > Doesn't really speak against nodatacow checksumming, AFAICS. You're right, it was more meant to point out that even with COW, stuff can get confused if you're really unlucky. > > >>> Well it was clear to me, that data+csum isn't sequentially on disk >>> are >>> there any numbers from real studies how often it would happen that >>> data >>> is written correctly but not the metadata? >>> And even if such study would show that - crash isn't the only >>> problem >>> we want to protect here (silent block errors, bus errors, etc). >>> I don't want to say crashes never happen, but in my practical >>> experience they don't happen that often either,... >>> >>> Losing a few blocks of valid data in the rare case of crashes, >>> seems to >>> be a penalty worth, when one gains confidence in data integrity in >>> all >>> others. >> That _really_ depends on what the data is. If you made that argument >> to >> the IT department at a financial institution, they would probably >> fall >> over laughing at you. > Well but your point is completely moot, because for someone who cares > so much in data, they wouldn't use nodatacow when btrfs has no journal > and the data could end up in any state in case of crash. > > And I'm quite certain that each financial institution rather clearly > gets an error message (i.e. because the checksums don't very), after > which they can get a backup, than having corrupt data taking for valid, > and the debts of their customers being zeroed. That all assumes that the administrators in question are smart. This is _never_ a safe assumption unless you have personally verified it, and even then it's still not a particularly safe assumption. > > It's kinda strange how you argue against better integrity protection > ;-) The point was that your argument that 'losing a few blocks of valid data on a crash is worth it for better integrity' was pretty far fetched. For almost all applications out there, losing known good data or getting false errors is never something that should happen. > > >>> But that's nothing the fs could or should decide for the user. >> OK, good point about this being policy. And in some cases >> (executables, >> configuration for administrative software, similar things), it is >> better >> to just return an error, but in many cases, that's not what most >> desktop >> users would want. Think document files, where a single byte error >> could >> easily be corrected by the user, or configuration files for sanely >> written apps (It's a lot nicer (and less confusing for someone >> without a >> lot of low-level computer background) to say 'Hey, your configuration >> file is messed up, here's how to fix it', than it is to say 'Hey, I >> couldn't read your configuration file'). And because BTRFS is >> supposed >> to be a general purpose filesystem, it has to account for the case of >> desktop users, and because server admins are supposed to be smart, >> the >> default should be for desktop usage. > Well but that's just the point I've made. The fs cannot decide what's > better or not. > Your document could be an important config file that allows/disallows > remote users access to resources. The single byte error could make a 0 > to a 1, allowing world wide access. That's not something that falls with actual 'Desktop' usage, that's server usage. > It could be your thesis' data, or part of the document file, changing > some numbers, which you won't easily notice but which makes everything > bogus when examined. And if you're writing a thesis, or some other research paper, you'd darn well better be verifying your data multiple times before you publish it. > I had brought the example with the video file, where it may not matter. It really doesn't in the case of a video file, or most audio files, or even some image files. If you take almost any arbitrary video file, and change any one bit outside of the header, then unless it's very poor quality to begin with, it's almost certain that nobody will notice (and in the case of the good formats, it'll just result in a dropped frame, because they have built-in verification). > > But in any case it's nothing what the fs can decide. The best it can do > is give an error on read, and the tools to give clearance to such files > (when they could not be auto-recovered by e.g. other copies). > > All this is however only possible with checksumming. Or properly educating users so they don't use nodatacow on everything. It's just like journal=writeback on ext4, it improves performance for some things, but can result in really weird inconsistencies when the system crashes. > > >>> a) Are checksums really stored per device (and not just once in the >>> metadata? At least from my naive understanding this would either >>> mean >>> that there's a waste of storage, or that the csums are made on data >>> that could vary from device to device (e.g. the same data split up >>> in >>> different extents, or compression on one device but not on the >>> other). >>> but.. >> AFAIUI, checksums are stored per-instance for every block. This is >> important in a multi-device filesystem in case you lose a device, so >> that you still have a checksum for the block. There should be no >> difference between extent layout and compression between devices >> however. > hmm but if that's the case, especially the later, that the extents are > the same on all devices,... then there's IMHO no need for data being > stored per-instance (I guess you mean per device instance?) for every > block. > The meta-data would have e.g. DUP anyway, so even if one device fails > metadata would hopefully be still there. > And if metadata is coompletely lost, the fs is lost anyway, and csums > don't matter anymore. OK, as Duncan pointed out in one of his replies, I was only correct by coincidence. checksums are stored based on metadata redundancy, so if metadata is raid1 or dup, you have two copies of each checksum. > > >>> b) that problem (different data each with valid corresponding >>> csums) >>> should in principle exist for CoWed data as well, right? And there, >>> I >>> guess, it's solved by CoWing the metadata... (which would still be >>> the >>> case for no-dataCoWed files). >> Yes. >>> Don't know what btrfs does in the CoWed case when such incident >>> happens... how does it decide which of two such corresponding >>> blocks >>> would be the newer one? The generations? >> Usually, but like I mentioned above there are edge cases that can >> occur >> as a result of data corruption on disk or other really rare >> circumstances. In the particular case of multiple copies of a block >> with different data but valid checksums, I'm about 95% certain that >> it >> will non-deterministically return one block or the other on an >> arbitrary >> read when the read doesn't hit the VFS cache. > Hmm would be quite worrysome if that could happen, especially also in > the CoW case. The thing is, this can't be fully protected against, except by verifying the blocks against each other when you read them, which will absolutely kill performance. The chance of this happening (without actively malicious intent) with COW on everything is extremely small (it requires a very large number of highly correlated errors), but having nodatacow enabled makes it slightly higher. In both cases it's statistically impossible, but that just means ti's something that almost certainly won't happen, and thus we shouldn't worry about dealing with it until we have everything else covered. > >> This is a potential issue >> for COW as well, but much less likely because it can more easily >> detect >> the corruption and fix it. > But then again, there should be no difference to checksumming the > no-CoWed data - the checksums would be CoWed again, if btrfs can detect > it there, it should be fine. > > > > >>> >>> Anyway, since metadata would still be CoWed, I think I may have >>> gotten >>> once again out of the tight spot - at least until you explain me, >>> why >>> my naive understanding, as laid out just above, doesn't work out >>> O:-) >> Hmm, I had forgotten about the metadata being COW, that does avoid >> the >> situation above under the specified circumstances, but does not avoid >> it >> happening due to disk errors (although that's extremely unlikely,a s >> it >> would require direct correlation of the errors in a way that is >> statistically impossible). > Ah... here we go :-) > > What exactly do you mean with disk errors here? IOW, what scenario do > you think of, in which checksumming no-CoWed data could lead to any > more corruptions than it can to without checksumming as well, or where > any inconsistencies could get into the filesystem's meta-data, that > couldn't already come in for checksummed+CoWed data and/or non- > checksummed+CoWed data? It can't (AFAICS) lead to any more _actual_ corruption, but it very much can lead to more false positives in the error detection, which is by definition a regression. > >>> Well, for PostgreSQL it's still fairly new (9.3, as I've said >>> above, ht >>> tps://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.3#Data_ >>> Chec >>> ksums), but it's not done per default (http://www.postgresql.org/do >>> cs/c >>> urrent/static/app-initdb.html), and they warn about a noticable >>> performance benefit (though I have of course no data whether this >>> would >>> be better/similar/worse to what is implied by btrfs checksumming). >>> >>> I've tried to find something for MySQL/MariaDB, but the only thing >>> I >>> could find there was: CHECKSUM TABLE >>> But that seems to be a SQL command, i.e. not on-read checksumming >>> as >>> we're talking about, but rather something the application/admin >>> would >>> need to do manually. >> I actually had been referring to this, with the assumption that the >> application would use it to verify it's own data. I hadn't realized >> PostgreSQL had in-line support for it. > Well but the (fairly new) in-line support, is the only thing that we > can really count here. > > What mysql does is that it requires the app to do it. > a) It's likely that there are many apps which don't use this (maybe > simply because they don't know it) and it's unlikely they'll all > change. > While what we can do at the btrfs level (or what postgresql would do) > works out of the box for everything. It works out of the box for everything, but it's also sub-optimal protection for almost everything that actually requires data integrity. > > b) I may simply not very well understand the CHECKSUM TABLE, but to me > it doesn't seem useful to provide data integrity in the sense we're > talking here about (i.e. silent block errors, bus errors, etc.) > Why? > First, it seems to checksum the whole data of the whole table, and > AFAICS, it uses only CRC32... given that such tables may be easily GiB > in size, CRC32 is IMHO simply not good enough. > Postgresql/btrfs in turn do the checksums on much smaller amounts of > data. > Second, verification seems to only take place when that command is > called. I'm not sure whether it implies locking the table in memory > then (didn't dig too deep), but I can't believe it would, which system > could keep a 100 GiB table in mem? > So it seems to be basically a one-shot verification, not covering any > corruptions that happen in between. > In fact, the documentation of the function even tells that this is for > backups/rollbacks/etc. only... so it's absolutely not that kind of data > integrity protection we're talking about (and even for that purpose, > CRC32 seems to be a poor choice). > > >> VDI is still widely used, because it's the default for Virtual Box >> when >> creating a VM. > Guess I just disbelieved that VirtualBox is still widely used O;-) On a commercial level, it really isn't (I don't even think that Oracle uses internally any more). On a personal level, it very much is, because too many people are too stupid because of Windows to learn to use stuff like QEMU or Xen. > > >> VHD is way more widely used than it should be, solely >> because there are insane people out there using Windows as a >> virtualization host. You also forgot VMDK, which is what VMWare uses >> almost exclusively, but I don't think it has built-in checksumming. >> >> As for Xen, the BCP are to avoid using image files like the plague, >> and >> use disks directly instead (or more commonly, use either LVM, or ZFS >> with zvols). > Anyway... what it comes down to: None of the VM image formats seem to > support checksumming. > > >>> So given all that, the picture looks a bit different again, I >>> think. >>> None of major FLOSS DBs doesn't do any checksumming per default, >>> MySQL >>> doesn't seem to support it, AFAICT. No VM image format seems to >>> even >>> support it. >> Again, most of my intent in referring to those was that the >> application >> or the Guest OS would do the verification itself. > I've answered that above already, IIRC (our mails get too lengthy O:-) > ). > The guest OS doesn't verify more than what our typical host OS (Linux) > does. > And that (except when btrfs with CoWed data is used ;-) does filesystem > integrity verification - which is however not data integrity > verification. > > > btw: That makes me think about something interesting: > If btrfs will ever support checksumming on no-CoWed data... then the > documentation should describe, that depending on the actual scenario, > it may make sense that btrfs filesystems inside the guest run generally > with nodatasum. > The idea begin: Why verifying twice? That gets to be a particularly dangerous recommendation, because lots of people (who arguably shouldn't be messing around with such stuff to begin with) will likely think it means that they can turn it off unconditionally in the guest system, which really isn't safe for anything that might be moved to some other FS. > > The constraints being, AFAICS, the following: > - If the VM image is ever to be moved off the host's btrfs image > (which > would have checksumming enabled) to a fs without checksumming or if > it would be ever copied remotely, than not having the "internal" > checksumming (i.e. from the btrfs filesystems inside the gues), > would > make one loose the integrity protection > - It further would only work, if the hypervisor itself, would properly > pass on any IO errors when it reads from the image files in the > host's btrfs, to block IO errors for the guest. If it wouldn't, then > the guest (with disabled "internal" checksumming) wouldn't probably > notice any data integrity errors, which he would, if the "internal" > checksumming wasn't turned off. > > >> If the application doesn't have that type of thing built in, then >> that's >> not something the filesystem should be worrying about, that's the job >> of >> the application developers to deal with. > No. > > If you see it like that, you could as well drop data checksumming in > btrfs completely. > You'd anyway argue that it would be the applications duty to do that. No, BTRFS's job is to verify that the data it returns matches what it was given in the first place. That is not reliably possible without having COW semantics on data blocks. > > The same way you could argue, that your MP4, JPG, ISO image or whatever > you downloaded via bittorrent needs to contain checksum data, and that > the actual application (which is not bittorrent, but e.g. mplayer, > imagemagick or wodim) need to verify these. > However this is not the case. ISO 9660 includes built-in ECC, it would be impractical for usage on removable optical media otherwise. JPEG and MP4 are irrelevant because in both cases, the average person can't detect corruption caused by single bit errors. Bittorrent itself properly verifies the downloads like it should. > In fact, it's quite usual and proper to have the lower layer handle > stuff which the higher layer don't have any real direct interest in > (except for the case, that the lower layer doesn't do it). Except that data integrity is obviously something the higher layers _do_ have interest in. > > >> The point of a filesystem is >> to store data within the integrity guarantees provided by the >> hardware, >> possibly with some additional protection > If you're convinced by that, you should probably propose that btrfs > removes data checksumming altogether. > I guess you won't make much friends with that idea ;) I think you missed the bit about 'possibly with some additional protection'. I really could have worded that better, but that's somewhat irrelevant. > > > >> In the case of stuff like torrents and such, all the good software >> for >> working with them has an option to verify the file after downloading. > Not sure what you mean with "and such", if it's again VMs, and DBs (the > IMHO actually more important use case than file sharing), I showed you > in the last mail, that non of these do any verification by default, and > half of the important ones don't even support it, with nothing on the > horizon that this would change (probably because they argue just the > other way round than you do: the fs should handle data integrity, and > ZFS and btrfs give them partially right ;-) ). > And I'm not sure with torrents, but I'd have suspected once the file's > downloaded completely, any checksumming data is no longer kept. If you keep the torrent file (even if it's kept loaded in the software), you have the checksum, as that's part of the identifier that is used to fetch the file. > If my guess is correct, even the torrent software doesn't really do > overall data integrity protection, but just until the download is > finished; at least this used to be the case with the other P2P network > softwares. Any good torrent software will let you verify the download after the fact, assuming you still have the torrent running (because if the verification fails, it will then go and re-download the failed blocks). > > > > Thanks for the discussion so far :) > It actually made me just more confident that no-CoWed data checksumming > should work and is actually needed ;) If your so convinced it's necessary, C is not hard to learn, and patches would go a long way towards getting this in the kernel. Whether I agree with it or not, if patches get posted, I'll provide the same degree of review as I would for any other feature (and even give my Tested-by assuming it passes xfstests, and the various edge-cases not in XFS tests that I throw at anything I test). ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dear developers, can we have notdatacow + checksumming, plz? 2015-12-21 13:36 ` Austin S. Hemmelgarn @ 2015-12-22 9:12 ` Duncan 2015-12-22 12:16 ` Austin S. Hemmelgarn 0 siblings, 1 reply; 12+ messages in thread From: Duncan @ 2015-12-22 9:12 UTC (permalink / raw) To: linux-btrfs Austin S. Hemmelgarn posted on Mon, 21 Dec 2015 08:36:02 -0500 as excerpted: > On 2015-12-16 21:09, Christoph Anton Mitterer wrote: >> On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote: >>> nodatacow only [avoids fragmentation] if the file is >>> pre-allocated, if it isn't, then it still ends up fragmented. >> Hmm is that "it may end up fragmented" or a "it will definitely? Cause >> I'd have hoped, that if nothing else had been written in the meantime, >> btrfs would perhaps try to write next to the already allocated blocks. > If there are multiple files being written, then there is a relatively > high probability that they will end up fragmented if they are more than > about 64k and aren't pre-allocated. Does the 30-second-by-default commit window (and similarly 30-second- default dirty-flush-time at the VFS level) modify this at all? It has been my assumption that same-file writes accumulated during this time should merge, increasing efficiency and decreasing fragmentation (both with and without nocow), tho of course further writes outside this 30- second window will likely trigger it, if other files have been written in parallel or in the mean time. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: dear developers, can we have notdatacow + checksumming, plz? 2015-12-22 9:12 ` Duncan @ 2015-12-22 12:16 ` Austin S. Hemmelgarn 0 siblings, 0 replies; 12+ messages in thread From: Austin S. Hemmelgarn @ 2015-12-22 12:16 UTC (permalink / raw) To: Duncan, linux-btrfs On 2015-12-22 04:12, Duncan wrote: > Austin S. Hemmelgarn posted on Mon, 21 Dec 2015 08:36:02 -0500 as > excerpted: > >> On 2015-12-16 21:09, Christoph Anton Mitterer wrote: > >>> On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote: > >>>> nodatacow only [avoids fragmentation] if the file is >>>> pre-allocated, if it isn't, then it still ends up fragmented. > >>> Hmm is that "it may end up fragmented" or a "it will definitely? Cause >>> I'd have hoped, that if nothing else had been written in the meantime, >>> btrfs would perhaps try to write next to the already allocated blocks. > >> If there are multiple files being written, then there is a relatively >> high probability that they will end up fragmented if they are more than >> about 64k and aren't pre-allocated. > > Does the 30-second-by-default commit window (and similarly 30-second- > default dirty-flush-time at the VFS level) modify this at all? It has > been my assumption that same-file writes accumulated during this time > should merge, increasing efficiency and decreasing fragmentation (both > with and without nocow), tho of course further writes outside this 30- > second window will likely trigger it, if other files have been written in > parallel or in the mean time. > I think it does, but not much, and it depends on the workload. I do notice less fragmentation on the filesystems I increase the commit window on, and more on ones I decrease it, but the difference is pretty small as long as you use something reasonable (I've never tested anything higher than 300, and I rarely go above 60). My guess based on what the commit window is for (namely, it's the amount of time the log tree gets updated before forcing a transaction to be committed) would be that it has less effect if stuff is regularly calling fsync(). ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2015-12-22 12:17 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-12-14 4:59 dear developers, can we have notdatacow + checksumming, plz? Christoph Anton Mitterer 2015-12-14 6:42 ` Russell Coker 2015-12-15 1:02 ` Christoph Anton Mitterer 2015-12-14 14:16 ` Austin S. Hemmelgarn 2015-12-15 3:15 ` Christoph Anton Mitterer 2015-12-15 16:00 ` Austin S. Hemmelgarn 2015-12-16 9:15 ` Duncan 2015-12-16 9:55 ` Duncan 2015-12-17 2:09 ` Christoph Anton Mitterer 2015-12-21 13:36 ` Austin S. Hemmelgarn 2015-12-22 9:12 ` Duncan 2015-12-22 12:16 ` Austin S. Hemmelgarn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox