dear developers, can we have notdatacow + checksumming, plz?

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* dear developers, can we have notdatacow + checksumming, plz?
@ 2015-12-14  4:59 Christoph Anton Mitterer
  2015-12-14  6:42 ` Russell Coker
  2015-12-14 14:16 ` Austin S. Hemmelgarn
  0 siblings, 2 replies; 12+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-14  4:59 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 10291 bytes --]

(consider that question being asked with that face on: http://goo.gl/LQaOuA)

Hey.

I've had some discussions on the list these days about not having
checksumming with nodatacow (mostly with Hugo and Duncan).

They both basically told me it wouldn't be straight possible with CoW,
and Duncan thinks it may not be so much necessary, but none of them
could give me really hard arguments, why it cannot work (or perhaps I
was just too stupid to understand them ^^)... while at the same time I
think that it would be generally utmost important to have checksumming
(real world examples below).

Also, I remember that in 2014, Ted Ts'o told me that there are some
plans ongoing to get data checksumming into ext4, with possibly even
some guy at RH actually doing it sooner or later.

Since these threads were rather admin-work-centric, developers may have
skipped it, therefore, I decided to write down some thoughts&ideas
label them with a more attracting subject and give it some bigger
attention.
O:-)

1) Motivation why, it makes sense to have checksumming (especially also
in the nodatacow case)

I think of all major btrfs features I know of (apart from the CoW
itself and having things like reflinks), checksumming is perhaps the
one that distinguishes it the most from traditional filesystems.

Sure we have snapshots, multi-device support and compression - but we
could have had that as well with LVM and software/hardware RAID... (and
ntfs supported compression IIRC ;) ).
Of course, btrfs does all that in a much smarter way, I know, but it's
nothing generally new.
The *data* checksumming at filesystem level, to my knowledge, is
however. Especially that it's always verified. Awesome. :-)

When one starts to get a bit deeper into btrfs (from the admin/end-user 
side) one sooner or later stumbles across the recommendation/need to
use nodatacow for certain types of data (DBs, VM images, etc.) and the
reason, AFAIU, being the inherent fragmentation that comes along with
the CoW, which is especially noticeable for those types of files with
lots of random internal writes.

Now duncan implied, that this could improve in the future, with the
auto-defragmentation getting (even) better, defrag becoming usable
again for those that do snapshots or reflinked copies and btrfs itself
generally maturing more and more.
But I kinda wonder to what extent one will be really able to solve
that, what seems to me a CoW-inherent "problem",...
Even *if* one can make the auto-defrag much smarter, it would still
mean that such files, like big DBs, VMs, or scientific datasets that
are internally rewritten, may get more or less constantly defragmented.
That may be quite undesired...
a) for performance reasons (when I consider our research software which
often has IO as the limiting factor and where we want as much IO being
used by actual programs as possible)...
b) SSDs...
Not really sure about that; btrfs seems to enable the autodefrag even
when an SSD is detected,... what is it doing? Placing the block in a
smart way on different chips so that accesses can be better
parallelised by the controller?
Anyway, (a) is could be already argument enough, not to run solve the
problem by a smart-[auto-]defrag, should that actually be implemented.

So I think having notdatacow is great and not just a workaround till
everything else gets better to handle these cases.
Thus, checksumming, which is such a vital feature, should also be
possible for that.

Duncan also mention that in some of those cases, the integrity is
already protected by the application layer, making it less important to
have it at the fs-layer.
Well, this may be true for file-sharing protocols, but I wouldn't know
that relational DBs really do cheksuming of the data.
They have journals, of course, but these protect against crashes, not
against silent block errors and that like.
And I wouldn't know that VM hypervisors do checksuming (but perhaps
I've just missed that).

Here I can give a real-world example, from the Tier-2 that I run for
LHC at work/university.
We have large amounts of storage (perhaps not as large as what Google
and Facebook have, or what the NSA stores about us)... but it's still
some ~ 2PiB, or a bit more.
That's managed with some special storage management software called
dCache. dCache even stores checksums, but per file, so that means for
normal reads, these cannot be verified (well technically it's
supported, but with our usual file sizes, this is not working) so what
remains are scrubs.
For The two PiB, we have some... roughly 50-60 nodes, each with
something between 12 and 24 disks, usually in either one or two RAID6
volumes, all different kinds of hard disks.
And we do run these scrubs quite rarely, since it costs IO that could
be used for actual computing jobs (a problem that wouldn't be there
with how btrfs calculates the sums on read, the data is then read
anyway)... so likely there are even more errors that are just never
noticed, because the datasets are removed again, before being scrubbed.

Long story short, it does happen every now and then, that a scrub shows
file errors, for neither the RAID was broken, nor there were any block
errors reported by the disks, or anything suspicious in SMART.
In other words, silent block corruption.

One may rely on the applications to do integrity protection, but I
think that's not realistic, and perhaps that shouldn't be their task
anyway (at least not when it's about storage device block errors and
that like).

I don't think it's on the horizon that things like DBs or large
scientific data files do their own integrity protection (i.e. one that
protects against bad blocks, and not just journalling that preserves
consistency in case of crashes).
And handling that on the fs level is anyway quite nice, I think.
It doesn't mean that countless applications need to handle this on the
application layer, making it configurable whether it should be enabled
(for integrity protection) or disabled (for more speed), each of them
writing a lot of code for that.
If we can control that on the fs layer, by setting datasum/nodatasum,
all needed is already there - except, that as of now, nodatacowed stuff
is excluded in btrfs.

2) Technical

Okay the following is obviously based on my naive view of how things
could work, which may not necessarily go well with how an actual fs
developer sees things ;-)

As said in the introduction, I can't quite believe that data
checksumming should in principle be possible for ext4, but not for
btrfs non-CoWed parts.

Duncan&Hugo said, the reason is basically it cannot do checksums with
no-CoW, because there's no guarantee that the fs doesn't end up
inconsistently...

But, AFAIU, not doing CoW, while not having a journal (or does it have
one for these cases???) almost certainly means that the data (not
necessarily the fs) will be inconsistent in case of a crash during a
no-CoWed write anyway, right?
Wouldn't it be basically like ext2?

Or we have the case of multi-device, e.g. RAID1, multiple copies of the
same blocks, a crash has happened during writing such (no-CoWed and no-
checksummed)...
Again it's almost certainly that at least one (maybe even both) of the
blocks contains garbage and likely (at least a 50% chance) we get that
one when the actual read happens later (I was told btrfs would behave
in these cases like e.g MD RAID does,... deliver what the first
readable block said).

If btrfs would calculate checksums and write them e.g. after or before
the actual data was written,... what would be the worst that could
happen (in my naive understanding of course ;-) ) at a crash?
- I'd say either one is lucky, and checksum and data matches.
  Yay.
- Or it doesn't match, which could boil down to the following two
  cases:
  - the data wasn't written out correctly and is actually garbage
    => then we can be happy, that the checksum wouldn't match and we'd 
       get an error
  - the data was written out correctly, but before the csum was
    written the system crashed, so the csum would now tell us that the
    block is bad, while in reality it isn't.
    or the other way round:
    the csum was written out (completely)... and no data was written
    at all before the system crashed (so the old block would be still 
    completely there)
    => in both cases: so what? Having that particular case happening
       is probably far less likely, than csumming actually detecting a
       bad block, or not completely written data in case of a crash.
       (Not to talk about all the cases where nothing crashes, and
       where we simply would want to detect block errors, bus errors,
       etc.)
=> Of course it wouldn't be as nice as in CoW, where it could
   simply take the most recent consistent state of that block, but
   still way better than:
   - delivering bogus data to the application in n other cases
   - not being able to decide which of m block copies is valid, if a
     RAID is scrubbed

And as said before, AFAIU, nodatacow'ed files have no journal in btrfs
as in ext3/4, so it's basically anyway that such files, when written
during a crash, may end up in any state, right? Which makes not having
a csum sound even worse, since nothing tells that this file is possibly
bad.

Not having checksumming seems to be especially bad in the multi-device
case... what happens when one runs a scrub? AFAIU, it simply does what
e.g. MD does: taking the first readable block, writing it to any
others, thereby possibly destroying the actually good one?

Not sure about whether the following would make any practical sense:
If data checksumming would work for nodatacow, then maybe some people
may even choose to run btrfs in CoW1 mode,.. they still could have most
fancy features from btrfs (checksumming, snapshots, perhaps even
refcopy?) but unless snapshots or refcopies are explicitly made, btrfs
doesn't do CoW.

Well, thanks for spending (hopefully not wasting ;-) ) your time on
reading my X-Mas wish ;)

Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: dear developers, can we have notdatacow + checksumming, plz?
  2015-12-14  4:59 dear developers, can we have notdatacow + checksumming, plz? Christoph Anton Mitterer
@ 2015-12-14  6:42 ` Russell Coker
  2015-12-15  1:02   ` Christoph Anton Mitterer
  2015-12-14 14:16 ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 12+ messages in thread
From: Russell Coker @ 2015-12-14  6:42 UTC (permalink / raw)
  To: Christoph Anton Mitterer; +Cc: linux-btrfs

On Mon, 14 Dec 2015 03:59:18 PM Christoph Anton Mitterer wrote:
> I've had some discussions on the list these days about not having
> checksumming with nodatacow (mostly with Hugo and Duncan).
> 
> They both basically told me it wouldn't be straight possible with CoW,
> and Duncan thinks it may not be so much necessary, but none of them
> could give me really hard arguments, why it cannot work (or perhaps I
> was just too stupid to understand them ^^)... while at the same time I
> think that it would be generally utmost important to have checksumming
> (real world examples below).

My understanding of BTRFS is that the metadata referencing data blocks has the 
checksums for those blocks, then the blocks which link to that metadata (EG 
directory entries referencing file metadata) has checksums of those.  For each 
metadata block there is a new version that is eventually linked from a new 
version of the tree root.

This means that the regular checksum mechanisms can't work with nocow data.  A 
filesystem can have checksums just pointing to data blocks but you need to 
cater for the case where a corrupt metadata block points to an old version of 
a data block and matching checksum.  The way that BTRFS works with an entire 
checksumed tree means that there's no possibility of pointing to an old 
version of a data block.

The NetApp published research into hard drive errors indicates that they are 
usually in small numbers and located in small areas of the disk.  So if BTRFS 
had a nocow file with any storage method other than dup you would have metadata 
and file data far enough apart that they are not likely to be hit by the same 
corruption (and the same thing would apply with most Ext4 Inode tables and 
data blocks).  I think that a file mode where there were checksums on data 
blocks with no checksums on the metadata tree would be useful.  But it would 
require a moderate amount of coding and there's lots of other things that the 
developers are working on.

-- 
My Main Blog         http://etbe.coker.com.au/
My Documents Blog    http://doc.coker.com.au/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: dear developers, can we have notdatacow + checksumming, plz?
  2015-12-14  6:42 ` Russell Coker
@ 2015-12-15  1:02   ` Christoph Anton Mitterer
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-15  1:02 UTC (permalink / raw)
  To: Russell Coker; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3002 bytes --]

On Mon, 2015-12-14 at 17:42 +1100, Russell Coker wrote:
> My understanding of BTRFS is that the metadata referencing data
> blocks has the 
> checksums for those blocks, then the blocks which link to that
> metadata (EG 
> directory entries referencing file metadata) has checksums of those.
You mean basically, that all metadata is chained, right?

> For each 
> metadata block there is a new version that is eventually linked from
> a new 
> version of the tree root.
> 
> This means that the regular checksum mechanisms can't work with nocow
> data.  A 
> filesystem can have checksums just pointing to data blocks but you
> need to 
> cater for the case where a corrupt metadata block points to an old
> version of 
> a data block and matching checksum.  The way that BTRFS works with an
> entire 
> checksumed tree means that there's no possibility of pointing to an
> old 
> version of a data block.
Hmm I'm not sure whether I understand that (or better said, I'm
probably sure I don't :D).

AFAIU, the metadata is always CoWed, right? So when a nodatacow file is
written, I'd assume it's mtime was update, which already leads to
CoWing of metadata... just that now, the checksums should be written as
well.

If the metadata block is corrupt, then should that be noticed via the
csums on that?

And you said "The way that BTRFS works with an entire checksumed tree
means that there's no possibility of pointing to an old version of a
data block."... how would that work for nodatacow'ed blocks? If there
is a crash it cannot know whether it was still the old block or the new
one or any garbage in between?!


> The NetApp published research into hard drive errors indicates that
> they are 
> usually in small numbers and located in small areas of the disk.  So
> if BTRFS 
> had a nocow file with any storage method other than dup you would
> have metadata 
> and file data far enough apart that they are not likely to be hit by
> the same 
> corruption (and the same thing would apply with most Ext4 Inode
> tables and 
> data blocks).
Well put aside any such research (whose results aren't guaranteed to be
always the case)... but that's just one reason from my motivation why
I've said checksums for no-CoWed files would be great (I used the
multi-device example though, not DUP).


> I think that a file mode where there were checksums on data 
> blocks with no checksums on the metadata tree would be useful.  But
> it would 
> require a moderate amount of coding
Do you mean in general, or having this as a mode for nodatacow'ed
files?
Loosing the meta data checksumming, doesn't seem really much more
appealing than not having data checksumming :-(


> and there's lots of other things that the 
> developers are working on.
Sure, I just wanted to bring this to their attending... I already
imagined that they wouldn't drop their current work to do that, just
because me whining for it ;-)


Thanks,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: dear developers, can we have notdatacow + checksumming, plz?
  2015-12-14  4:59 dear developers, can we have notdatacow + checksumming, plz? Christoph Anton Mitterer
  2015-12-14  6:42 ` Russell Coker
@ 2015-12-14 14:16 ` Austin S. Hemmelgarn
  2015-12-15  3:15   ` Christoph Anton Mitterer
  1 sibling, 1 reply; 12+ messages in thread
From: Austin S. Hemmelgarn @ 2015-12-14 14:16 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs

On 2015-12-13 23:59, Christoph Anton Mitterer wrote:
> (consider that question being asked with that face on: http://goo.gl/LQaOuA)
>
> Hey.
>
> I've had some discussions on the list these days about not having
> checksumming with nodatacow (mostly with Hugo and Duncan).
>
> They both basically told me it wouldn't be straight possible with CoW,
> and Duncan thinks it may not be so much necessary, but none of them
> could give me really hard arguments, why it cannot work (or perhaps I
> was just too stupid to understand them ^^)... while at the same time I
> think that it would be generally utmost important to have checksumming
> (real world examples below).
>
> Also, I remember that in 2014, Ted Ts'o told me that there are some
> plans ongoing to get data checksumming into ext4, with possibly even
> some guy at RH actually doing it sooner or later.
>
> Since these threads were rather admin-work-centric, developers may have
> skipped it, therefore, I decided to write down some thoughts&ideas
> label them with a more attracting subject and give it some bigger
> attention.
> O:-)
>
>
>
>
> 1) Motivation why, it makes sense to have checksumming (especially also
> in the nodatacow case)
>
>
> I think of all major btrfs features I know of (apart from the CoW
> itself and having things like reflinks), checksumming is perhaps the
> one that distinguishes it the most from traditional filesystems.
>
> Sure we have snapshots, multi-device support and compression - but we
> could have had that as well with LVM and software/hardware RAID... (and
> ntfs supported compression IIRC ;) ).
> Of course, btrfs does all that in a much smarter way, I know, but it's
> nothing generally new.
> The *data* checksumming at filesystem level, to my knowledge, is
> however. Especially that it's always verified. Awesome. :-)
>
>
> When one starts to get a bit deeper into btrfs (from the admin/end-user
> side) one sooner or later stumbles across the recommendation/need to
> use nodatacow for certain types of data (DBs, VM images, etc.) and the
> reason, AFAIU, being the inherent fragmentation that comes along with
> the CoW, which is especially noticeable for those types of files with
> lots of random internal writes.
It is worth pointing out that in the case of DB's at least, this is 
because at least some of the do COW internally to provide the 
transactional semantics that are required for many workloads.
>
> Now duncan implied, that this could improve in the future, with the
> auto-defragmentation getting (even) better, defrag becoming usable
> again for those that do snapshots or reflinked copies and btrfs itself
> generally maturing more and more.
> But I kinda wonder to what extent one will be really able to solve
> that, what seems to me a CoW-inherent "problem",...
> Even *if* one can make the auto-defrag much smarter, it would still
> mean that such files, like big DBs, VMs, or scientific datasets that
> are internally rewritten, may get more or less constantly defragmented.
> That may be quite undesired...
> a) for performance reasons (when I consider our research software which
> often has IO as the limiting factor and where we want as much IO being
> used by actual programs as possible)...
There are other things that can be done to improve this.  I would assume 
of course that you're already doing some of them (stuff like using 
dedicated storage controller cards instead of the stuff on the 
motherboard), but some things often get overlooked, like actually taking 
the time to fine-tune the I/O scheduler for the workload (Linux has 
particularly brain-dead default settings for CFQ, and the deadline I/O 
scheduler is only good in hard-real-time usage or on small hard drives 
that actually use spinning disks).
> b) SSDs...
> Not really sure about that; btrfs seems to enable the autodefrag even
> when an SSD is detected,... what is it doing? Placing the block in a
> smart way on different chips so that accesses can be better
> parallelised by the controller?
This really isn't possible with an SSD.  Except for NVMe and Open 
Channel SSD's, they use the same interfaces as a regular hard drive, 
which means you get absolutely no information about the data layout on 
the device.

The big argument for defragmenting a SSD is that it makes it such that 
you require fewer I/O requests to the device to read a file, and in most 
cases, the device will outlive it's usefulness because of performance 
long before it dies due to wearing out the flash storage.
> Anyway, (a) is could be already argument enough, not to run solve the
> problem by a smart-[auto-]defrag, should that actually be implemented.
>
> So I think having notdatacow is great and not just a workaround till
> everything else gets better to handle these cases.
> Thus, checksumming, which is such a vital feature, should also be
> possible for that.
The problem is not entirely the lack of COW semantics, it's also the 
fact that it's impossible to implement an atomic write on a hard disk. 
If we could tell the disk 'ensure that this set of writes either all 
happen, or none of them happen', then we could do checksumming without 
using COW in the filesystem safely, except that that would require the 
disk to either do COW, or use the block level equivalent of a log 
structured filesystem, thus pushing the issue further down the storage 
stack.
>
>
> Duncan also mention that in some of those cases, the integrity is
> already protected by the application layer, making it less important to
> have it at the fs-layer.
> Well, this may be true for file-sharing protocols, but I wouldn't know
> that relational DBs really do cheksuming of the data.
All the ones I know of except GDBM and BerkDB do in fact provide the 
option of checksumming.  It's pretty much mandatory if you want to be 
considered for usage in financial, military, or medical applications.
> They have journals, of course, but these protect against crashes, not
> against silent block errors and that like.
> And I wouldn't know that VM hypervisors do checksuming (but perhaps
> I've just missed that).
>
> Here I can give a real-world example, from the Tier-2 that I run for
> LHC at work/university.
> We have large amounts of storage (perhaps not as large as what Google
> and Facebook have, or what the NSA stores about us)... but it's still
> some ~ 2PiB, or a bit more.
> That's managed with some special storage management software called
> dCache. dCache even stores checksums, but per file, so that means for
> normal reads, these cannot be verified (well technically it's
> supported, but with our usual file sizes, this is not working) so what
> remains are scrubs.
> For The two PiB, we have some... roughly 50-60 nodes, each with
> something between 12 and 24 disks, usually in either one or two RAID6
> volumes, all different kinds of hard disks.
> And we do run these scrubs quite rarely, since it costs IO that could
> be used for actual computing jobs (a problem that wouldn't be there
> with how btrfs calculates the sums on read, the data is then read
> anyway)... so likely there are even more errors that are just never
> noticed, because the datasets are removed again, before being scrubbed.
>
>
> Long story short, it does happen every now and then, that a scrub shows
> file errors, for neither the RAID was broken, nor there were any block
> errors reported by the disks, or anything suspicious in SMART.
> In other words, silent block corruption.
Or a transient error in system RAM that ECC didn't catch, or a 
undetected error in the physical link layer to the disks, or an error in 
the disk cache or controller, or any number of other things.  BTRFS 
could only protect against some cases, not all (for example, if you have 
a big enough error in RAM that ECC doesn't catch it, you've got serious 
issues that just about nothing short of a cold reboot can save you from).
>
> One may rely on the applications to do integrity protection, but I
> think that's not realistic, and perhaps that shouldn't be their task
> anyway (at least not when it's about storage device block errors and
> that like).
That depends, if the application has data safety requirements above and 
beyond what the OS can provide, then it very much is their job to ensure 
those requirements are met.
>
> I don't think it's on the horizon that things like DBs or large
> scientific data files do their own integrity protection (i.e. one that
> protects against bad blocks, and not just journalling that preserves
> consistency in case of crashes).
Actually, a lot of them do in fact do this (or at least, many database 
systems do), precisely because most existing filesystems don't provide 
guarantees of data consistency without a ridiculous hit to performance.
> And handling that on the fs level is anyway quite nice, I think.
> It doesn't mean that countless applications need to handle this on the
> application layer, making it configurable whether it should be enabled
> (for integrity protection) or disabled (for more speed), each of them
> writing a lot of code for that.
> If we can control that on the fs layer, by setting datasum/nodatasum,
> all needed is already there - except, that as of now, nodatacowed stuff
> is excluded in btrfs.
>
>
>
>
>
> 2) Technical
>
>
> Okay the following is obviously based on my naive view of how things
> could work, which may not necessarily go well with how an actual fs
> developer sees things ;-)
>
> As said in the introduction, I can't quite believe that data
> checksumming should in principle be possible for ext4, but not for
> btrfs non-CoWed parts.
Except that for this to work safely, ext4 would have to add COW support, 
which I think they added for the in-line encryption stuff (in-line data 
transformations like encryption or compression have the exact same 
issues that data checksumming does when run on a non-COW filesystem).
>
> Duncan&Hugo said, the reason is basically it cannot do checksums with
> no-CoW, because there's no guarantee that the fs doesn't end up
> inconsistently...
Exactly.
>
> But, AFAIU, not doing CoW, while not having a journal (or does it have
> one for these cases???) almost certainly means that the data (not
> necessarily the fs) will be inconsistent in case of a crash during a
> no-CoWed write anyway, right?
> Wouldn't it be basically like ext2?
Kind of, but not quite.  Even with nodatacow, metadata is still COW, 
which is functionally as safe as a traditional journaling filesystem 
like XFS or ext4.  Absolute worst case scenario for both nodatacow on 
BTRFS, and a traditional journaling filesystem, the contents of the file 
are inconsistent.  However, almost all of the things that are 
recommended use cases for nodatacow (primarily database files and VM 
images) have some internal method of detecting and dealing with 
corruption (because of the traditional filesystem semantics ensuring 
metadata consistency, but not data consistency).
>
> Or we have the case of multi-device, e.g. RAID1, multiple copies of the
> same blocks, a crash has happened during writing such (no-CoWed and no-
> checksummed)...
> Again it's almost certainly that at least one (maybe even both) of the
> blocks contains garbage and likely (at least a 50% chance) we get that
> one when the actual read happens later (I was told btrfs would behave
> in these cases like e.g MD RAID does,... deliver what the first
> readable block said).
>
> If btrfs would calculate checksums and write them e.g. after or before
> the actual data was written,... what would be the worst that could
> happen (in my naive understanding of course ;-) ) at a crash?
> - I'd say either one is lucky, and checksum and data matches.
>    Yay.
> - Or it doesn't match, which could boil down to the following two
>    cases:
>    - the data wasn't written out correctly and is actually garbage
>      => then we can be happy, that the checksum wouldn't match and we'd
>         get an error
>    - the data was written out correctly, but before the csum was
>      written the system crashed, so the csum would now tell us that the
>      block is bad, while in reality it isn't.
>      or the other way round:
>      the csum was written out (completely)... and no data was written
>      at all before the system crashed (so the old block would be still
>      completely there)
>      => in both cases: so what? Having that particular case happening
>         is probably far less likely, than csumming actually detecting a
>         bad block, or not completely written data in case of a crash.
>         (Not to talk about all the cases where nothing crashes, and
>         where we simply would want to detect block errors, bus errors,
>         etc.)
There is another case to consider, the data got written out, but the 
crash happened while writing the checksum (so the checksum was partially 
written, and is corrupt).  This means we get a false positive on a disk 
error that isn't there, even when the data is correct, and that should 
be avoided if at all possible.

Also, because of how disks work, and the internal layout of BTRFS, it's 
a lot more likely than you think that the data would be written but the 
checksum wouldn't.  The checksum isn't part of the data block, nor is it 
stored with it, it's actually a part of the metadata block that stores 
the layout of the data for that file on disk.  Because of the nature of 
the stuff that nodatacow is supposed to be used for, it's almost always 
better to return bad data than it is to return no data (if you can get 
any data, then it's usually possible to recover the database file or VM 
image, but if you get none, it's a lot harder to recover the file).
> => Of course it wouldn't be as nice as in CoW, where it could
>     simply take the most recent consistent state of that block, but
>     still way better than:
>     - delivering bogus data to the application in n other cases
>     - not being able to decide which of m block copies is valid, if a
>       RAID is scrubbed
This gets _really_ scarily dangerous for a RAID setup, because we 
_absolutely_ can't ensure consistency between disks without using COW. 
As of right now, we dispatch writes to disks one at a time (although 
this would still be just as dangerous even if we dispatched writes in 
parallel), so if we crash it's possible that one disk would hold the old 
data, one would hold the new data, and _both_ would have correct 
checksums, which means that we would non-deterministically return one 
block or the other when an application tries to read it, and which block 
we return could change _each_ time the read is attempted, which 
absolutely breaks the semantics required of a filesystem on any modern 
OS (namely, the file won't change unless something writes to it).
>
> And as said before, AFAIU, nodatacow'ed files have no journal in btrfs
> as in ext3/4, so it's basically anyway that such files, when written
> during a crash, may end up in any state, right? Which makes not having
> a csum sound even worse, since nothing tells that this file is possibly
> bad.
As I stated above, most of the stuff that nodatacow is intended for 
already has it's own built-in protection.  No self-respecting RDBMS 
would be caught dead without internal consistency checks, and they all 
do COW internally anyway (because it's required for atomic transactions, 
which are an absolute requirement for database systems), and in fact 
that's part of why performance is so horrible for them on a COW 
filesystem.  As far as VM's go, either the disk image should have it's 
own internal consistency checks (for example, qcow2 format, used by 
QEMU, which also does COW internally), or the guest OS should have such 
checks.
>
> Not having checksumming seems to be especially bad in the multi-device
> case... what happens when one runs a scrub? AFAIU, it simply does what
> e.g. MD does: taking the first readable block, writing it to any
> others, thereby possibly destroying the actually good one?
AFAICT from the code, yes, that is the case.
>
> Not sure about whether the following would make any practical sense:
> If data checksumming would work for nodatacow, then maybe some people
> may even choose to run btrfs in CoW1 mode,.. they still could have most
> fancy features from btrfs (checksumming, snapshots, perhaps even
> refcopy?) but unless snapshots or refcopies are explicitly made, btrfs
> doesn't do CoW.
That might have some use when people _really_ don't care about 
consistency across a crash (for example, when it's a filesystem that 
gets reinitialized every boot).


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: dear developers, can we have notdatacow + checksumming, plz?
  2015-12-14 14:16 ` Austin S. Hemmelgarn
@ 2015-12-15  3:15   ` Christoph Anton Mitterer
  2015-12-15 16:00     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-15  3:15 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 15302 bytes --]

On Mon, 2015-12-14 at 09:16 -0500, Austin S. Hemmelgarn wrote:
> > When one starts to get a bit deeper into btrfs (from the admin/end-
> > user
> > side) one sooner or later stumbles across the recommendation/need
> > to
> > use nodatacow for certain types of data (DBs, VM images, etc.) and
> > the
> > reason, AFAIU, being the inherent fragmentation that comes along
> > with
> > the CoW, which is especially noticeable for those types of files
> > with
> > lots of random internal writes.
> It is worth pointing out that in the case of DB's at least, this is 
> because at least some of the do COW internally to provide the 
> transactional semantics that are required for many workloads.
Guess that also applies to some VM images then, IIRC qcow2 does CoW.



> > a) for performance reasons (when I consider our research software
> > which
> > often has IO as the limiting factor and where we want as much IO
> > being
> > used by actual programs as possible)...
> There are other things that can be done to improve this.  I would
> assume 
> of course that you're already doing some of them (stuff like using 
> dedicated storage controller cards instead of the stuff on the 
> motherboard), but some things often get overlooked, like actually
> taking 
> the time to fine-tune the I/O scheduler for the workload (Linux has 
> particularly brain-dead default settings for CFQ, and the deadline
> I/O 
> scheduler is only good in hard-real-time usage or on small hard
> drives 
> that actually use spinning disks).
Well sure, I think we'de done most of this and have dedicated
controllers, at least of a quality that funding allows us ;-)
But regardless how much one tunes, and how good the hardware is. If
you'd then loose always a fraction of your overall IO, and be it just
5%, to defragging these types of files, one may actually want to avoid
this at all, for which nodatacow seems *the* solution.


> The big argument for defragmenting a SSD is that it makes it such
> that 
> you require fewer I/O requests to the device to read a file
I've had read about that too, but since I haven't had much personal
experience or measurements in that respect, I didn't list it :)


> The problem is not entirely the lack of COW semantics, it's also the
> fact that it's impossible to implement an atomic write on a hard
> disk. 
Sure... but that's just the same for the nodatacow writes of data.
(And the same, AFAIU, for CoW itself, just that we'd notice any
corruption in case of a crash due to the CoWed nature of the fs and
could go back to the last generation).


> > but I wouldn't know that relational DBs really do cheksuming of the
> > data.
> All the ones I know of except GDBM and BerkDB do in fact provide the 
> option of checksumming.  It's pretty much mandatory if you want to be
> considered for usage in financial, military, or medical applications.
Hmm I see... PostgreSQL seem to have it since 9.3 ... didn't know
that... only crc16 but at least something.


> > Long story short, it does happen every now and then, that a scrub
> > shows
> > file errors, for neither the RAID was broken, nor there were any
> > block
> > errors reported by the disks, or anything suspicious in SMART.
> > In other words, silent block corruption.
> Or a transient error in system RAM that ECC didn't catch, or a 
> undetected error in the physical link layer to the disks, or an error
> in 
> the disk cache or controller, or any number of other things.
Well sure,... I was referring to these particular cases, where silent
block corruption was the most likely reason.
The data was reproducibly read identical, which probably rules out bad
RAM or controller, etc.


>   BTRFS 
> could only protect against some cases, not all (for example, if you
> have 
> a big enough error in RAM that ECC doesn't catch it, you've got
> serious 
> issues that just about nothing short of a cold reboot can save you
> from).
Sure, I haven't claimed, that checksumming for no-CoWed data is a
solution for everything.


> > But, AFAIU, not doing CoW, while not having a journal (or does it
> > have
> > one for these cases???) almost certainly means that the data (not
> > necessarily the fs) will be inconsistent in case of a crash during
> > a
> > no-CoWed write anyway, right?
> > Wouldn't it be basically like ext2?
> Kind of, but not quite.  Even with nodatacow, metadata is still COW, 
> which is functionally as safe as a traditional journaling filesystem 
> like XFS or ext4.
Sure, I was referring to the data part only, should have made that more
clear.


> Absolute worst case scenario for both nodatacow on 
> BTRFS, and a traditional journaling filesystem, the contents of the
> file 
> are inconsistent.  However, almost all of the things that are 
> recommended use cases for nodatacow (primarily database files and VM 
> images) have some internal method of detecting and dealing with 
> corruption (because of the traditional filesystem semantics ensuring 
> metadata consistency, but not data consistency).
What about VMs? At least a quick google search didn't give me any
results on whether there would be e.g. checksumming support for qcow2.
For raw images there surely is not.

And even if DBs do some checksumming now, it may be just a consequence
of that missing in the filesystems.
As I've written somewhere else in the previous mail: it's IMHO much
better if one system takes care on this, where the code is well tested,
than each application doing it's own thing.


> >    - the data was written out correctly, but before the csum was
> >      written the system crashed, so the csum would now tell us that
> > the
> >      block is bad, while in reality it isn't.
> There is another case to consider, the data got written out, but the
> crash happened while writing the checksum (so the checksum was
> partially 
> written, and is corrupt).  This means we get a false positive on a
> disk 
> error that isn't there, even when the data is correct, and that
> should 
> be avoided if at all possible.
I've had that, and I've left it quoted above.
But as I've said before: That's one case out of many? How likely is it
that the crash happens exactly after a large data block has been
written followed by a relatively tiny amount of checksum data.
I'd assume it's far more likely that the crash happens during writing
the data.

And regarding "reporting data to be in error, which is actually
correct"... isn't that what all journaling systems may do?
And, AFAIU, isn't that also what can happen in btrfs? The data was
already CoWed, but the metadata wasn't written out... so it would fall
back somehow - here's where the unicorn[0] does it's job - to an older
generation?
So that would be nothing really new.


> Also, because of how disks work, and the internal layout of BTRFS,
> it's 
> a lot more likely than you think that the data would be written but
> the 
> checksum wouldn't.  The checksum isn't part of the data block, nor is
> it 
> stored with it, it's actually a part of the metadata block that
> stores 
> the layout of the data for that file on disk.
Well it was clear to me, that data+csum isn't sequentially on disk are
there any numbers from real studies how often it would happen that data
is written correctly but not the metadata?
And even if such study would show that - crash isn't the only problem
we want to protect here (silent block errors, bus errors, etc).
I don't want to say crashes never happen, but in my practical
experience they don't happen that often either,...

Losing a few blocks of valid data in the rare case of crashes, seems to
be a penalty worth, when one gains confidence in data integrity in all
others.


> Because of the nature of 
> the stuff that nodatacow is supposed to be used for, it's almost
> always 
> better to return bad data than it is to return no data (if you can
> get 
> any data, then it's usually possible to recover the database file or
> VM 
> image, but if you get none, it's a lot harder to recover the file).
No. Simply no! :D

Seriously:
If you have bad data, for whichever reason (crash, silent block errors,
etc.), it's always best to notice.
*Then* you can decide what to do:
- Is there a backup and does one want to get the data from that
  backup, rather than continuing to use bad data, possibly even
  overwriting good backups one week later
- Is there either no backup or the effort of recovering it is to big
  and the corruption doesn't matter enough (e.g. when you have large
  video files, and there is a sinlge bit flip... well that may just
  mean that one colour looks a tiny bit different)

But that's nothing the fs could or should decide for the user.

After I've had sent the initial mail from this thread I remembered what
I've had forgotten to add:
Is there a way in btrfs, to tell it that gives clearance to a file
which it found to be in error based on checksums?

Cause *this* is IMHO the proper solution for your "it's almost always
better to return bad data than it is to return no data".

When we at the Tier-2 detect a file error that we cannot correct by
means of replicas, we determine the owner of that file, tell him about
the issue, and if he wants to continue using the broken file, there's a
way in the storage management system to rewrite the checksum.


> > => Of course it wouldn't be as nice as in CoW, where it could
> >     simply take the most recent consistent state of that block, but
> >     still way better than:
> >     - delivering bogus data to the application in n other cases
> >     - not being able to decide which of m block copies is valid, if
> > a
> >       RAID is scrubbed
> This gets _really_ scarily dangerous for a RAID setup, because we 
> _absolutely_ can't ensure consistency between disks without using
> COW. 
Hmm now I just thought "damn he got me" ;-)

> As of right now, we dispatch writes to disks one at a time (although 
> this would still be just as dangerous even if we dispatched writes in
> parallel)
Sure...


> so if we crash it's possible that one disk would hold the old 
> data, one would hold the new data
sure..


> and _both_ would have correct 
> checksums, which means that we would non-deterministically return one
> block or the other when an application tries to read it, and which
> block 
> we return could change _each_ time the read is attempted, which 
> absolutely breaks the semantics required of a filesystem on any
> modern 
> OS (namely, the file won't change unless something writes to it).
Here I do not longer follow you, so perhaps you (or someone else) can
explain a bit further. :-)

a) Are checksums really stored per device (and not just once in the
metadata? At least from my naive understanding this would either mean
that there's a waste of storage, or that the csums are made on data
that could vary from device to device (e.g. the same data split up in
different extents, or compression on one device but not on the other).
but..

b) that problem (different data each with valid corresponding csums)
should in principle exist for CoWed data as well, right? And there, I
guess, it's solved by CoWing the metadata... (which would still be the
case for no-dataCoWed files).
Don't know what btrfs does in the CoWed case when such incident
happens... how does it decide which of two such corresponding blocks
would be the newer one? The generations?

Anyway, since metadata would still be CoWed, I think I may have gotten
once again out of the tight spot - at least until you explain me, why 
my naive understanding, as laid out just above, doesn't work out O:-)



> As I stated above, most of the stuff that nodatacow is intended for 
> already has it's own built-in protection.  No self-respecting RDBMS 
> would be caught dead without internal consistency checks, and they
> all 
> do COW internally anyway (because it's required for atomic
> transactions, 
> which are an absolute requirement for database systems), and in fact 
> that's part of why performance is so horrible for them on a COW 
> filesystem.  As far as VM's go, either the disk image should have
> it's 
> own internal consistency checks (for example, qcow2 format, used by 
> QEMU, which also does COW internally), or the guest OS should have
> such 
> checks.
Well, for PostgreSQL it's still fairly new (9.3, as I've said above, ht
tps://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.3#Data_Chec
ksums), but it's not done per default (http://www.postgresql.org/docs/c
urrent/static/app-initdb.html), and they warn about a noticable
performance benefit (though I have of course no data whether this would
be better/similar/worse to what is implied by btrfs checksumming).

I've tried to find something for MySQL/MariaDB, but the only thing I
could find there was: CHECKSUM TABLE
But that seems to be a SQL command, i.e. not on-read checksumming as
we're talking about, but rather something the application/admin would
need to do manually.


BDB seems to support it (https://docs.oracle.com/cd/E17076_04/html/api_
reference/C/dbset_flags.html), but again not per default.
(And yes, we have quite big ones of them ^^)

SQLite doesn't seem to do it, at least not per default? (https://www.sq
lite.org/fileformat.html)


I tried once again to find any reference that qcow2 (which alone I
think would justify having csum support for nodatacow) supports
checksumming.
https://people.gnome.org/~markmc/qcow-image-format.html which seems to
be the original definition, doesn't tell[1] anything about it.
raw image, do of course not to any form of checksumming...
I had a short glance at OVF, but nothing popped up immediately that
would make me believe it supports checksumming.
Well there's VDI and VHD left... but are these still used seriously?
I guess KVM and Xen people mostly use raw or qcow2 these days, don't
they?


So given all that, the picture looks a bit different again, I think.
None of major FLOSS DBs doesn't do any checksumming per default, MySQL
doesn't seem to support it, AFAICT. No VM image format seems to even
support it.

And not to talk about countless of scientific data formats, which are
mostly not widely known to the FLOSS world, but which are used with
FLOSS software/Linux.


So AFAICT, the only thing left is torrent/edonkey files.
And do these store the checksums along the files? Or do they rather
wait until a chunk has been received, verify that and then throw it
away?
In any case however, at least some of these files types eventually end
up in the raw files, without any checksum (as that's only used during
download),... so when the files remain in the nodatacow area, they're
again at risk (+ during the time after the P2P software has finally
committed them to disk, and they'd be moved to CoWed and thus
checksummed areas)


Cheers,
Chris. :-)


[0] http://abstrusegoose.com/120
[1] admittedly I just cross read over it, and searched for the usual
suspect strings (hash, crc, sum) ;)

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: dear developers, can we have notdatacow + checksumming, plz?
  2015-12-15  3:15   ` Christoph Anton Mitterer
@ 2015-12-15 16:00     ` Austin S. Hemmelgarn
  2015-12-16  9:15       ` Duncan
                         ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Austin S. Hemmelgarn @ 2015-12-15 16:00 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs

On 2015-12-14 22:15, Christoph Anton Mitterer wrote:
> On Mon, 2015-12-14 at 09:16 -0500, Austin S. Hemmelgarn wrote:
>>> When one starts to get a bit deeper into btrfs (from the admin/end-
>>> user
>>> side) one sooner or later stumbles across the recommendation/need
>>> to
>>> use nodatacow for certain types of data (DBs, VM images, etc.) and
>>> the
>>> reason, AFAIU, being the inherent fragmentation that comes along
>>> with
>>> the CoW, which is especially noticeable for those types of files
>>> with
>>> lots of random internal writes.
>> It is worth pointing out that in the case of DB's at least, this is
>> because at least some of the do COW internally to provide the
>> transactional semantics that are required for many workloads.
> Guess that also applies to some VM images then, IIRC qcow2 does CoW.
Yep, and I think that VMWare's image format does too.
>
>
>
>>> a) for performance reasons (when I consider our research software
>>> which
>>> often has IO as the limiting factor and where we want as much IO
>>> being
>>> used by actual programs as possible)...
>> There are other things that can be done to improve this.  I would
>> assume
>> of course that you're already doing some of them (stuff like using
>> dedicated storage controller cards instead of the stuff on the
>> motherboard), but some things often get overlooked, like actually
>> taking
>> the time to fine-tune the I/O scheduler for the workload (Linux has
>> particularly brain-dead default settings for CFQ, and the deadline
>> I/O
>> scheduler is only good in hard-real-time usage or on small hard
>> drives
>> that actually use spinning disks).
> Well sure, I think we'de done most of this and have dedicated
> controllers, at least of a quality that funding allows us ;-)
> But regardless how much one tunes, and how good the hardware is. If
> you'd then loose always a fraction of your overall IO, and be it just
> 5%, to defragging these types of files, one may actually want to avoid
> this at all, for which nodatacow seems *the* solution.
nodatacow only works for that if the file is pre-allocated, if it isn't, 
then it still ends up fragmented.
>
>
>> The big argument for defragmenting a SSD is that it makes it such
>> that
>> you require fewer I/O requests to the device to read a file
> I've had read about that too, but since I haven't had much personal
> experience or measurements in that respect, I didn't list it :)
I can't give any real numbers, but I've seen noticeable performance 
improvements on good SSD's (Intel, Samsung, and Crucial) when making 
sure that things are defragmented.
>
>> The problem is not entirely the lack of COW semantics, it's also the
>> fact that it's impossible to implement an atomic write on a hard
>> disk.
> Sure... but that's just the same for the nodatacow writes of data.
> (And the same, AFAIU, for CoW itself, just that we'd notice any
> corruption in case of a crash due to the CoWed nature of the fs and
> could go back to the last generation).
Yes, but it's also the reason that using either COW or a log-structured 
filesystem (like NILFS2, LogFS, or I think F2FS) is important for 
consistency.
>
>
>>> but I wouldn't know that relational DBs really do cheksuming of the
>>> data.
>> All the ones I know of except GDBM and BerkDB do in fact provide the
>> option of checksumming.  It's pretty much mandatory if you want to be
>> considered for usage in financial, military, or medical applications.
> Hmm I see... PostgreSQL seem to have it since 9.3 ... didn't know
> that... only crc16 but at least something.
>
>
>>> Long story short, it does happen every now and then, that a scrub
>>> shows
>>> file errors, for neither the RAID was broken, nor there were any
>>> block
>>> errors reported by the disks, or anything suspicious in SMART.
>>> In other words, silent block corruption.
>> Or a transient error in system RAM that ECC didn't catch, or a
>> undetected error in the physical link layer to the disks, or an error
>> in
>> the disk cache or controller, or any number of other things.
> Well sure,... I was referring to these particular cases, where silent
> block corruption was the most likely reason.
> The data was reproducibly read identical, which probably rules out bad
> RAM or controller, etc.
>
>
>>    BTRFS
>> could only protect against some cases, not all (for example, if you
>> have
>> a big enough error in RAM that ECC doesn't catch it, you've got
>> serious
>> issues that just about nothing short of a cold reboot can save you
>> from).
> Sure, I haven't claimed, that checksumming for no-CoWed data is a
> solution for everything.
>
>
>>> But, AFAIU, not doing CoW, while not having a journal (or does it
>>> have
>>> one for these cases???) almost certainly means that the data (not
>>> necessarily the fs) will be inconsistent in case of a crash during
>>> a
>>> no-CoWed write anyway, right?
>>> Wouldn't it be basically like ext2?
>> Kind of, but not quite.  Even with nodatacow, metadata is still COW,
>> which is functionally as safe as a traditional journaling filesystem
>> like XFS or ext4.
> Sure, I was referring to the data part only, should have made that more
> clear.
>
>
>> Absolute worst case scenario for both nodatacow on
>> BTRFS, and a traditional journaling filesystem, the contents of the
>> file
>> are inconsistent.  However, almost all of the things that are
>> recommended use cases for nodatacow (primarily database files and VM
>> images) have some internal method of detecting and dealing with
>> corruption (because of the traditional filesystem semantics ensuring
>> metadata consistency, but not data consistency).
> What about VMs? At least a quick google search didn't give me any
> results on whether there would be e.g. checksumming support for qcow2.
> For raw images there surely is not.
I don't mean that the VMM does checksumming, I mean that the guest OS 
should be the one to handle the corruption.  No sane OS doesn't run at 
least some form of consistency checks when mounting a filesystem.
>
> And even if DBs do some checksumming now, it may be just a consequence
> of that missing in the filesystems.
> As I've written somewhere else in the previous mail: it's IMHO much
> better if one system takes care on this, where the code is well tested,
> than each application doing it's own thing.
That's really a subjective opinion.  The application knows better than 
we do what type of data integrity it needs, and can almost certainly do 
a better job of providing it than we can.  This is actually essentially 
the same reason that BTRFS and ZFS have multi-device support, the 
filesystem knows much better than the block device how it stores data, 
so it makes more sense to handle laying that data out across the disks 
in the filesystem.
>
>
>>>     - the data was written out correctly, but before the csum was
>>>       written the system crashed, so the csum would now tell us that
>>> the
>>>       block is bad, while in reality it isn't.
>> There is another case to consider, the data got written out, but the
>> crash happened while writing the checksum (so the checksum was
>> partially
>> written, and is corrupt).  This means we get a false positive on a
>> disk
>> error that isn't there, even when the data is correct, and that
>> should
>> be avoided if at all possible.
> I've had that, and I've left it quoted above.
> But as I've said before: That's one case out of many? How likely is it
> that the crash happens exactly after a large data block has been
> written followed by a relatively tiny amount of checksum data.
> I'd assume it's far more likely that the crash happens during writing
> the data.
Except that the whole metadata block pointing to that data block gets 
rewritten, not just the checksum.
>
> And regarding "reporting data to be in error, which is actually
> correct"... isn't that what all journaling systems may do?
No, most of them don't actually do that.  The general design of a 
journaling filesystem is that the journal is used as what's called a 
Write-Intent-Log (WIL), the purpose of which is to say 'Hey, I'm going 
to write this data here in a little while.' so that when your system 
dies while writing that data, you can then finish writing it correctly 
when the system gets booted up again.  And in particular, the only 
journaling filesystem that I know of that even allows the option of 
journaling the file contents instead of just metadata is ext4.
> And, AFAIU, isn't that also what can happen in btrfs? The data was
> already CoWed, but the metadata wasn't written out... so it would fall
> back somehow - here's where the unicorn[0] does it's job - to an older
> generation?
Kind of, there are some really rare cases where it's possible if you get 
_really_ unlucky on a multi-device filesystem that things get corrupted 
such that the filesystem thinks that data that is perfectly correct is 
invalid, and thinks that the other copy which is corrupted is valid. 
(I've actually had this happen before, it was not fun trying to recover 
from it).
> So that would be nothing really new.
>
>
>> Also, because of how disks work, and the internal layout of BTRFS,
>> it's
>> a lot more likely than you think that the data would be written but
>> the
>> checksum wouldn't.  The checksum isn't part of the data block, nor is
>> it
>> stored with it, it's actually a part of the metadata block that
>> stores
>> the layout of the data for that file on disk.
> Well it was clear to me, that data+csum isn't sequentially on disk are
> there any numbers from real studies how often it would happen that data
> is written correctly but not the metadata?
> And even if such study would show that - crash isn't the only problem
> we want to protect here (silent block errors, bus errors, etc).
> I don't want to say crashes never happen, but in my practical
> experience they don't happen that often either,...
>
> Losing a few blocks of valid data in the rare case of crashes, seems to
> be a penalty worth, when one gains confidence in data integrity in all
> others.
That _really_ depends on what the data is.  If you made that argument to 
the IT department at a financial institution, they would probably fall 
over laughing at you.
>
>
>> Because of the nature of
>> the stuff that nodatacow is supposed to be used for, it's almost
>> always
>> better to return bad data than it is to return no data (if you can
>> get
>> any data, then it's usually possible to recover the database file or
>> VM
>> image, but if you get none, it's a lot harder to recover the file).
> No. Simply no! :D
>
> Seriously:
> If you have bad data, for whichever reason (crash, silent block errors,
> etc.), it's always best to notice.
> *Then* you can decide what to do:
> - Is there a backup and does one want to get the data from that
>    backup, rather than continuing to use bad data, possibly even
>    overwriting good backups one week later
> - Is there either no backup or the effort of recovering it is to big
>    and the corruption doesn't matter enough (e.g. when you have large
>    video files, and there is a sinlge bit flip... well that may just
>    mean that one colour looks a tiny bit different)
>
> But that's nothing the fs could or should decide for the user.
OK, good point about this being policy.  And in some cases (executables, 
configuration for administrative software, similar things), it is better 
to just return an error, but in many cases, that's not what most desktop 
users would want.  Think document files, where a single byte error could 
easily be corrected by the user, or configuration files for sanely 
written apps (It's a lot nicer (and less confusing for someone without a 
lot of low-level computer background) to say 'Hey, your configuration 
file is messed up, here's how to fix it', than it is to say 'Hey, I 
couldn't read your configuration file').  And because BTRFS is supposed 
to be a general purpose filesystem, it has to account for the case of 
desktop users, and because server admins are supposed to be smart, the 
default should be for desktop usage.
>
> After I've had sent the initial mail from this thread I remembered what
> I've had forgotten to add:
> Is there a way in btrfs, to tell it that gives clearance to a file
> which it found to be in error based on checksums?
>
> Cause *this* is IMHO the proper solution for your "it's almost always
> better to return bad data than it is to return no data".
>
> When we at the Tier-2 detect a file error that we cannot correct by
> means of replicas, we determine the owner of that file, tell him about
> the issue, and if he wants to continue using the broken file, there's a
> way in the storage management system to rewrite the checksum.
>
>
>>> => Of course it wouldn't be as nice as in CoW, where it could
>>>      simply take the most recent consistent state of that block, but
>>>      still way better than:
>>>      - delivering bogus data to the application in n other cases
>>>      - not being able to decide which of m block copies is valid, if
>>> a
>>>        RAID is scrubbed
>> This gets _really_ scarily dangerous for a RAID setup, because we
>> _absolutely_ can't ensure consistency between disks without using
>> COW.
> Hmm now I just thought "damn he got me" ;-)
>
>> As of right now, we dispatch writes to disks one at a time (although
>> this would still be just as dangerous even if we dispatched writes in
>> parallel)
> Sure...
>
>
>> so if we crash it's possible that one disk would hold the old
>> data, one would hold the new data
> sure..
>
>
>> and _both_ would have correct
>> checksums, which means that we would non-deterministically return one
>> block or the other when an application tries to read it, and which
>> block
>> we return could change _each_ time the read is attempted, which
>> absolutely breaks the semantics required of a filesystem on any
>> modern
>> OS (namely, the file won't change unless something writes to it).
> Here I do not longer follow you, so perhaps you (or someone else) can
> explain a bit further. :-)
>
> a) Are checksums really stored per device (and not just once in the
> metadata? At least from my naive understanding this would either mean
> that there's a waste of storage, or that the csums are made on data
> that could vary from device to device (e.g. the same data split up in
> different extents, or compression on one device but not on the other).
> but..
AFAIUI, checksums are stored per-instance for every block.  This is 
important in a multi-device filesystem in case you lose a device, so 
that you still have a checksum for the block.  There should be no 
difference between extent layout and compression between devices however.
>
> b) that problem (different data each with valid corresponding csums)
> should in principle exist for CoWed data as well, right? And there, I
> guess, it's solved by CoWing the metadata... (which would still be the
> case for no-dataCoWed files).
Yes.
> Don't know what btrfs does in the CoWed case when such incident
> happens... how does it decide which of two such corresponding blocks
> would be the newer one? The generations?
Usually, but like I mentioned above there are edge cases that can occur 
as a result of data corruption on disk or other really rare 
circumstances.  In the particular case of multiple copies of a block 
with different data but valid checksums, I'm about 95% certain that it 
will non-deterministically return one block or the other on an arbitrary 
read when the read doesn't hit the VFS cache.  This is a potential issue 
for COW as well, but much less likely because it can more easily detect 
the corruption and fix it.
>
> Anyway, since metadata would still be CoWed, I think I may have gotten
> once again out of the tight spot - at least until you explain me, why
> my naive understanding, as laid out just above, doesn't work out O:-)
Hmm, I had forgotten about the metadata being COW, that does avoid the 
situation above under the specified circumstances, but does not avoid it 
happening due to disk errors (although that's extremely unlikely,a s it 
would require direct correlation of the errors in a way that is 
statistically impossible).
>
>
>> As I stated above, most of the stuff that nodatacow is intended for
>> already has it's own built-in protection.  No self-respecting RDBMS
>> would be caught dead without internal consistency checks, and they
>> all
>> do COW internally anyway (because it's required for atomic
>> transactions,
>> which are an absolute requirement for database systems), and in fact
>> that's part of why performance is so horrible for them on a COW
>> filesystem.  As far as VM's go, either the disk image should have
>> it's
>> own internal consistency checks (for example, qcow2 format, used by
>> QEMU, which also does COW internally), or the guest OS should have
>> such
>> checks.
> Well, for PostgreSQL it's still fairly new (9.3, as I've said above, ht
> tps://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.3#Data_Chec
> ksums), but it's not done per default (http://www.postgresql.org/docs/c
> urrent/static/app-initdb.html), and they warn about a noticable
> performance benefit (though I have of course no data whether this would
> be better/similar/worse to what is implied by btrfs checksumming).
>
> I've tried to find something for MySQL/MariaDB, but the only thing I
> could find there was: CHECKSUM TABLE
> But that seems to be a SQL command, i.e. not on-read checksumming as
> we're talking about, but rather something the application/admin would
> need to do manually.
I actually had been referring to this, with the assumption that the 
application would use it to verify it's own data.  I hadn't realized 
PostgreSQL had in-line support for it.
>
>
> BDB seems to support it (https://docs.oracle.com/cd/E17076_04/html/api_
> reference/C/dbset_flags.html), but again not per default.
> (And yes, we have quite big ones of them ^^)
>
> SQLite doesn't seem to do it, at least not per default? (https://www.sq
> lite.org/fileformat.html)
>
>
> I tried once again to find any reference that qcow2 (which alone I
> think would justify having csum support for nodatacow) supports
> checksumming.
> https://people.gnome.org/~markmc/qcow-image-format.html which seems to
> be the original definition, doesn't tell[1] anything about it.
> raw image, do of course not to any form of checksumming...
> I had a short glance at OVF, but nothing popped up immediately that
> would make me believe it supports checksumming.
> Well there's VDI and VHD left... but are these still used seriously?
> I guess KVM and Xen people mostly use raw or qcow2 these days, don't
> they?
VDI is still widely used, because it's the default for Virtual Box when 
creating a VM.  VHD is way more widely used than it should be, solely 
because there are insane people out there using Windows as a 
virtualization host.  You also forgot VMDK, which is what VMWare uses 
almost exclusively, but I don't think it has built-in checksumming.

As for Xen, the BCP are to avoid using image files like the plague, and 
use disks directly instead (or more commonly, use either LVM, or ZFS 
with zvols).
>
>
> So given all that, the picture looks a bit different again, I think.
> None of major FLOSS DBs doesn't do any checksumming per default, MySQL
> doesn't seem to support it, AFAICT. No VM image format seems to even
> support it.
Again, most of my intent in referring to those was that the application 
or the Guest OS would do the verification itself.
>
> And not to talk about countless of scientific data formats, which are
> mostly not widely known to the FLOSS world, but which are used with
> FLOSS software/Linux.
If the application doesn't have that type of thing built in, then that's 
not something the filesystem should be worrying about, that's the job of 
the application developers to deal with.  The point of a filesystem is 
to store data within the integrity guarantees provided by the hardware, 
possibly with some additional protection, not to save the user or 
application from making stupid choices.
>
> So AFAICT, the only thing left is torrent/edonkey files.
> And do these store the checksums along the files? Or do they rather
> wait until a chunk has been received, verify that and then throw it
> away?
> In any case however, at least some of these files types eventually end
> up in the raw files, without any checksum (as that's only used during
> download),... so when the files remain in the nodatacow area, they're
> again at risk (+ during the time after the P2P software has finally
> committed them to disk, and they'd be moved to CoWed and thus
> checksummed areas)
In the case of stuff like torrents and such, all the good software for 
working with them has an option to verify the file after downloading.

>
> [0] http://abstrusegoose.com/120
> [1] admittedly I just cross read over it, and searched for the usual
> suspect strings (hash, crc, sum) ;)
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: dear developers, can we have notdatacow + checksumming, plz?
  2015-12-15 16:00     ` Austin S. Hemmelgarn
@ 2015-12-16  9:15       ` Duncan
  2015-12-16  9:55       ` Duncan
  2015-12-17  2:09       ` Christoph Anton Mitterer
  2 siblings, 0 replies; 12+ messages in thread
From: Duncan @ 2015-12-16  9:15 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Tue, 15 Dec 2015 11:00:40 -0500 as
excerpted:

> And in particular, the only
> journaling filesystem that I know of that even allows the option of
> journaling the file contents instead of just metadata is ext4.

IIRC, ext3 was the first to have it in Linux mainline, with data=writeback 
for the speed freaks that don't care about data loss, data=ordered as the 
default normal option (except for that infamous period when Linus lost 
his head and let people talk him into switching to data=writeback, 
despite the risks... he later came back to his senses and reverted that), 
and data=journal for the folks that were willing to pay trade a bit of 
speed for better data protection (tho it was famous for surprising 
everybody, in that in certain use-cases it was extremely fast, faster 
than data=writeback, something I don't think was ever fully explained).

To my knowledge ext3 still has that, tho I haven't used it probably a 
decade.

Reiserfs has all three data= options as well, with data=ordered the 
default, tho it only had data=writeback initially.  While I've used 
reiserfs for years, it has always been with the default data=ordered 
since that was introduced, and I'd be surprised if data=journal had the 
same use-case speed advantage that it did on ext3, as it's too 
different.  Meanwhile, that early data=writeback default is where 
reiserfs got its ill repute for data loss, but it had long switched to 
data=ordered by default by the time Linus lost his senses and tried 
data=writeback by default on ext3.  Because I was on reiserfs from 
data=writeback era, I was rather glad most kernel hackers didn't want to 
touch it by the time Linus let them talk him into data=writeback on ext3, 
and thus left reiserfs (which again had long been data=ordered by default 
by then) well enough alone.

But I did help a few people running ext3 trace down their new ext3 
stability issues to that bad data=writeback experiment, and persuaded 
them to specify data=ordered, which solved their problems, so indeed 
they /were/ data=writeback related.  And happily, Linus did eventually 
regain his senses and return ext3 to data=ordered by default once again.

And based on what you said, ext4 still has all three data= options, 
including data=journal.  But I wasn't sure on that myself (tho I would 
have assumed it inherited it from ext3) and thus am /definitely/ not sure 
whether it inherits ext3's data=journal speed advantages in certain 
corner-cases.

I have no idea whether other journaled filesystems allow choosing the 
journal level or not, tho.  I only know of those three.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: dear developers, can we have notdatacow + checksumming, plz?
  2015-12-15 16:00     ` Austin S. Hemmelgarn
  2015-12-16  9:15       ` Duncan
@ 2015-12-16  9:55       ` Duncan
  2015-12-17  2:09       ` Christoph Anton Mitterer
  2 siblings, 0 replies; 12+ messages in thread
From: Duncan @ 2015-12-16  9:55 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Tue, 15 Dec 2015 11:00:40 -0500 as
excerpted:

> AFAIUI, checksums are stored per-instance for every block.  This is
> important in a multi-device filesystem in case you lose a device, so
> that you still have a checksum for the block.  There should be no
> difference between extent layout and compression between devices
> however.

I don't believe that's quite correct.

What is correct, to the best of my knowledge, is that checksums are 
metadata, and thus have whatever duplication/parity level metadata is 
assigned.

For single devices, that is of course by default dup, 2X the metadata and 
thus 2X the checksums, both on the single data (as effectively the only 
choice on a single device, at least thru 4.3, tho there's a patch adding 
dup data as an option that I think should be in 4.4) when covering data, 
dup metadata when covering it.

For multiple devices, it's default raid1 metadata, default single data, 
so the picture doesn't differ much by default from the single-device 
default picture.  It's also possible to do single metadata, raidN data, 
which really doesn't make sense except for raid0 data, and thus I believe 
there's a warning about that sort of layout in newer mkfs.btrfs, or when 
lowering the metadata redundancy using balance filters.

But of course it's possible to do raid1 data and metadata, which would be 
two copies of each, regardless of the number of devices (except that it's 
2+, of course).  But the copies aren't 1:1 assigned.  That is, if they're 
equal generation, btrfs can read either checksum and apply it to either 
data/metadata block.  (Of course if they're not equal generation, btrfs 
will choose the higher one, thus covering the case of writing at the time 
of a crash, since either they will both be the same generation if the 
root block wasn't updated to the new one on either one yet, or one will 
be a higher/newer generation than the other, if it had already finished 
writing one but not the other at the time of the crash.)

This is why it's an extremely good idea if you have a pair of devices in 
raid1, and you mount one of them degraded/writable with the other 
unavailable for some reason, that you don't also mount the other one 
writable and then try to recombined them.  Chances are the generations 
wouldn't match and it'd pick the one with the higher generation, but if 
they did for some reason match, and both checksums were valid on their 
data, but the data differed... either one could be chosen, and a scrub 
might choose either one to fix the other, as well, which could in theory 
result in a file with intermixed blocks from the two different versions!

Just ensure that if one is mounted writable, it's the only one mounted 
writable if there's a chance of recombining, and you'll be fine, as it'll 
be the only one with advancing generations.  And if by some accident both 
are mounted writable separately, the best bet is to be sure and wipe the 
one, then add it as a new device, if you're going to reintroduce it to 
the same filesystem.

Of course this gets a bit more complicated with 3+ device raid1, since 
currently, there's still only two copies of each block and two copies of 
the checksum, meaning there's at least one device without a copy of each 
block, and if the filesystem is mounted degraded writable repeatedly with 
a random device missing...

Similarly, the permutations can be calculated for the other raid types, 
and for mixed raid types like raid6 data (specified) and raid1 metadata 
(unspecified so the default used), but I won't attempt that here.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: dear developers, can we have notdatacow + checksumming, plz?
  2015-12-15 16:00     ` Austin S. Hemmelgarn
  2015-12-16  9:15       ` Duncan
  2015-12-16  9:55       ` Duncan
@ 2015-12-17  2:09       ` Christoph Anton Mitterer
  2015-12-21 13:36         ` Austin S. Hemmelgarn
  2 siblings, 1 reply; 12+ messages in thread
From: Christoph Anton Mitterer @ 2015-12-17  2:09 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 20453 bytes --]

On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote:
> > Well sure, I think we'de done most of this and have dedicated
> > controllers, at least of a quality that funding allows us ;-)
> > But regardless how much one tunes, and how good the hardware is. If
> > you'd then loose always a fraction of your overall IO, and be it
> > just
> > 5%, to defragging these types of files, one may actually want to
> > avoid
> > this at all, for which nodatacow seems *the* solution.
> nodatacow only works for that if the file is pre-allocated, if it
> isn't, 
> then it still ends up fragmented.
Hmm is that "it may end up fragmented" or a "it will definitely?
Cause I'd have hoped, that if nothing else had been written in the
meantime, btrfs would perhaps try to write next to the already
allocated blocks.


> > > The problem is not entirely the lack of COW semantics, it's also
> > > the
> > > fact that it's impossible to implement an atomic write on a hard
> > > disk.
> > Sure... but that's just the same for the nodatacow writes of data.
> > (And the same, AFAIU, for CoW itself, just that we'd notice any
> > corruption in case of a crash due to the CoWed nature of the fs and
> > could go back to the last generation).
> Yes, but it's also the reason that using either COW or a log-
> structured 
> filesystem (like NILFS2, LogFS, or I think F2FS) is important for 
> consistency.
So then it's no reason why it shouldn't work.
The meta-data is CoWed, any incomplete writes of checksumdata in that
(be it for CoWed data or no-CoWed data, should the later be
implemented), would be protected at that level.

Currently, the no-CoWed data is, AFAIU completely at risk of being
corrupted (no checksums, no journal).

Checksums on no-CoWed data would just improve that.


> > What about VMs? At least a quick google search didn't give me any
> > results on whether there would be e.g. checksumming support for
> > qcow2.
> > For raw images there surely is not.
> I don't mean that the VMM does checksumming, I mean that the guest OS
> should be the one to handle the corruption.  No sane OS doesn't run
> at 
> least some form of consistency checks when mounting a filesystem.
Well but we're not talking about having a filesystem that "looks clear"
here. For this alone we wouldn't need any checksumming at all.

We talk about data integrity protection, i.e. all files and their
contents. Nothing which a fsck inside a guest VM would ever notice (I
mean by a fsck), if there are just some bit flips or things like that.


> > 
> > And even if DBs do some checksumming now, it may be just a
> > consequence
> > of that missing in the filesystems.
> > As I've written somewhere else in the previous mail: it's IMHO much
> > better if one system takes care on this, where the code is well
> > tested,
> > than each application doing it's own thing.
> That's really a subjective opinion.  The application knows better
> than 
> we do what type of data integrity it needs, and can almost certainly
> do 
> a better job of providing it than we can.
Hmm I don't see that.
When we, at the filesystem level, provide data integrity, than all data
is guaranteed to be valid.
What more should an application be able to provide? At best they can do
the same thing faster, but even for that I see no immediate reason to
believe it.

And in practise it seems far more likely that if countless applications
should such task on their own, that it's more error prone (that's why
we have libraries for all kinds of code, trying to reuse code,
minimising the possibility of errors in countless home-brew solutions),
or not done at all.


> > > >     - the data was written out correctly, but before the csum
> > > > was
> > > >       written the system crashed, so the csum would now tell us
> > > > that
> > > > the
> > > >       block is bad, while in reality it isn't.
> > > There is another case to consider, the data got written out, but
> > > the
> > > crash happened while writing the checksum (so the checksum was
> > > partially
> > > written, and is corrupt).  This means we get a false positive on
> > > a
> > > disk
> > > error that isn't there, even when the data is correct, and that
> > > should
> > > be avoided if at all possible.
> > I've had that, and I've left it quoted above.
> > But as I've said before: That's one case out of many? How likely is
> > it
> > that the crash happens exactly after a large data block has been
> > written followed by a relatively tiny amount of checksum data.
> > I'd assume it's far more likely that the crash happens during
> > writing
> > the data.
> Except that the whole metadata block pointing to that data block gets
> rewritten, not just the checksum.
But that's the case anyway, isn't it? With or without checksums.



> > And regarding "reporting data to be in error, which is actually
> > correct"... isn't that what all journaling systems may do?
> No, most of them don't actually do that.  The general design of a 
> journaling filesystem is that the journal is used as what's called a 
> Write-Intent-Log (WIL), the purpose of which is to say 'Hey, I'm
> going 
> to write this data here in a little while.' so that when your system 
> dies while writing that data, you can then finish writing it
> correctly 
> when the system gets booted up again.  And in particular, the only 
> journaling filesystem that I know of that even allows the option of 
> journaling the file contents instead of just metadata is ext4.
Well but that's just what I say... the system crashes,... the journal
tells about anything that's not for sure cleanly on disk, even though
it may have actually made it it.

Nothing more than what would happen in our case.


> > And, AFAIU, isn't that also what can happen in btrfs? The data was
> > already CoWed, but the metadata wasn't written out... so it would
> > fall
> > back somehow - here's where the unicorn[0] does it's job - to an
> > older
> > generation?
> Kind of, there are some really rare cases where it's possible if you
> get 
> _really_ unlucky on a multi-device filesystem that things get
> corrupted 
> such that the filesystem thinks that data that is perfectly correct
> is 
> invalid, and thinks that the other copy which is corrupted is valid. 
> (I've actually had this happen before, it was not fun trying to
> recover 
> from it).
Doesn't really speak against nodatacow checksumming, AFAICS.


> > Well it was clear to me, that data+csum isn't sequentially on disk
> > are
> > there any numbers from real studies how often it would happen that
> > data
> > is written correctly but not the metadata?
> > And even if such study would show that - crash isn't the only
> > problem
> > we want to protect here (silent block errors, bus errors, etc).
> > I don't want to say crashes never happen, but in my practical
> > experience they don't happen that often either,...
> > 
> > Losing a few blocks of valid data in the rare case of crashes,
> > seems to
> > be a penalty worth, when one gains confidence in data integrity in
> > all
> > others.
> That _really_ depends on what the data is.  If you made that argument
> to 
> the IT department at a financial institution, they would probably
> fall 
> over laughing at you.
Well but your point is completely moot, because for someone who cares
so much in data, they wouldn't use nodatacow when btrfs has no journal
and the data could end up in any state in case of crash.

And I'm quite certain that each financial institution rather clearly
gets an error message (i.e. because the checksums don't very), after
which they can get a backup, than having corrupt data taking for valid,
and the debts of their customers being zeroed.

It's kinda strange how you argue against better integrity protection
;-)


> > But that's nothing the fs could or should decide for the user.
> OK, good point about this being policy.  And in some cases
> (executables, 
> configuration for administrative software, similar things), it is
> better 
> to just return an error, but in many cases, that's not what most
> desktop 
> users would want.  Think document files, where a single byte error
> could 
> easily be corrected by the user, or configuration files for sanely 
> written apps (It's a lot nicer (and less confusing for someone
> without a 
> lot of low-level computer background) to say 'Hey, your configuration
> file is messed up, here's how to fix it', than it is to say 'Hey, I 
> couldn't read your configuration file').  And because BTRFS is
> supposed 
> to be a general purpose filesystem, it has to account for the case of
> desktop users, and because server admins are supposed to be smart,
> the 
> default should be for desktop usage.
Well but that's just the point I've made. The fs cannot decide what's
better or not.
Your document could be an important config file that allows/disallows
remote users access to resources. The single byte error could make a 0
to a 1, allowing world wide access.
It could be your thesis' data, or part of the document file, changing
some numbers, which you won't easily notice but which makes everything
bogus when examined.
I had brought the example with the video file, where it may not matter.

But in any case it's nothing what the fs can decide. The best it can do
is give an error on read, and the tools to give clearance to such files
(when they could not be auto-recovered by e.g. other copies).

All this is however only possible with checksumming.


> > a) Are checksums really stored per device (and not just once in the
> > metadata? At least from my naive understanding this would either
> > mean
> > that there's a waste of storage, or that the csums are made on data
> > that could vary from device to device (e.g. the same data split up
> > in
> > different extents, or compression on one device but not on the
> > other).
> > but..
> AFAIUI, checksums are stored per-instance for every block.  This is 
> important in a multi-device filesystem in case you lose a device, so 
> that you still have a checksum for the block.  There should be no 
> difference between extent layout and compression between devices
> however.
hmm but if that's the case, especially the later, that the extents are
the same on all devices,... then there's IMHO no need for data being
stored per-instance (I guess you mean per device instance?) for every
block.
The meta-data would have e.g. DUP anyway, so even if one device fails
metadata would hopefully be still there.
And if metadata is coompletely lost, the fs is lost anyway, and csums
don't matter anymore.


> > b) that problem (different data each with valid corresponding
> > csums)
> > should in principle exist for CoWed data as well, right? And there,
> > I
> > guess, it's solved by CoWing the metadata... (which would still be
> > the
> > case for no-dataCoWed files).
> Yes.
> > Don't know what btrfs does in the CoWed case when such incident
> > happens... how does it decide which of two such corresponding
> > blocks
> > would be the newer one? The generations?
> Usually, but like I mentioned above there are edge cases that can
> occur 
> as a result of data corruption on disk or other really rare 
> circumstances.  In the particular case of multiple copies of a block 
> with different data but valid checksums, I'm about 95% certain that
> it 
> will non-deterministically return one block or the other on an
> arbitrary 
> read when the read doesn't hit the VFS cache.
Hmm would be quite worrysome if that could happen, especially also in
the CoW case.

>   This is a potential issue 
> for COW as well, but much less likely because it can more easily
> detect 
> the corruption and fix it.
But then again, there should be no difference to checksumming the
no-CoWed data - the checksums would be CoWed again, if btrfs can detect
it there, it should be fine.




> > 
> > Anyway, since metadata would still be CoWed, I think I may have
> > gotten
> > once again out of the tight spot - at least until you explain me,
> > why
> > my naive understanding, as laid out just above, doesn't work out
> > O:-)
> Hmm, I had forgotten about the metadata being COW, that does avoid
> the 
> situation above under the specified circumstances, but does not avoid
> it 
> happening due to disk errors (although that's extremely unlikely,a s
> it 
> would require direct correlation of the errors in a way that is 
> statistically impossible).
Ah... here we go :-)

What exactly do you mean with disk errors here? IOW, what scenario do
you think of, in which checksumming no-CoWed data could lead to any
more corruptions than it can to without checksumming as well, or where
any inconsistencies could get into the filesystem's meta-data, that
couldn't already come in for checksummed+CoWed data and/or non-
checksummed+CoWed data?

> > Well, for PostgreSQL it's still fairly new (9.3, as I've said
> > above, ht
> > tps://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.3#Data_
> > Chec
> > ksums), but it's not done per default (http://www.postgresql.org/do
> > cs/c
> > urrent/static/app-initdb.html), and they warn about a noticable
> > performance benefit (though I have of course no data whether this
> > would
> > be better/similar/worse to what is implied by btrfs checksumming).
> > 
> > I've tried to find something for MySQL/MariaDB, but the only thing
> > I
> > could find there was: CHECKSUM TABLE
> > But that seems to be a SQL command, i.e. not on-read checksumming
> > as
> > we're talking about, but rather something the application/admin
> > would
> > need to do manually.
> I actually had been referring to this, with the assumption that the 
> application would use it to verify it's own data.  I hadn't realized 
> PostgreSQL had in-line support for it.
Well but the (fairly new) in-line support, is the only thing that we
can really count here.

What mysql does is that it requires the app to do it.
a) It's likely that there are many apps which don't use this (maybe
simply because they don't know it) and it's unlikely they'll all
change.
While what we can do at the btrfs level (or what postgresql would do)
works out of the box for everything.

b) I may simply not very well understand the CHECKSUM TABLE, but to me
it doesn't seem useful to provide data integrity in the sense we're
talking here about (i.e. silent block errors, bus errors, etc.)
Why?
First, it seems to checksum the whole data of the whole table, and
AFAICS, it uses only CRC32... given that such tables may be easily GiB
in size, CRC32 is IMHO simply not good enough.
Postgresql/btrfs in turn do the checksums on much smaller amounts of
data.
Second, verification seems to only take place when that command is
called. I'm not sure whether it implies locking the table in memory
then (didn't dig too deep), but I can't believe it would, which system
could keep a 100 GiB table in mem?
So it seems to be basically a one-shot verification, not covering any
corruptions that happen in between.
In fact, the documentation of the function even tells that this is for
backups/rollbacks/etc. only... so it's absolutely not that kind of data
 integrity protection we're talking about (and even for that purpose,
CRC32 seems to be a poor choice).


> VDI is still widely used, because it's the default for Virtual Box
> when 
> creating a VM.
Guess I just disbelieved that VirtualBox is still widely used O;-)


>   VHD is way more widely used than it should be, solely 
> because there are insane people out there using Windows as a 
> virtualization host.  You also forgot VMDK, which is what VMWare uses
> almost exclusively, but I don't think it has built-in checksumming.
> 
> As for Xen, the BCP are to avoid using image files like the plague,
> and 
> use disks directly instead (or more commonly, use either LVM, or ZFS 
> with zvols).
Anyway... what it comes down to: None of the VM image formats seem to
support checksumming.


> > So given all that, the picture looks a bit different again, I
> > think.
> > None of major FLOSS DBs doesn't do any checksumming per default,
> > MySQL
> > doesn't seem to support it, AFAICT. No VM image format seems to
> > even
> > support it.
> Again, most of my intent in referring to those was that the
> application 
> or the Guest OS would do the verification itself.
I've answered that above already, IIRC (our mails get too lengthy O:-)
).
The guest OS doesn't verify more than what our typical host OS (Linux)
does.
And that (except when btrfs with CoWed data is used ;-) does filesystem
integrity verification - which is however not data integrity
verification.


btw: That makes me think about something interesting:
If btrfs will ever support checksumming on no-CoWed data... then the
documentation should describe, that depending on the actual scenario,
it may make sense that btrfs filesystems inside the guest run generally
with nodatasum.
The idea begin: Why verifying twice?

The constraints being, AFAICS, the following:
- If the VM image is ever to be moved off the host's btrfs image
(which 
  would have checksumming enabled) to a fs without checksumming or if
  it would be ever copied remotely, than not having the "internal"
  checksumming (i.e. from the btrfs filesystems inside the gues),
would 
  make one loose the integrity protection
- It further would only work, if the hypervisor itself, would properly
  pass on any IO errors when it reads from the image files in the
  host's btrfs, to block IO errors for the guest. If it wouldn't, then
  the guest (with disabled "internal" checksumming) wouldn't probably
  notice any data integrity errors, which he would, if the "internal"
  checksumming wasn't turned off.


> If the application doesn't have that type of thing built in, then
> that's 
> not something the filesystem should be worrying about, that's the job
> of 
> the application developers to deal with.
No.

If you see it like that, you could as well drop data checksumming in
btrfs completely.
You'd anyway argue that it would be the applications duty to do that.

The same way you could argue, that your MP4, JPG, ISO image or whatever
you downloaded via bittorrent needs to contain checksum data, and that
the actual application (which is not bittorrent, but e.g. mplayer,
imagemagick or wodim) need to verify these.
However this is not the case.
In fact, it's quite usual and proper to have the lower layer handle
stuff which the higher layer don't have any real direct interest in
(except for the case, that the lower layer doesn't do it).


> The point of a filesystem is 
> to store data within the integrity guarantees provided by the
> hardware, 
> possibly with some additional protection
If you're convinced by that, you should probably propose that btrfs
removes data checksumming altogether.
I guess you won't make much friends with that idea ;)



> In the case of stuff like torrents and such, all the good software
> for 
> working with them has an option to verify the file after downloading.
Not sure what you mean with "and such", if it's again VMs, and DBs (the
IMHO actually more important use case than file sharing), I showed you
in the last mail, that non of these do any verification by default, and
 half of the important ones don't even support it, with nothing on the
horizon that this would change (probably because they argue just the
other way round than you do: the fs should handle data integrity, and
ZFS and btrfs give them partially right ;-) ).
And I'm not sure with torrents, but I'd have suspected once the file's
downloaded completely, any checksumming data is no longer kept.
If my guess is correct, even the torrent software doesn't really do
overall data integrity protection, but just until the download is
finished; at least this used to be the case with the other P2P network
softwares.



Thanks for the discussion so far :)
It actually made me just more confident that no-CoWed data checksumming
should work and is actually needed ;)


Cheers,
Chris.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5313 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: dear developers, can we have notdatacow + checksumming, plz?
  2015-12-17  2:09       ` Christoph Anton Mitterer
@ 2015-12-21 13:36         ` Austin S. Hemmelgarn
  2015-12-22  9:12           ` Duncan
  0 siblings, 1 reply; 12+ messages in thread
From: Austin S. Hemmelgarn @ 2015-12-21 13:36 UTC (permalink / raw)
  To: Christoph Anton Mitterer, linux-btrfs

On 2015-12-16 21:09, Christoph Anton Mitterer wrote:
> On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote:
>>> Well sure, I think we'de done most of this and have dedicated
>>> controllers, at least of a quality that funding allows us ;-)
>>> But regardless how much one tunes, and how good the hardware is. If
>>> you'd then loose always a fraction of your overall IO, and be it
>>> just
>>> 5%, to defragging these types of files, one may actually want to
>>> avoid
>>> this at all, for which nodatacow seems *the* solution.
>> nodatacow only works for that if the file is pre-allocated, if it
>> isn't,
>> then it still ends up fragmented.
> Hmm is that "it may end up fragmented" or a "it will definitely?
> Cause I'd have hoped, that if nothing else had been written in the
> meantime, btrfs would perhaps try to write next to the already
> allocated blocks.
If there are multiple files being written, then there is a relatively 
high probability that they will end up fragmented if they are more than 
about 64k and aren't pre-allocated.
>
>
>>>> The problem is not entirely the lack of COW semantics, it's also
>>>> the
>>>> fact that it's impossible to implement an atomic write on a hard
>>>> disk.
>>> Sure... but that's just the same for the nodatacow writes of data.
>>> (And the same, AFAIU, for CoW itself, just that we'd notice any
>>> corruption in case of a crash due to the CoWed nature of the fs and
>>> could go back to the last generation).
>> Yes, but it's also the reason that using either COW or a log-
>> structured
>> filesystem (like NILFS2, LogFS, or I think F2FS) is important for
>> consistency.
> So then it's no reason why it shouldn't work.
> The meta-data is CoWed, any incomplete writes of checksumdata in that
> (be it for CoWed data or no-CoWed data, should the later be
> implemented), would be protected at that level.
>
> Currently, the no-CoWed data is, AFAIU completely at risk of being
> corrupted (no checksums, no journal).
>
> Checksums on no-CoWed data would just improve that.
Except that without COW semantics on the data blocks, you can't be sure 
whether the checksum is for the data that is there, the data that was 
going to be written there, or data that had been there previously.  This 
will significantly increase the chances of having false positives, which 
really isn't a viable tradeoff.
>
>
>>> What about VMs? At least a quick google search didn't give me any
>>> results on whether there would be e.g. checksumming support for
>>> qcow2.
>>> For raw images there surely is not.
>> I don't mean that the VMM does checksumming, I mean that the guest OS
>> should be the one to handle the corruption.  No sane OS doesn't run
>> at
>> least some form of consistency checks when mounting a filesystem.
> Well but we're not talking about having a filesystem that "looks clear"
> here. For this alone we wouldn't need any checksumming at all.
>
> We talk about data integrity protection, i.e. all files and their
> contents. Nothing which a fsck inside a guest VM would ever notice (I
> mean by a fsck), if there are just some bit flips or things like that.
That really depends on what is being done inside the VM.  If you're 
using BTRFS or even dm-verity, you should have no issues detecting the 
corruption.
>
>
>>>
>>> And even if DBs do some checksumming now, it may be just a
>>> consequence
>>> of that missing in the filesystems.
>>> As I've written somewhere else in the previous mail: it's IMHO much
>>> better if one system takes care on this, where the code is well
>>> tested,
>>> than each application doing it's own thing.
>> That's really a subjective opinion.  The application knows better
>> than
>> we do what type of data integrity it needs, and can almost certainly
>> do
>> a better job of providing it than we can.
> Hmm I don't see that.
> When we, at the filesystem level, provide data integrity, than all data
> is guaranteed to be valid.
> What more should an application be able to provide? At best they can do
> the same thing faster, but even for that I see no immediate reason to
> believe it.
Any number of things.  As of right now, there are no local filesystems 
on Linux that provide:
1. Cryptographic verification of the file data (Technically possible 
with IMA and EVM, or with DM-Verity (if the data is supposed to be 
read-only), but those require extra setup, and aren't part of the FS).
2. Erasure coding other than what is provided by RAID5/6 (At least one 
distributed cluster filesystem provides this (Ceph), but running such a 
FS on a single node is impractical).
3. Efficient transactional logging (for example, the type that is needed 
by most RDBMS software).
4. Easy selective protections (Some applications need only part of their 
data protected).

Item 1 can't really be provided by BTRFS under it's current design, it 
would require at least implementing support for cryptographically secure 
hashes in place of CRC32c (and each attempt to do that has been pretty 
much shot down).  Item 2 is possible, and is something I would love to 
see support for, but would require a significant amount of coding, and 
almost certainly wouldn't anywhere near as flexible as letting the 
application do it itself.  Item 3 can't be done without making the 
filesystem application specific, because you need to know enough about 
the data being logged to do it efficiently (see the original Oracle 
Cluster Filesystem for an example (not OCFS2), it was designed solely 
for Oracle's database software).  Item 4 is technically possible, but 
not all that practical, as the amount of metadata required to track 
different levels of protection within a file would prohibitive.
>
> And in practise it seems far more likely that if countless applications
> should such task on their own, that it's more error prone (that's why
> we have libraries for all kinds of code, trying to reuse code,
> minimising the possibility of errors in countless home-brew solutions),
> or not done at all.
Yes, and _all_ the libraries are in userspace, which is even more 
argument for the protection being done there.
>
>
>>>>>      - the data was written out correctly, but before the csum
>>>>> was
>>>>>        written the system crashed, so the csum would now tell us
>>>>> that
>>>>> the
>>>>>        block is bad, while in reality it isn't.
>>>> There is another case to consider, the data got written out, but
>>>> the
>>>> crash happened while writing the checksum (so the checksum was
>>>> partially
>>>> written, and is corrupt).  This means we get a false positive on
>>>> a
>>>> disk
>>>> error that isn't there, even when the data is correct, and that
>>>> should
>>>> be avoided if at all possible.
>>> I've had that, and I've left it quoted above.
>>> But as I've said before: That's one case out of many? How likely is
>>> it
>>> that the crash happens exactly after a large data block has been
>>> written followed by a relatively tiny amount of checksum data.
>>> I'd assume it's far more likely that the crash happens during
>>> writing
>>> the data.
>> Except that the whole metadata block pointing to that data block gets
>> rewritten, not just the checksum.
> But that's the case anyway, isn't it? With or without checksums.
Yes, and it's also one of the less well documented failure modes for 
nodatacow. If the data is COW, then BTRFS doesn't even look at the new 
data, because the only metadata block that points to it is invalid, so 
you see old data, but you are also guaranteed to see verified data.
>
>
>
>>> And regarding "reporting data to be in error, which is actually
>>> correct"... isn't that what all journaling systems may do?
>> No, most of them don't actually do that.  The general design of a
>> journaling filesystem is that the journal is used as what's called a
>> Write-Intent-Log (WIL), the purpose of which is to say 'Hey, I'm
>> going
>> to write this data here in a little while.' so that when your system
>> dies while writing that data, you can then finish writing it
>> correctly
>> when the system gets booted up again.  And in particular, the only
>> journaling filesystem that I know of that even allows the option of
>> journaling the file contents instead of just metadata is ext4.
> Well but that's just what I say... the system crashes,... the journal
> tells about anything that's not for sure cleanly on disk, even though
> it may have actually made it it.
Except, like I said, it doesn't track data, only metadata, so only stuff 
for which allocations changed would be covered by the journal.
>
> Nothing more than what would happen in our case.
>
>
>>> And, AFAIU, isn't that also what can happen in btrfs? The data was
>>> already CoWed, but the metadata wasn't written out... so it would
>>> fall
>>> back somehow - here's where the unicorn[0] does it's job - to an
>>> older
>>> generation?
>> Kind of, there are some really rare cases where it's possible if you
>> get
>> _really_ unlucky on a multi-device filesystem that things get
>> corrupted
>> such that the filesystem thinks that data that is perfectly correct
>> is
>> invalid, and thinks that the other copy which is corrupted is valid.
>> (I've actually had this happen before, it was not fun trying to
>> recover
>> from it).
> Doesn't really speak against nodatacow checksumming, AFAICS.
You're right, it was more meant to point out that even with COW, stuff 
can get confused if you're really unlucky.
>
>
>>> Well it was clear to me, that data+csum isn't sequentially on disk
>>> are
>>> there any numbers from real studies how often it would happen that
>>> data
>>> is written correctly but not the metadata?
>>> And even if such study would show that - crash isn't the only
>>> problem
>>> we want to protect here (silent block errors, bus errors, etc).
>>> I don't want to say crashes never happen, but in my practical
>>> experience they don't happen that often either,...
>>>
>>> Losing a few blocks of valid data in the rare case of crashes,
>>> seems to
>>> be a penalty worth, when one gains confidence in data integrity in
>>> all
>>> others.
>> That _really_ depends on what the data is.  If you made that argument
>> to
>> the IT department at a financial institution, they would probably
>> fall
>> over laughing at you.
> Well but your point is completely moot, because for someone who cares
> so much in data, they wouldn't use nodatacow when btrfs has no journal
> and the data could end up in any state in case of crash.
>
> And I'm quite certain that each financial institution rather clearly
> gets an error message (i.e. because the checksums don't very), after
> which they can get a backup, than having corrupt data taking for valid,
> and the debts of their customers being zeroed.
That all assumes that the administrators in question are smart.  This is 
_never_ a safe assumption unless you have personally verified it, and 
even then it's still not a particularly safe assumption.
>
> It's kinda strange how you argue against better integrity protection
> ;-)
The point was that your argument that 'losing a few blocks of valid data 
on a crash is worth it for better integrity' was pretty far fetched. 
For almost all applications out there, losing known good data or getting 
false errors is never something that should happen.
>
>
>>> But that's nothing the fs could or should decide for the user.
>> OK, good point about this being policy.  And in some cases
>> (executables,
>> configuration for administrative software, similar things), it is
>> better
>> to just return an error, but in many cases, that's not what most
>> desktop
>> users would want.  Think document files, where a single byte error
>> could
>> easily be corrected by the user, or configuration files for sanely
>> written apps (It's a lot nicer (and less confusing for someone
>> without a
>> lot of low-level computer background) to say 'Hey, your configuration
>> file is messed up, here's how to fix it', than it is to say 'Hey, I
>> couldn't read your configuration file').  And because BTRFS is
>> supposed
>> to be a general purpose filesystem, it has to account for the case of
>> desktop users, and because server admins are supposed to be smart,
>> the
>> default should be for desktop usage.
> Well but that's just the point I've made. The fs cannot decide what's
> better or not.
> Your document could be an important config file that allows/disallows
> remote users access to resources. The single byte error could make a 0
> to a 1, allowing world wide access.
That's not something that falls with actual 'Desktop' usage, that's 
server usage.
> It could be your thesis' data, or part of the document file, changing
> some numbers, which you won't easily notice but which makes everything
> bogus when examined.
And if you're writing a thesis, or some other research paper, you'd darn 
well better be verifying your data multiple times before you publish it.
> I had brought the example with the video file, where it may not matter.
It really doesn't in the case of a video file, or most audio files, or 
even some image files.  If you take almost any arbitrary video file, and 
change any one bit outside of the header, then unless it's very poor 
quality to begin with, it's almost certain that nobody will notice (and 
in the case of the good formats, it'll just result in a dropped frame, 
because they have built-in verification).
>
> But in any case it's nothing what the fs can decide. The best it can do
> is give an error on read, and the tools to give clearance to such files
> (when they could not be auto-recovered by e.g. other copies).
>
> All this is however only possible with checksumming.
Or properly educating users so they don't use nodatacow on everything.
It's just like journal=writeback on ext4, it improves performance for 
some things, but can result in really weird inconsistencies when the 
system crashes.
>
>
>>> a) Are checksums really stored per device (and not just once in the
>>> metadata? At least from my naive understanding this would either
>>> mean
>>> that there's a waste of storage, or that the csums are made on data
>>> that could vary from device to device (e.g. the same data split up
>>> in
>>> different extents, or compression on one device but not on the
>>> other).
>>> but..
>> AFAIUI, checksums are stored per-instance for every block.  This is
>> important in a multi-device filesystem in case you lose a device, so
>> that you still have a checksum for the block.  There should be no
>> difference between extent layout and compression between devices
>> however.
> hmm but if that's the case, especially the later, that the extents are
> the same on all devices,... then there's IMHO no need for data being
> stored per-instance (I guess you mean per device instance?) for every
> block.
> The meta-data would have e.g. DUP anyway, so even if one device fails
> metadata would hopefully be still there.
> And if metadata is coompletely lost, the fs is lost anyway, and csums
> don't matter anymore.
OK, as Duncan pointed out in one of his replies, I was only correct by 
coincidence.  checksums are stored based on metadata redundancy, so if 
metadata is raid1 or dup, you have two copies of each checksum.
>
>
>>> b) that problem (different data each with valid corresponding
>>> csums)
>>> should in principle exist for CoWed data as well, right? And there,
>>> I
>>> guess, it's solved by CoWing the metadata... (which would still be
>>> the
>>> case for no-dataCoWed files).
>> Yes.
>>> Don't know what btrfs does in the CoWed case when such incident
>>> happens... how does it decide which of two such corresponding
>>> blocks
>>> would be the newer one? The generations?
>> Usually, but like I mentioned above there are edge cases that can
>> occur
>> as a result of data corruption on disk or other really rare
>> circumstances.  In the particular case of multiple copies of a block
>> with different data but valid checksums, I'm about 95% certain that
>> it
>> will non-deterministically return one block or the other on an
>> arbitrary
>> read when the read doesn't hit the VFS cache.
> Hmm would be quite worrysome if that could happen, especially also in
> the CoW case.
The thing is, this can't be fully protected against, except by verifying 
the blocks against each other when you read them, which will absolutely 
kill performance.  The chance of this happening (without actively 
malicious intent) with COW on everything is extremely small (it requires 
a very large number of highly correlated errors), but having nodatacow 
enabled makes it slightly higher.  In both cases it's statistically 
impossible, but that just means ti's something that almost certainly 
won't happen, and thus we shouldn't worry about dealing with it until we 
have everything else covered.
>
>>    This is a potential issue
>> for COW as well, but much less likely because it can more easily
>> detect
>> the corruption and fix it.
> But then again, there should be no difference to checksumming the
> no-CoWed data - the checksums would be CoWed again, if btrfs can detect
> it there, it should be fine.
>
>
>
>
>>>
>>> Anyway, since metadata would still be CoWed, I think I may have
>>> gotten
>>> once again out of the tight spot - at least until you explain me,
>>> why
>>> my naive understanding, as laid out just above, doesn't work out
>>> O:-)
>> Hmm, I had forgotten about the metadata being COW, that does avoid
>> the
>> situation above under the specified circumstances, but does not avoid
>> it
>> happening due to disk errors (although that's extremely unlikely,a s
>> it
>> would require direct correlation of the errors in a way that is
>> statistically impossible).
> Ah... here we go :-)
>
> What exactly do you mean with disk errors here? IOW, what scenario do
> you think of, in which checksumming no-CoWed data could lead to any
> more corruptions than it can to without checksumming as well, or where
> any inconsistencies could get into the filesystem's meta-data, that
> couldn't already come in for checksummed+CoWed data and/or non-
> checksummed+CoWed data?
It can't (AFAICS) lead to any more _actual_ corruption, but it very much 
can lead to more false positives in the error detection, which is by 
definition a regression.
>
>>> Well, for PostgreSQL it's still fairly new (9.3, as I've said
>>> above, ht
>>> tps://wiki.postgresql.org/wiki/What%27s_new_in_PostgreSQL_9.3#Data_
>>> Chec
>>> ksums), but it's not done per default (http://www.postgresql.org/do
>>> cs/c
>>> urrent/static/app-initdb.html), and they warn about a noticable
>>> performance benefit (though I have of course no data whether this
>>> would
>>> be better/similar/worse to what is implied by btrfs checksumming).
>>>
>>> I've tried to find something for MySQL/MariaDB, but the only thing
>>> I
>>> could find there was: CHECKSUM TABLE
>>> But that seems to be a SQL command, i.e. not on-read checksumming
>>> as
>>> we're talking about, but rather something the application/admin
>>> would
>>> need to do manually.
>> I actually had been referring to this, with the assumption that the
>> application would use it to verify it's own data.  I hadn't realized
>> PostgreSQL had in-line support for it.
> Well but the (fairly new) in-line support, is the only thing that we
> can really count here.
>
> What mysql does is that it requires the app to do it.
> a) It's likely that there are many apps which don't use this (maybe
> simply because they don't know it) and it's unlikely they'll all
> change.
> While what we can do at the btrfs level (or what postgresql would do)
> works out of the box for everything.
It works out of the box for everything, but it's also sub-optimal 
protection for almost everything that actually requires data integrity.
>
> b) I may simply not very well understand the CHECKSUM TABLE, but to me
> it doesn't seem useful to provide data integrity in the sense we're
> talking here about (i.e. silent block errors, bus errors, etc.)
> Why?
> First, it seems to checksum the whole data of the whole table, and
> AFAICS, it uses only CRC32... given that such tables may be easily GiB
> in size, CRC32 is IMHO simply not good enough.
> Postgresql/btrfs in turn do the checksums on much smaller amounts of
> data.
> Second, verification seems to only take place when that command is
> called. I'm not sure whether it implies locking the table in memory
> then (didn't dig too deep), but I can't believe it would, which system
> could keep a 100 GiB table in mem?
> So it seems to be basically a one-shot verification, not covering any
> corruptions that happen in between.
> In fact, the documentation of the function even tells that this is for
> backups/rollbacks/etc. only... so it's absolutely not that kind of data
>   integrity protection we're talking about (and even for that purpose,
> CRC32 seems to be a poor choice).
>
>
>> VDI is still widely used, because it's the default for Virtual Box
>> when
>> creating a VM.
> Guess I just disbelieved that VirtualBox is still widely used O;-)
On a commercial level, it really isn't (I don't even think that Oracle 
uses internally any more).  On a personal level, it very much is, 
because too many people are too stupid because of Windows to learn to 
use stuff like QEMU or Xen.
>
>
>>    VHD is way more widely used than it should be, solely
>> because there are insane people out there using Windows as a
>> virtualization host.  You also forgot VMDK, which is what VMWare uses
>> almost exclusively, but I don't think it has built-in checksumming.
>>
>> As for Xen, the BCP are to avoid using image files like the plague,
>> and
>> use disks directly instead (or more commonly, use either LVM, or ZFS
>> with zvols).
> Anyway... what it comes down to: None of the VM image formats seem to
> support checksumming.
>
>
>>> So given all that, the picture looks a bit different again, I
>>> think.
>>> None of major FLOSS DBs doesn't do any checksumming per default,
>>> MySQL
>>> doesn't seem to support it, AFAICT. No VM image format seems to
>>> even
>>> support it.
>> Again, most of my intent in referring to those was that the
>> application
>> or the Guest OS would do the verification itself.
> I've answered that above already, IIRC (our mails get too lengthy O:-)
> ).
> The guest OS doesn't verify more than what our typical host OS (Linux)
> does.
> And that (except when btrfs with CoWed data is used ;-) does filesystem
> integrity verification - which is however not data integrity
> verification.
>
>
> btw: That makes me think about something interesting:
> If btrfs will ever support checksumming on no-CoWed data... then the
> documentation should describe, that depending on the actual scenario,
> it may make sense that btrfs filesystems inside the guest run generally
> with nodatasum.
> The idea begin: Why verifying twice?
That gets to be a particularly dangerous recommendation, because lots of 
people (who arguably shouldn't be messing around with such stuff to 
begin with) will likely think it means that they can turn it off 
unconditionally in the guest system, which really isn't safe for 
anything that might be moved to some other FS.
>
> The constraints being, AFAICS, the following:
> - If the VM image is ever to be moved off the host's btrfs image
> (which
>    would have checksumming enabled) to a fs without checksumming or if
>    it would be ever copied remotely, than not having the "internal"
>    checksumming (i.e. from the btrfs filesystems inside the gues),
> would
>    make one loose the integrity protection
> - It further would only work, if the hypervisor itself, would properly
>    pass on any IO errors when it reads from the image files in the
>    host's btrfs, to block IO errors for the guest. If it wouldn't, then
>    the guest (with disabled "internal" checksumming) wouldn't probably
>    notice any data integrity errors, which he would, if the "internal"
>    checksumming wasn't turned off.
>
>
>> If the application doesn't have that type of thing built in, then
>> that's
>> not something the filesystem should be worrying about, that's the job
>> of
>> the application developers to deal with.
> No.
>
> If you see it like that, you could as well drop data checksumming in
> btrfs completely.
> You'd anyway argue that it would be the applications duty to do that.
No, BTRFS's job is to verify that the data it returns matches what it 
was given in the first place.  That is not reliably possible without 
having COW semantics on data blocks.
>
> The same way you could argue, that your MP4, JPG, ISO image or whatever
> you downloaded via bittorrent needs to contain checksum data, and that
> the actual application (which is not bittorrent, but e.g. mplayer,
> imagemagick or wodim) need to verify these.
> However this is not the case.
ISO 9660 includes built-in ECC, it would be impractical for usage on 
removable optical media otherwise.  JPEG and MP4 are irrelevant because 
in both cases, the average person can't detect corruption caused by 
single bit errors.  Bittorrent itself properly verifies the downloads 
like it should.
> In fact, it's quite usual and proper to have the lower layer handle
> stuff which the higher layer don't have any real direct interest in
> (except for the case, that the lower layer doesn't do it).
Except that data integrity is obviously something the higher layers _do_ 
have interest in.
>
>
>> The point of a filesystem is
>> to store data within the integrity guarantees provided by the
>> hardware,
>> possibly with some additional protection
> If you're convinced by that, you should probably propose that btrfs
> removes data checksumming altogether.
> I guess you won't make much friends with that idea ;)
I think you missed the bit about 'possibly with some additional 
protection'.  I really could have worded that better, but that's 
somewhat irrelevant.
>
>
>
>> In the case of stuff like torrents and such, all the good software
>> for
>> working with them has an option to verify the file after downloading.
> Not sure what you mean with "and such", if it's again VMs, and DBs (the
> IMHO actually more important use case than file sharing), I showed you
> in the last mail, that non of these do any verification by default, and
>   half of the important ones don't even support it, with nothing on the
> horizon that this would change (probably because they argue just the
> other way round than you do: the fs should handle data integrity, and
> ZFS and btrfs give them partially right ;-) ).
> And I'm not sure with torrents, but I'd have suspected once the file's
> downloaded completely, any checksumming data is no longer kept.
If you keep the torrent file (even if it's kept loaded in the software), 
you have the checksum, as that's part of the identifier that is used to 
fetch the file.
> If my guess is correct, even the torrent software doesn't really do
> overall data integrity protection, but just until the download is
> finished; at least this used to be the case with the other P2P network
> softwares.
Any good torrent software will let you verify the download after the 
fact, assuming you still have the torrent running (because if the 
verification fails, it will then go and re-download the failed blocks).
>
>
>
> Thanks for the discussion so far :)
> It actually made me just more confident that no-CoWed data checksumming
> should work and is actually needed ;)
If your so convinced it's necessary, C is not hard to learn, and patches 
would go a long way towards getting this in the kernel.  Whether I agree 
with it or not, if patches get posted, I'll provide the same degree of 
review as I would for any other feature (and even give my Tested-by 
assuming it passes xfstests, and the various edge-cases not in XFS tests 
that I throw at anything I test).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: dear developers, can we have notdatacow + checksumming, plz?
  2015-12-21 13:36         ` Austin S. Hemmelgarn
@ 2015-12-22  9:12           ` Duncan
  2015-12-22 12:16             ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 12+ messages in thread
From: Duncan @ 2015-12-22  9:12 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Mon, 21 Dec 2015 08:36:02 -0500 as
excerpted:

> On 2015-12-16 21:09, Christoph Anton Mitterer wrote:

>> On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote:

>>> nodatacow only [avoids fragmentation] if the file is
>>> pre-allocated, if it isn't, then it still ends up fragmented.

>> Hmm is that "it may end up fragmented" or a "it will definitely? Cause
>> I'd have hoped, that if nothing else had been written in the meantime,
>> btrfs would perhaps try to write next to the already allocated blocks.

> If there are multiple files being written, then there is a relatively
> high probability that they will end up fragmented if they are more than
> about 64k and aren't pre-allocated.

Does the 30-second-by-default commit window (and similarly 30-second-
default dirty-flush-time at the VFS level) modify this at all?  It has 
been my assumption that same-file writes accumulated during this time 
should merge, increasing efficiency and decreasing fragmentation (both 
with and without nocow), tho of course further writes outside this 30-
second window will likely trigger it, if other files have been written in 
parallel or in the mean time.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: dear developers, can we have notdatacow + checksumming, plz?
  2015-12-22  9:12           ` Duncan
@ 2015-12-22 12:16             ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 12+ messages in thread
From: Austin S. Hemmelgarn @ 2015-12-22 12:16 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 2015-12-22 04:12, Duncan wrote:
> Austin S. Hemmelgarn posted on Mon, 21 Dec 2015 08:36:02 -0500 as
> excerpted:
>
>> On 2015-12-16 21:09, Christoph Anton Mitterer wrote:
>
>>> On Tue, 2015-12-15 at 11:00 -0500, Austin S. Hemmelgarn wrote:
>
>>>> nodatacow only [avoids fragmentation] if the file is
>>>> pre-allocated, if it isn't, then it still ends up fragmented.
>
>>> Hmm is that "it may end up fragmented" or a "it will definitely? Cause
>>> I'd have hoped, that if nothing else had been written in the meantime,
>>> btrfs would perhaps try to write next to the already allocated blocks.
>
>> If there are multiple files being written, then there is a relatively
>> high probability that they will end up fragmented if they are more than
>> about 64k and aren't pre-allocated.
>
> Does the 30-second-by-default commit window (and similarly 30-second-
> default dirty-flush-time at the VFS level) modify this at all?  It has
> been my assumption that same-file writes accumulated during this time
> should merge, increasing efficiency and decreasing fragmentation (both
> with and without nocow), tho of course further writes outside this 30-
> second window will likely trigger it, if other files have been written in
> parallel or in the mean time.
>
I think it does, but not much, and it depends on the workload.  I do 
notice less fragmentation on the filesystems I increase the commit 
window on, and more on ones I decrease it, but the difference is pretty 
small as long as you use something reasonable (I've never tested 
anything higher than 300, and I rarely go above 60).  My guess based on 
what the commit window is for (namely, it's the amount of time the log 
tree gets updated before forcing a transaction to be committed) would be 
that it has less effect if stuff is regularly calling fsync().

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-12-22 12:17 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-12-14  4:59 dear developers, can we have notdatacow + checksumming, plz? Christoph Anton Mitterer
2015-12-14  6:42 ` Russell Coker
2015-12-15  1:02   ` Christoph Anton Mitterer
2015-12-14 14:16 ` Austin S. Hemmelgarn
2015-12-15  3:15   ` Christoph Anton Mitterer
2015-12-15 16:00     ` Austin S. Hemmelgarn
2015-12-16  9:15       ` Duncan
2015-12-16  9:55       ` Duncan
2015-12-17  2:09       ` Christoph Anton Mitterer
2015-12-21 13:36         ` Austin S. Hemmelgarn
2015-12-22  9:12           ` Duncan
2015-12-22 12:16             ` Austin S. Hemmelgarn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox