On 2015-10-19 02:19, Erkki Seppala wrote:
> Hugo Mills <hugo@carfax.org.uk> writes:
>>     It has to be disabled because if you enable it, there's a race
>> condition: since you're overwriting existing data (rather than CoWing
>> it), you can't update the checksums atomically. So, in the interests
>> of consistency, checksums are disabled.
>
> I suppose this has been suggested before, but couldn't it store both the
> new and the old checksums and be satisfied if either of them match?
Actually, I don't think that's been suggested before, read on however 
for an explanation of why we don't do that.
>
> The user is probably not happy that a partial write is going to be
> difficult to read from the device due to a checksum error, but there is
> no promise of recently-overwritten data state with traditional
> filesystems either in case of sudden powerdown, assuming there is no
> data journaling..
And that is exactly the case with how things are now, when something is 
marked NOCOW, it has essentially zero guarantee of data consistency 
after a crash.  As things are now though, there is a guarantee that you 
can still read the file, but using checksums like you suggest would 
result in it being unreadable most of the time, because it's 
statistically unlikely that we wrote the _whole_ block (IOW, we can't 
guarantee without COW that the data was completely written) because:
a. While some disks do atomically write single sectors, most don't, and 
if the power dies during the disk writing a single sector, there is no 
certainty exactly what that sector will read back as.
b. Assuming that item a is not an issue, one block in BTRFS is usually 
multiple sectors on disk, and a majority of disks have volatile write 
caches, thus it is not unlikely that the power will die during the 
process of writing the block.
c. In the event that both items a and b are not an issue (for example, 
you have a storage controller with a non-volatile write cache, have 
write caching turned off on the disks, and it's a smart enough storage 
controller that it only removes writes from the cache after they 
return), then there is still the small but distinct possibility that the 
crash will cause either corruption in the write cache, or some other 
hardware related issue.