Re: btrfs autodefrag? - Austin S Hemmelgarn

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Erkki Seppala <flux-btrfs@inside.org>, linux-btrfs@vger.kernel.org
Subject: Re: btrfs autodefrag?
Date: Mon, 19 Oct 2015 15:48:05 -0400	[thread overview]
Message-ID: <562548F5.4050301@gmail.com> (raw)
In-Reply-To: <m49si56bzv3.fsf@coffee.modeemi.fi>

[-- Attachment #1: Type: text/plain, Size: 8176 bytes --]

On 2015-10-19 12:13, Erkki Seppala wrote:
> Austin S Hemmelgarn <ahferroin7@gmail.com> writes:
>
>> And that is exactly the case with how things are now, when something
>> is marked NOCOW, it has essentially zero guarantee of data consistency
>> after a crash.
>
> Yes. In addition to the zero guarantee of the data validity for the data
> being written into, btrfs also doesn't give any guarantees for the rest
> of the data, even if it was perfectly quiescent, but was just marked COW
> at the time it was written :).
Assuming you do actually mean COW and not NOCOW, in which case there is 
a guarantee that the data will either:
1. Match the original data prior to the write.
2. Match the data that was written.
or, if you are using only single copies of the metadata blocks and the 
system crashes exactly during a write to a metadata block:
3. Everything under that metadata block will become inaccessible, and 
require usage of btrfs-progs to recover.

In the case of NOCOW however, there is absolutely no such guarantee 
(just like ext4 for example can not provide such a guarantee), and any 
of the above could be the case, or any arbitrary portion of the new data 
could have been written.
>>   As things are now though, there is a guarantee that
>> you can still read the file, but using checksums like you suggest
>> would result in it being unreadable most of the time, because it's
>> statistically unlikely that we wrote the _whole_ block (IOW, we can't
>> guarantee without COW that the data was completely written) because:
>
> Well, the amount of data being written at any given time is very small
> compared to the whole device. So it's not all the data that is at risk
> of having the wrong checksum. Given how small blocks are (4k) I really
> doubt that the likelihood of large amounts of data remaining unreadable
> would be great.
That very much depends on how you are using things.for many of the types 
of things which NOCOW should be used for, directio and AIO are also very 
commonly used, and those can write chunks much bigger than BTRFS's block 
size in one go.
>
> However, here's a compromise: when detecting an error on a COW file,
> instead of refusing to read it, produce a warning to the kernel log. In
> addition, when scrubbing it, the last resort after trying other copies
> the checksum could simply be repaired, paired with an appropriate log
> message. Such a log message would not indicate that the data is wrong,
> but that the system administrator might be interested in checking it,
> for example against backups, or by perhaps running a scrub within the
> virtual machine.
In this case I'm assuming you mean NOCOW instead of COW, as the 
corruption can't be detected in a NOCOW file by BTRFS.

In a significant majority of cases, it is actually better to return no 
data than to return known corrupted data (think medical or military 
applications, in those kind of cases it's quite often worse to act on 
incorrect data than it is to not act at all).  Disk images for virtual 
machines are one of the very few rare cases where this is not true, 
simply because they can usually correct the corruption themselves.
>
> If the scrub would say everything is OK, then certainly everything would
> be OK.
That's a _very_ optimistic point of view to take, and doesn't take into 
account software bugs, or potential hardware problems.
>
>> a. While some disks do atomically write single sectors, most don't,
>> and if the power dies during the disk writing a single sector, there
>> is no certainty exactly what that sector will read back as.
>
> So it seems that the majority vote is to not to provide a feature to the
> minority.. :)
For something that provides a false sense of data safety and is 
potentially easy to shoot yourself in the foot with?  Yes we will almost 
certainly not provide it.  If, however, you wish to write a patch to 
provide such a feature (or pay someone to do so for you), there is 
nothing stopping you from doing so, and if it's something that people 
actually want, then it will likely end up included.
>> b. Assuming that item a is not an issue, one block in BTRFS is usually
>> multiple sectors on disk, and a majority of disks have volatile write
>> caches, thus it is not unlikely that the power will die during the
>> process of writing the block.
>
> I'm not at all familiar with the on-disk structure of Btrfs, but it
> seems that indeed the block size is 16 kilobytes by default, so the risk
> of one of the four device-blocks (on modern 4kB-sector HDDs) being
> corrupted or only a set of them having being written is real. But,
> there's only so much data in-flight at any given time.
While the default is usually 16k, there are situations where it may be 
different, for example if the system has a page size greater than 16k 
(some ARM64, PPC, and MIPS systems use 64k pages), or if it's a small 
filesystem (in which case the blocks will be 4k).

It is also worth noting that while most 'modern' HDDs use 4k sectors:
1. They are still vastly outnumbered by older HDDs that use 512 byte 
sectors.
2. A significant percentage of them use 512 byte virtual sectors (that 
is, they expose a 512 byte sector based interface to the OS, but use 4k 
sectors internally, which has potentially dangerous implications if 
their firmware is not well written).
3. SSD's internally use much bigger block sizes (the smallest erase 
block size that I've personally seen in an SSD is 1M, usually it's 2M or 
4M).  The implications of this are pretty scary for cheap SSD's (and OCZ 
SSD's, which are not by any means cheap) that don't include 
super-capacitors to ensure that power-loss in the middle of a write 
won't interrupt the write.
4. I've heard rumors of some exotic ones out there that use 64k sectors 
on disk.
>
> I did read that there are two checksums (on Wikipedia,
> Btrfs#Checksum_tree..): one per block, and one per a contiguous run of
> allocated blocks. The latter checksum seems more likely to be broken,
> but I don't see why in that case the per-block checksums (or one of the
> two checksums I proposed) couldn't be referred to. This is of course
> because I don't understand much of the Btrfs on-disk format, technical
> feasibility be damned :).
>
> I understand that the metadata is always COW, so that level of
> corruption cannot occur.
Oh, it can occur in reality, it's just a _statistical_ impossibility.
>> c. In the event that both items a and b are not an issue (for example,
>> you have a storage controller with a non-volatile write cache, have
>> write caching turned off on the disks, and it's a smart enough storage
>> controller that it only removes writes from the cache after they
>> return), then there is still the small but distinct possibility that
>> the crash will cause either corruption in the write cache, or some
>> other hardware related issue.
>
> However, should this not be the case, for example when my computer is
> never brought down abruptly, it could still be valuable information to
> see that the data has not changed behind my back.
Well yes, but if that is the case, then you shouldn't be worrying about 
anything, as un-mounting the filesystem requires that there be no open 
files on it, and it explicitly flushes all the buffered writes in RAM 
out to disk.

On the other hand, if you're worried about your disk or other hardware 
having issues, then you should be seriously considering verifying that 
it works correctly, and replacing it if it doesn't, and just using BTRFS 
on it is not a safe or even remotely reliable way to detect hardware 
failures.
>
> I understand it is the prime motivation behind btrfs scrubbing in any
> case; otherwise there could be a faster 'queue a verify after a write'
> that would never scrub the same data twice.
Actually, having the ability to tell it to verify a block after writing 
it would potentially be a very useful feature for unreliable hardware, 
assuming you're willing to take the performance penalty for the 
additional read on every write.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

     prev parent reply	other threads:[~2015-10-19 19:48 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-17 16:36 btrfs autodefrag? Xavier Gnata
2015-10-18  5:46 ` Duncan
2015-10-18 12:44   ` Xavier Gnata
2015-10-19  6:04   ` Paul Harvey
2015-10-18 14:24 ` Rich Freeman
2015-10-18 14:40   ` Hugo Mills
2015-10-19  6:19     ` Erkki Seppala
2015-10-19 11:56       ` Austin S Hemmelgarn
2015-10-19 16:13         ` Erkki Seppala
2015-10-19 19:48           ` Austin S Hemmelgarn [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=562548F5.4050301@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=flux-btrfs@inside.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.