From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Erkki Seppala <flux-btrfs@inside.org>, linux-btrfs@vger.kernel.org
Subject: Re: btrfs autodefrag?
Date: Mon, 19 Oct 2015 07:56:51 -0400 [thread overview]
Message-ID: <5624DA83.40200@gmail.com> (raw)
In-Reply-To: <m49zizfbcsb.fsf@coffee.modeemi.fi>
[-- Attachment #1: Type: text/plain, Size: 2320 bytes --]
On 2015-10-19 02:19, Erkki Seppala wrote:
> Hugo Mills <hugo@carfax.org.uk> writes:
>> It has to be disabled because if you enable it, there's a race
>> condition: since you're overwriting existing data (rather than CoWing
>> it), you can't update the checksums atomically. So, in the interests
>> of consistency, checksums are disabled.
>
> I suppose this has been suggested before, but couldn't it store both the
> new and the old checksums and be satisfied if either of them match?
Actually, I don't think that's been suggested before, read on however
for an explanation of why we don't do that.
>
> The user is probably not happy that a partial write is going to be
> difficult to read from the device due to a checksum error, but there is
> no promise of recently-overwritten data state with traditional
> filesystems either in case of sudden powerdown, assuming there is no
> data journaling..
And that is exactly the case with how things are now, when something is
marked NOCOW, it has essentially zero guarantee of data consistency
after a crash. As things are now though, there is a guarantee that you
can still read the file, but using checksums like you suggest would
result in it being unreadable most of the time, because it's
statistically unlikely that we wrote the _whole_ block (IOW, we can't
guarantee without COW that the data was completely written) because:
a. While some disks do atomically write single sectors, most don't, and
if the power dies during the disk writing a single sector, there is no
certainty exactly what that sector will read back as.
b. Assuming that item a is not an issue, one block in BTRFS is usually
multiple sectors on disk, and a majority of disks have volatile write
caches, thus it is not unlikely that the power will die during the
process of writing the block.
c. In the event that both items a and b are not an issue (for example,
you have a storage controller with a non-volatile write cache, have
write caching turned off on the disks, and it's a smart enough storage
controller that it only removes writes from the cache after they
return), then there is still the small but distinct possibility that the
crash will cause either corruption in the write cache, or some other
hardware related issue.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]
next prev parent reply other threads:[~2015-10-19 11:56 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-17 16:36 btrfs autodefrag? Xavier Gnata
2015-10-18 5:46 ` Duncan
2015-10-18 12:44 ` Xavier Gnata
2015-10-19 6:04 ` Paul Harvey
2015-10-18 14:24 ` Rich Freeman
2015-10-18 14:40 ` Hugo Mills
2015-10-19 6:19 ` Erkki Seppala
2015-10-19 11:56 ` Austin S Hemmelgarn [this message]
2015-10-19 16:13 ` Erkki Seppala
2015-10-19 19:48 ` Austin S Hemmelgarn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5624DA83.40200@gmail.com \
--to=ahferroin7@gmail.com \
--cc=flux-btrfs@inside.org \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.