From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: What is the vision for btrfs fs repair?
Date: Sun, 12 Oct 2014 23:59:52 +0000 (UTC) [thread overview]
Message-ID: <pan$9d644$428672a3$85285a3a$ab3729f1@cox.net> (raw)
In-Reply-To: 2313804.P0rE2GFdbV@merkaba
Martin Steigerwald posted on Sun, 12 Oct 2014 12:14:01 +0200 as excerpted:
> I always thought with a controller and device and driver combination
> that honors fsync with BTRFS it would either be the new state of the
> last known good state *anyway*. So where does the need to rollback arise
> from?
My understanding here is...
With btrfs a full-tree commit is atomic. You should get either the old
tree or the new tree. However, due to the cascading nature of updates on
cow-based structures, these full-tree commits are done by default
(there's a mount-option to adjust it) every 30 seconds. Between these
atomic commits partial updates may have occurred. The btrfs log (the one
that btrfs-zero-log kills) is limited to between-commit updates, and thus
to the upto 30 seconds (default) worth of changes since the last full-
tree atomic commit.
In addition to that, there's a history of tree-root commits kept (with
the superblocks pointing to the last one). Btrfs-find-tree-root can be
used to list this history. The recovery mount option simply allows btrfs
to fall back to this history, should the current root be corrupted.
Btrfs restore can be used to list tree roots as well, and can be pointed
at an appropriate one if necessary.
Fsync forces the file and its corresponding metadata update to the log
and barring hardware or software bugs should not return until it's safely
in the log, but I'm not sure whether it forces a full-tree commit.
Either way the guarantees should be the same. If the log can be replayed
or a full-tree commit has occurred since the fsync, the new copy should
appear. If it can't, the rollback to the last atomic tree commit should
return an intact copy of the file from that point. If the recovery mount
option is used and a further rollback to an earlier full-tree commit is
forced, provided it existed at the point of that full-tree commit, the
intact file at that point should appear.
So if the current tree root is a good one, the log will replay the last
upto 30 seconds of activity on top of that last atomic tree root. If the
current root tree itself is corrupt, the recovery mount option will let
an earlier one be used. Obviously in that case the log will be discarded
since it applies to a later root tree that itself has been discarded.
The debate is whether recovery should be automated so the admin doesn't
have to care about it, or whether having to manually add that option
serves as a necessary notifier to the admin that something /did/ go
wrong, and that an earlier root is being used instead, so more than a few
seconds worth of data may have disappeared.
As someone else has already suggested, I'd argue that as long as btrfs
continues to be under the sort of development it's in now, keeping
recovery as a non-default option is desired. Once it's optimized and
considered stable, arguably recovery should be made the default, perhaps
with a no-recovery option for those who prefer that in-the-face
notification in the form of a mount error, if btrfs would otherwise fall
back to an earlier tree root commit.
What worries me, however, is that IMO the recent warning stripping was
premature. Btrfs is certainly NOT fully stable or optimized for normal
use at this point. We're still using the even/odd PID balancing scheme
for raid1 reads, for instance, and multi-device writes are still
serialized when they could be parallelized to a much larger degree (tho
keeping some serialization is arguably good for data safety). Arguably
optimizing that now would be premature optimization since the code itself
is still subject to change, so I'm not complaining, but by that very same
token, it *IS* still subject to change, which by definition means it's
*NOT* stable, so why are we removing all the warnings and giving the
impression that it IS stable?
The decision wasn't mine to make and I don't know, but while a nice
suggestion, making recovery-by-default a measure of when btrfs goes
stable simply won't work, because surely, the same folks behind the
warning stripping would then ensure this indicator too, said btrfs was
stable, while the state of the code itself continues to say otherwise.
Meanwhile, if your distributed transactions scenario doesn't account for
crash and loss of data on one side with real-time backup/redundancy, such
that loss of a few seconds worth of transactions on a single local
filesystem is going to kill the entire scenario, I don't think too much
of that scenario in the first place, and regardless, btrfs, certainly in
its current state, is definitely NOT an appropriate base for it. Use
appropriate tools for the task. Btrfs at least at this point is simply
not an appropriate tool for that task.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2014-10-13 0:00 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-10-08 19:11 What is the vision for btrfs fs repair? Eric Sandeen
2014-10-09 11:29 ` Austin S Hemmelgarn
2014-10-09 11:53 ` Duncan
2014-10-09 11:55 ` Hugo Mills
2014-10-09 12:07 ` Austin S Hemmelgarn
2014-10-09 12:12 ` Hugo Mills
2014-10-09 12:32 ` Austin S Hemmelgarn
[not found] ` <107Y1p00G0wm9Bl0107vjZ>
2014-10-09 12:34 ` Duncan
2014-10-09 13:18 ` Austin S Hemmelgarn
2014-10-09 13:49 ` Duncan
2014-10-09 15:44 ` Eric Sandeen
[not found] ` <0zvr1p0162Q6ekd01zvtN0>
2014-10-09 12:42 ` Duncan
2014-10-10 1:58 ` Chris Murphy
2014-10-10 3:20 ` Duncan
2014-10-10 10:53 ` Bob Marley
2014-10-10 10:59 ` Roman Mamedov
2014-10-10 11:12 ` Bob Marley
2014-10-10 15:18 ` cwillu
2014-10-10 14:37 ` Chris Murphy
2014-10-10 17:43 ` Bob Marley
2014-10-10 17:53 ` Bardur Arantsson
2014-10-10 19:35 ` Austin S Hemmelgarn
2014-10-10 22:05 ` Eric Sandeen
2014-10-13 11:26 ` Austin S Hemmelgarn
2014-10-12 10:14 ` Martin Steigerwald
2014-10-12 23:59 ` Duncan [this message]
2014-10-13 11:37 ` Austin S Hemmelgarn
2014-10-13 11:48 ` Rich Freeman
2014-10-11 7:29 ` Goffredo Baroncelli
2014-11-17 20:55 ` Phillip Susi
2014-10-12 10:06 ` Martin Steigerwald
2014-10-12 10:17 ` Martin Steigerwald
2014-10-13 21:09 ` Josef Bacik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$9d644$428672a3$85285a3a$ab3729f1@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).