From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: how to best segment a big block device in resizeable btrfs filesystems?
Date: Sun, 8 Jul 2018 08:05:27 +0000 (UTC) [thread overview]
Message-ID: <pan$81eb5$e9127e5f$5b2e03ba$1abc9191@cox.net> (raw)
In-Reply-To: 293ab6d6-f609-0e9b-3d33-053336e43744@gmail.com
Andrei Borzenkov posted on Fri, 06 Jul 2018 07:28:48 +0300 as excerpted:
> 03.07.2018 10:15, Duncan пишет:
>> Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as
>> excerpted:
>>
>>> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
>>>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a
>>>> bit dangerous to do it while writes are happening).
>>>
>>> Could you please elaborate? Do you mean btrfs can trim data before new
>>> writes are actually committed to disk?
>>
>> No.
>>
>> But normally old roots aren't rewritten for some time simply due to
>> odds (fuller filesystems will of course recycle them sooner), and the
>> btrfs mount option usebackuproot (formerly recovery, until the
>> norecovery mount option that parallels that of other filesystems was
>> added and this option was renamed to avoid confusion) can be used to
>> try an older root if the current root is too damaged to successfully
>> mount.
>> But other than simply by odds not using them again immediately, btrfs
>> has
>> no special protection for those old roots, and trim/discard will
>> recover them to hardware-unused as it does any other unused space, tho
>> whether it simply marks them for later processing or actually processes
>> them immediately is up to the individual implementation -- some do it
>> immediately, killing all chances at using the backup root because it's
>> already zeroed out, some don't.
>>
>>
> How is it relevant to "while writes are happening"? Will trimming old
> tress immediately after writes have stopped be any different? Why?
Define "while writes are happening" vs. "immediately after writes have
stopped". How soon is "immediately", and does the writes stopped
condition account for data that has reached the device-hardware write
buffer (so is no longer being transmitted to the device across the bus)
but not been actually written to media, or not?
On a reasonably quiescent system, multiple empty write cycles are likely
to have occurred since the last write barrier, and anything in-process is
likely to have made it to media even if software is missing a write
barrier it needs (software bug) or the hardware lies about honoring the
write barrier (hardware bug, allegedly sometimes deliberate on hardware
willing to gamble with your data that a crash won't happen in a critical
moment, a somewhat rare occurrence, in ordered to improve normal
operation performance metrics).
On an IO-maxed system, data and write-barriers are coming down as fast as
the system can handle them, and write-barriers become critical -- crash
after something was supposed to get to media but didn't, either because
of a missing write barrier or because the hardware/firmware lied about
the barrier and said the data it was supposed to ensure was on-media was,
when it wasn't, and the btrfs atomic-cow commit guarantees of consistent
state at each commit go out the window.
At this point it becomes useful to have a number of previous "guaranteed
consistent state" roots to fall back on, with the /hope/ being that at
least /one/ of them is usably consistent. If all but the last one are
wiped due to trim...
When the system isn't write-maxed the write will have almost certainly
made it regardless of whether the barrier is there or not, because
there's enough idle time to finish the current write before another one
comes down the pipe, so the last-written root is almost certain to be
fine regardless of barriers, and the history of past roots doesn't matter
even if there's a crash.
If "immediately after writes have stopped" is strictly defined as a
condition when all writes including the btrfs commit updating the current
root and the superblock pointers to the current root have completed, with
no new writes coming down the pipe in the mean time that might have
delayed a critical update if a barrier was missed, then trimming old
roots in this state should be entirely safe, and the distinction between
that state and the "while writes are happening" is clear.
But if "immediately after writes have stopped" is less strictly defined,
then the distinction between that state and "while writes are happening"
remains blurry at best, and having old roots around to fall back on in
case a write-barrier was missed (for whatever reason, hardware or
software) becomes a very good thing.
Of course the fact that trim/discard itself is an instruction written to
the device in the combined command/data stream complexifies the picture
substantially. If those write barriers get missed who knows what state
the new root is in, and if the old ones got erased... But again, on a
mostly idle system, it'll probably all "just work", because the writes
will likely all make it to media, regardless, because there's not a bunch
of other writes competing for limited write bandwidth and making ordering
critical.
>> In the context of the discard mount option, that can mean there's never
>> any old roots available ever, as they've already been cleaned up by the
>> hardware due to the discard option telling the hardware to do it.
>>
>> But even not using that mount option, and simply doing the trims
>> periodically, as done weekly by for instance the systemd fstrim timer
>> and service units, or done manually if you prefer, obviously
>> potentially wipes the old roots at that point. If the system's
>> effectively idle at the time, not much risk as the current commit is
>> likely to represent a filesystem in full stasis, but if there's lots of
>> writes going on at that moment *AND* the system happens to crash at
>> just the wrong time, before additional commits have recreated at least
>> a bit of root history, again, you'll potentially be left without any
>> old roots for the usebackuproot mount option to try to fall back to,
>> should it actually be necessary.
>>
>>
> Sorry? You are just saying that "previous state can be discarded before
> new state is committed", just more verbosely.
No, it's more the new state gets committed before the old is trimmed, but
should it turn out to be unusable (due to missing write barriers, etc,
which is more of an issue on a write-bottlenecked system), having a
history of old roots/states around to fall back to can be very useful.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2018-07-08 8:07 UTC|newest]
Thread overview: 65+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-06-29 4:27 So, does btrfs check lowmem take days? weeks? Marc MERLIN
2018-06-29 5:07 ` Qu Wenruo
2018-06-29 5:28 ` Marc MERLIN
2018-06-29 5:48 ` Qu Wenruo
2018-06-29 6:06 ` Marc MERLIN
2018-06-29 6:29 ` Qu Wenruo
2018-06-29 6:59 ` Marc MERLIN
2018-06-29 7:09 ` Roman Mamedov
2018-06-29 7:22 ` Marc MERLIN
2018-06-29 7:34 ` Roman Mamedov
2018-06-29 8:04 ` Lionel Bouton
2018-06-29 16:24 ` btrfs send/receive vs rsync Marc MERLIN
2018-06-30 8:18 ` Duncan
2018-06-29 7:20 ` So, does btrfs check lowmem take days? weeks? Qu Wenruo
2018-06-29 7:28 ` Marc MERLIN
2018-06-29 17:10 ` Marc MERLIN
2018-06-30 0:04 ` Chris Murphy
2018-06-30 2:44 ` Marc MERLIN
2018-06-30 14:49 ` Qu Wenruo
2018-06-30 21:06 ` Marc MERLIN
2018-06-29 6:02 ` Su Yue
2018-06-29 6:10 ` Marc MERLIN
2018-06-29 6:32 ` Su Yue
2018-06-29 6:43 ` Marc MERLIN
2018-07-01 23:22 ` Marc MERLIN
2018-07-02 2:02 ` Su Yue
2018-07-02 3:22 ` Marc MERLIN
2018-07-02 6:22 ` Su Yue
2018-07-02 14:05 ` Marc MERLIN
2018-07-02 14:42 ` Qu Wenruo
2018-07-02 15:18 ` how to best segment a big block device in resizeable btrfs filesystems? Marc MERLIN
2018-07-02 16:59 ` Austin S. Hemmelgarn
2018-07-02 17:34 ` Marc MERLIN
2018-07-02 18:35 ` Austin S. Hemmelgarn
2018-07-02 19:40 ` Marc MERLIN
2018-07-03 4:25 ` Andrei Borzenkov
2018-07-03 7:15 ` Duncan
2018-07-06 4:28 ` Andrei Borzenkov
2018-07-08 8:05 ` Duncan [this message]
2018-07-03 0:51 ` Paul Jones
2018-07-03 4:06 ` Marc MERLIN
2018-07-03 4:26 ` Paul Jones
2018-07-03 5:42 ` Marc MERLIN
2018-07-03 1:37 ` Qu Wenruo
2018-07-03 4:15 ` Marc MERLIN
2018-07-03 9:55 ` Paul Jones
2018-07-03 11:29 ` Qu Wenruo
2018-07-03 4:23 ` Andrei Borzenkov
2018-07-02 15:19 ` So, does btrfs check lowmem take days? weeks? Marc MERLIN
2018-07-02 17:08 ` Austin S. Hemmelgarn
2018-07-02 17:33 ` Roman Mamedov
2018-07-02 17:39 ` Marc MERLIN
2018-07-03 0:31 ` Chris Murphy
2018-07-03 4:22 ` Marc MERLIN
2018-07-03 8:34 ` Su Yue
2018-07-03 21:34 ` Chris Murphy
2018-07-03 21:40 ` Marc MERLIN
2018-07-04 1:37 ` Su Yue
2018-07-03 8:50 ` Qu Wenruo
2018-07-03 14:38 ` Marc MERLIN
2018-07-03 21:46 ` Chris Murphy
2018-07-03 22:00 ` Marc MERLIN
2018-07-03 22:52 ` Qu Wenruo
2018-06-29 5:35 ` Su Yue
2018-06-29 5:46 ` Marc MERLIN
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$81eb5$e9127e5f$5b2e03ba$1abc9191@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).