linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Murphy <lists@colorremedies.com>
To: Adam Borowski <kilobyte@angband.pl>
Cc: Chris Murphy <lists@colorremedies.com>,
	"Austin S. Hemmelgarn" <ahferroin7@gmail.com>,
	Marat Khalili <mkh@rqc.ru>, Dave <davestechshop@gmail.com>,
	Linux fs Btrfs <linux-btrfs@vger.kernel.org>,
	Fred Van Andel <vanandel@gmail.com>
Subject: Re: Problem with file system
Date: Sat, 4 Nov 2017 11:14:33 -0600	[thread overview]
Message-ID: <CAJCQCtReVi8k+OoitBBUZm_T+mj9dkV6GCVs7JMSG1myR78AGw@mail.gmail.com> (raw)
In-Reply-To: <20171104044634.thg7mnchm4hvzdic@angband.pl>

On Fri, Nov 3, 2017 at 10:46 PM, Adam Borowski <kilobyte@angband.pl> wrote:
> On Fri, Nov 03, 2017 at 04:03:44PM -0600, Chris Murphy wrote:
>> On Tue, Oct 31, 2017 at 5:28 AM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>
>> > If you're running on an SSD (or thinly provisioned storage, or something
>> > else which supports discards) and have the 'discard' mount option enabled,
>> > then there is no backup metadata tree (this issue was mentioned on the list
>> > a while ago, but nobody ever replied),
>>
>>
>> This is a really good point. I've been running discard mount option
>> for some time now without problems, in a laptop with Samsung
>> Electronics Co Ltd NVMe SSD Controller SM951/PM951.
>>
>> However, just trying btrfs-debug-tree -b on a specific block address
>> for any of the backup root trees listed in the super, only the current
>> one returns a valid result.  All others fail with checksum errors. And
>> even the good one fails with checksum errors within seconds as a new
>> tree is created, the super updated, and Btrfs considers the old root
>> tree disposable and subject to discard.
>>
>> So absolutely if I were to have a problem, probably no rollback for
>> me. This seems to totally obviate a fundamental part of Btrfs design.
>
> How is this an issue?  Discard is issued only once we're positive there's no
> reference to the freed blocks anywhere.  At that point, they're also open
> for reuse, thus they can be arbitrarily scribbled upon.

If it's not an issue, then no one should ever need those backup slots
in the super and we should just remove them.

But in fact, we know people end up situations where they're needed for
either automatic recovery at mount time or explicitly calling
--usebackuproot. And in some cases we're seeing users using discard
who have a borked root tree, and none of the backup roots are present
so they're fucked. Their file system is fucked.

Now again, maybe this means the hardware is misbehaving, and honored
the discard out of order, and did that and wrote the new supers before
it had completely committed all the metadata? I have no idea, but the
evidence is present in the list that some people run into this and
when they do the file system is beyond repair even t hough it can
usually be scraped with btrfs restore.


> Unless your hardware is seriously broken (such as lying about barriers,
> which is nearly-guaranteed data loss on btrfs anyway), there's no way the
> filesystem will ever reference such blocks.  The corpses of old trees that
> are left lying around with no discard can at most be used for manual
> forensics, but whether a given block will have been overwritten or not is
> a matter of pure luck.

File systems that overwrite, are hinting the intent in the journal
what's about to happen. So if there's a partial overwrite of metadata,
it's fine. The journal can help recover. But Btrfs without a journal,
has a major piece of information required to bootstrap the file system
at mount time, that's damaged, and then every backup has been
discarded. So it actually makes Btrfs more fragile than other file
systems in the same situation.



>
> For rollbacks, there are snapshots.  Once a transaction has been fully
> committed, the old version is considered gone.

Yeah well snapshots do not cause root trees to stick around.


>
>>  because it's already been discarded.
>> > This is ideally something which should be addressed (we need some sort of
>> > discard queue for handling in-line discards), but it's not easy to address.
>>
>> Discard data extents, don't discard metadata extents? Or put them on a
>> substantial delay.
>
> Why would you special-case metadata?  Metadata that points to overwritten or
> discarded blocks is of no use either.

I would rather lose 30 seconds, 1 minute, or even 2 minutes of writes,
than lose an entire file system. That's why.

Anyway right now I consider discard mount option fundamentally broken
on Btrfs for SSDs. I haven't tested this on LVM thinp, maybe it's
broken there too.

Even fstrim leaves a tiny window open for a few minutes every time it
gets called, where if the root tree is corrupted for any reason,
you're fucked because all the backup roots are already gone.

-- 
Chris Murphy

  parent reply	other threads:[~2017-11-04 17:14 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-24 15:27 Problem with file system Fred Van Andel
2017-04-24 17:02 ` Chris Murphy
2017-04-25  4:05   ` Duncan
2017-04-25  0:26 ` Qu Wenruo
2017-04-25  5:33   ` Marat Khalili
2017-04-25  6:13     ` Qu Wenruo
2017-04-26 16:43       ` Fred Van Andel
2017-10-30  3:31         ` Dave
2017-10-30 21:37           ` Chris Murphy
2017-10-31  5:57             ` Marat Khalili
2017-10-31 11:28               ` Austin S. Hemmelgarn
2017-11-03  7:42                 ` Kai Krakow
2017-11-03 11:33                   ` Austin S. Hemmelgarn
2017-11-03 22:03                 ` Chris Murphy
2017-11-04  4:46                   ` Adam Borowski
2017-11-04 12:00                     ` Marat Khalili
2017-11-04 17:14                     ` Chris Murphy [this message]
2017-11-06 13:29                       ` Austin S. Hemmelgarn
2017-11-06 18:45                         ` Chris Murphy
2017-11-06 19:12                           ` Austin S. Hemmelgarn
2017-11-04  7:26             ` Dave
2017-11-04 17:25               ` Chris Murphy
2017-11-07  7:01                 ` Dave
2017-11-07 13:02                   ` Austin S. Hemmelgarn
2017-11-08  4:50                     ` Chris Murphy
2017-11-08 12:13                       ` Austin S. Hemmelgarn
2017-11-08 17:17                         ` Chris Murphy
2017-11-08 17:22                           ` Hugo Mills
2017-11-08 17:54                             ` Chris Murphy
2017-11-08 18:10                               ` Austin S. Hemmelgarn
2017-11-08 18:31                                 ` Chris Murphy
2017-11-08 19:29                                   ` Austin S. Hemmelgarn
2017-10-31  1:58           ` Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJCQCtReVi8k+OoitBBUZm_T+mj9dkV6GCVs7JMSG1myR78AGw@mail.gmail.com \
    --to=lists@colorremedies.com \
    --cc=ahferroin7@gmail.com \
    --cc=davestechshop@gmail.com \
    --cc=kilobyte@angband.pl \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=mkh@rqc.ru \
    --cc=vanandel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).