All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo@cn.fujitsu.com>
To: Hugo Mills <hugo@carfax.org.uk>,
	Eric Sandeen <sandeen@redhat.com>,
	Austin S Hemmelgarn <ahferroin7@gmail.com>,
	Christoph Anton Mitterer <calestyo@scientia.net>,
	Duncan <1i5t5.duncan@cox.net>, <linux-btrfs@vger.kernel.org>
Subject: Re: shall distros run btrfsck on boot?(Off topic, btrfs per-inode tree idea)
Date: Wed, 25 Nov 2015 09:59:59 +0800	[thread overview]
Message-ID: <5655161F.5070309@cn.fujitsu.com> (raw)
In-Reply-To: <20151124223349.GV24333@carfax.org.uk>



Hugo Mills wrote on 2015/11/24 22:33 +0000:
> On Tue, Nov 24, 2015 at 04:26:47PM -0600, Eric Sandeen wrote:
>> On 11/24/15 2:38 PM, Austin S Hemmelgarn wrote:
>>
>>> if the system was
>>> shut down cleanly, you're fine barring software bugs, but if it
>>> crashed, you should be running a check on the FS.
>>
>> Um, no...
>>
>> The *entire point* of having a journaling filesystem is that after a
>> crash or power loss, a journal replay on next mount will bring the
>> metadata into a consistent state.
>
>     Not an actual argument within the discussion, but an interesting
> observation on a fine distinction:
>
>     It's interesting to note that there's a difference here between
> journalling and CoW filesystems. A journalling FS needs a journal
> replay to become consistent. A CoW FS is _always_ consistent, by
> design. Now, btrfs has a log that should be replayed after an unclean
> shutdown, but that's all about the data that got written within the
> current transaction that wasn't committed,

In fact, log tree of btrfs is only used to speedup fsync. And there is a 
"notreelog" mount option to disable such log tree, if one uses it, fsync 
performance will just drop to the level of sync.

So it's just an optimization, although it's already quite away from the 
original topic, I think the best method for btrfs to improve fsync 
performance is to introduce something like ext*:

Per-file extent map tree.


The reason btrfs is slow on fsync is, file extent and inode info are all 
stored in the same tree(fs tree or subvolume tree).

To only fsync a inode, it's impossible only fsync all its file extents, 
but to sync the whole tree, which may just as slow as a full sync.

That's why log tree is introduced, only writeback file extents of an 
inode and record its metadata changes into the log tree.
And performance test result also supports this.


But other filesystem, at least ext* uses a better solution, each inode 
(no matter regular file or dir) has its own tree to record its file 
extents or dir entries.
Making fsync quite easy and straightforward.

If btrfs follows the same design, at least the random RW performance may 
have a boost and simplify the fsync codes.

Thanks,
Qu


> rather than about FS
> metadata consistency. This means that a read-only mount of btrfs can
> _actually_ be read-only, not modifying any of the data on the disk,
> whereas a read-only mount of a journalling FS _must_ modify the disk
> data after an unclean shitdown, in order to be useful (because the FS
> isn't consistent without the journal replay).
>
>     Hugo.
>



  parent reply	other threads:[~2015-11-25  2:00 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-24  4:02 shall distros run btrfsck on boot? Christoph Anton Mitterer
2015-11-24  4:31 ` Wang Shilong
2015-11-24  4:35 ` Duncan
2015-11-24  4:40   ` Eric Sandeen
2015-11-24  4:43   ` Christoph Anton Mitterer
2015-11-24  5:33     ` Qu Wenruo
2015-11-24  6:46     ` Duncan
2015-11-24  6:56       ` Duncan
2015-11-24 17:14         ` Eric Sandeen
2015-11-24 17:23           ` Christoph Anton Mitterer
2015-11-24 20:38             ` Austin S Hemmelgarn
2015-11-24 22:26               ` Eric Sandeen
2015-11-24 22:33                 ` Hugo Mills
2015-11-24 23:01                   ` Christoph Anton Mitterer
2015-11-24 23:06                     ` Hugo Mills
2015-11-25  1:59                   ` Qu Wenruo [this message]
2015-11-25 12:32                 ` Austin S Hemmelgarn
2015-11-25 15:26                   ` Martin Steigerwald
2015-11-28 16:52 ` Jeff Mahoney
2015-11-30  1:59   ` Qu Wenruo
2015-11-30 19:27     ` Jeff Mahoney
2015-11-30 15:06   ` Austin S Hemmelgarn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5655161F.5070309@cn.fujitsu.com \
    --to=quwenruo@cn.fujitsu.com \
    --cc=1i5t5.duncan@cox.net \
    --cc=ahferroin7@gmail.com \
    --cc=calestyo@scientia.net \
    --cc=hugo@carfax.org.uk \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=sandeen@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.