From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: "Ellis H. Wilson III" <ellisw@panasas.com>, linux-btrfs@vger.kernel.org
Subject: Re: btrfs-cleaner / snapshot performance analysis
Date: Tue, 13 Feb 2018 17:14:01 -0800 [thread overview]
Message-ID: <20180214011401.GA5201@magnolia> (raw)
In-Reply-To: <659ed727-f5dc-e81b-c01c-b6a063f3ed34@gmx.com>
On Sun, Feb 11, 2018 at 02:40:16PM +0800, Qu Wenruo wrote:
>
>
> On 2018年02月10日 00:45, Ellis H. Wilson III wrote:
> > Hi all,
> >
> > I am trying to better understand how the cleaner kthread (btrfs-cleaner)
> > impacts foreground performance, specifically during snapshot deletion.
> > My experience so far has been that it can be dramatically disruptive to
> > foreground I/O.
> >
> > Looking through the wiki at kernel.org I have not yet stumbled onto any
> > analysis that would shed light on this specific problem. I have found
> > numerous complaints about btrfs-cleaner online, especially relating to
> > quotas being enabled. This has proven thus far less than helpful, as
> > the response tends to be "use less snapshots," or "disable quotas," both
> > of which strike me as intellectually unsatisfying answers, especially
> > the former in a filesystem where snapshots are supposed to be
> > "first-class citizens."
>
> Yes, snapshots of btrfs is really "first-class citizen".
> Tons of designs are all biased to snapshot.
>
> But one should be clear about one thing:
> Snapshot creation and backref walk (used in qgroup, relocation and
> extent deletion), are two conflicting workload in fact.
>
> Btrfs puts snapshot creation on a very high priority, so that it greatly
> degrades the performance of backref walk (used in snapshot deletion,
> relocation and extent exclusive/shared calculation of qgroup).
>
> Let me explain this problem in detail.
>
> Just as explained by Peter Grandi, for any snapshot system (or any
> system supports reflink) there must be a reserved mapping tree, to tell
> which extent is used by who.
>
> It's very critical, to determine if an extent is shared so we determine
> if we need to do CoW.
>
> There are several different ways to implement it, and this hugely
> affects snapshot creation performance.
>
> 1) Direct mapping record
> Just records exactly which extent is used by who, directly.
> So when we needs to check the owner, just search the tree ONCE, then
> we get it.
>
> This is simple and it seems that LVM thin-provision and LVM
> traditional targets are all using them.
> (Maybe XFS also follows this way?)
Yes, it does.
> Pros:
> *FAST* backref walk, which means quick extent deletion and CoW
> condition check.
>
>
> Cons:
> *SLOW* snapshot creation.
> Each snapshot creation needs to insert new owner relationship into
> the tree. This modification grow with the size of snapshot source.
...of course xfs also doesn't support snapshots. :)
--D
> 2) Indirect mapping record
> Records upper level referencer only.
>
> To get all direct owner of an extent, it will needs multiple lookup
> in the reserved mapping tree.
>
> And obviously, btrfs is using this method.
>
> Pros:
> *FAST* owner inheritance, which means snapshot creation.
> (Well, the only advantage I can think of)
>
> Cons:
> *VERY SLOW* backref walk, used by extent deletion, relocation, qgroup
> and Cow condition check.
> (That may also be why btrfs default to CoW data, so that it can skip
> the costy backref walk)
>
> And a more detailed example of the difference between them will be:
>
> [Basic tree layout]
> Tree X
> node A
> / \
> node B node C
> / \ / \
> leaf D leaf E leaf F leaf G
>
> Use above tree X as snapshot source.
>
> [Snapshot creation: Direct mapping]
> Then for direct mapping record, if we are going to create snapshot Y
> then we would get:
>
> Tree X Tree Y
> node A <node H>
> | \ / |
> | X |
> | / \ |
> node B node C
> / \ / \
> leaf D leaf E leaf F leaf G
>
> We need to create new node H, and update the owner for node B/C/D/E/F/G.
>
> That's to say, we need to create 1 new node, and update 6 references of
> existing nodes/leaves.
> And this will grow rapidly if the tree is large, but still should be a
> linear increase.
>
>
> [Snapshot creation: Indirect mapping]
> And if using indirect mapping tree, firstly, reserved mapping tree
> doesn't record exactly the owner for each leaf/node, but only records
> its parent(s).
>
> So even when tree X exists along, without snapshot Y, if we need to know
> the owner of leaf D, we only knows its only parent is node B.
> And do the same query on node B until we read node A and knows it's
> owned by tree X.
>
> Tree X ^
> node A ^ Look upward until
> / | we reach tree root
> node B | to search the owner
> / | of a leaf/node
> leaf D |
>
> So even in its best case, to look up the owner of leaf D, we need to do
> 3 times lookup. One for leaf D, one for node B, one for node A (which is
> the end).
> Such lookup will get more and more complex if there are extra branch in
> the lookup chain.
>
> But such complicated design makes one thing easier, that is snapshot
> creation:
> Tree X Tree Y
> node A <node H>
> | \ / |
> | X |
> | / \ |
> node B node C
> / \ / \
> leaf D leaf E leaf F leaf G
>
> Still same tree Y, snapshot from tree X.
>
> Despite the new node H, we only needs to update the reference lookup for
> node B and C.
>
> So far so good, as for indirect mapping, we reduced the modification to
> reserved mapping tree, from 6 to 2.
> And it reduce will be even more obvious if the tree is larger.
>
> But the problem is reserved for snapshot deletion:
>
> [Snapshot deletion: Direct mapping]
>
> To delete snapshot Y:
>
> Tree X Tree Y
> node A <node H>
> | \ / |
> | X |
> | / \ |
> node B node C
> / \ / \
> leaf D leaf E leaf F leaf G
>
> [Snapshot deletion: Direct mapping]
> Quite straightforward, just check the owner of each node to see if we
> can delete the node/leaf.
>
> For direct mapping, we just do the owner lookup in the reserved mapping
> tree, 7 times. And we found node H can be deleted.
>
> That's all, same amount of work for snapshot creation and deletion.
> Not bad.
>
> [Snapshot deletion: Indirect mapping]
> Here we still needs to the lookup, 7 times.
>
> But the difference is, each lookup can cause extra lookup.
>
> For node H, just one single lookup as it's the root.
> But for leaf G, it needs 4 times lookup.
> Tree X Tree Y
> node A <node H>
> \ |
> \ |
> \ |
> node C
> |
> leaf G
>
> One for leaf G itself, one for node C, one for node A (parent of node C)
> and one for node H (parent of node C again).
>
> When summing up the lookup, for indirect mapping it needs:
> 1 for node H
> 3 for node B and C each
> 4 for leaf D~G each
>
> total 23 lookup opeartions.
>
> And it will just be even more if there are more snapshots, and it's not
> a linear increase.
>
>
> Although we could do some optimization, for example for above extent
> deletion, we don't really care all the owners of the node/leaf, but only
> cares if the extent is shared.
>
> In that case, if we find node C is also shared by tree X, we don't need
> to check node H.
> If using this optimization, the lookup times reduced to 17 times.
>
>
> But here comes to qgroup and balance, where they can't use such
> optimization, as they needs to update all owners to handle the owner
> change. (tree relocation tree for relocation, and qgroup number change
> for quota).
>
> That's why quota brings an obvious impact on performance.
>
>
> So in short conclusions:
> 1) Snapshot is not an easy workload to be considered as one single
> operation
> Creation and deletion are different workload, at least for btrfs.
>
> 2) Snapshot deletion and qgroup is the biggest cost, by the btrfs design
> Either reduce number of snapshots to reduce branches, or disable
> quota to optimize the lookup operation.
>
> Thanks,
> Qu
>
>
> >
> > The 2007 and 2013 Rodeh papers don't do the thorough practical snapshot
> > performance analysis I would expect to see given the assertions in the
> > latter that "BTRFS...supports efficient snapshots..." The former is
> > sufficiently pre-BTRFS that while it does performance analysis of btree
> > clones, it's unclear (to me at least) if the results can be
> > forward-propagated in some way to real-world performance expectations
> > for BTRFS snapshot creation/deletion/modification.
> >
> > Has this analysis been performed somewhere else and I'm just missing it?
> > Also, I'll be glad to comment on my specific setup, kernel version,
> > etc, and discuss pragmatic work-arounds, but I'd like to better
> > understand the high-level performance implications first.
> >
> > Thanks in advance to anyone who can comment on this. I am very inclined
> > to read anything thrown at me, so if there is documentation I failed to
> > read, please just send the link.
> >
> > Best,
> >
> > ellis
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
prev parent reply other threads:[~2018-02-14 1:17 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-02-09 16:45 btrfs-cleaner / snapshot performance analysis Ellis H. Wilson III
2018-02-09 17:10 ` Peter Grandi
2018-02-09 20:36 ` Hans van Kranenburg
2018-02-10 18:29 ` Ellis H. Wilson III
2018-02-10 22:05 ` Tomasz Pala
2018-02-11 15:59 ` Ellis H. Wilson III
2018-02-11 18:24 ` Hans van Kranenburg
2018-02-12 15:37 ` Ellis H. Wilson III
2018-02-12 16:02 ` Austin S. Hemmelgarn
2018-02-12 16:39 ` Ellis H. Wilson III
2018-02-12 18:07 ` Austin S. Hemmelgarn
2018-02-13 13:34 ` E V
2018-02-11 1:02 ` Hans van Kranenburg
2018-02-11 9:31 ` Andrei Borzenkov
2018-02-11 17:25 ` Adam Borowski
2018-02-11 16:15 ` Ellis H. Wilson III
2018-02-11 18:03 ` Hans van Kranenburg
2018-02-12 14:45 ` Ellis H. Wilson III
2018-02-12 17:09 ` Hans van Kranenburg
2018-02-12 17:38 ` Ellis H. Wilson III
2018-02-11 6:40 ` Qu Wenruo
2018-02-14 1:14 ` Darrick J. Wong [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180214011401.GA5201@magnolia \
--to=darrick.wong@oracle.com \
--cc=ellisw@panasas.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).