From: Alex Lyakas <alex.btrfs@zadarastorage.com>
To: Josef Bacik <jbacik@fb.com>
Cc: linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Snapshot aware defrag and qgroups thoughts
Date: Thu, 19 Jun 2014 12:46:46 +0300 [thread overview]
Message-ID: <CAOcd+r3BQJQ61P3FfwyA8_QT5Ov_F3iktf86A_a5gra-6JeQUw@mail.gmail.com> (raw)
In-Reply-To: <53553172.9030309@fb.com>
Hi Josef,
thanks for the detailed description of how the extent tree works!
When I was digging through that in the past, I made some slides to
remember all the call chains. Maybe somebody finds that useful to
accompany your notes.
https://drive.google.com/file/d/0ByBy89zr3kJNNmM5OG5wXzQ3LUE/edit?usp=sharing
Thanks,
Alex.
On Mon, Apr 21, 2014 at 5:55 PM, Josef Bacik <jbacik@fb.com> wrote:
> We have a big problem, but it involves a lot of moving parts, so I'm going
> to
> explain all of the parts, and then the problem, and then what I am doing to
> fix
> the problem. I want you guys to check my work to make sure I'm not missing
> something so when I come back from paternity leave in a few weeks I can just
> sit
> down and finish this work out.
>
> ===== Extent refs =======
>
> This is basically how extent refs work. You have either
>
> key.objectid = bytenr;
> key.type = BTRFS_EXTENT_ITEM_KEY;
> key.offset = length;
>
> or you have
>
> key.objectid = bytenr;
> key.type = BTRFS_METADATA_ITEM_KEY;
> key.offset = level of the metadata block;
>
> in the case of skinny metadata. Then you have the extent item which
> describes
> the number of refs and such, followed by 1 or more inline refs. All you
> need
> to know for this problem is how I'm going to describe them. What I call a
> "normal ref" or a "full ref" is a reference that has the actual root
> information
> in the ref. What I call a "shared ref" is one where we only know the tree
> block
> that owns the particular ref. So how does this work in practice?
>
> 1) Normal allocation - metadata: We allocate a tree block as we add new
> items
> to a tree. We know that this root owns this tree block so we create a
> normal
> ref with the root objectid in the extent ref. We also set the owner of the
> block itself to our objectid. This is important to keep in mind.
>
> 2) Normal allocaiton - data: We allocate some data for a given fs tree and
> we
> add a extent ref with the root objectid of the tree we are in, the inode
> number
> and the logical offset into the inode for this inode.
>
> 3) Splitting a data extent: We write to the middle of an existing extent.
> We
> will split this extent into two BTRFS_EXTENT_DATA_KEY items and the increase
> the
> ref count of the original extent by 1. This means we look up the extent ref
> for
> root->objectid, inode number and the _original_ inode offset. We don't
> create
> another extent ref, this is important to keep in mind.
>
> ===== btrfs_copy_root/update_ref_for_cow/btrfs_inc_ref/btrfs_dec_ref =====
>
> But Josef, didn't you say there were shared refs? Why yes I did, but I need
> to
> explain it in context of the people who actually do the dirty work. We'll
> start
> with the easy case
>
> 1) btrfs_copy_root - where snapshots start: When we make a snapshot we call
> this function, which allocates a completely new block with a new root
> objectid
> and then memcpy's the original root we are snapshotting. Then we call
> btrfs_inc_ref on our new buffer, which will walk all items in that buffer
> and
> add a new normal ref to each of those blocks for our new root. This is only
> at
> the level below the new root, nothing below that point.
>
> 2) btrfs_inc_ref/btrfs_dec_ref - how we deal with snapshots: These guys are
> responsible for dealing with the particular action we want to make on our
> given
> buffer. So if we are free'ing our buffer, we need to drop any refs it has
> to
> the blocks it points to. For level > 0 this means modifying refs for all of
> the
> tree blocks it points to. For level == 0 this means modifying refs for any
> data
> extents the leaf may point to.
>
> 3) update_ref_for_cow - this is where the magic happens: This has a few
> different modes of operation, but every operation means we check to see if
> the
> block is shared, which is we see if we have been snapshotted and if we have
> been
> see if this block has changed since we snapshotted. If it is shared then we
> look up the extent refs and the flags. If not then we carry on. From here
> we
> have a few options.
>
> 3a) Not shared: Don't do anything, we can do our normal cow operations and
> carry
> on.
>
> 3b) Shared and cowing from the owning root: This is where the
> btrfs_header_owner() is important. If we owned this block and it is shared
> then
> we know that any of the upper levels won't have a normal ref to anything
> underneath this block, so we need to add a shared ref for anything this
> block
> points to. So the first thing we do is btrfs_inc_ref(), but we set the full
> backref flag. This means that when we add refs for everything this block
> points
> to we don't use a root objectid, we use the bytenr of this block. Then we
> set
> BTRFS_BLOCK_FLAG_FULL_BACKREF for the extent flags for this give block.
>
> 3c) Shared and cowing from not the owning root: So if we are cowing down
> from
> the snapshot we need to make sure that any block we own completely ourselves
> has
> normal refs for any blocks it points to. So we cow down and hit a shared
> block
> that we aren't the owner of, so we do btrfs_inc_ref() for our block and
> don't
> set full_backref so we get a normal ref on everything this block points to
> with
> our actual root objectid.
>
> 3d) Last ref but has FULL_BACKREF set: The last time we cow a block that was
> converted to a full backref from 3b. This is where we call btrfs_inc_ref()
> without full_backref set so we get our normal root objectid ref added, and
> then
> we call btrfs_dec_ref() with full_backref set so that we remove the shared
> ref
> for this block since it is about to be deleted.
>
> ===== Delayed refs ======
>
> In order to make sure we don't recursively update the extent tree for every
> modification as we cow down a tree we have delayed refs. Every time we
> modify
> the ref count of an extent (allocate, inc ref, free) we allocate a delayed
> ref
> with the information about that ref. So every call to btrfs_inc_ref,
> btrfs_dec_ref, btrfs_allocate_tree_bloc, btrfs_alloc_reserved_file_extent,
> btrfs_free_extent all ends up allocating a delayed ref. The nice thing
> about
> this is it can keep us from actually modifying the extent tree. If we cow a
> block to free items and then subsequently free that block we can avoid
> touching
> the extent tree altogether for this.
>
> PROBLEM: For qgroups we needed a way to have a coherent view of the extent
> tree
> at a given time, which is where our sequence numbers come in. Each delayed
> ref
> new gets a sequence number, and a different sequence number means we can't
> do
> this in-memory merging right away like we used to. Now instead we have
> btrfs_merge_delayed_refs, which works pretty well, but is O(n) for the
> number
> of delayed refs we have for a given extent. Usually this is a small number,
> but
> there is one key area this isn't true, which is snapshot aware defrag.
>
> ====== Snapshot aware defrag =====
>
> This has a couple of problems with it, but the problem that relates to this
> email is that we can have many many updates for one extent. So remember how
> we
> will add an extent ref every time we split an extent? Imagine writing a
> 128mb
> extent and then overwriting every other block. That is going to end up with
> 32768 file extents all pointing to the same extent. Now snapshot this fs 5
> thousand times, now you have 163,840,000 extents you are going to be
> updating
> when you defrag this range. How this then ends up is you have that many
> delayed
> refs outstanding at the end of your defrag operation, that is if you are
> lucky
> enough to not OOM. This is because of the sequence numbers, we can't merge
> them
> until we actually go to run them. So now you've made it to actually calling
> btrfs_merge_delayed_refs, and then you wonder why your box just livelocked
> for
> 10 seconds for no apparent reason. With out the sequence numbers we would
> have
> merged each of these operations as we added them, so when we finished the
> defrag
> operation we'd have 5000 delayed refs each with a -32768 count for the
> freeing
> of the fragmented extent.
>
> Now obviously 1 part of this fix is to do one root at a time so we aren't
> queueing up N number of snapshots worth of delayed refs, but that still
> leaves
> the actual number of extents we have, which is why we need to kill/mitigate
> the
> pain of the sequence numbers.
>
> ======= Qgroups =======
>
> I'm starting over with the qgroup accounting because it simply needs to
> operate
> without as much dependance on the sequence numbers. So here is how it is
> going
> to work
>
> 1) Create a delayed ref sort of scheme for quota operations. This lets us
> process things in bulk without having to worry about weird locking and such.
>
> 2) Only create these quota operations when we modify the actual refs. Part
> of
> the problem with the current work is that it tries to figure out what quota
> stuff is going to be used based on delayed refs. These updates could end up
> being chucked, which will screw up our counts. So instead only add a qgroup
> action when we add the first ref for that root objectid or remove the last
> ref
> for that root objectid. This will mean adding stuff to the actual ref
> modifying
> stuff in extent-tree.c. I have this part pretty well nailed down in my
> qgroup-work branch in btrfs-next.
>
> 3) Create qgroup operations in update_ref_for_cow. One of the tricky parts
> of
> killing sequence numbers is that we can completely lose extent ref
> operations
> that we really do care about. So say we snapshot, which means we convert
> all of
> the referenced space in the original root to not exclusive and add it to the
> referenced count of the snapshot. If we rm -rf on that snapshot we will cow
> down only to free the leaf, which means that the delayed ref will get
> dropped.
> This will cause us to lose the qgroup action because we only account for
> real
> ref modificatoins to the tree, and so we won't convert this space back to
> exclusive space for the original tree. So what we need to do here is add a
> qgroup operation saying that we are converting shared space to exclusive
> space
> with the cow and then set a flag on the extent buffer so we know that we
> have
> an outstanding convert operation. When this convert operation runs we
> simply
> look the extent buffer up and clear the flag. If we try to free the tree
> block
> before then we make sure to add a real qgroup operation.
>
> 4) The tricky part is data because we have implied refs, that is refs to
> data
> with no real way to know other than to lookup all of it's parents. We will
> still need sequence numbers in a way for this case, but way differently than
> what we have now. We only care about knowing about a given point in time
> previous to this special case. The special case is when we remove what
> looks
> like the last ref for this extent for this root but there are still
> outstanding
> refs on the extent. This case we need to lookup the backrefs to make sure
> there
> isn't an implied ref from our root to this data block. This is the point at
> which we increment the sequence number. So what happens is you have a whole
> bunch of delayed refs with the same sequence number, and then we hit this
> case
> and inc the sequence number, and you have a whole bunch of delayed refs with
> the
> new sequence number. You still have to post-process merge to make sure, but
> you are far more likely to merge everything in real-time since you are only
> changing the sequence number every once and a while instead of for every new
> sequence number. Then you lookup the backrefs for everything that happened
> up
> until but not after your sequence number to see if you had an implied ref
> for
> the data.
>
> ==== Conclusion =====
>
> I realize this is a lot of information, but I'm not the only one working in
> this
> area at the moment so I thought it was important to get it all on paper so
> we're
> all on the same page. Please let me know if you have questions or
> suggestions,
> Like I said I'm on paternity leave so theres going to be a lag time for my
> responses. Thanks,
>
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
prev parent reply other threads:[~2014-06-19 9:46 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-04-21 14:55 Snapshot aware defrag and qgroups thoughts Josef Bacik
2014-04-21 22:13 ` Duncan
2014-06-19 9:46 ` Alex Lyakas [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAOcd+r3BQJQ61P3FfwyA8_QT5Ov_F3iktf86A_a5gra-6JeQUw@mail.gmail.com \
--to=alex.btrfs@zadarastorage.com \
--cc=jbacik@fb.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).