From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wg0-f49.google.com ([74.125.82.49]:52801 "EHLO mail-wg0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757932AbaFSJqr (ORCPT ); Thu, 19 Jun 2014 05:46:47 -0400 Received: by mail-wg0-f49.google.com with SMTP id y10so1994124wgg.8 for ; Thu, 19 Jun 2014 02:46:46 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <53553172.9030309@fb.com> References: <53553172.9030309@fb.com> Date: Thu, 19 Jun 2014 12:46:46 +0300 Message-ID: Subject: Re: Snapshot aware defrag and qgroups thoughts From: Alex Lyakas To: Josef Bacik Cc: linux-btrfs Content-Type: text/plain; charset=UTF-8 Sender: linux-btrfs-owner@vger.kernel.org List-ID: Hi Josef, thanks for the detailed description of how the extent tree works! When I was digging through that in the past, I made some slides to remember all the call chains. Maybe somebody finds that useful to accompany your notes. https://drive.google.com/file/d/0ByBy89zr3kJNNmM5OG5wXzQ3LUE/edit?usp=sharing Thanks, Alex. On Mon, Apr 21, 2014 at 5:55 PM, Josef Bacik wrote: > We have a big problem, but it involves a lot of moving parts, so I'm going > to > explain all of the parts, and then the problem, and then what I am doing to > fix > the problem. I want you guys to check my work to make sure I'm not missing > something so when I come back from paternity leave in a few weeks I can just > sit > down and finish this work out. > > ===== Extent refs ======= > > This is basically how extent refs work. You have either > > key.objectid = bytenr; > key.type = BTRFS_EXTENT_ITEM_KEY; > key.offset = length; > > or you have > > key.objectid = bytenr; > key.type = BTRFS_METADATA_ITEM_KEY; > key.offset = level of the metadata block; > > in the case of skinny metadata. Then you have the extent item which > describes > the number of refs and such, followed by 1 or more inline refs. All you > need > to know for this problem is how I'm going to describe them. What I call a > "normal ref" or a "full ref" is a reference that has the actual root > information > in the ref. What I call a "shared ref" is one where we only know the tree > block > that owns the particular ref. So how does this work in practice? > > 1) Normal allocation - metadata: We allocate a tree block as we add new > items > to a tree. We know that this root owns this tree block so we create a > normal > ref with the root objectid in the extent ref. We also set the owner of the > block itself to our objectid. This is important to keep in mind. > > 2) Normal allocaiton - data: We allocate some data for a given fs tree and > we > add a extent ref with the root objectid of the tree we are in, the inode > number > and the logical offset into the inode for this inode. > > 3) Splitting a data extent: We write to the middle of an existing extent. > We > will split this extent into two BTRFS_EXTENT_DATA_KEY items and the increase > the > ref count of the original extent by 1. This means we look up the extent ref > for > root->objectid, inode number and the _original_ inode offset. We don't > create > another extent ref, this is important to keep in mind. > > ===== btrfs_copy_root/update_ref_for_cow/btrfs_inc_ref/btrfs_dec_ref ===== > > But Josef, didn't you say there were shared refs? Why yes I did, but I need > to > explain it in context of the people who actually do the dirty work. We'll > start > with the easy case > > 1) btrfs_copy_root - where snapshots start: When we make a snapshot we call > this function, which allocates a completely new block with a new root > objectid > and then memcpy's the original root we are snapshotting. Then we call > btrfs_inc_ref on our new buffer, which will walk all items in that buffer > and > add a new normal ref to each of those blocks for our new root. This is only > at > the level below the new root, nothing below that point. > > 2) btrfs_inc_ref/btrfs_dec_ref - how we deal with snapshots: These guys are > responsible for dealing with the particular action we want to make on our > given > buffer. So if we are free'ing our buffer, we need to drop any refs it has > to > the blocks it points to. For level > 0 this means modifying refs for all of > the > tree blocks it points to. For level == 0 this means modifying refs for any > data > extents the leaf may point to. > > 3) update_ref_for_cow - this is where the magic happens: This has a few > different modes of operation, but every operation means we check to see if > the > block is shared, which is we see if we have been snapshotted and if we have > been > see if this block has changed since we snapshotted. If it is shared then we > look up the extent refs and the flags. If not then we carry on. From here > we > have a few options. > > 3a) Not shared: Don't do anything, we can do our normal cow operations and > carry > on. > > 3b) Shared and cowing from the owning root: This is where the > btrfs_header_owner() is important. If we owned this block and it is shared > then > we know that any of the upper levels won't have a normal ref to anything > underneath this block, so we need to add a shared ref for anything this > block > points to. So the first thing we do is btrfs_inc_ref(), but we set the full > backref flag. This means that when we add refs for everything this block > points > to we don't use a root objectid, we use the bytenr of this block. Then we > set > BTRFS_BLOCK_FLAG_FULL_BACKREF for the extent flags for this give block. > > 3c) Shared and cowing from not the owning root: So if we are cowing down > from > the snapshot we need to make sure that any block we own completely ourselves > has > normal refs for any blocks it points to. So we cow down and hit a shared > block > that we aren't the owner of, so we do btrfs_inc_ref() for our block and > don't > set full_backref so we get a normal ref on everything this block points to > with > our actual root objectid. > > 3d) Last ref but has FULL_BACKREF set: The last time we cow a block that was > converted to a full backref from 3b. This is where we call btrfs_inc_ref() > without full_backref set so we get our normal root objectid ref added, and > then > we call btrfs_dec_ref() with full_backref set so that we remove the shared > ref > for this block since it is about to be deleted. > > ===== Delayed refs ====== > > In order to make sure we don't recursively update the extent tree for every > modification as we cow down a tree we have delayed refs. Every time we > modify > the ref count of an extent (allocate, inc ref, free) we allocate a delayed > ref > with the information about that ref. So every call to btrfs_inc_ref, > btrfs_dec_ref, btrfs_allocate_tree_bloc, btrfs_alloc_reserved_file_extent, > btrfs_free_extent all ends up allocating a delayed ref. The nice thing > about > this is it can keep us from actually modifying the extent tree. If we cow a > block to free items and then subsequently free that block we can avoid > touching > the extent tree altogether for this. > > PROBLEM: For qgroups we needed a way to have a coherent view of the extent > tree > at a given time, which is where our sequence numbers come in. Each delayed > ref > new gets a sequence number, and a different sequence number means we can't > do > this in-memory merging right away like we used to. Now instead we have > btrfs_merge_delayed_refs, which works pretty well, but is O(n) for the > number > of delayed refs we have for a given extent. Usually this is a small number, > but > there is one key area this isn't true, which is snapshot aware defrag. > > ====== Snapshot aware defrag ===== > > This has a couple of problems with it, but the problem that relates to this > email is that we can have many many updates for one extent. So remember how > we > will add an extent ref every time we split an extent? Imagine writing a > 128mb > extent and then overwriting every other block. That is going to end up with > 32768 file extents all pointing to the same extent. Now snapshot this fs 5 > thousand times, now you have 163,840,000 extents you are going to be > updating > when you defrag this range. How this then ends up is you have that many > delayed > refs outstanding at the end of your defrag operation, that is if you are > lucky > enough to not OOM. This is because of the sequence numbers, we can't merge > them > until we actually go to run them. So now you've made it to actually calling > btrfs_merge_delayed_refs, and then you wonder why your box just livelocked > for > 10 seconds for no apparent reason. With out the sequence numbers we would > have > merged each of these operations as we added them, so when we finished the > defrag > operation we'd have 5000 delayed refs each with a -32768 count for the > freeing > of the fragmented extent. > > Now obviously 1 part of this fix is to do one root at a time so we aren't > queueing up N number of snapshots worth of delayed refs, but that still > leaves > the actual number of extents we have, which is why we need to kill/mitigate > the > pain of the sequence numbers. > > ======= Qgroups ======= > > I'm starting over with the qgroup accounting because it simply needs to > operate > without as much dependance on the sequence numbers. So here is how it is > going > to work > > 1) Create a delayed ref sort of scheme for quota operations. This lets us > process things in bulk without having to worry about weird locking and such. > > 2) Only create these quota operations when we modify the actual refs. Part > of > the problem with the current work is that it tries to figure out what quota > stuff is going to be used based on delayed refs. These updates could end up > being chucked, which will screw up our counts. So instead only add a qgroup > action when we add the first ref for that root objectid or remove the last > ref > for that root objectid. This will mean adding stuff to the actual ref > modifying > stuff in extent-tree.c. I have this part pretty well nailed down in my > qgroup-work branch in btrfs-next. > > 3) Create qgroup operations in update_ref_for_cow. One of the tricky parts > of > killing sequence numbers is that we can completely lose extent ref > operations > that we really do care about. So say we snapshot, which means we convert > all of > the referenced space in the original root to not exclusive and add it to the > referenced count of the snapshot. If we rm -rf on that snapshot we will cow > down only to free the leaf, which means that the delayed ref will get > dropped. > This will cause us to lose the qgroup action because we only account for > real > ref modificatoins to the tree, and so we won't convert this space back to > exclusive space for the original tree. So what we need to do here is add a > qgroup operation saying that we are converting shared space to exclusive > space > with the cow and then set a flag on the extent buffer so we know that we > have > an outstanding convert operation. When this convert operation runs we > simply > look the extent buffer up and clear the flag. If we try to free the tree > block > before then we make sure to add a real qgroup operation. > > 4) The tricky part is data because we have implied refs, that is refs to > data > with no real way to know other than to lookup all of it's parents. We will > still need sequence numbers in a way for this case, but way differently than > what we have now. We only care about knowing about a given point in time > previous to this special case. The special case is when we remove what > looks > like the last ref for this extent for this root but there are still > outstanding > refs on the extent. This case we need to lookup the backrefs to make sure > there > isn't an implied ref from our root to this data block. This is the point at > which we increment the sequence number. So what happens is you have a whole > bunch of delayed refs with the same sequence number, and then we hit this > case > and inc the sequence number, and you have a whole bunch of delayed refs with > the > new sequence number. You still have to post-process merge to make sure, but > you are far more likely to merge everything in real-time since you are only > changing the sequence number every once and a while instead of for every new > sequence number. Then you lookup the backrefs for everything that happened > up > until but not after your sequence number to see if you had an implied ref > for > the data. > > ==== Conclusion ===== > > I realize this is a lot of information, but I'm not the only one working in > this > area at the moment so I thought it was important to get it all on paper so > we're > all on the same page. Please let me know if you have questions or > suggestions, > Like I said I'm on paternity leave so theres going to be a lag time for my > responses. Thanks, > > Josef > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html