* Snapshot aware defrag and qgroups thoughts
@ 2014-04-21 14:55 Josef Bacik
2014-04-21 22:13 ` Duncan
2014-06-19 9:46 ` Alex Lyakas
0 siblings, 2 replies; 3+ messages in thread
From: Josef Bacik @ 2014-04-21 14:55 UTC (permalink / raw)
To: linux-btrfs
We have a big problem, but it involves a lot of moving parts, so I'm
going to
explain all of the parts, and then the problem, and then what I am doing
to fix
the problem. I want you guys to check my work to make sure I'm not missing
something so when I come back from paternity leave in a few weeks I can
just sit
down and finish this work out.
===== Extent refs =======
This is basically how extent refs work. You have either
key.objectid = bytenr;
key.type = BTRFS_EXTENT_ITEM_KEY;
key.offset = length;
or you have
key.objectid = bytenr;
key.type = BTRFS_METADATA_ITEM_KEY;
key.offset = level of the metadata block;
in the case of skinny metadata. Then you have the extent item which
describes
the number of refs and such, followed by 1 or more inline refs. All you
need
to know for this problem is how I'm going to describe them. What I call a
"normal ref" or a "full ref" is a reference that has the actual root
information
in the ref. What I call a "shared ref" is one where we only know the
tree block
that owns the particular ref. So how does this work in practice?
1) Normal allocation - metadata: We allocate a tree block as we add new
items
to a tree. We know that this root owns this tree block so we create a
normal
ref with the root objectid in the extent ref. We also set the owner of the
block itself to our objectid. This is important to keep in mind.
2) Normal allocaiton - data: We allocate some data for a given fs tree
and we
add a extent ref with the root objectid of the tree we are in, the inode
number
and the logical offset into the inode for this inode.
3) Splitting a data extent: We write to the middle of an existing
extent. We
will split this extent into two BTRFS_EXTENT_DATA_KEY items and the
increase the
ref count of the original extent by 1. This means we look up the extent
ref for
root->objectid, inode number and the _original_ inode offset. We don't
create
another extent ref, this is important to keep in mind.
===== btrfs_copy_root/update_ref_for_cow/btrfs_inc_ref/btrfs_dec_ref =====
But Josef, didn't you say there were shared refs? Why yes I did, but I
need to
explain it in context of the people who actually do the dirty work.
We'll start
with the easy case
1) btrfs_copy_root - where snapshots start: When we make a snapshot we call
this function, which allocates a completely new block with a new root
objectid
and then memcpy's the original root we are snapshotting. Then we call
btrfs_inc_ref on our new buffer, which will walk all items in that
buffer and
add a new normal ref to each of those blocks for our new root. This is
only at
the level below the new root, nothing below that point.
2) btrfs_inc_ref/btrfs_dec_ref - how we deal with snapshots: These guys are
responsible for dealing with the particular action we want to make on
our given
buffer. So if we are free'ing our buffer, we need to drop any refs it
has to
the blocks it points to. For level > 0 this means modifying refs for
all of the
tree blocks it points to. For level == 0 this means modifying refs for
any data
extents the leaf may point to.
3) update_ref_for_cow - this is where the magic happens: This has a few
different modes of operation, but every operation means we check to see
if the
block is shared, which is we see if we have been snapshotted and if we
have been
see if this block has changed since we snapshotted. If it is shared then we
look up the extent refs and the flags. If not then we carry on. From
here we
have a few options.
3a) Not shared: Don't do anything, we can do our normal cow operations
and carry
on.
3b) Shared and cowing from the owning root: This is where the
btrfs_header_owner() is important. If we owned this block and it is
shared then
we know that any of the upper levels won't have a normal ref to anything
underneath this block, so we need to add a shared ref for anything this
block
points to. So the first thing we do is btrfs_inc_ref(), but we set the full
backref flag. This means that when we add refs for everything this
block points
to we don't use a root objectid, we use the bytenr of this block. Then
we set
BTRFS_BLOCK_FLAG_FULL_BACKREF for the extent flags for this give block.
3c) Shared and cowing from not the owning root: So if we are cowing down
from
the snapshot we need to make sure that any block we own completely
ourselves has
normal refs for any blocks it points to. So we cow down and hit a
shared block
that we aren't the owner of, so we do btrfs_inc_ref() for our block and
don't
set full_backref so we get a normal ref on everything this block points
to with
our actual root objectid.
3d) Last ref but has FULL_BACKREF set: The last time we cow a block that was
converted to a full backref from 3b. This is where we call btrfs_inc_ref()
without full_backref set so we get our normal root objectid ref added,
and then
we call btrfs_dec_ref() with full_backref set so that we remove the
shared ref
for this block since it is about to be deleted.
===== Delayed refs ======
In order to make sure we don't recursively update the extent tree for every
modification as we cow down a tree we have delayed refs. Every time we
modify
the ref count of an extent (allocate, inc ref, free) we allocate a
delayed ref
with the information about that ref. So every call to btrfs_inc_ref,
btrfs_dec_ref, btrfs_allocate_tree_bloc, btrfs_alloc_reserved_file_extent,
btrfs_free_extent all ends up allocating a delayed ref. The nice thing
about
this is it can keep us from actually modifying the extent tree. If we cow a
block to free items and then subsequently free that block we can avoid
touching
the extent tree altogether for this.
PROBLEM: For qgroups we needed a way to have a coherent view of the
extent tree
at a given time, which is where our sequence numbers come in. Each
delayed ref
new gets a sequence number, and a different sequence number means we
can't do
this in-memory merging right away like we used to. Now instead we have
btrfs_merge_delayed_refs, which works pretty well, but is O(n) for the
number
of delayed refs we have for a given extent. Usually this is a small
number, but
there is one key area this isn't true, which is snapshot aware defrag.
====== Snapshot aware defrag =====
This has a couple of problems with it, but the problem that relates to this
email is that we can have many many updates for one extent. So remember
how we
will add an extent ref every time we split an extent? Imagine writing a
128mb
extent and then overwriting every other block. That is going to end up with
32768 file extents all pointing to the same extent. Now snapshot this fs 5
thousand times, now you have 163,840,000 extents you are going to be
updating
when you defrag this range. How this then ends up is you have that many
delayed
refs outstanding at the end of your defrag operation, that is if you are
lucky
enough to not OOM. This is because of the sequence numbers, we can't
merge them
until we actually go to run them. So now you've made it to actually calling
btrfs_merge_delayed_refs, and then you wonder why your box just
livelocked for
10 seconds for no apparent reason. With out the sequence numbers we
would have
merged each of these operations as we added them, so when we finished
the defrag
operation we'd have 5000 delayed refs each with a -32768 count for the
freeing
of the fragmented extent.
Now obviously 1 part of this fix is to do one root at a time so we aren't
queueing up N number of snapshots worth of delayed refs, but that still
leaves
the actual number of extents we have, which is why we need to
kill/mitigate the
pain of the sequence numbers.
======= Qgroups =======
I'm starting over with the qgroup accounting because it simply needs to
operate
without as much dependance on the sequence numbers. So here is how it
is going
to work
1) Create a delayed ref sort of scheme for quota operations. This lets us
process things in bulk without having to worry about weird locking and such.
2) Only create these quota operations when we modify the actual refs.
Part of
the problem with the current work is that it tries to figure out what quota
stuff is going to be used based on delayed refs. These updates could end up
being chucked, which will screw up our counts. So instead only add a qgroup
action when we add the first ref for that root objectid or remove the
last ref
for that root objectid. This will mean adding stuff to the actual ref
modifying
stuff in extent-tree.c. I have this part pretty well nailed down in my
qgroup-work branch in btrfs-next.
3) Create qgroup operations in update_ref_for_cow. One of the tricky
parts of
killing sequence numbers is that we can completely lose extent ref
operations
that we really do care about. So say we snapshot, which means we
convert all of
the referenced space in the original root to not exclusive and add it to the
referenced count of the snapshot. If we rm -rf on that snapshot we will cow
down only to free the leaf, which means that the delayed ref will get
dropped.
This will cause us to lose the qgroup action because we only account for
real
ref modificatoins to the tree, and so we won't convert this space back to
exclusive space for the original tree. So what we need to do here is add a
qgroup operation saying that we are converting shared space to exclusive
space
with the cow and then set a flag on the extent buffer so we know that we
have
an outstanding convert operation. When this convert operation runs we
simply
look the extent buffer up and clear the flag. If we try to free the
tree block
before then we make sure to add a real qgroup operation.
4) The tricky part is data because we have implied refs, that is refs to
data
with no real way to know other than to lookup all of it's parents. We will
still need sequence numbers in a way for this case, but way differently than
what we have now. We only care about knowing about a given point in time
previous to this special case. The special case is when we remove what
looks
like the last ref for this extent for this root but there are still
outstanding
refs on the extent. This case we need to lookup the backrefs to make
sure there
isn't an implied ref from our root to this data block. This is the point at
which we increment the sequence number. So what happens is you have a whole
bunch of delayed refs with the same sequence number, and then we hit
this case
and inc the sequence number, and you have a whole bunch of delayed refs
with the
new sequence number. You still have to post-process merge to make sure, but
you are far more likely to merge everything in real-time since you are only
changing the sequence number every once and a while instead of for every new
sequence number. Then you lookup the backrefs for everything that
happened up
until but not after your sequence number to see if you had an implied
ref for
the data.
==== Conclusion =====
I realize this is a lot of information, but I'm not the only one working
in this
area at the moment so I thought it was important to get it all on paper
so we're
all on the same page. Please let me know if you have questions or
suggestions,
Like I said I'm on paternity leave so theres going to be a lag time for my
responses. Thanks,
Josef
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Snapshot aware defrag and qgroups thoughts
2014-04-21 14:55 Snapshot aware defrag and qgroups thoughts Josef Bacik
@ 2014-04-21 22:13 ` Duncan
2014-06-19 9:46 ` Alex Lyakas
1 sibling, 0 replies; 3+ messages in thread
From: Duncan @ 2014-04-21 22:13 UTC (permalink / raw)
To: linux-btrfs
Josef Bacik posted on Mon, 21 Apr 2014 07:55:46 -0700 as excerpted:
[Near the bottom, point #4 immediately before conclusion.]
> You still have to post-process merge to make sure, but you are far more
> likely to merge everything in real-time since you are only changing the
> sequence number every once and a while instead of for every new sequence
> number.
Much of this is above my head, which is OK as I'm not a btrfs dev, but...
...only changing the sequence number... for every new sequence number?
Something's wrong with that. No WONDER the system live-locked! =:^)
My above-my-head guess is that in place of for every new sequence number
there, you meant for every delayed ref.
Meanwhile, thanks for putting this all down. I already said I don't
understand much of it, but I certainly have a better appreciation for the
complexities of snapshot-aware-defrag. I hadn't considered what qgroups
could do to it at all!
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Snapshot aware defrag and qgroups thoughts
2014-04-21 14:55 Snapshot aware defrag and qgroups thoughts Josef Bacik
2014-04-21 22:13 ` Duncan
@ 2014-06-19 9:46 ` Alex Lyakas
1 sibling, 0 replies; 3+ messages in thread
From: Alex Lyakas @ 2014-06-19 9:46 UTC (permalink / raw)
To: Josef Bacik; +Cc: linux-btrfs
Hi Josef,
thanks for the detailed description of how the extent tree works!
When I was digging through that in the past, I made some slides to
remember all the call chains. Maybe somebody finds that useful to
accompany your notes.
https://drive.google.com/file/d/0ByBy89zr3kJNNmM5OG5wXzQ3LUE/edit?usp=sharing
Thanks,
Alex.
On Mon, Apr 21, 2014 at 5:55 PM, Josef Bacik <jbacik@fb.com> wrote:
> We have a big problem, but it involves a lot of moving parts, so I'm going
> to
> explain all of the parts, and then the problem, and then what I am doing to
> fix
> the problem. I want you guys to check my work to make sure I'm not missing
> something so when I come back from paternity leave in a few weeks I can just
> sit
> down and finish this work out.
>
> ===== Extent refs =======
>
> This is basically how extent refs work. You have either
>
> key.objectid = bytenr;
> key.type = BTRFS_EXTENT_ITEM_KEY;
> key.offset = length;
>
> or you have
>
> key.objectid = bytenr;
> key.type = BTRFS_METADATA_ITEM_KEY;
> key.offset = level of the metadata block;
>
> in the case of skinny metadata. Then you have the extent item which
> describes
> the number of refs and such, followed by 1 or more inline refs. All you
> need
> to know for this problem is how I'm going to describe them. What I call a
> "normal ref" or a "full ref" is a reference that has the actual root
> information
> in the ref. What I call a "shared ref" is one where we only know the tree
> block
> that owns the particular ref. So how does this work in practice?
>
> 1) Normal allocation - metadata: We allocate a tree block as we add new
> items
> to a tree. We know that this root owns this tree block so we create a
> normal
> ref with the root objectid in the extent ref. We also set the owner of the
> block itself to our objectid. This is important to keep in mind.
>
> 2) Normal allocaiton - data: We allocate some data for a given fs tree and
> we
> add a extent ref with the root objectid of the tree we are in, the inode
> number
> and the logical offset into the inode for this inode.
>
> 3) Splitting a data extent: We write to the middle of an existing extent.
> We
> will split this extent into two BTRFS_EXTENT_DATA_KEY items and the increase
> the
> ref count of the original extent by 1. This means we look up the extent ref
> for
> root->objectid, inode number and the _original_ inode offset. We don't
> create
> another extent ref, this is important to keep in mind.
>
> ===== btrfs_copy_root/update_ref_for_cow/btrfs_inc_ref/btrfs_dec_ref =====
>
> But Josef, didn't you say there were shared refs? Why yes I did, but I need
> to
> explain it in context of the people who actually do the dirty work. We'll
> start
> with the easy case
>
> 1) btrfs_copy_root - where snapshots start: When we make a snapshot we call
> this function, which allocates a completely new block with a new root
> objectid
> and then memcpy's the original root we are snapshotting. Then we call
> btrfs_inc_ref on our new buffer, which will walk all items in that buffer
> and
> add a new normal ref to each of those blocks for our new root. This is only
> at
> the level below the new root, nothing below that point.
>
> 2) btrfs_inc_ref/btrfs_dec_ref - how we deal with snapshots: These guys are
> responsible for dealing with the particular action we want to make on our
> given
> buffer. So if we are free'ing our buffer, we need to drop any refs it has
> to
> the blocks it points to. For level > 0 this means modifying refs for all of
> the
> tree blocks it points to. For level == 0 this means modifying refs for any
> data
> extents the leaf may point to.
>
> 3) update_ref_for_cow - this is where the magic happens: This has a few
> different modes of operation, but every operation means we check to see if
> the
> block is shared, which is we see if we have been snapshotted and if we have
> been
> see if this block has changed since we snapshotted. If it is shared then we
> look up the extent refs and the flags. If not then we carry on. From here
> we
> have a few options.
>
> 3a) Not shared: Don't do anything, we can do our normal cow operations and
> carry
> on.
>
> 3b) Shared and cowing from the owning root: This is where the
> btrfs_header_owner() is important. If we owned this block and it is shared
> then
> we know that any of the upper levels won't have a normal ref to anything
> underneath this block, so we need to add a shared ref for anything this
> block
> points to. So the first thing we do is btrfs_inc_ref(), but we set the full
> backref flag. This means that when we add refs for everything this block
> points
> to we don't use a root objectid, we use the bytenr of this block. Then we
> set
> BTRFS_BLOCK_FLAG_FULL_BACKREF for the extent flags for this give block.
>
> 3c) Shared and cowing from not the owning root: So if we are cowing down
> from
> the snapshot we need to make sure that any block we own completely ourselves
> has
> normal refs for any blocks it points to. So we cow down and hit a shared
> block
> that we aren't the owner of, so we do btrfs_inc_ref() for our block and
> don't
> set full_backref so we get a normal ref on everything this block points to
> with
> our actual root objectid.
>
> 3d) Last ref but has FULL_BACKREF set: The last time we cow a block that was
> converted to a full backref from 3b. This is where we call btrfs_inc_ref()
> without full_backref set so we get our normal root objectid ref added, and
> then
> we call btrfs_dec_ref() with full_backref set so that we remove the shared
> ref
> for this block since it is about to be deleted.
>
> ===== Delayed refs ======
>
> In order to make sure we don't recursively update the extent tree for every
> modification as we cow down a tree we have delayed refs. Every time we
> modify
> the ref count of an extent (allocate, inc ref, free) we allocate a delayed
> ref
> with the information about that ref. So every call to btrfs_inc_ref,
> btrfs_dec_ref, btrfs_allocate_tree_bloc, btrfs_alloc_reserved_file_extent,
> btrfs_free_extent all ends up allocating a delayed ref. The nice thing
> about
> this is it can keep us from actually modifying the extent tree. If we cow a
> block to free items and then subsequently free that block we can avoid
> touching
> the extent tree altogether for this.
>
> PROBLEM: For qgroups we needed a way to have a coherent view of the extent
> tree
> at a given time, which is where our sequence numbers come in. Each delayed
> ref
> new gets a sequence number, and a different sequence number means we can't
> do
> this in-memory merging right away like we used to. Now instead we have
> btrfs_merge_delayed_refs, which works pretty well, but is O(n) for the
> number
> of delayed refs we have for a given extent. Usually this is a small number,
> but
> there is one key area this isn't true, which is snapshot aware defrag.
>
> ====== Snapshot aware defrag =====
>
> This has a couple of problems with it, but the problem that relates to this
> email is that we can have many many updates for one extent. So remember how
> we
> will add an extent ref every time we split an extent? Imagine writing a
> 128mb
> extent and then overwriting every other block. That is going to end up with
> 32768 file extents all pointing to the same extent. Now snapshot this fs 5
> thousand times, now you have 163,840,000 extents you are going to be
> updating
> when you defrag this range. How this then ends up is you have that many
> delayed
> refs outstanding at the end of your defrag operation, that is if you are
> lucky
> enough to not OOM. This is because of the sequence numbers, we can't merge
> them
> until we actually go to run them. So now you've made it to actually calling
> btrfs_merge_delayed_refs, and then you wonder why your box just livelocked
> for
> 10 seconds for no apparent reason. With out the sequence numbers we would
> have
> merged each of these operations as we added them, so when we finished the
> defrag
> operation we'd have 5000 delayed refs each with a -32768 count for the
> freeing
> of the fragmented extent.
>
> Now obviously 1 part of this fix is to do one root at a time so we aren't
> queueing up N number of snapshots worth of delayed refs, but that still
> leaves
> the actual number of extents we have, which is why we need to kill/mitigate
> the
> pain of the sequence numbers.
>
> ======= Qgroups =======
>
> I'm starting over with the qgroup accounting because it simply needs to
> operate
> without as much dependance on the sequence numbers. So here is how it is
> going
> to work
>
> 1) Create a delayed ref sort of scheme for quota operations. This lets us
> process things in bulk without having to worry about weird locking and such.
>
> 2) Only create these quota operations when we modify the actual refs. Part
> of
> the problem with the current work is that it tries to figure out what quota
> stuff is going to be used based on delayed refs. These updates could end up
> being chucked, which will screw up our counts. So instead only add a qgroup
> action when we add the first ref for that root objectid or remove the last
> ref
> for that root objectid. This will mean adding stuff to the actual ref
> modifying
> stuff in extent-tree.c. I have this part pretty well nailed down in my
> qgroup-work branch in btrfs-next.
>
> 3) Create qgroup operations in update_ref_for_cow. One of the tricky parts
> of
> killing sequence numbers is that we can completely lose extent ref
> operations
> that we really do care about. So say we snapshot, which means we convert
> all of
> the referenced space in the original root to not exclusive and add it to the
> referenced count of the snapshot. If we rm -rf on that snapshot we will cow
> down only to free the leaf, which means that the delayed ref will get
> dropped.
> This will cause us to lose the qgroup action because we only account for
> real
> ref modificatoins to the tree, and so we won't convert this space back to
> exclusive space for the original tree. So what we need to do here is add a
> qgroup operation saying that we are converting shared space to exclusive
> space
> with the cow and then set a flag on the extent buffer so we know that we
> have
> an outstanding convert operation. When this convert operation runs we
> simply
> look the extent buffer up and clear the flag. If we try to free the tree
> block
> before then we make sure to add a real qgroup operation.
>
> 4) The tricky part is data because we have implied refs, that is refs to
> data
> with no real way to know other than to lookup all of it's parents. We will
> still need sequence numbers in a way for this case, but way differently than
> what we have now. We only care about knowing about a given point in time
> previous to this special case. The special case is when we remove what
> looks
> like the last ref for this extent for this root but there are still
> outstanding
> refs on the extent. This case we need to lookup the backrefs to make sure
> there
> isn't an implied ref from our root to this data block. This is the point at
> which we increment the sequence number. So what happens is you have a whole
> bunch of delayed refs with the same sequence number, and then we hit this
> case
> and inc the sequence number, and you have a whole bunch of delayed refs with
> the
> new sequence number. You still have to post-process merge to make sure, but
> you are far more likely to merge everything in real-time since you are only
> changing the sequence number every once and a while instead of for every new
> sequence number. Then you lookup the backrefs for everything that happened
> up
> until but not after your sequence number to see if you had an implied ref
> for
> the data.
>
> ==== Conclusion =====
>
> I realize this is a lot of information, but I'm not the only one working in
> this
> area at the moment so I thought it was important to get it all on paper so
> we're
> all on the same page. Please let me know if you have questions or
> suggestions,
> Like I said I'm on paternity leave so theres going to be a lag time for my
> responses. Thanks,
>
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2014-06-19 9:46 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-21 14:55 Snapshot aware defrag and qgroups thoughts Josef Bacik
2014-04-21 22:13 ` Duncan
2014-06-19 9:46 ` Alex Lyakas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).