linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Snapshot aware defrag and qgroups thoughts
@ 2014-04-21 14:55 Josef Bacik
  2014-04-21 22:13 ` Duncan
  2014-06-19  9:46 ` Alex Lyakas
  0 siblings, 2 replies; 3+ messages in thread
From: Josef Bacik @ 2014-04-21 14:55 UTC (permalink / raw)
  To: linux-btrfs

We have a big problem, but it involves a lot of moving parts, so I'm 
going to
explain all of the parts, and then the problem, and then what I am doing 
to fix
the problem.  I want you guys to check my work to make sure I'm not missing
something so when I come back from paternity leave in a few weeks I can 
just sit
down and finish this work out.

===== Extent refs =======

This is basically how extent refs work.  You have either

key.objectid = bytenr;
key.type = BTRFS_EXTENT_ITEM_KEY;
key.offset = length;

or you have

key.objectid = bytenr;
key.type = BTRFS_METADATA_ITEM_KEY;
key.offset = level of the metadata block;

in the case of skinny metadata.  Then you have the extent item which 
describes
the number of refs and such, followed by 1 or more inline refs.  All you 
need
to know for this problem is how I'm going to describe them.  What I call a
"normal ref" or a "full ref" is a reference that has the actual root 
information
in the ref.  What I call a "shared ref" is one where we only know the 
tree block
that owns the particular ref.  So how does this work in practice?

1) Normal allocation - metadata:  We allocate a tree block as we add new 
items
to a tree.  We know that this root owns this tree block so we create a 
normal
ref with the root objectid in the extent ref.  We also set the owner of the
block itself to our objectid.  This is important to keep in mind.

2) Normal allocaiton - data: We allocate some data for a given fs tree 
and we
add a extent ref with the root objectid of the tree we are in, the inode 
number
and the logical offset into the inode for this inode.

3) Splitting a data extent: We write to the middle of an existing 
extent.  We
will split this extent into two BTRFS_EXTENT_DATA_KEY items and the 
increase the
ref count of the original extent by 1.  This means we look up the extent 
ref for
root->objectid, inode number and the _original_ inode offset.  We don't 
create
another extent ref, this is important to keep in mind.

===== btrfs_copy_root/update_ref_for_cow/btrfs_inc_ref/btrfs_dec_ref =====

But Josef, didn't you say there were shared refs?  Why yes I did, but I 
need to
explain it in context of the people who actually do the dirty work. 
We'll start
with the easy case

1) btrfs_copy_root - where snapshots start:  When we make a snapshot we call
this function, which allocates a completely new block with a new root 
objectid
and then memcpy's the original root we are snapshotting.  Then we call
btrfs_inc_ref on our new buffer, which will walk all items in that 
buffer and
add a new normal ref to each of those blocks for our new root.  This is 
only at
the level below the new root, nothing below that point.

2) btrfs_inc_ref/btrfs_dec_ref - how we deal with snapshots: These guys are
responsible for dealing with the particular action we want to make on 
our given
buffer.  So if we are free'ing our buffer, we need to drop any refs it 
has to
the blocks it points to.  For level > 0 this means modifying refs for 
all of the
tree blocks it points to.  For level == 0 this means modifying refs for 
any data
extents the leaf may point to.

3) update_ref_for_cow - this is where the magic happens:  This has a few
different modes of operation, but every operation means we check to see 
if the
block is shared, which is we see if we have been snapshotted and if we 
have been
see if this block has changed since we snapshotted.  If it is shared then we
look up the extent refs and the flags.  If not then we carry on. From 
here we
have a few options.

3a) Not shared: Don't do anything, we can do our normal cow operations 
and carry
on.

3b) Shared and cowing from the owning root: This is where the
btrfs_header_owner() is important.  If we owned this block and it is 
shared then
we know that any of the upper levels won't have a normal ref to anything
underneath this block, so we need to add a shared ref for anything this 
block
points to.  So the first thing we do is btrfs_inc_ref(), but we set the full
backref flag.  This means that when we add refs for everything this 
block points
to we don't use a root objectid, we use the bytenr of this block. Then 
we set
BTRFS_BLOCK_FLAG_FULL_BACKREF for the extent flags for this give block.

3c) Shared and cowing from not the owning root: So if we are cowing down 
from
the snapshot we need to make sure that any block we own completely 
ourselves has
normal refs for any blocks it points to.  So we cow down and hit a 
shared block
that we aren't the owner of, so we do btrfs_inc_ref() for our block and 
don't
set full_backref so we get a normal ref on everything this block points 
to with
our actual root objectid.

3d) Last ref but has FULL_BACKREF set: The last time we cow a block that was
converted to a full backref from 3b.  This is where we call btrfs_inc_ref()
without full_backref set so we get our normal root objectid ref added, 
and then
we call btrfs_dec_ref() with full_backref set so that we remove the 
shared ref
for this block since it is about to be deleted.

===== Delayed refs ======

In order to make sure we don't recursively update the extent tree for every
modification as we cow down a tree we have delayed refs.  Every time we 
modify
the ref count of an extent (allocate, inc ref, free) we allocate a 
delayed ref
with the information about that ref.  So every call to btrfs_inc_ref,
btrfs_dec_ref, btrfs_allocate_tree_bloc, btrfs_alloc_reserved_file_extent,
btrfs_free_extent all ends up allocating a delayed ref.  The nice thing 
about
this is it can keep us from actually modifying the extent tree.  If we cow a
block to free items and then subsequently free that block we can avoid 
touching
the extent tree altogether for this.

PROBLEM: For qgroups we needed a way to have a coherent view of the 
extent tree
at a given time, which is where our sequence numbers come in.  Each 
delayed ref
new gets a sequence number, and a different sequence number means we 
can't do
this in-memory merging right away like we used to.  Now instead we have
btrfs_merge_delayed_refs, which works pretty well, but is O(n) for the 
number
of delayed refs we have for a given extent.  Usually this is a small 
number, but
there is one key area this isn't true, which is snapshot aware defrag.

====== Snapshot aware defrag =====

This has a couple of problems with it, but the problem that relates to this
email is that we can have many many updates for one extent.  So remember 
how we
will add an extent ref every time we split an extent?  Imagine writing a 
128mb
extent and then overwriting every other block.  That is going to end up with
32768 file extents all pointing to the same extent.  Now snapshot this fs 5
thousand times, now you have 163,840,000 extents you are going to be 
updating
when you defrag this range.  How this then ends up is you have that many 
delayed
refs outstanding at the end of your defrag operation, that is if you are 
lucky
enough to not OOM.  This is because of the sequence numbers, we can't 
merge them
until we actually go to run them.  So now you've made it to actually calling
btrfs_merge_delayed_refs, and then you wonder why your box just 
livelocked for
10 seconds for no apparent reason.  With out the sequence numbers we 
would have
merged each of these operations as we added them, so when we finished 
the defrag
operation we'd have 5000 delayed refs each with a -32768 count for the 
freeing
of the fragmented extent.

Now obviously 1 part of this fix is to do one root at a time so we aren't
queueing up N number of snapshots worth of delayed refs, but that still 
leaves
the actual number of extents we have, which is why we need to 
kill/mitigate the
pain of the sequence numbers.

======= Qgroups =======

I'm starting over with the qgroup accounting because it simply needs to 
operate
without as much dependance on the sequence numbers.  So here is how it 
is going
to work

1) Create a delayed ref sort of scheme for quota operations.  This lets us
process things in bulk without having to worry about weird locking and such.

2) Only create these quota operations when we modify the actual refs.  
Part of
the problem with the current work is that it tries to figure out what quota
stuff is going to be used based on delayed refs.  These updates could end up
being chucked, which will screw up our counts.  So instead only add a qgroup
action when we add the first ref for that root objectid or remove the 
last ref
for that root objectid.  This will mean adding stuff to the actual ref 
modifying
stuff in extent-tree.c.  I have this part pretty well nailed down in my
qgroup-work branch in btrfs-next.

3) Create qgroup operations in update_ref_for_cow.  One of the tricky 
parts of
killing sequence numbers is that we can completely lose extent ref 
operations
that we really do care about.  So say we snapshot, which means we 
convert all of
the referenced space in the original root to not exclusive and add it to the
referenced count of the snapshot.  If we rm -rf on that snapshot we will cow
down only to free the leaf, which means that the delayed ref will get 
dropped.
This will cause us to lose the qgroup action because we only account for 
real
ref modificatoins to the tree, and so we won't convert this space back to
exclusive space for the original tree.  So what we need to do here is add a
qgroup operation saying that we are converting shared space to exclusive 
space
with the cow and then set a flag on the extent buffer so we know that we 
have
an outstanding convert operation.  When this convert operation runs we 
simply
look the extent buffer up and clear the flag.  If we try to free the 
tree block
before then we make sure to add a real qgroup operation.

4) The tricky part is data because we have implied refs, that is refs to 
data
with no real way to know other than to lookup all of it's parents. We will
still need sequence numbers in a way for this case, but way differently than
what we have now.  We only care about knowing about a given point in time
previous to this special case.  The special case is when we remove what 
looks
like the last ref for this extent for this root but there are still 
outstanding
refs on the extent.  This case we need to lookup the backrefs to make 
sure there
isn't an implied ref from our root to this data block.  This is the point at
which we increment the sequence number.  So what happens is you have a whole
bunch of delayed refs with the same sequence number, and then we hit 
this case
and inc the sequence number, and you have a whole bunch of delayed refs 
with the
new sequence number.  You still have to post-process merge to make sure, but
you are far more likely to merge everything in real-time since you are only
changing the sequence number every once and a while instead of for every new
sequence number.  Then you lookup the backrefs for everything that 
happened up
until but not after your sequence number to see if you had an implied 
ref for
the data.

==== Conclusion =====

I realize this is a lot of information, but I'm not the only one working 
in this
area at the moment so I thought it was important to get it all on paper 
so we're
all on the same page.  Please let me know if you have questions or 
suggestions,
Like I said I'm on paternity leave so theres going to be a lag time for my
responses.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Snapshot aware defrag and qgroups thoughts
  2014-04-21 14:55 Snapshot aware defrag and qgroups thoughts Josef Bacik
@ 2014-04-21 22:13 ` Duncan
  2014-06-19  9:46 ` Alex Lyakas
  1 sibling, 0 replies; 3+ messages in thread
From: Duncan @ 2014-04-21 22:13 UTC (permalink / raw)
  To: linux-btrfs

Josef Bacik posted on Mon, 21 Apr 2014 07:55:46 -0700 as excerpted:

[Near the bottom, point #4 immediately before conclusion.]

> You still have to post-process merge to make sure, but you are far more
> likely to merge everything in real-time since you are only changing the
> sequence number every once and a while instead of for every new sequence
> number.

Much of this is above my head, which is OK as I'm not a btrfs dev, but...

...only changing the sequence number... for every new sequence number?

Something's wrong with that. No WONDER the system live-locked! =:^)

My above-my-head guess is that in place of for every new sequence number 
there, you meant for every delayed ref.

Meanwhile, thanks for putting this all down.  I already said I don't 
understand much of it, but I certainly have a better appreciation for the 
complexities of snapshot-aware-defrag.  I hadn't considered what qgroups 
could do to it at all!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Snapshot aware defrag and qgroups thoughts
  2014-04-21 14:55 Snapshot aware defrag and qgroups thoughts Josef Bacik
  2014-04-21 22:13 ` Duncan
@ 2014-06-19  9:46 ` Alex Lyakas
  1 sibling, 0 replies; 3+ messages in thread
From: Alex Lyakas @ 2014-06-19  9:46 UTC (permalink / raw)
  To: Josef Bacik; +Cc: linux-btrfs

Hi Josef,
thanks for the detailed description of how the extent tree works!
When I was digging through that in the past, I made some slides to
remember all the call chains. Maybe somebody finds that useful to
accompany your notes.
https://drive.google.com/file/d/0ByBy89zr3kJNNmM5OG5wXzQ3LUE/edit?usp=sharing

Thanks,
Alex.


On Mon, Apr 21, 2014 at 5:55 PM, Josef Bacik <jbacik@fb.com> wrote:
> We have a big problem, but it involves a lot of moving parts, so I'm going
> to
> explain all of the parts, and then the problem, and then what I am doing to
> fix
> the problem.  I want you guys to check my work to make sure I'm not missing
> something so when I come back from paternity leave in a few weeks I can just
> sit
> down and finish this work out.
>
> ===== Extent refs =======
>
> This is basically how extent refs work.  You have either
>
> key.objectid = bytenr;
> key.type = BTRFS_EXTENT_ITEM_KEY;
> key.offset = length;
>
> or you have
>
> key.objectid = bytenr;
> key.type = BTRFS_METADATA_ITEM_KEY;
> key.offset = level of the metadata block;
>
> in the case of skinny metadata.  Then you have the extent item which
> describes
> the number of refs and such, followed by 1 or more inline refs.  All you
> need
> to know for this problem is how I'm going to describe them.  What I call a
> "normal ref" or a "full ref" is a reference that has the actual root
> information
> in the ref.  What I call a "shared ref" is one where we only know the tree
> block
> that owns the particular ref.  So how does this work in practice?
>
> 1) Normal allocation - metadata:  We allocate a tree block as we add new
> items
> to a tree.  We know that this root owns this tree block so we create a
> normal
> ref with the root objectid in the extent ref.  We also set the owner of the
> block itself to our objectid.  This is important to keep in mind.
>
> 2) Normal allocaiton - data: We allocate some data for a given fs tree and
> we
> add a extent ref with the root objectid of the tree we are in, the inode
> number
> and the logical offset into the inode for this inode.
>
> 3) Splitting a data extent: We write to the middle of an existing extent.
> We
> will split this extent into two BTRFS_EXTENT_DATA_KEY items and the increase
> the
> ref count of the original extent by 1.  This means we look up the extent ref
> for
> root->objectid, inode number and the _original_ inode offset.  We don't
> create
> another extent ref, this is important to keep in mind.
>
> ===== btrfs_copy_root/update_ref_for_cow/btrfs_inc_ref/btrfs_dec_ref =====
>
> But Josef, didn't you say there were shared refs?  Why yes I did, but I need
> to
> explain it in context of the people who actually do the dirty work. We'll
> start
> with the easy case
>
> 1) btrfs_copy_root - where snapshots start:  When we make a snapshot we call
> this function, which allocates a completely new block with a new root
> objectid
> and then memcpy's the original root we are snapshotting.  Then we call
> btrfs_inc_ref on our new buffer, which will walk all items in that buffer
> and
> add a new normal ref to each of those blocks for our new root.  This is only
> at
> the level below the new root, nothing below that point.
>
> 2) btrfs_inc_ref/btrfs_dec_ref - how we deal with snapshots: These guys are
> responsible for dealing with the particular action we want to make on our
> given
> buffer.  So if we are free'ing our buffer, we need to drop any refs it has
> to
> the blocks it points to.  For level > 0 this means modifying refs for all of
> the
> tree blocks it points to.  For level == 0 this means modifying refs for any
> data
> extents the leaf may point to.
>
> 3) update_ref_for_cow - this is where the magic happens:  This has a few
> different modes of operation, but every operation means we check to see if
> the
> block is shared, which is we see if we have been snapshotted and if we have
> been
> see if this block has changed since we snapshotted.  If it is shared then we
> look up the extent refs and the flags.  If not then we carry on. From here
> we
> have a few options.
>
> 3a) Not shared: Don't do anything, we can do our normal cow operations and
> carry
> on.
>
> 3b) Shared and cowing from the owning root: This is where the
> btrfs_header_owner() is important.  If we owned this block and it is shared
> then
> we know that any of the upper levels won't have a normal ref to anything
> underneath this block, so we need to add a shared ref for anything this
> block
> points to.  So the first thing we do is btrfs_inc_ref(), but we set the full
> backref flag.  This means that when we add refs for everything this block
> points
> to we don't use a root objectid, we use the bytenr of this block. Then we
> set
> BTRFS_BLOCK_FLAG_FULL_BACKREF for the extent flags for this give block.
>
> 3c) Shared and cowing from not the owning root: So if we are cowing down
> from
> the snapshot we need to make sure that any block we own completely ourselves
> has
> normal refs for any blocks it points to.  So we cow down and hit a shared
> block
> that we aren't the owner of, so we do btrfs_inc_ref() for our block and
> don't
> set full_backref so we get a normal ref on everything this block points to
> with
> our actual root objectid.
>
> 3d) Last ref but has FULL_BACKREF set: The last time we cow a block that was
> converted to a full backref from 3b.  This is where we call btrfs_inc_ref()
> without full_backref set so we get our normal root objectid ref added, and
> then
> we call btrfs_dec_ref() with full_backref set so that we remove the shared
> ref
> for this block since it is about to be deleted.
>
> ===== Delayed refs ======
>
> In order to make sure we don't recursively update the extent tree for every
> modification as we cow down a tree we have delayed refs.  Every time we
> modify
> the ref count of an extent (allocate, inc ref, free) we allocate a delayed
> ref
> with the information about that ref.  So every call to btrfs_inc_ref,
> btrfs_dec_ref, btrfs_allocate_tree_bloc, btrfs_alloc_reserved_file_extent,
> btrfs_free_extent all ends up allocating a delayed ref.  The nice thing
> about
> this is it can keep us from actually modifying the extent tree.  If we cow a
> block to free items and then subsequently free that block we can avoid
> touching
> the extent tree altogether for this.
>
> PROBLEM: For qgroups we needed a way to have a coherent view of the extent
> tree
> at a given time, which is where our sequence numbers come in.  Each delayed
> ref
> new gets a sequence number, and a different sequence number means we can't
> do
> this in-memory merging right away like we used to.  Now instead we have
> btrfs_merge_delayed_refs, which works pretty well, but is O(n) for the
> number
> of delayed refs we have for a given extent.  Usually this is a small number,
> but
> there is one key area this isn't true, which is snapshot aware defrag.
>
> ====== Snapshot aware defrag =====
>
> This has a couple of problems with it, but the problem that relates to this
> email is that we can have many many updates for one extent.  So remember how
> we
> will add an extent ref every time we split an extent?  Imagine writing a
> 128mb
> extent and then overwriting every other block.  That is going to end up with
> 32768 file extents all pointing to the same extent.  Now snapshot this fs 5
> thousand times, now you have 163,840,000 extents you are going to be
> updating
> when you defrag this range.  How this then ends up is you have that many
> delayed
> refs outstanding at the end of your defrag operation, that is if you are
> lucky
> enough to not OOM.  This is because of the sequence numbers, we can't merge
> them
> until we actually go to run them.  So now you've made it to actually calling
> btrfs_merge_delayed_refs, and then you wonder why your box just livelocked
> for
> 10 seconds for no apparent reason.  With out the sequence numbers we would
> have
> merged each of these operations as we added them, so when we finished the
> defrag
> operation we'd have 5000 delayed refs each with a -32768 count for the
> freeing
> of the fragmented extent.
>
> Now obviously 1 part of this fix is to do one root at a time so we aren't
> queueing up N number of snapshots worth of delayed refs, but that still
> leaves
> the actual number of extents we have, which is why we need to kill/mitigate
> the
> pain of the sequence numbers.
>
> ======= Qgroups =======
>
> I'm starting over with the qgroup accounting because it simply needs to
> operate
> without as much dependance on the sequence numbers.  So here is how it is
> going
> to work
>
> 1) Create a delayed ref sort of scheme for quota operations.  This lets us
> process things in bulk without having to worry about weird locking and such.
>
> 2) Only create these quota operations when we modify the actual refs.  Part
> of
> the problem with the current work is that it tries to figure out what quota
> stuff is going to be used based on delayed refs.  These updates could end up
> being chucked, which will screw up our counts.  So instead only add a qgroup
> action when we add the first ref for that root objectid or remove the last
> ref
> for that root objectid.  This will mean adding stuff to the actual ref
> modifying
> stuff in extent-tree.c.  I have this part pretty well nailed down in my
> qgroup-work branch in btrfs-next.
>
> 3) Create qgroup operations in update_ref_for_cow.  One of the tricky parts
> of
> killing sequence numbers is that we can completely lose extent ref
> operations
> that we really do care about.  So say we snapshot, which means we convert
> all of
> the referenced space in the original root to not exclusive and add it to the
> referenced count of the snapshot.  If we rm -rf on that snapshot we will cow
> down only to free the leaf, which means that the delayed ref will get
> dropped.
> This will cause us to lose the qgroup action because we only account for
> real
> ref modificatoins to the tree, and so we won't convert this space back to
> exclusive space for the original tree.  So what we need to do here is add a
> qgroup operation saying that we are converting shared space to exclusive
> space
> with the cow and then set a flag on the extent buffer so we know that we
> have
> an outstanding convert operation.  When this convert operation runs we
> simply
> look the extent buffer up and clear the flag.  If we try to free the tree
> block
> before then we make sure to add a real qgroup operation.
>
> 4) The tricky part is data because we have implied refs, that is refs to
> data
> with no real way to know other than to lookup all of it's parents. We will
> still need sequence numbers in a way for this case, but way differently than
> what we have now.  We only care about knowing about a given point in time
> previous to this special case.  The special case is when we remove what
> looks
> like the last ref for this extent for this root but there are still
> outstanding
> refs on the extent.  This case we need to lookup the backrefs to make sure
> there
> isn't an implied ref from our root to this data block.  This is the point at
> which we increment the sequence number.  So what happens is you have a whole
> bunch of delayed refs with the same sequence number, and then we hit this
> case
> and inc the sequence number, and you have a whole bunch of delayed refs with
> the
> new sequence number.  You still have to post-process merge to make sure, but
> you are far more likely to merge everything in real-time since you are only
> changing the sequence number every once and a while instead of for every new
> sequence number.  Then you lookup the backrefs for everything that happened
> up
> until but not after your sequence number to see if you had an implied ref
> for
> the data.
>
> ==== Conclusion =====
>
> I realize this is a lot of information, but I'm not the only one working in
> this
> area at the moment so I thought it was important to get it all on paper so
> we're
> all on the same page.  Please let me know if you have questions or
> suggestions,
> Like I said I'm on paternity leave so theres going to be a lag time for my
> responses.  Thanks,
>
> Josef
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2014-06-19  9:46 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-21 14:55 Snapshot aware defrag and qgroups thoughts Josef Bacik
2014-04-21 22:13 ` Duncan
2014-06-19  9:46 ` Alex Lyakas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).