public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Josef Bacik <josef@toxicpanda.com>,
	linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH 2/3] btrfs: add a comment describing delalloc space reservation
Date: Tue, 4 Feb 2020 17:48:55 +0800	[thread overview]
Message-ID: <7000f8a2-4d78-d9a1-2e3f-143b88ace1eb@gmx.com> (raw)
In-Reply-To: <20200203204436.517473-3-josef@toxicpanda.com>


[-- Attachment #1.1: Type: text/plain, Size: 7623 bytes --]



On 2020/2/4 上午4:44, Josef Bacik wrote:
> delalloc space reservation is tricky because it encompasses both data
> and metadata.  Make it clear what each side does, the general flow of
> how space is moved throughout the lifetime of a write, and what goes
> into the calculations.

In fact, the lifespan of a write would be super helpful for newbies.

I still remember the pain when trying to understand the whole mechanism
years ago.

> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/delalloc-space.c | 90 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 90 insertions(+)
> 
> diff --git a/fs/btrfs/delalloc-space.c b/fs/btrfs/delalloc-space.c
> index c13d8609cc99..09a9c01fc1b5 100644
> --- a/fs/btrfs/delalloc-space.c
> +++ b/fs/btrfs/delalloc-space.c
> @@ -9,6 +9,96 @@
>  #include "qgroup.h"
>  #include "block-group.h"
>  
> +/*
> + * HOW DOES THIS WORK
> + *
> + * There are two stages to data reservations, one for data and one for metadata
> + * to handle the new extents and checksums generated by writing data.
> + *
> + *
> + * DATA RESERVATION
> + *   The data reservation stuff is relatively straightforward.  We want X bytes,
> + *   and thus need to make sure we have X bytes free in data space in order to
> + *   write that data.  If there is not X bytes free, allocate data chunks until
> + *   we can satisfy that reservation.  If we can no longer allocate data chunks,
> + *   attempt to flush space to see if we can now make the reservaiton.  See the
> + *   comment for data_flush_states to see how that flushing is accomplished.

What about such less words version?
We want X bytes of data space.

If there is not enough, try the following methods in order:
- Allocate new data chunk
- Flush space
  See comment for data_flush_states

> + *
> + *   Once this space is reserved, it is added to space_info->bytes_may_use.  The
> + *   caller must keep track of this reservation and free it up if it is never
> + *   used.  With the buffered IO case this is handled via the EXTENT_DELALLOC
> + *   bit's on the inode's io_tree.  For direct IO it's more straightforward, we
> + *   take the reservation at the start of the operation, and if we write less
> + *   than we reserved we free the excess.

This part involves the lifespan and state machine of data.
I guess more explanation on the state machine would help a lot.

Like:
Page clean
|
+- btrfs_buffered_write()
|  Reserve data space for data, metadata space for csum/file
|  extents/inodes.
|
Page dirty
|
+- run_delalloc_range()
|  Allocate data extents, submit ordered extents to do csum calculation
|  and bio submission
Page write back
|
+- finish_oredred_io()
|  Insert csum and file extents
|
Page clean

Although I'm not sure if such lifespan should belong to delalloc-space.c.

> + *
> + *   For the buffered case our reservation will take one of two paths
> + *
> + *   1) It is allocated.  In find_free_extent() we will call
> + *   btrfs_add_reserved_bytes() with the size of the extent we made, along with
> + *   the size that we are covering with this allocation.  For non-compressed
> + *   these will be the same thing, but for compressed they could be different.
> + *   In any case, we increase space_info->bytes_reserved by the extent size, and
> + *   reduce the space_info->bytes_may_use by the ram_bytes size.  From now on
> + *   the handling of this reserved space is the responsibility of the ordered
> + *   extent or the cow path.
> + *
> + *   2) There is an error, and we free it.  This is handled with the
> + *   EXTENT_CLEAR_DATA_RESV bit when clearing EXTENT_DELALLOC on the inode's
> + *   io_tree.
> + *
> + * METADATA RESERVATION
> + *   The general metadata reservation lifetimes are discussed elsewhere, this
> + *   will just focus on how it is used for delalloc space.
> + *
> + *   There are 3 things we are keeping reservations for.

It looks the 3 things are too detailed I guess?
It's definitely educational, but not sure if it fits the introduction
nature of such comment.

I guess we only need to mention:
- Objective
  How this meta rsv is used for (inode item, file extents, csum)

- Location of interest
  Important details. (outstanding extents and DELALLOC bits for metadata
  rsv calculation)

- Timing of such rsv
  When we reserve/update, use and release. (function entrance)

Then it should be enough for people to dig for their own interest.

Thanks,
Qu

> + *
> + *   1) Updating the inode item.  We hold a reservation for this inode as long
> + *   as there are dirty bytes outstanding for this inode.  This is because we
> + *   may update the inode multiple times throughout an operation, and there is
> + *   no telling when we may have to do a full cow back to that inode item.  Thus
> + *   we must always hold a reservation.
> + *
> + *   2) Adding an extent item.  This is trickier, so a few sub points
> + *
> + *     a) We keep track of how many extents an inode may need to create in
> + *     inode->outstanding_extents.  This is how many items we will have reserved
> + *     for the extents for this inode.
> + *
> + *     b) count_max_extents() is used to figure out how many extent items we
> + *     will need based on the contiguous area we have dirtied.  Thus if we are
> + *     writing 4k extents but they coalesce into a very large extent, we will
> + *     break this into smaller extents which means we'll need a reservation for
> + *     each of those extents.
> + *
> + *     c) When we set EXTENT_DELALLOC on the inode io_tree we will figure out
> + *     the nummber of extents needed for the contiguous area we just created,
> + *     and add that to inode->outstanding_extents.
> + *
> + *     d) We have no idea at reservation time how this new extent fits into
> + *     existing extents.  We unconditionally use count_max_extents() on the
> + *     reservation we are currently doing.  The reservation _must_ use
> + *     btrfs_delalloc_release_extents() once it has done it's work to clear up
> + *     this outstanding extents.  This means that we will transiently have too
> + *     many extent reservations for this inode than we need.  For example say we
> + *     have a clean inode, and we do a buffered write of 4k.  The reservation
> + *     code will mod outstanding_extents to 1, and then set_delalloc will
> + *     increase it to 2.  Then once we are finished,
> + *     btrfs_delalloc_release_extents() will drop it back down to 1 again.
> + *
> + *     e) Ordered extents take on the responsibility of their extent.  We know
> + *     that the ordered extent represents a single inode item, so it will modify
> + *     ->outstanding_extents by 1, and will clear delalloc which will adjust the
> + *     ->outstanding_extents by whatever value it needs to be adjusted to.  Once
> + *     the ordered io is finished we drop the ->outstanding_extents by 1 and if
> + *     we are 0 we drop our inode item reservation as well.
> + *
> + *   3) Adding csums for the range.  This is more straightforward than the
> + *   extent items, as we just want to hold the number of bytes we'll need for
> + *   checksums until the ordered extent is removed.  If there is an error it is
> + *   cleared via the EXTENT_CLEAR_META_RESV bit when clearning EXTENT_DELALLOC
> + *   on the inode io_tree.
> + */
> +
>  int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes)
>  {
>  	struct btrfs_root *root = inode->root;
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2020-02-04  9:49 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-03 20:44 [PATCH 0/3] Add comments describing how space reservation works Josef Bacik
2020-02-03 20:44 ` [PATCH 1/3] btrfs: add a comment describing block-rsvs Josef Bacik
2020-02-04  9:30   ` Qu Wenruo
2020-02-04 10:32     ` Nikolay Borisov
2020-02-03 20:44 ` [PATCH 2/3] btrfs: add a comment describing delalloc space reservation Josef Bacik
2020-02-04  9:48   ` Qu Wenruo [this message]
2020-02-04 12:27     ` Nikolay Borisov
2020-02-04 12:39       ` Qu Wenruo
2020-02-05 13:44         ` David Sterba
2020-02-03 20:44 ` [PATCH 3/3] btrfs: describe the space reservation system in general Josef Bacik
2020-02-04 10:14   ` Qu Wenruo
  -- strict thread matches above, loose matches on Subject: below --
2020-02-04 18:18 [PATCH 0/3][v2] Add comments describing how space reservation works Josef Bacik
2020-02-04 18:18 ` [PATCH 2/3] btrfs: add a comment describing delalloc space reservation Josef Bacik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7000f8a2-4d78-d9a1-2e3f-143b88ace1eb@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=josef@toxicpanda.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox