All of lore.kernel.org
 help / color / mirror / Atom feed
From: Josef Bacik <josef@toxicpanda.com>
To: fdmanana@kernel.org
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 21/21] btrfs: always reserve space for delayed refs when starting transaction
Date: Fri, 8 Sep 2023 11:32:05 -0400	[thread overview]
Message-ID: <20230908153205.GV1977092@perftesting> (raw)
In-Reply-To: <c243a7580fbaed93db3d3c3911b0a217199580ef.1694174371.git.fdmanana@suse.com>

On Fri, Sep 08, 2023 at 01:09:23PM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> When starting a transaction (or joining an existing one with
> btrfs_start_transaction()), we reserve space for the number of items we
> want to insert in a btree, but we don't do it for the delayed refs we
> will generate while using the transaction to modify (COW) extent buffers
> in a btree or allocate new extent buffers. Basically how it works:
> 
> 1) When we start a transaction we reserve space for the number of items
>    the caller wants to be inserted/modified/deleted in a btree. This space
>    goes to the transaction block reserve;
> 
> 2) If the delayed refs block reserve is not full, its size is greater
>    than the amount of its reserved space, and the flush method is
>    BTRFS_RESERVE_FLUSH_ALL, then we attempt to reserve more space for
>    it corresponding to the number of items the caller wants to
>    insert/modify/delete in a btree;
> 
> 3) The size of the delayed refs block reserve is increased when a task
>    creates delayed refs after COWing an extent buffer, allocating a new
>    one or deleting (freeing) an extent buffer. This happens after the
>    the task started or joined a transaction, whenever it calls
>    btrfs_update_delayed_refs_rsv();
> 
> 4) The delayed refs block reserve is then refilled by anyone calling
>    btrfs_delayed_refs_rsv_refill(), either during unlink/truncate
>    operations or when someone else calls btrfs_start_transaction() with
>    a 0 number of items and flush method BTRFS_RESERVE_FLUSH_ALL;
> 
> 5) As a task COWs or allocates extent buffers, it consumes space from the
>    transaction block reserve. When the task releases its transaction
>    handle (btrfs_end_transaction()) or it attempts to commit the
>    transaction, it releases any remaining space in the transaction block
>    reserve that it did not use, as not all space may have been used (due
>    to pessimistic space calculation) by calling btrfs_block_rsv_release()
>    which will try to add that unused space to the delayed refs block
>    reserve (if its current size is greater than its reserved space).
>    That transferred space may not be enough to completely fulfill the
>    delayed refs block reserve.
> 
>    Plus we have some tasks that will attempt do modify as many leaves
>    as they can before getting -ENOSPC (and then reserving more space and
>    retrying), such as hole punching and extent cloning which call
>    btrfs_replace_file_extents(). Such tasks can generate therefore a
>    high number of delayed refs, for both metadata and data (we can't
>    know in advance how many file extent items we will find in a range
>    and therefore how many delayed refs for dropping references on data
>    extents we will generate);
> 
> 6) If a transaction starts its commit before the delayeds refs block
>    reserve is refilled, for example by the transaction kthread or by
>    someone who called btrfs_join_transaction() before starting the
>    commit, then when running delayed references if we don't have enough
>    reserved space in the delayed refs block reserve, we will consume
>    space from the global block reserve.
> 
> Now this doesn't make a lot of sense because:
> 
> 1) We should reserve space for delayed references when starting the
>    transaction, since we have no guarantees the delayed refs block
>    reserve will be refilled;
> 
> 2) If no refill happens then we will consume from the global block reserve
>    when running delayed refs during the transaction commit;
> 
> 3) If we have a bunch of tasks calling btrfs_start_transaction() with a
>    number of items greater than zero and at the time the delayed refs
>    reserve is full, then we don't reserve any space at
>    btrfs_start_transaction() for the delayed refs that will be generated
>    by a task, and we can therefore end up using a lot of space from the
>    global reserve when running the delayed refs during a transaction
>    commit;
> 
> 4) There are also other operations that result in bumping the size of the
>    delayed refs reserve, such as creating and deleting block groups, as
>    well as the need to update a block group item because we allocated or
>    freed an extent from the respective block group;
> 
> 5) If we have a significant gap betweent the delayed refs reserve's size
>    and its reserved space, two very bad things may happen:
> 
>    1) The reserved space of the global reserve may not be enough and we
>       fail the transaction commit with -ENOSPC when running delayed refs;
> 
>    2) If the available space in the global reserve is enough it may result
>       in nearly exhausting it. If the fs has no more unallocated device
>       space for allocating a new block group and all the available space
>       in existing metadata block groups is not far from the global
>       reserve's size before we started the transaction commit, we may end
>       up in a situation where after the transaction commit we have too
>       little available metadata space, and any future transaction commit
>       will fail with -ENOSPC, because although we were able to reserve
>       space to start the transaction, we were not able to commit it, as
>       running delayed refs generates some more delayed refs (to update the
>       extent tree for example) - this includes not even being able to
>       commit a transaction that was started with the goal of unlinking a
>       file, removing an empty data block group or doing reclaim/balance,
>       so there's no way to release metadata space.
> 
>       In the worst case the next time we mount the filesystem we may
>       also fail with -ENOSPC due to failure to commit a transaction to
>       cleanup orphan inodes. This later case was reported and hit by
>       someone running a SLE (SUSE Linux Enterprise) distribution for
>       example - where the fs had no more unallocated space that could be
>       used to allocate a new metadata block group, and the available
>       metadata space was about 1.5M, not enough to commit a transaction
>       to cleanup an orphan inode (or do relocation of data block groups
>       that were far from being full).
> 
> So improve on this situation by always reserving space for delayed refs
> when calling start_transaction(), and if the flush method is
> BTRFS_RESERVE_FLUSH_ALL, also try to refill the delayeds refs block
> reserve if it's not full. The space reserved for the delayed refs is added
> to a local block reserve that is part of the transaction handle, and when
> a task updates the delayed refs block reserve size, after creating a
> delayed ref, the space is transferred from that local reserve to the
> global delayed refs reserve (fs_info->delayed_refs_rsv). In case the
> local reserve does not have enough space, which may happen for tasks
> that generate a variable and potentially large number of delayed refs
> (such as the hole punching and extent cloning cases mentioned before),
> we transfer any available space and then rely on the current behaviour
> of hoping some other task refills the delayed refs reserve or fallback
> to the global block reserve.
> 
> Signed-off-by: Filipe Manana <fdmanana@suse.com>

Reviewed-by: Josef Bacik <josef@toxicpanda.com>

Thanks,

Josef

  reply	other threads:[~2023-09-08 15:32 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-08 12:09 [PATCH 00/21] btrfs: updates to delayed refs accounting and space reservation fdmanana
2023-09-08 12:09 ` [PATCH 01/21] btrfs: fix race when refilling delayed refs block reserve fdmanana
2023-09-08 14:46   ` Josef Bacik
2023-09-08 17:21     ` Filipe Manana
2023-09-08 12:09 ` [PATCH 02/21] btrfs: prevent transaction block reserve underflow when starting transaction fdmanana
2023-09-08 14:57   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 03/21] btrfs: pass a space_info argument to btrfs_reserve_metadata_bytes() fdmanana
2023-09-08 14:58   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 04/21] btrfs: remove unnecessary logic when running new delayed references fdmanana
2023-09-08 14:59   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 05/21] btrfs: remove the refcount warning/check at btrfs_put_delayed_ref() fdmanana
2023-09-08 15:00   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 06/21] btrfs: return -EUCLEAN for delayed tree ref with a ref count not equals to 1 fdmanana
2023-09-08 15:01   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 07/21] btrfs: remove redundant BUG_ON() from __btrfs_inc_extent_ref() fdmanana
2023-09-08 15:02   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 08/21] btrfs: remove refs_to_add argument " fdmanana
2023-09-08 15:02   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 09/21] btrfs: remove refs_to_drop argument from __btrfs_free_extent() fdmanana
2023-09-08 15:03   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 10/21] btrfs: initialize key where it's used when running delayed data ref fdmanana
2023-09-08 15:04   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 11/21] btrfs: remove pointless 'ref_root' variable from run_delayed_data_ref() fdmanana
2023-09-08 15:07   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 12/21] btrfs: log message if extent item not found when running delayed extent op fdmanana
2023-09-08 15:08   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 13/21] btrfs: use a single variable for return value at run_delayed_extent_op() fdmanana
2023-09-08 15:09   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 14/21] btrfs: use a single variable for return value at lookup_inline_extent_backref() fdmanana
2023-09-08 15:10   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 15/21] btrfs: return -EUCLEAN if extent item is missing when searching inline backref fdmanana
2023-09-08 15:10   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 16/21] btrfs: simplify check for extent item overrun at lookup_inline_extent_backref() fdmanana
2023-09-08 15:11   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 17/21] btrfs: allow to run delayed refs by bytes to be released instead of count fdmanana
2023-09-08 15:15   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 18/21] btrfs: reserve space for delayed refs on a per ref basis fdmanana
2023-09-08 15:23   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 19/21] btrfs: remove pointless initialization at btrfs_delayed_refs_rsv_release() fdmanana
2023-09-08 15:16   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 20/21] btrfs: stop doing excessive space reservation for csum deletion fdmanana
2023-09-08 15:26   ` Josef Bacik
2023-09-08 12:09 ` [PATCH 21/21] btrfs: always reserve space for delayed refs when starting transaction fdmanana
2023-09-08 15:32   ` Josef Bacik [this message]
2023-09-08 17:20 ` [PATCH v2 00/21] btrfs: updates to delayed refs accounting and space reservation fdmanana
2023-09-08 17:20   ` [PATCH v2 01/21] btrfs: fix race when refilling delayed refs block reserve fdmanana
2023-09-08 17:20   ` [PATCH v2 02/21] btrfs: prevent transaction block reserve underflow when starting transaction fdmanana
2023-09-08 17:20   ` [PATCH v2 03/21] btrfs: pass a space_info argument to btrfs_reserve_metadata_bytes() fdmanana
2023-09-08 17:20   ` [PATCH v2 04/21] btrfs: remove unnecessary logic when running new delayed references fdmanana
2023-09-08 17:20   ` [PATCH v2 05/21] btrfs: remove the refcount warning/check at btrfs_put_delayed_ref() fdmanana
2023-09-08 17:20   ` [PATCH v2 06/21] btrfs: return -EUCLEAN for delayed tree ref with a ref count not equals to 1 fdmanana
2023-09-08 17:20   ` [PATCH v2 07/21] btrfs: remove redundant BUG_ON() from __btrfs_inc_extent_ref() fdmanana
2023-09-08 17:20   ` [PATCH v2 08/21] btrfs: remove refs_to_add argument " fdmanana
2023-09-08 17:20   ` [PATCH v2 09/21] btrfs: remove refs_to_drop argument from __btrfs_free_extent() fdmanana
2023-09-08 17:20   ` [PATCH v2 10/21] btrfs: initialize key where it's used when running delayed data ref fdmanana
2023-09-08 17:20   ` [PATCH v2 11/21] btrfs: remove pointless 'ref_root' variable from run_delayed_data_ref() fdmanana
2023-09-08 17:20   ` [PATCH v2 12/21] btrfs: log message if extent item not found when running delayed extent op fdmanana
2023-09-08 17:20   ` [PATCH v2 13/21] btrfs: use a single variable for return value at run_delayed_extent_op() fdmanana
2023-09-08 17:20   ` [PATCH v2 14/21] btrfs: use a single variable for return value at lookup_inline_extent_backref() fdmanana
2023-09-08 17:20   ` [PATCH v2 15/21] btrfs: return -EUCLEAN if extent item is missing when searching inline backref fdmanana
2023-09-08 17:20   ` [PATCH v2 16/21] btrfs: simplify check for extent item overrun at lookup_inline_extent_backref() fdmanana
2023-09-08 17:20   ` [PATCH v2 17/21] btrfs: allow to run delayed refs by bytes to be released instead of count fdmanana
2023-09-08 17:20   ` [PATCH v2 18/21] btrfs: reserve space for delayed refs on a per ref basis fdmanana
2023-09-08 17:20   ` [PATCH v2 19/21] btrfs: remove pointless initialization at btrfs_delayed_refs_rsv_release() fdmanana
2023-09-08 17:20   ` [PATCH v2 20/21] btrfs: stop doing excessive space reservation for csum deletion fdmanana
2023-09-08 17:20   ` [PATCH v2 21/21] btrfs: always reserve space for delayed refs when starting transaction fdmanana
2026-02-03 17:42     ` Alex Lyakas
2026-02-03 18:03       ` Filipe Manana
2023-09-11 17:20   ` [PATCH v2 00/21] btrfs: updates to delayed refs accounting and space reservation David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230908153205.GV1977092@perftesting \
    --to=josef@toxicpanda.com \
    --cc=fdmanana@kernel.org \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.