From: Nikolay Borisov <nborisov@suse.com>
To: Josef Bacik <josef@toxicpanda.com>,
linux-btrfs@vger.kernel.org, kernel-team@fb.com,
linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Subject: Re: [PATCH v2 2/8] btrfs: handle shrink_delalloc pages calculation differently
Date: Wed, 7 Jul 2021 13:50:38 +0300 [thread overview]
Message-ID: <adfeec00-538f-8497-e21e-0112d49256ae@suse.com> (raw)
In-Reply-To: <670bdb9693668dd0662a3c3db8a954df1aa966e4.1624974951.git.josef@toxicpanda.com>
On 29.06.21 г. 16:59, Josef Bacik wrote:
> We have been hitting some early ENOSPC issues in production with more
> recent kernels, and I tracked it down to us simply not flushing delalloc
> as aggressively as we should be. With tracing I was seeing us failing
> all tickets with all of the block rsvs at or around 0, with very little
> pinned space, but still around 120MiB of outstanding bytes_may_used.
> Upon further investigation I saw that we were flushing around 14 pages
> per shrink call for delalloc, despite having around 2GiB of delalloc
> outstanding.
>
> Consider the example of a 8 way machine, all CPUs trying to create a
> file in parallel, which at the time of this commit requires 5 items to
> do. Assuming a 16k leaf size, we have 10MiB of total metadata reclaim
> size waiting on reservations. Now assume we have 128MiB of delalloc
> outstanding. With our current math we would set items to 20, and then
> set to_reclaim to 20 * 256k, or 5MiB.
>
> Assuming that we went through this loop all 3 times, for both
> FLUSH_DELALLOC and FLUSH_DELALLOC_WAIT, and then did the full loop
> twice, we'd only flush 60MiB of the 128MiB delalloc space. This could
> leave a fair bit of delalloc reservations still hanging around by the
> time we go to ENOSPC out all the remaining tickets.
>
> Fix this two ways. First, change the calculations to be a fraction of
> the total delalloc bytes on the system. Prior to this change we were
> calculating based on dirty inodes so our math made more sense, now it's
> just completely unrelated to what we're actually doing.
>
> Second add a FLUSH_DELALLOC_FULL state, that we hold off until we've
> gone through the flush states at least once. This will empty the system
> of all delalloc so we're sure to be truly out of space when we start
> failing tickets.
>
> I'm tagging stable 5.10 and forward, because this is where we started
> using the page stuff heavily again. This affects earlier kernel
> versions as well, but would be a pain to backport to them as the
> flushing mechanisms aren't the same.
>
> CC: stable@vger.kernel.org # 5.10+
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/btrfs/ctree.h | 9 +++++----
> fs/btrfs/space-info.c | 35 ++++++++++++++++++++++++++---------
> include/trace/events/btrfs.h | 1 +
> 3 files changed, 32 insertions(+), 13 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index d7ef4d7d2c1a..232ff1a49ca6 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2783,10 +2783,11 @@ enum btrfs_flush_state {
> FLUSH_DELAYED_REFS = 4,
> FLUSH_DELALLOC = 5,
> FLUSH_DELALLOC_WAIT = 6,
> - ALLOC_CHUNK = 7,
> - ALLOC_CHUNK_FORCE = 8,
> - RUN_DELAYED_IPUTS = 9,
> - COMMIT_TRANS = 10,
> + FLUSH_DELALLOC_FULL = 7,
> + ALLOC_CHUNK = 8,
> + ALLOC_CHUNK_FORCE = 9,
> + RUN_DELAYED_IPUTS = 10,
> + COMMIT_TRANS = 11,
> };
>
> int btrfs_subvolume_reserve_metadata(struct btrfs_root *root,
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index af161eb808a2..0c539a94c6d9 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -494,6 +494,9 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
> long time_left;
> int loops;
>
> + delalloc_bytes = percpu_counter_sum_positive(&fs_info->delalloc_bytes);
> + ordered_bytes = percpu_counter_sum_positive(&fs_info->ordered_bytes);
> +
> /* Calc the number of the pages we need flush for space reservation */
> if (to_reclaim == U64_MAX) {
> items = U64_MAX;
> @@ -501,19 +504,21 @@ static void shrink_delalloc(struct btrfs_fs_info *fs_info,
> /*
> * to_reclaim is set to however much metadata we need to
> * reclaim, but reclaiming that much data doesn't really track
> - * exactly, so increase the amount to reclaim by 2x in order to
> - * make sure we're flushing enough delalloc to hopefully reclaim
> - * some metadata reservations.
> + * exactly. What we really want to do is reclaim full inode's
> + * worth of reservations, however that's not available to us
> + * here. We will take a fraction of the delalloc bytes for our
> + * flushing loops and hope for the best. Delalloc will expand
> + * the amount we write to cover an entire dirty extent, which
> + * will reclaim the metadata reservation for that range. If
> + * it's not enough subsequent flush stages will be more
> + * aggressive.
> */
> + to_reclaim = max(to_reclaim, delalloc_bytes >> 3);
> items = calc_reclaim_items_nr(fs_info, to_reclaim) * 2;
> - to_reclaim = items * EXTENT_SIZE_PER_ITEM;
> }
>
> trans = (struct btrfs_trans_handle *)current->journal_info;
>
> - delalloc_bytes = percpu_counter_sum_positive(
> - &fs_info->delalloc_bytes);
> - ordered_bytes = percpu_counter_sum_positive(&fs_info->ordered_bytes);
> if (delalloc_bytes == 0 && ordered_bytes == 0)
> return;
nit: That check should also be moved alongside delalloc/ordered _bytes
read out.
<snip>
next prev parent reply other threads:[~2021-07-07 10:50 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-29 13:59 [PATCH v2 0/8] ENOSPC delalloc flushing fixes Josef Bacik
2021-06-29 13:59 ` [PATCH v2 1/8] btrfs: enable a tracepoint when we fail tickets Josef Bacik
2021-07-07 10:51 ` Nikolay Borisov
2021-06-29 13:59 ` [PATCH v2 2/8] btrfs: handle shrink_delalloc pages calculation differently Josef Bacik
2021-07-07 10:50 ` Nikolay Borisov [this message]
2021-06-29 13:59 ` [PATCH v2 3/8] btrfs: wait on async extents when flushing delalloc Josef Bacik
2021-07-07 11:09 ` Nikolay Borisov
2021-06-29 13:59 ` [PATCH v2 4/8] btrfs: wake up async_delalloc_pages waiters after submit Josef Bacik
2021-07-07 11:04 ` Nikolay Borisov
2021-06-29 13:59 ` [PATCH v2 5/8] fs: add a filemap_fdatawrite_wbc helper Josef Bacik
2021-07-07 11:03 ` Nikolay Borisov
2021-06-29 13:59 ` [PATCH v2 6/8] btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking Josef Bacik
2021-07-07 11:03 ` Nikolay Borisov
2021-06-29 13:59 ` [PATCH v2 7/8] 9p: migrate from sync_inode to filemap_fdatawrite_wbc Josef Bacik
2021-07-07 11:00 ` Nikolay Borisov
2021-06-29 13:59 ` [PATCH v2 8/8] fs: kill sync_inode Josef Bacik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=adfeec00-538f-8497-e21e-0112d49256ae@suse.com \
--to=nborisov@suse.com \
--cc=josef@toxicpanda.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox