public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH] xfs: set aside allocation btree blocks from block reservation
Date: Thu, 18 Feb 2021 08:25:20 -0500	[thread overview]
Message-ID: <20210218132520.GD685651@bfoster> (raw)
In-Reply-To: <20210218003451.GC4662@dread.disaster.area>

On Thu, Feb 18, 2021 at 11:34:51AM +1100, Dave Chinner wrote:
> On Wed, Feb 17, 2021 at 08:23:39AM -0500, Brian Foster wrote:
> > The blocks used for allocation btrees (bnobt and countbt) are
> > technically considered free space. This is because as free space is
> > used, allocbt blocks are removed and naturally become available for
> > traditional allocation. However, this means that a significant
> > portion of free space may consist of in-use btree blocks if free
> > space is severely fragmented.
> > 
> > On large filesystems with large perag reservations, this can lead to
> > a rare but nasty condition where a significant amount of physical
> > free space is available, but the majority of actual usable blocks
> > consist of in-use allocbt blocks. We have a record of a (~12TB, 32
> > AG) filesystem with multiple AGs in a state with ~2.5GB or so free
> > blocks tracked across ~300 total allocbt blocks, but effectively at
> > 100% full because the the free space is entirely consumed by
> > refcountbt perag reservation.
> > 
> > Such a large perag reservation is by design on large filesystems.
> > The problem is that because the free space is so fragmented, this AG
> > contributes the 300 or so allocbt blocks to the global counters as
> > free space. If this pattern repeats across enough AGs, the
> > filesystem lands in a state where global block reservation can
> > outrun physical block availability. For example, a streaming
> > buffered write on the affected filesystem continues to allow delayed
> > allocation beyond the point where writeback starts to fail due to
> > physical block allocation failures. The expected behavior is for the
> > delalloc block reservation to fail gracefully with -ENOSPC before
> > physical block allocation failure is a possibility.
> 
> *nod*
> 
> > To address this problem, introduce a percpu counter to track the sum
> > of the allocbt block counters already tracked in the AGF. Use the
> > new counter to set these blocks aside at reservation time and thus
> > ensure they cannot be allocated until truly available. Since this is
> > only necessary when large reflink perag reservations are in place
> > and the counter requires a read of each AGF to fully populate, only
> > enforce on reflink enabled filesystems. This allows initialization
> > of the counter at ->pagf_init time because the refcountbt perag
> > reservation init code reads each AGF at mount time.
> 
> Ok, so the mechanism sounds ok, but a per-cpu counter seems like
> premature optimisation. How often are we really updating btree block
> counts? An atomic counter is good for at least a million updates a
> second across a 2 socket 32p machine, and I highly doubt we're
> incrementing/decrementing btree block counts that often on such a
> machine. 
> 
> While per-cpu counters have a fast write side, they come with
> additional algorithmic complexity. Hence if the update rate of the
> counter is not fast enough to need per-cpu counters, we should avoid
> them. just because other free space counters use per-cpu counters,
> it doesn't mean everything in that path needs to use them...
> 

The use of the percpu counter was more for the read side than the write
side. I think of it more of an abstraction to avoid having to open code
and define a new spin lock just for this. I actually waffled a bit on
just setting a batch count of 0 to get roughly equivalent behavior, but
didn't think it would make much difference.

> > Note that the counter uses a small percpu batch size to allow the
> > allocation paths to keep the primary count accurate enough that the
> > reservation path doesn't ever need to lock and sum the counter.
> > Absolute accuracy is not required here, just that the counter
> > reflects the majority of unavailable blocks so the reservation path
> > fails first.
> 
> And this makes the per-cpu counter scale almost no better than an
> simple atomic counter, because a spinlock requires two atomic
> operations (lock and unlock). Hence a batch count of 4 only reduces
> the atomic op count by half but introduces at lot of extra
> complexity. It won't make a difference to the scalability of
> workloads that hammer the btree block count because contention on
> the internal counter spinlock will occur at close to the same
> concurrency rate as would occur on an atomic counter.
> 

Right, but percpu_counter_read_positive() allows a fast read in the
xfs_mod_fdblocks() path. I didn't use an atomic because I was concerned
about introducing overhead in that path. If we're Ok with whatever
overhead an atomic read might introduce (a spin lock in the worst case
for some arches), then I don't mind switching over to that. I also don't
mind defining a new spin lock and explicitly implementing the lockless
read in xfs_mod_fdblocks(), I just thought it was extra code for little
benefit over the percpu counter. Preference?

Brian

> Hence a per-cpu counter used in this manner seems like a premature
> optimisation to me...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


  reply	other threads:[~2021-02-18 16:42 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-17 13:23 [PATCH] xfs: set aside allocation btree blocks from block reservation Brian Foster
2021-02-18  0:34 ` Dave Chinner
2021-02-18 13:25   ` Brian Foster [this message]
2021-02-19  2:24     ` Dave Chinner
2021-02-19 14:09       ` Brian Foster
2021-02-19 21:28         ` Dave Chinner
2021-02-18  7:51 ` Chandan Babu R

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210218132520.GD685651@bfoster \
    --to=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox