From: Dave Chinner <david@fromorbit.com>
To: linux-xfs@vger.kernel.org
Subject: [RFC] [PATCH 00/50] xfs: per-ag centric allocation alogrithms
Date: Sat, 11 Jun 2022 11:26:09 +1000 [thread overview]
Message-ID: <20220611012659.3418072-1-david@fromorbit.com> (raw)
Hi folks,
This is "heads up" at this point so that people can see what is
coming down the line and make early comments, not a request to
consider these for merging soon. I may cherry pick some of the
initial AGI/AGF cleanup patches patches for this cycle, but I'll
send them separately if I do. The patch series is based on a
5.19-rc1 kernel.
This series continues the work towards making shrinking a filesystem
possible. We need to be able to stop operations from taking place
on AGs that need to be removed by a shrink, so before shrink can be
implemented we need to have the infrastructure in place to prevent
incursion into AGs that are going to be, or are in the process, of
being removed from active duty.
The focus of this is making operations that depend on access to AGs
use the perag to access and pin the AG in active use, thereby
creating a barrier we can use to delay shrink until all active uses
have been drained and new uses are prevented.
This series starts by driving the perag down into the AGI, AGF and
AGFL access routines and unifies the perag structure initialisation
with the high level AG header read functions. This largely replaces
the xfs_mount/agno pair that is passed to all these functions with a
perag, and in most places we already have a perag ready to pass in.
There are a few places where perags need to be grabbed before
reading the AG header buffers - some of these will need to be driven
to higher layers to ensure we can run operations on AGs without
getting stuck part way through waiting on a perag reference.
The next section of this patchset moves some of the AG geometry
information from the xfs_mount to the xfs_perag, and starts
converting code that requires geometry validation to use a perag
instead of a mount and having to extract the AGNO from the object
location. This also allows us to store the AG size in the perag and
then we can stop having to compare the agno against sb_agcount to
determine if the AG is the last AG and so has a runt size. This
greatly simplifies some of the type validity checking we do and
substantially reduces the CPU overhead of type validity checking. It
also cuts over 1.2kB out of the binary size.
The series then starts converting the code to use active references.
Active reference counts are used by high level code that needs to
prevent the AG from being taken out from under it by a shrink
operation. The high level code needs to be able to handle not
getting an active reference gracefully, and the shrink code will
need to wait for active references to drain before continuing.
Active references are implemented just as reference counts right now
- an active reference is taken at perag init during mount, and all
other active references are dependent on the active reference count
being greater than zero. This gives us an initial method of stopping
new active references without needing other infrastructure; just
drop the reference taken at filesystem mount time and when the
refcount then falls to zero no new references can be taken.
In future, this will need to take into account AG control state
(e.g. offline, no alloc, etc) as well as the reference count, but
right now we can implement a basic barrier for shrink with just
reference count manipulations. There are patches to convert the
perag state to atomic opstate fields similar to the xfs_mount and
xlog opstate fields in preparation for this.
The first target for active reference conversion is the
for_each_perag*() iterators. This captures a lot of high level code
that should skip offline AGs, and introduces the ability to
differentiate between a lookup that didn't have an online AG and the
end of the AG iteration range.
From there, the inode allocation AG selection is converted to active
references, and the perag is driven deeper into the inode allocation
and btree code to replace the xfs_mount. Most of the inode
allocation code operates on a single AG once it is selected, hence
it should pass the perag as the primary referenced object around for
allocation, not the xfs_mount. There is a bit of churn here, but it
emphasises that inode allocation is inherently an allocation group
based operation.
Next the bmap/alloc interface undergoes a major untangling,
reworking xfs_bmap_btalloc() into separate allocation operations for
different contexts and failure handling behaviours. This then allows
us to completely remove the xfs_alloc_vextent() layer via
restructuring the xfs_alloc_vextent/xfs_alloc_ag_vextent() into a
set of realtively simple helper function that describe the
allocation that they are doing. e.g. xfs_alloc_vextent_exact_bno().
This allows the requirements for accessing AGs to be allocation
context dependent. The allocations that require operation on a
single AG generally can't tolerate failure after the allocation
method and AG has been decided on, and hence the caller needs to
manage the active references to ensure the allocation does not race
with shrink removing the selected AG for the duration of the
operation that requires access to that allocation group.
Other allocations iterate AGs and so the first AG is just a hint -
these do not need to pin a perag first as they can tolerate not
being able to access an AG by simply skipping over it. These require
new perag iteration functions that can start at arbitrary AGs and
wrap around at arbitrary AGs, hence a new set for
for_each_perag_wrap*() helpers to do this.
Next is the rework of the filestreams allocator. This doesn't change
any functionality, but gets rid of the unnecessary multi-pass
selection algorithm when the selected AG is not available. It
currently does a lookup pass which might iterate all AGs to select
an AG, then checks if the AG is acceptible and if not does a "new
AG" pass that is essentially identical to the lookup pass. Both of
these scans also do the same "longest extent in AG" check before
selecting an AG as is done after the AG is selected.
IOWs, the filestreams algorithm can be greatly simplified into a
single new AG selection pass if the there is no current association
or the currently associated AG doesn't have enough contiguous free
space for the allocation to proceed. With this simplification of
the filestreams allocator, it's then trivial to convert it to use
for_each_ag_wrap() for the AG scan algorithm.
This actually passes auto group fstests with rmapbt=1 with only one
regression - xfs/294 gets ENOSPC earlier and that makes unexpected
output noise. The last patch in the series is needed to fix a AGF
ABBA locking deadlock in g/476 - I only just worked this one out,
and I strongly suspect that it's a pre-existing bug that leaves an
AGF locked after failing to allocate anything from the AG.
This series currently ends at the xfs_bmap_btalloc ->allocator
conversion. There still more to be done here before we can start
disabling AGs for shrink:
- the bmapi layer needs to handle active AG references for exact and
near allocation
- converting the allocation "firstblock" restrictions to hold an
actively referenced perag, not a filesystem block address.
- inode cache lookups need to converted to active references
- audits needed to find and convert all the places that we use
bp->b_pag instead of active references passed from high level
code.
- addition of a "going offline" opstate and state machine to use for
rejecting new active references as well as blocking shrink from
making progress until all active references are gone
- ioctls for changing AG state from userspace
- audit of the freeing code to determine whether it can use passive
references to allow freeing of blocks (which may require
allocation!) whilst new allocations are prevented from being run
on "going offline" AGs. This will allow userspace to stop new
allocations in AGs to be shrunk before it starts emptying them and
freeing the space that they have in use.
Cheers,
Dave.
next reply other threads:[~2022-06-11 1:27 UTC|newest]
Thread overview: 69+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-11 1:26 Dave Chinner [this message]
2022-06-11 1:26 ` [PATCH 01/50] xfs: make last AG grow/shrink perag centric Dave Chinner
2022-06-16 7:30 ` Christoph Hellwig
2022-06-11 1:26 ` [PATCH 02/50] xfs: kill xfs_ialloc_pagi_init() Dave Chinner
2022-06-16 7:32 ` Christoph Hellwig
2022-06-11 1:26 ` [PATCH 03/50] xfs: pass perag to xfs_ialloc_read_agi() Dave Chinner
2022-06-16 7:34 ` Christoph Hellwig
2022-06-11 1:26 ` [PATCH 04/50] xfs: kill xfs_alloc_pagf_init() Dave Chinner
2022-06-16 7:35 ` Christoph Hellwig
2022-06-11 1:26 ` [PATCH 05/50] xfs: pass perag to xfs_alloc_read_agf() Dave Chinner
2022-06-11 2:37 ` kernel test robot
2022-06-11 12:04 ` kernel test robot
2022-06-11 13:46 ` kernel test robot
2022-06-14 12:17 ` kernel test robot
2022-06-16 7:38 ` Christoph Hellwig
2022-06-11 1:26 ` [PATCH 06/50] xfs: pass perag to xfs_read_agi Dave Chinner
2022-06-16 7:39 ` Christoph Hellwig
2022-06-16 7:39 ` Christoph Hellwig
2022-06-11 1:26 ` [PATCH 07/50] xfs: pass perag to xfs_read_agf Dave Chinner
2022-06-16 7:40 ` Christoph Hellwig
2022-06-11 1:26 ` [PATCH 08/50] xfs: pass perag to xfs_alloc_get_freelist Dave Chinner
2022-06-16 7:40 ` Christoph Hellwig
2022-06-11 1:26 ` [PATCH 09/50] xfs: pass perag to xfs_alloc_put_freelist Dave Chinner
2022-06-16 7:40 ` Christoph Hellwig
2022-06-11 1:26 ` [PATCH 10/50] xfs: pass perag to xfs_alloc_read_agfl Dave Chinner
2022-06-16 7:41 ` Christoph Hellwig
2022-06-11 1:26 ` [PATCH 11/50] xfs: Pre-calculate per-AG agbno geometry Dave Chinner
2022-06-11 1:26 ` [PATCH 12/50] xfs: Pre-calculate per-AG agino geometry Dave Chinner
2022-06-11 3:08 ` kernel test robot
2022-06-11 1:26 ` [PATCH 13/50] xfs: replace xfs_ag_block_count() with perag accesses Dave Chinner
2022-06-11 1:26 ` [PATCH 14/50] xfs: make is_log_ag() a first class helper Dave Chinner
2022-06-11 1:26 ` [PATCH 15/50] xfs: active perag reference counting Dave Chinner
2022-06-11 1:26 ` [PATCH 16/50] xfs: rework the perag trace points to be perag centric Dave Chinner
2022-06-11 1:26 ` [PATCH 17/50] xfs: convert xfs_imap() to take a perag Dave Chinner
2022-06-11 1:26 ` [PATCH 18/50] xfs: use active perag references for inode allocation Dave Chinner
2022-06-11 1:26 ` [PATCH 19/50] xfs: inobt can use perags in many more places than it does Dave Chinner
2022-06-11 1:26 ` [PATCH 20/50] xfs: convert xfs_ialloc_next_ag() to an atomic Dave Chinner
2022-06-11 1:26 ` [PATCH 21/50] xfs: perags need atomic operational state Dave Chinner
2022-06-11 1:26 ` [PATCH 22/50] xfs: introduce xfs_for_each_perag_wrap() Dave Chinner
2022-06-11 1:26 ` [PATCH 23/50] xfs: rework xfs_alloc_vextent() Dave Chinner
2022-06-11 1:26 ` [PATCH 24/50] xfs: use xfs_alloc_vextent_this_ag() in _iterate_ags() Dave Chinner
2022-06-11 1:26 ` [PATCH 25/50] xfs: combine __xfs_alloc_vextent_this_ag and xfs_alloc_ag_vextent Dave Chinner
2022-06-11 1:26 ` [PATCH 26/50] xfs: use xfs_alloc_vextent_this_ag() where appropriate Dave Chinner
2022-06-11 1:26 ` [PATCH 27/50] xfs: factor xfs_bmap_btalloc() Dave Chinner
2022-06-11 1:26 ` [PATCH 28/50] xfs: use xfs_alloc_vextent_first_ag() where appropriate Dave Chinner
2022-06-11 1:26 ` [PATCH 29/50] xfs: use xfs_alloc_vextent_start_bno() " Dave Chinner
2022-06-11 1:26 ` [PATCH 30/50] xfs: introduce xfs_alloc_vextent_near_bno() Dave Chinner
2022-06-11 1:26 ` [PATCH 31/50] xfs: introduce xfs_alloc_vextent_exact_bno() Dave Chinner
2022-06-11 1:26 ` [PATCH 32/50] xfs: introduce xfs_alloc_vextent_prepare() Dave Chinner
2022-06-11 1:26 ` [PATCH 33/50] xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno() Dave Chinner
2022-06-11 1:26 ` [PATCH 34/50] xfs: fold xfs_alloc_ag_vextent() into callers Dave Chinner
2022-06-11 1:26 ` [PATCH 35/50] xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walker Dave Chinner
2022-06-11 1:26 ` [PATCH 36/50] xfs: convert trim to use for_each_perag_range Dave Chinner
2022-06-11 1:26 ` [PATCH 37/50] xfs: factor out filestreams from xfs_bmap_btalloc_nullfb Dave Chinner
2022-06-11 1:26 ` [PATCH 38/50] xfs: get rid of notinit from xfs_bmap_longest_free_extent Dave Chinner
2022-06-11 1:26 ` [PATCH 39/50] xfs: use xfs_bmap_longest_free_extent() in filestreams Dave Chinner
2022-06-11 1:26 ` [PATCH 40/50] xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.c Dave Chinner
2022-06-11 1:26 ` [PATCH 41/50] xfs: merge filestream AG lookup into xfs_filestream_select_ag() Dave Chinner
2022-06-11 1:26 ` [PATCH 42/50] xfs: merge new filestream AG selection " Dave Chinner
2022-06-11 1:26 ` [PATCH 43/50] xfs: remove xfs_filestream_select_ag() longest extent check Dave Chinner
2022-06-11 1:26 ` [PATCH 44/50] xfs: factor out MRU hit case in xfs_filestream_select_ag Dave Chinner
2022-06-11 1:26 ` [PATCH 45/50] xfs: track an active perag reference in filestreams Dave Chinner
2022-06-11 1:26 ` [PATCH 46/50] xfs: use for_each_perag_wrap in xfs_filestream_pick_ag Dave Chinner
2022-06-11 1:26 ` [PATCH 47/50] xfs: pass perag to filestreams tracing Dave Chinner
2022-06-11 1:26 ` [PATCH 48/50] xfs: return a referenced perag from filestreams allocator Dave Chinner
2022-06-11 1:26 ` [PATCH 49/50] xfs: refactor the filestreams allocator pick functions Dave Chinner
2022-06-11 1:26 ` [PATCH 50/50] xfs: fix low space alloc deadlock Dave Chinner
2022-06-16 12:01 ` [RFC] [PATCH 00/50] xfs: per-ag centric allocation alogrithms Christoph Hellwig
2022-06-21 2:08 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220611012659.3418072-1-david@fromorbit.com \
--to=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox