* [GIT PULL] xfs: per-ag centric allocation algorithms
@ 2023-02-15 1:11 Dave Chinner
2023-02-16 21:43 ` Darrick J. Wong
0 siblings, 1 reply; 2+ messages in thread
From: Dave Chinner @ 2023-02-15 1:11 UTC (permalink / raw)
To: linux-xfs; +Cc: djwong
Hi Darrick,
As you requested yesterday on #xfs, please find below a pull request
for the allocator rework series. It is unchanged from the last
version I posted except for updates to RVB tags and a rebase on the
current linux-xfs/for-next branch. Details of the changes made in
the series are captured in the tag identifying the branch for you to
pull.
PLease let me know if I've screwed anything up in the tree or
commit metadata and I'll get them sorted ASAP.
Cheers,
Dave.
---
The following changes since commit dd07bb8b6baf2389caff221f043d9188ce6bab8c:
xfs: revert commit 8954c44ff477 (2023-02-10 09:06:06 -0800)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs tags/xfs-alloc-perag-conversion
for you to fetch changes up to bd4f5d09cc93c8ca51e4efea86ac90a4bb553d6e:
xfs: refactor the filestreams allocator pick functions (2023-02-13 09:14:56 +1100)
----------------------------------------------------------------
xfs: per-ag centric allocation alogrithms
This series continues the work towards making shrinking a filesystem
possible. We need to be able to stop operations from taking place
on AGs that need to be removed by a shrink, so before shrink can be
implemented we need to have the infrastructure in place to prevent
incursion into AGs that are going to be, or are in the process, of
being removed from active duty.
The focus of this is making operations that depend on access to AGs
use the perag to access and pin the AG in active use, thereby
creating a barrier we can use to delay shrink until all active uses
of an AG have been drained and new uses are prevented.
This series starts by fixing some existing issues that are exposed
by changes later in the series. They stand alone, so can be picked
up independently of the rest of this patchset.
The most complex of these fixes is cleaning up the mess that is the
AGF deadlock avoidance algorithm. This algorithm stores the first
block that is allocated in a transaction in tp->t_firstblock, then
uses this to try to limit future allocations within the transaction
to AGs at or higher than the filesystem block stored in
tp->t_firstblock. This depends on one of the initial bug fixes in
the series to move the deadlock avoidance checks to
xfs_alloc_vextent(), and then builds on it to relax the constraints
of the avoidance algorithm to only be active when a deadlock is
possible.
We also update the algorithm to record allocations from higher AGs
that are allocated from, because we when we need to lock more than
two AGs we still have to ensure lock order is correct. Therefore we
can't lock AGs in the order 1, 3, 2, even though tp->t_firstblock
indicates that we've allocated from AG 1 and so AG is valid to lock.
It's not valid, because we already hold AG 3 locked, and so
tp->t-first_block should actually point at AG 3, not AG 1 in this
situation.
It should now be obvious that the deadlock avoidance algorithm
should record AGs, not filesystem blocks. So the series then changes
the transaction to store the highest AG we've allocated in rather
than a filesystem block we allocated. This makes it obvious what
the constraints are, and trivial to update as we lock and allocate
from various AGs.
With all the bug fixes out of the way, the series then starts
converting the code to use active references. Active reference
counts are used by high level code that needs to prevent the AG from
being taken out from under it by a shrink operation. The high level
code needs to be able to handle not getting an active reference
gracefully, and the shrink code will need to wait for active
references to drain before continuing.
Active references are implemented just as reference counts right now
- an active reference is taken at perag init during mount, and all
other active references are dependent on the active reference count
being greater than zero. This gives us an initial method of stopping
new active references without needing other infrastructure; just
drop the reference taken at filesystem mount time and when the
refcount then falls to zero no new references can be taken.
In future, this will need to take into account AG control state
(e.g. offline, no alloc, etc) as well as the reference count, but
right now we can implement a basic barrier for shrink with just
reference count manipulations. As such, patches to convert the perag
state to atomic opstate fields similar to the xfs_mount and xlog
opstate fields follow the initial active perag reference counting
patches.
The first target for active reference conversion is the
for_each_perag*() iterators. This captures a lot of high level code
that should skip offline AGs, and introduces the ability to
differentiate between a lookup that didn't have an online AG and the
end of the AG iteration range.
From there, the inode allocation AG selection is converted to active
references, and the perag is driven deeper into the inode allocation
and btree code to replace the xfs_mount. Most of the inode
allocation code operates on a single AG once it is selected, hence
it should pass the perag as the primary referenced object around for
allocation, not the xfs_mount. There is a bit of churn here, but it
emphasises that inode allocation is inherently an allocation group
based operation.
Next the bmap/alloc interface undergoes a major untangling,
reworking xfs_bmap_btalloc() into separate allocation operations for
different contexts and failure handling behaviours. This then allows
us to completely remove the xfs_alloc_vextent() layer via
restructuring the xfs_alloc_vextent/xfs_alloc_ag_vextent() into a
set of realtively simple helper function that describe the
allocation that they are doing. e.g. xfs_alloc_vextent_exact_bno().
This allows the requirements for accessing AGs to be allocation
context dependent. The allocations that require operation on a
single AG generally can't tolerate failure after the allocation
method and AG has been decided on, and hence the caller needs to
manage the active references to ensure the allocation does not race
with shrink removing the selected AG for the duration of the
operation that requires access to that allocation group.
Other allocations iterate AGs and so the first AG is just a hint -
these do not need to pin a perag first as they can tolerate not
being able to access an AG by simply skipping over it. These require
new perag iteration functions that can start at arbitrary AGs and
wrap around at arbitrary AGs, hence a new set for
for_each_perag_wrap*() helpers to do this.
Next is the rework of the filestreams allocator. This doesn't change
any functionality, but gets rid of the unnecessary multi-pass
selection algorithm when the selected AG is not available. It
currently does a lookup pass which might iterate all AGs to select
an AG, then checks if the AG is acceptible and if not does a "new
AG" pass that is essentially identical to the lookup pass. Both of
these scans also do the same "longest extent in AG" check before
selecting an AG as is done after the AG is selected.
IOWs, the filestreams algorithm can be greatly simplified into a
single new AG selection pass if the there is no current association
or the currently associated AG doesn't have enough contiguous free
space for the allocation to proceed. With this simplification of
the filestreams allocator, it's then trivial to convert it to use
for_each_perag_wrap() for the AG scan algorithm.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
----------------------------------------------------------------
Dave Chinner (42):
xfs: fix low space alloc deadlock
xfs: prefer free inodes at ENOSPC over chunk allocation
xfs: block reservation too large for minleft allocation
xfs: drop firstblock constraints from allocation setup
xfs: t_firstblock is tracking AGs not blocks
xfs: don't assert fail on transaction cancel with deferred ops
xfs: active perag reference counting
xfs: rework the perag trace points to be perag centric
xfs: convert xfs_imap() to take a perag
xfs: use active perag references for inode allocation
xfs: inobt can use perags in many more places than it does
xfs: convert xfs_ialloc_next_ag() to an atomic
xfs: perags need atomic operational state
xfs: introduce xfs_for_each_perag_wrap()
xfs: rework xfs_alloc_vextent()
xfs: factor xfs_alloc_vextent_this_ag() for _iterate_ags()
xfs: combine __xfs_alloc_vextent_this_ag and xfs_alloc_ag_vextent
xfs: use xfs_alloc_vextent_this_ag() where appropriate
xfs: factor xfs_bmap_btalloc()
xfs: use xfs_alloc_vextent_first_ag() where appropriate
xfs: use xfs_alloc_vextent_start_bno() where appropriate
xfs: introduce xfs_alloc_vextent_near_bno()
xfs: introduce xfs_alloc_vextent_exact_bno()
xfs: introduce xfs_alloc_vextent_prepare()
xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno()
xfs: fold xfs_alloc_ag_vextent() into callers
xfs: move the minimum agno checks into xfs_alloc_vextent_check_args
xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walker
xfs: convert trim to use for_each_perag_range
xfs: factor out filestreams from xfs_bmap_btalloc_nullfb
xfs: get rid of notinit from xfs_bmap_longest_free_extent
xfs: use xfs_bmap_longest_free_extent() in filestreams
xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.c
xfs: merge filestream AG lookup into xfs_filestream_select_ag()
xfs: merge new filestream AG selection into xfs_filestream_select_ag()
xfs: remove xfs_filestream_select_ag() longest extent check
xfs: factor out MRU hit case in xfs_filestream_select_ag
xfs: track an active perag reference in filestreams
xfs: use for_each_perag_wrap in xfs_filestream_pick_ag
xfs: pass perag to filestreams tracing
xfs: return a referenced perag from filestreams allocator
xfs: refactor the filestreams allocator pick functions
fs/xfs/libxfs/xfs_ag.c | 93 ++++-
fs/xfs/libxfs/xfs_ag.h | 111 +++++-
fs/xfs/libxfs/xfs_ag_resv.c | 2 +-
fs/xfs/libxfs/xfs_alloc.c | 683 +++++++++++++++++++++++--------------
fs/xfs/libxfs/xfs_alloc.h | 61 ++--
fs/xfs/libxfs/xfs_alloc_btree.c | 2 +-
fs/xfs/libxfs/xfs_bmap.c | 664 +++++++++++++++++------------------
fs/xfs/libxfs/xfs_bmap.h | 7 +
fs/xfs/libxfs/xfs_bmap_btree.c | 64 ++--
fs/xfs/libxfs/xfs_btree.c | 2 +-
fs/xfs/libxfs/xfs_ialloc.c | 241 ++++++-------
fs/xfs/libxfs/xfs_ialloc.h | 5 +-
fs/xfs/libxfs/xfs_ialloc_btree.c | 47 ++-
fs/xfs/libxfs/xfs_ialloc_btree.h | 20 +-
fs/xfs/libxfs/xfs_refcount_btree.c | 10 +-
fs/xfs/libxfs/xfs_rmap_btree.c | 2 +-
fs/xfs/libxfs/xfs_sb.c | 3 +-
fs/xfs/scrub/agheader_repair.c | 35 +-
fs/xfs/scrub/bmap.c | 2 +-
fs/xfs/scrub/common.c | 21 +-
fs/xfs/scrub/fscounters.c | 13 +-
fs/xfs/scrub/repair.c | 7 +-
fs/xfs/xfs_bmap_util.c | 2 +-
fs/xfs/xfs_discard.c | 50 ++-
fs/xfs/xfs_filestream.c | 455 ++++++++++++------------
fs/xfs/xfs_filestream.h | 6 +-
fs/xfs/xfs_fsmap.c | 4 +-
fs/xfs/xfs_icache.c | 8 +-
fs/xfs/xfs_inode.c | 2 +-
fs/xfs/xfs_iwalk.c | 10 +-
fs/xfs/xfs_mount.h | 3 +-
fs/xfs/xfs_reflink.c | 4 +-
fs/xfs/xfs_super.c | 47 +--
fs/xfs/xfs_trace.h | 81 ++---
fs/xfs/xfs_trans.c | 8 +-
fs/xfs/xfs_trans.h | 2 +-
36 files changed, 1534 insertions(+), 1243 deletions(-)
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 2+ messages in thread* Re: [GIT PULL] xfs: per-ag centric allocation algorithms
2023-02-15 1:11 [GIT PULL] xfs: per-ag centric allocation algorithms Dave Chinner
@ 2023-02-16 21:43 ` Darrick J. Wong
0 siblings, 0 replies; 2+ messages in thread
From: Darrick J. Wong @ 2023-02-16 21:43 UTC (permalink / raw)
To: Dave Chinner; +Cc: linux-xfs
On Wed, Feb 15, 2023 at 12:11:09PM +1100, Dave Chinner wrote:
> Hi Darrick,
>
> As you requested yesterday on #xfs, please find below a pull request
> for the allocator rework series. It is unchanged from the last
> version I posted except for updates to RVB tags and a rebase on the
> current linux-xfs/for-next branch. Details of the changes made in
> the series are captured in the tag identifying the branch for you to
> pull.
>
> PLease let me know if I've screwed anything up in the tree or
> commit metadata and I'll get them sorted ASAP.
Pulled, thanks.
--D
>
> Cheers,
>
> Dave.
>
> ---
>
> The following changes since commit dd07bb8b6baf2389caff221f043d9188ce6bab8c:
>
> xfs: revert commit 8954c44ff477 (2023-02-10 09:06:06 -0800)
>
> are available in the Git repository at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs tags/xfs-alloc-perag-conversion
>
> for you to fetch changes up to bd4f5d09cc93c8ca51e4efea86ac90a4bb553d6e:
>
> xfs: refactor the filestreams allocator pick functions (2023-02-13 09:14:56 +1100)
>
> ----------------------------------------------------------------
> xfs: per-ag centric allocation alogrithms
>
> This series continues the work towards making shrinking a filesystem
> possible. We need to be able to stop operations from taking place
> on AGs that need to be removed by a shrink, so before shrink can be
> implemented we need to have the infrastructure in place to prevent
> incursion into AGs that are going to be, or are in the process, of
> being removed from active duty.
>
> The focus of this is making operations that depend on access to AGs
> use the perag to access and pin the AG in active use, thereby
> creating a barrier we can use to delay shrink until all active uses
> of an AG have been drained and new uses are prevented.
>
> This series starts by fixing some existing issues that are exposed
> by changes later in the series. They stand alone, so can be picked
> up independently of the rest of this patchset.
>
> The most complex of these fixes is cleaning up the mess that is the
> AGF deadlock avoidance algorithm. This algorithm stores the first
> block that is allocated in a transaction in tp->t_firstblock, then
> uses this to try to limit future allocations within the transaction
> to AGs at or higher than the filesystem block stored in
> tp->t_firstblock. This depends on one of the initial bug fixes in
> the series to move the deadlock avoidance checks to
> xfs_alloc_vextent(), and then builds on it to relax the constraints
> of the avoidance algorithm to only be active when a deadlock is
> possible.
>
> We also update the algorithm to record allocations from higher AGs
> that are allocated from, because we when we need to lock more than
> two AGs we still have to ensure lock order is correct. Therefore we
> can't lock AGs in the order 1, 3, 2, even though tp->t_firstblock
> indicates that we've allocated from AG 1 and so AG is valid to lock.
> It's not valid, because we already hold AG 3 locked, and so
> tp->t-first_block should actually point at AG 3, not AG 1 in this
> situation.
>
> It should now be obvious that the deadlock avoidance algorithm
> should record AGs, not filesystem blocks. So the series then changes
> the transaction to store the highest AG we've allocated in rather
> than a filesystem block we allocated. This makes it obvious what
> the constraints are, and trivial to update as we lock and allocate
> from various AGs.
>
> With all the bug fixes out of the way, the series then starts
> converting the code to use active references. Active reference
> counts are used by high level code that needs to prevent the AG from
> being taken out from under it by a shrink operation. The high level
> code needs to be able to handle not getting an active reference
> gracefully, and the shrink code will need to wait for active
> references to drain before continuing.
>
> Active references are implemented just as reference counts right now
> - an active reference is taken at perag init during mount, and all
> other active references are dependent on the active reference count
> being greater than zero. This gives us an initial method of stopping
> new active references without needing other infrastructure; just
> drop the reference taken at filesystem mount time and when the
> refcount then falls to zero no new references can be taken.
>
> In future, this will need to take into account AG control state
> (e.g. offline, no alloc, etc) as well as the reference count, but
> right now we can implement a basic barrier for shrink with just
> reference count manipulations. As such, patches to convert the perag
> state to atomic opstate fields similar to the xfs_mount and xlog
> opstate fields follow the initial active perag reference counting
> patches.
>
> The first target for active reference conversion is the
> for_each_perag*() iterators. This captures a lot of high level code
> that should skip offline AGs, and introduces the ability to
> differentiate between a lookup that didn't have an online AG and the
> end of the AG iteration range.
>
> From there, the inode allocation AG selection is converted to active
> references, and the perag is driven deeper into the inode allocation
> and btree code to replace the xfs_mount. Most of the inode
> allocation code operates on a single AG once it is selected, hence
> it should pass the perag as the primary referenced object around for
> allocation, not the xfs_mount. There is a bit of churn here, but it
> emphasises that inode allocation is inherently an allocation group
> based operation.
>
> Next the bmap/alloc interface undergoes a major untangling,
> reworking xfs_bmap_btalloc() into separate allocation operations for
> different contexts and failure handling behaviours. This then allows
> us to completely remove the xfs_alloc_vextent() layer via
> restructuring the xfs_alloc_vextent/xfs_alloc_ag_vextent() into a
> set of realtively simple helper function that describe the
> allocation that they are doing. e.g. xfs_alloc_vextent_exact_bno().
>
> This allows the requirements for accessing AGs to be allocation
> context dependent. The allocations that require operation on a
> single AG generally can't tolerate failure after the allocation
> method and AG has been decided on, and hence the caller needs to
> manage the active references to ensure the allocation does not race
> with shrink removing the selected AG for the duration of the
> operation that requires access to that allocation group.
>
> Other allocations iterate AGs and so the first AG is just a hint -
> these do not need to pin a perag first as they can tolerate not
> being able to access an AG by simply skipping over it. These require
> new perag iteration functions that can start at arbitrary AGs and
> wrap around at arbitrary AGs, hence a new set for
> for_each_perag_wrap*() helpers to do this.
>
> Next is the rework of the filestreams allocator. This doesn't change
> any functionality, but gets rid of the unnecessary multi-pass
> selection algorithm when the selected AG is not available. It
> currently does a lookup pass which might iterate all AGs to select
> an AG, then checks if the AG is acceptible and if not does a "new
> AG" pass that is essentially identical to the lookup pass. Both of
> these scans also do the same "longest extent in AG" check before
> selecting an AG as is done after the AG is selected.
>
> IOWs, the filestreams algorithm can be greatly simplified into a
> single new AG selection pass if the there is no current association
> or the currently associated AG doesn't have enough contiguous free
> space for the allocation to proceed. With this simplification of
> the filestreams allocator, it's then trivial to convert it to use
> for_each_perag_wrap() for the AG scan algorithm.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
>
> ----------------------------------------------------------------
> Dave Chinner (42):
> xfs: fix low space alloc deadlock
> xfs: prefer free inodes at ENOSPC over chunk allocation
> xfs: block reservation too large for minleft allocation
> xfs: drop firstblock constraints from allocation setup
> xfs: t_firstblock is tracking AGs not blocks
> xfs: don't assert fail on transaction cancel with deferred ops
> xfs: active perag reference counting
> xfs: rework the perag trace points to be perag centric
> xfs: convert xfs_imap() to take a perag
> xfs: use active perag references for inode allocation
> xfs: inobt can use perags in many more places than it does
> xfs: convert xfs_ialloc_next_ag() to an atomic
> xfs: perags need atomic operational state
> xfs: introduce xfs_for_each_perag_wrap()
> xfs: rework xfs_alloc_vextent()
> xfs: factor xfs_alloc_vextent_this_ag() for _iterate_ags()
> xfs: combine __xfs_alloc_vextent_this_ag and xfs_alloc_ag_vextent
> xfs: use xfs_alloc_vextent_this_ag() where appropriate
> xfs: factor xfs_bmap_btalloc()
> xfs: use xfs_alloc_vextent_first_ag() where appropriate
> xfs: use xfs_alloc_vextent_start_bno() where appropriate
> xfs: introduce xfs_alloc_vextent_near_bno()
> xfs: introduce xfs_alloc_vextent_exact_bno()
> xfs: introduce xfs_alloc_vextent_prepare()
> xfs: move allocation accounting to xfs_alloc_vextent_set_fsbno()
> xfs: fold xfs_alloc_ag_vextent() into callers
> xfs: move the minimum agno checks into xfs_alloc_vextent_check_args
> xfs: convert xfs_alloc_vextent_iterate_ags() to use perag walker
> xfs: convert trim to use for_each_perag_range
> xfs: factor out filestreams from xfs_bmap_btalloc_nullfb
> xfs: get rid of notinit from xfs_bmap_longest_free_extent
> xfs: use xfs_bmap_longest_free_extent() in filestreams
> xfs: move xfs_bmap_btalloc_filestreams() to xfs_filestreams.c
> xfs: merge filestream AG lookup into xfs_filestream_select_ag()
> xfs: merge new filestream AG selection into xfs_filestream_select_ag()
> xfs: remove xfs_filestream_select_ag() longest extent check
> xfs: factor out MRU hit case in xfs_filestream_select_ag
> xfs: track an active perag reference in filestreams
> xfs: use for_each_perag_wrap in xfs_filestream_pick_ag
> xfs: pass perag to filestreams tracing
> xfs: return a referenced perag from filestreams allocator
> xfs: refactor the filestreams allocator pick functions
>
> fs/xfs/libxfs/xfs_ag.c | 93 ++++-
> fs/xfs/libxfs/xfs_ag.h | 111 +++++-
> fs/xfs/libxfs/xfs_ag_resv.c | 2 +-
> fs/xfs/libxfs/xfs_alloc.c | 683 +++++++++++++++++++++++--------------
> fs/xfs/libxfs/xfs_alloc.h | 61 ++--
> fs/xfs/libxfs/xfs_alloc_btree.c | 2 +-
> fs/xfs/libxfs/xfs_bmap.c | 664 +++++++++++++++++------------------
> fs/xfs/libxfs/xfs_bmap.h | 7 +
> fs/xfs/libxfs/xfs_bmap_btree.c | 64 ++--
> fs/xfs/libxfs/xfs_btree.c | 2 +-
> fs/xfs/libxfs/xfs_ialloc.c | 241 ++++++-------
> fs/xfs/libxfs/xfs_ialloc.h | 5 +-
> fs/xfs/libxfs/xfs_ialloc_btree.c | 47 ++-
> fs/xfs/libxfs/xfs_ialloc_btree.h | 20 +-
> fs/xfs/libxfs/xfs_refcount_btree.c | 10 +-
> fs/xfs/libxfs/xfs_rmap_btree.c | 2 +-
> fs/xfs/libxfs/xfs_sb.c | 3 +-
> fs/xfs/scrub/agheader_repair.c | 35 +-
> fs/xfs/scrub/bmap.c | 2 +-
> fs/xfs/scrub/common.c | 21 +-
> fs/xfs/scrub/fscounters.c | 13 +-
> fs/xfs/scrub/repair.c | 7 +-
> fs/xfs/xfs_bmap_util.c | 2 +-
> fs/xfs/xfs_discard.c | 50 ++-
> fs/xfs/xfs_filestream.c | 455 ++++++++++++------------
> fs/xfs/xfs_filestream.h | 6 +-
> fs/xfs/xfs_fsmap.c | 4 +-
> fs/xfs/xfs_icache.c | 8 +-
> fs/xfs/xfs_inode.c | 2 +-
> fs/xfs/xfs_iwalk.c | 10 +-
> fs/xfs/xfs_mount.h | 3 +-
> fs/xfs/xfs_reflink.c | 4 +-
> fs/xfs/xfs_super.c | 47 +--
> fs/xfs/xfs_trace.h | 81 ++---
> fs/xfs/xfs_trans.c | 8 +-
> fs/xfs/xfs_trans.h | 2 +-
> 36 files changed, 1534 insertions(+), 1243 deletions(-)
> --
> Dave Chinner
> david@fromorbit.com
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2023-02-16 21:43 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-15 1:11 [GIT PULL] xfs: per-ag centric allocation algorithms Dave Chinner
2023-02-16 21:43 ` Darrick J. Wong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox