* [RFC V3 0/3] xfs: Add support to shrink multiple empty AGs
@ 2025-10-20 15:43 Nirjhar Roy (IBM)
2025-10-20 15:43 ` [RFC V3 1/3] xfs: Re-introduce xg_active_wq field in struct xfs_group Nirjhar Roy (IBM)
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: Nirjhar Roy (IBM) @ 2025-10-20 15:43 UTC (permalink / raw)
To: linux-xfs
Cc: nirjhar.roy.lists, ritesh.list, ojaswin, djwong, bfoster, david,
hsiangkao
This work is based on a previous RFC[1] by Gao Xiang and various ideas
proposed by Dave Chinner in the RFC[1].
Currently the functionality of shrink is limited to shrinking the last
AG partially but not beyond that. This patch extends the functionality
to support shrinking beyond 1 AG. However the AGs that we will be remove
have to empty in order to prevent any loss of data.
The patch begins with the re-introduction of some of the data
structures that were removed, some code refactoring and
finally the patch that implements the multi AG shrink design.
The final patch has all the details including the definition of the
terminologies and the overall design.
fstests are in [3].
[rfc_v2] --> v3
1) Function/macro renamings:
1.a xfs_ag_is_empty() -> xfs_perag_is_empty()
1.b xfs_ag_is_active() -> xfs_perag_is_active()
1.c xfs_shrinkfs_stablize_ags() -> xfs_shrinkfs_quiesce_ags()
1.d for_each_perag_range_reverse -> for_each_agno_range_reverse
2) Modified the commit messages for patch 3/3
2.a Modified the definition of empty AG
2.b Slightly changed the description of some of the steps in ag
quiesce/stablization and ag deactivation.
3) Design changes:
3.a In function xfs_growfs_data_private() - call
xfs_trans_mod_sb(tp, XFS_TRANS_SB_RES_FDBLOCKS, delta) instead of
manually restoring the fdblock incore counters(which were reserved
during AG deactivation) if the AG count is reducing during shrink.
3.b Introduced a new state XFS_OPSTATE_SHRINKING. This flag will be set
during start of the shrink (in xfs_growfs_data_private())
and will be cleared after the shrink process finishes/aborts.
Now, using the function xfs_is_shrinking(), we turn off the
following check in xfs_validate_ag_length():
if (bp->b_pag && seqno != mp->m_sb.sb_agcount - 1)
return __this_address;
We do the above in the following way:
if (!xfs_is_shrinking(mp) &&
bp->b_pag && seqno != mp->m_sb.sb_agcount - 1)
return __this_address;
Shrinking is a rare operation and hence the above logic makes
sense.
3.c In function xfs_perag_deactivate() - Returning int instead of bool
and replacing wait_event() with wait_event_killable() so that the
shrink process can be safely killed by an user. If the wait is
interrupted, the offlined AGs (if any) will be re-activated.
[rfc_v1] --> v2
1) Function renamings:
1.a xfs_activate_ag() -> xfs_perag_activate()
1.b xfs_deactivate_ag() -> xfs_perag_deactivate()
1.c xfs_pag_populate_cached_bufs() -> xfs_buf_cache_grab_all()
1.d xfs_buf_offline_perag_rele_cached() -> xfs_buf_cache_invalidate()
1.e xfs_extent_busy_wait_range() -> xfs_extent_busy_wait_ags()
1.f xfs_growfs_get_delta() -> xfs_growfs_compute_delta()
2) Fixed several coding style fixes and typos in the code and
commit messages.
3) Introduced for_each_perag_range_reverse() macro and used in
instead of using for loops directly.
4) Design changes:
4.a In function xfs_ag_is_empty() - Removed the
ASSERT(!xfs_ag_contains_log(mp, pag_agno(pag)));
4.b In function xfs_shrinkfs_reactivate_ags() - Replaced
if (nagcount >= oagcount) return; with ASSERT(nagcount < oagcount);
4.c In function xfs_perag_deactivate() - Add one extra step where
we manually reduce/reserve (pagf_freeblks + pagf_flcount) worth of
free datablocks from the global counters. This is necessary
in order to prevent a race where, some AGs have been temporarily
offlined but the delayed allocator has already promised some bytes
and later the real extent/block allocation is failing due to
the AG(s) being offline.
4.d In function xfs_perag_activate() - Add one extra step where
we restore the global free block counter which we reduced in
xfs_perag_deactivate.
4.e In function xfs_shrinkfs_deactivate_ags() -
1. Flushing the xfs_discard_wq after the log force/flush.
2. Removed the direct usage of xfs_log_quiesce(). The reason
is that xfs_log_quiesce() is expected to be called when the
caller has made sure that the log/filesystem is idle but
for shrink, we don't necessarily need the log/filesystem
to be idle.
However, we still need the checkpointing to take place,
so we are doing a xfs_sync_sb+AIL flush twice - something
similar that is being done in xfs_log_cover().
More details are in the patch.
3. Moved the entire code of ag stabilization (after ag
offlining) into a separate function -
xfs_shrinkfs_stabilize_ags().
4.f Fixed a bug where if the size of the new tail AG was less than
XFS_MIN_AG_BLOCKS, then shrink was passing - the correct behavior
is to fail with -EINVAL. Thank you Ritesh[2] for pointing this out.
5) Added RBs from Darrick in patch 1/3 and patch 2/3 (after addressing his
comments).
[1] https://lore.kernel.org/all/20210414195240.1802221-1-hsiangkao@redhat.com/
[2] https://lore.kernel.org/all/875xfas2f6.fsf@gmail.com/
[3] https://lore.kernel.org/all/cover.1758035262.git.nirjhar.roy.lists@gmail.com/
[rfc_v1] https://lore.kernel.org/all/cover.1752746805.git.nirjhar.roy.lists@gmail.com/
[rfc_v2] https://lore.kernel.org/linux-xfs/cover.1758034274.git.nirjhar.roy.lists@gmail.com/
Nirjhar Roy (IBM) (3):
xfs: Re-introduce xg_active_wq field in struct xfs_group
xfs: Refactoring the nagcount and delta calculation
xfs: Add support to shrink multiple empty AGs
fs/xfs/libxfs/xfs_ag.c | 191 ++++++++++++++++-
fs/xfs/libxfs/xfs_ag.h | 17 ++
fs/xfs/libxfs/xfs_alloc.c | 10 +-
fs/xfs/libxfs/xfs_group.c | 4 +-
fs/xfs/libxfs/xfs_group.h | 2 +
fs/xfs/xfs_buf.c | 78 +++++++
fs/xfs/xfs_buf.h | 1 +
fs/xfs/xfs_buf_item_recover.c | 37 ++--
fs/xfs/xfs_extent_busy.c | 30 +++
fs/xfs/xfs_extent_busy.h | 2 +
fs/xfs/xfs_fsops.c | 379 +++++++++++++++++++++++++++++++---
fs/xfs/xfs_mount.h | 3 +
fs/xfs/xfs_trans.c | 1 -
13 files changed, 701 insertions(+), 54 deletions(-)
--
2.43.5
^ permalink raw reply [flat|nested] 16+ messages in thread* [RFC V3 1/3] xfs: Re-introduce xg_active_wq field in struct xfs_group 2025-10-20 15:43 [RFC V3 0/3] xfs: Add support to shrink multiple empty AGs Nirjhar Roy (IBM) @ 2025-10-20 15:43 ` Nirjhar Roy (IBM) 2025-10-20 15:43 ` [RFC V3 2/3] xfs: Refactoring the nagcount and delta calculation Nirjhar Roy (IBM) ` (2 subsequent siblings) 3 siblings, 0 replies; 16+ messages in thread From: Nirjhar Roy (IBM) @ 2025-10-20 15:43 UTC (permalink / raw) To: linux-xfs Cc: nirjhar.roy.lists, ritesh.list, ojaswin, djwong, bfoster, david, hsiangkao pag_active_wq was removed in commit 9943b4573290 ("xfs: remove the unused pag_active_wq field in struct xfs_perag") because it was not waited upon. Re-introducing this in struct xfs_group. This patch also replaces atomic_dec() in xfs_group_rele() with if (atomic_dec_and_test(&xg->xg_active_ref)) wake_up(&xg->xg_active_wq); The reason for this change is that the online shrink code will wait for all the active references to come down to zero before actually starting the shrink process (only if the number of blocks that we are trying to remove is worth 1 or more AGs). Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/libxfs/xfs_group.c | 4 +++- fs/xfs/libxfs/xfs_group.h | 2 ++ 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/xfs/libxfs/xfs_group.c b/fs/xfs/libxfs/xfs_group.c index 792f76d2e2a0..51ef9dd9d1ed 100644 --- a/fs/xfs/libxfs/xfs_group.c +++ b/fs/xfs/libxfs/xfs_group.c @@ -147,7 +147,8 @@ xfs_group_rele( struct xfs_group *xg) { trace_xfs_group_rele(xg, _RET_IP_); - atomic_dec(&xg->xg_active_ref); + if (atomic_dec_and_test(&xg->xg_active_ref)) + wake_up(&xg->xg_active_wq); } void @@ -202,6 +203,7 @@ xfs_group_insert( xfs_defer_drain_init(&xg->xg_intents_drain); /* Active ref owned by mount indicates group is online. */ + init_waitqueue_head(&xg->xg_active_wq); atomic_set(&xg->xg_active_ref, 1); error = xa_insert(&mp->m_groups[type].xa, index, xg, GFP_KERNEL); diff --git a/fs/xfs/libxfs/xfs_group.h b/fs/xfs/libxfs/xfs_group.h index 4423932a2313..21361508a5b7 100644 --- a/fs/xfs/libxfs/xfs_group.h +++ b/fs/xfs/libxfs/xfs_group.h @@ -11,6 +11,8 @@ struct xfs_group { enum xfs_group_type xg_type; atomic_t xg_ref; /* passive reference count */ atomic_t xg_active_ref; /* active reference count */ + /* woken up when xg_active_ref falls to zero */ + wait_queue_head_t xg_active_wq; /* Precalculated geometry info */ uint32_t xg_block_count; /* max usable gbno */ -- 2.43.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [RFC V3 2/3] xfs: Refactoring the nagcount and delta calculation 2025-10-20 15:43 [RFC V3 0/3] xfs: Add support to shrink multiple empty AGs Nirjhar Roy (IBM) 2025-10-20 15:43 ` [RFC V3 1/3] xfs: Re-introduce xg_active_wq field in struct xfs_group Nirjhar Roy (IBM) @ 2025-10-20 15:43 ` Nirjhar Roy (IBM) 2026-02-02 14:15 ` Nirjhar Roy (IBM) 2025-10-20 15:43 ` [RFC V3 3/3] xfs: Add support to shrink multiple empty AGs Nirjhar Roy (IBM) 2025-10-22 7:17 ` [RFC V3 0/3] " Christoph Hellwig 3 siblings, 1 reply; 16+ messages in thread From: Nirjhar Roy (IBM) @ 2025-10-20 15:43 UTC (permalink / raw) To: linux-xfs Cc: nirjhar.roy.lists, ritesh.list, ojaswin, djwong, bfoster, david, hsiangkao Introduce xfs_growfs_compute_delta() to calculate the nagcount and delta blocks and refactor the code from xfs_growfs_data_private(). No functional changes. Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> --- fs/xfs/libxfs/xfs_ag.c | 28 ++++++++++++++++++++++++++++ fs/xfs/libxfs/xfs_ag.h | 3 +++ fs/xfs/xfs_fsops.c | 17 ++--------------- 3 files changed, 33 insertions(+), 15 deletions(-) diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c index e6ba914f6d06..f2b35d59d51e 100644 --- a/fs/xfs/libxfs/xfs_ag.c +++ b/fs/xfs/libxfs/xfs_ag.c @@ -872,6 +872,34 @@ xfs_ag_shrink_space( return err2; } +void +xfs_growfs_compute_deltas( + struct xfs_mount *mp, + xfs_rfsblock_t nb, + int64_t *deltap, + xfs_agnumber_t *nagcountp) +{ + xfs_rfsblock_t nb_div, nb_mod; + int64_t delta; + xfs_agnumber_t nagcount; + + nb_div = nb; + nb_mod = do_div(nb_div, mp->m_sb.sb_agblocks); + if (nb_mod && nb_mod >= XFS_MIN_AG_BLOCKS) + nb_div++; + else if (nb_mod) + nb = nb_div * mp->m_sb.sb_agblocks; + + if (nb_div > XFS_MAX_AGNUMBER + 1) { + nb_div = XFS_MAX_AGNUMBER + 1; + nb = nb_div * mp->m_sb.sb_agblocks; + } + nagcount = nb_div; + delta = nb - mp->m_sb.sb_dblocks; + *deltap = delta; + *nagcountp = nagcount; +} + /* * Extent the AG indicated by the @id by the length passed in */ diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h index 1f24cfa27321..f7b56d486468 100644 --- a/fs/xfs/libxfs/xfs_ag.h +++ b/fs/xfs/libxfs/xfs_ag.h @@ -331,6 +331,9 @@ struct aghdr_init_data { int xfs_ag_init_headers(struct xfs_mount *mp, struct aghdr_init_data *id); int xfs_ag_shrink_space(struct xfs_perag *pag, struct xfs_trans **tpp, xfs_extlen_t delta); +void +xfs_growfs_compute_deltas(struct xfs_mount *mp, xfs_rfsblock_t nb, + int64_t *deltap, xfs_agnumber_t *nagcountp); int xfs_ag_extend_space(struct xfs_perag *pag, struct xfs_trans *tp, xfs_extlen_t len); int xfs_ag_get_geometry(struct xfs_perag *pag, struct xfs_ag_geometry *ageo); diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c index 0ada73569394..8353e2f186f6 100644 --- a/fs/xfs/xfs_fsops.c +++ b/fs/xfs/xfs_fsops.c @@ -92,18 +92,17 @@ xfs_growfs_data_private( struct xfs_growfs_data *in) /* growfs data input struct */ { xfs_agnumber_t oagcount = mp->m_sb.sb_agcount; + xfs_rfsblock_t nb = in->newblocks; struct xfs_buf *bp; int error; xfs_agnumber_t nagcount; xfs_agnumber_t nagimax = 0; - xfs_rfsblock_t nb, nb_div, nb_mod; int64_t delta; bool lastag_extended = false; struct xfs_trans *tp; struct aghdr_init_data id = {}; struct xfs_perag *last_pag; - nb = in->newblocks; error = xfs_sb_validate_fsb_count(&mp->m_sb, nb); if (error) return error; @@ -122,20 +121,8 @@ xfs_growfs_data_private( mp->m_sb.sb_rextsize); if (error) return error; + xfs_growfs_compute_deltas(mp, nb, &delta, &nagcount); - nb_div = nb; - nb_mod = do_div(nb_div, mp->m_sb.sb_agblocks); - if (nb_mod && nb_mod >= XFS_MIN_AG_BLOCKS) - nb_div++; - else if (nb_mod) - nb = nb_div * mp->m_sb.sb_agblocks; - - if (nb_div > XFS_MAX_AGNUMBER + 1) { - nb_div = XFS_MAX_AGNUMBER + 1; - nb = nb_div * mp->m_sb.sb_agblocks; - } - nagcount = nb_div; - delta = nb - mp->m_sb.sb_dblocks; /* * Reject filesystems with a single AG because they are not * supported, and reject a shrink operation that would cause a -- 2.43.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RFC V3 2/3] xfs: Refactoring the nagcount and delta calculation 2025-10-20 15:43 ` [RFC V3 2/3] xfs: Refactoring the nagcount and delta calculation Nirjhar Roy (IBM) @ 2026-02-02 14:15 ` Nirjhar Roy (IBM) 2026-02-02 14:38 ` Christoph Hellwig 2026-02-02 16:50 ` Carlos Maiolino 0 siblings, 2 replies; 16+ messages in thread From: Nirjhar Roy (IBM) @ 2026-02-02 14:15 UTC (permalink / raw) To: linux-xfs, Carlos Maiolino Cc: ritesh.list, ojaswin, djwong, bfoster, david, hsiangkao On 10/20/25 21:13, Nirjhar Roy (IBM) wrote: > Introduce xfs_growfs_compute_delta() to calculate the nagcount > and delta blocks and refactor the code from xfs_growfs_data_private(). > No functional changes. > > Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com> > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Hi Carlos, Darrick, Can this be picked up? This is quite independent of the rest of the patches in this series. --NR > --- > fs/xfs/libxfs/xfs_ag.c | 28 ++++++++++++++++++++++++++++ > fs/xfs/libxfs/xfs_ag.h | 3 +++ > fs/xfs/xfs_fsops.c | 17 ++--------------- > 3 files changed, 33 insertions(+), 15 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c > index e6ba914f6d06..f2b35d59d51e 100644 > --- a/fs/xfs/libxfs/xfs_ag.c > +++ b/fs/xfs/libxfs/xfs_ag.c > @@ -872,6 +872,34 @@ xfs_ag_shrink_space( > return err2; > } > > +void > +xfs_growfs_compute_deltas( > + struct xfs_mount *mp, > + xfs_rfsblock_t nb, > + int64_t *deltap, > + xfs_agnumber_t *nagcountp) > +{ > + xfs_rfsblock_t nb_div, nb_mod; > + int64_t delta; > + xfs_agnumber_t nagcount; > + > + nb_div = nb; > + nb_mod = do_div(nb_div, mp->m_sb.sb_agblocks); > + if (nb_mod && nb_mod >= XFS_MIN_AG_BLOCKS) > + nb_div++; > + else if (nb_mod) > + nb = nb_div * mp->m_sb.sb_agblocks; > + > + if (nb_div > XFS_MAX_AGNUMBER + 1) { > + nb_div = XFS_MAX_AGNUMBER + 1; > + nb = nb_div * mp->m_sb.sb_agblocks; > + } > + nagcount = nb_div; > + delta = nb - mp->m_sb.sb_dblocks; > + *deltap = delta; > + *nagcountp = nagcount; > +} > + > /* > * Extent the AG indicated by the @id by the length passed in > */ > diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h > index 1f24cfa27321..f7b56d486468 100644 > --- a/fs/xfs/libxfs/xfs_ag.h > +++ b/fs/xfs/libxfs/xfs_ag.h > @@ -331,6 +331,9 @@ struct aghdr_init_data { > int xfs_ag_init_headers(struct xfs_mount *mp, struct aghdr_init_data *id); > int xfs_ag_shrink_space(struct xfs_perag *pag, struct xfs_trans **tpp, > xfs_extlen_t delta); > +void > +xfs_growfs_compute_deltas(struct xfs_mount *mp, xfs_rfsblock_t nb, > + int64_t *deltap, xfs_agnumber_t *nagcountp); > int xfs_ag_extend_space(struct xfs_perag *pag, struct xfs_trans *tp, > xfs_extlen_t len); > int xfs_ag_get_geometry(struct xfs_perag *pag, struct xfs_ag_geometry *ageo); > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c > index 0ada73569394..8353e2f186f6 100644 > --- a/fs/xfs/xfs_fsops.c > +++ b/fs/xfs/xfs_fsops.c > @@ -92,18 +92,17 @@ xfs_growfs_data_private( > struct xfs_growfs_data *in) /* growfs data input struct */ > { > xfs_agnumber_t oagcount = mp->m_sb.sb_agcount; > + xfs_rfsblock_t nb = in->newblocks; > struct xfs_buf *bp; > int error; > xfs_agnumber_t nagcount; > xfs_agnumber_t nagimax = 0; > - xfs_rfsblock_t nb, nb_div, nb_mod; > int64_t delta; > bool lastag_extended = false; > struct xfs_trans *tp; > struct aghdr_init_data id = {}; > struct xfs_perag *last_pag; > > - nb = in->newblocks; > error = xfs_sb_validate_fsb_count(&mp->m_sb, nb); > if (error) > return error; > @@ -122,20 +121,8 @@ xfs_growfs_data_private( > mp->m_sb.sb_rextsize); > if (error) > return error; > + xfs_growfs_compute_deltas(mp, nb, &delta, &nagcount); > > - nb_div = nb; > - nb_mod = do_div(nb_div, mp->m_sb.sb_agblocks); > - if (nb_mod && nb_mod >= XFS_MIN_AG_BLOCKS) > - nb_div++; > - else if (nb_mod) > - nb = nb_div * mp->m_sb.sb_agblocks; > - > - if (nb_div > XFS_MAX_AGNUMBER + 1) { > - nb_div = XFS_MAX_AGNUMBER + 1; > - nb = nb_div * mp->m_sb.sb_agblocks; > - } > - nagcount = nb_div; > - delta = nb - mp->m_sb.sb_dblocks; > /* > * Reject filesystems with a single AG because they are not > * supported, and reject a shrink operation that would cause a -- Nirjhar Roy Linux Kernel Developer IBM, Bangalore ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC V3 2/3] xfs: Refactoring the nagcount and delta calculation 2026-02-02 14:15 ` Nirjhar Roy (IBM) @ 2026-02-02 14:38 ` Christoph Hellwig 2026-02-02 16:50 ` Carlos Maiolino 1 sibling, 0 replies; 16+ messages in thread From: Christoph Hellwig @ 2026-02-02 14:38 UTC (permalink / raw) To: Nirjhar Roy (IBM) Cc: linux-xfs, Carlos Maiolino, ritesh.list, ojaswin, djwong, bfoster, david, hsiangkao On Mon, Feb 02, 2026 at 07:45:56PM +0530, Nirjhar Roy (IBM) wrote: > > On 10/20/25 21:13, Nirjhar Roy (IBM) wrote: > > Introduce xfs_growfs_compute_delta() to calculate the nagcount > > and delta blocks and refactor the code from xfs_growfs_data_private(). > > No functional changes. > > > > Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com> > > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> > > Hi Carlos, Darrick, > > Can this be picked up? This is quite independent of the rest of the patches > in this series. For a 2 1/2 month patch picked from a larger series, a resend is probably a better idea. But the last days of the merge window might not be the right time for a pure refactoring without urgency anyway. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC V3 2/3] xfs: Refactoring the nagcount and delta calculation 2026-02-02 14:15 ` Nirjhar Roy (IBM) 2026-02-02 14:38 ` Christoph Hellwig @ 2026-02-02 16:50 ` Carlos Maiolino 2026-02-02 16:53 ` Nirjhar Roy (IBM) 1 sibling, 1 reply; 16+ messages in thread From: Carlos Maiolino @ 2026-02-02 16:50 UTC (permalink / raw) To: Nirjhar Roy (IBM) Cc: linux-xfs, ritesh.list, ojaswin, djwong, bfoster, david, hsiangkao Hi! On Mon, Feb 02, 2026 at 07:45:56PM +0530, Nirjhar Roy (IBM) wrote: > > On 10/20/25 21:13, Nirjhar Roy (IBM) wrote: > > Introduce xfs_growfs_compute_delta() to calculate the nagcount > > and delta blocks and refactor the code from xfs_growfs_data_private(). > > No functional changes. > > > > Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com> > > Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> > > Hi Carlos, Darrick, > > Can this be picked up? This is quite independent of the rest of the patches > in this series. If you tag a series as RFC, don't expect the maintainer to pick it up. Please, re-send it again without the RFC tag. We don't have more time for this merge window though, I'll pick it up for the next. Cheers. > > --NR > > > --- > > fs/xfs/libxfs/xfs_ag.c | 28 ++++++++++++++++++++++++++++ > > fs/xfs/libxfs/xfs_ag.h | 3 +++ > > fs/xfs/xfs_fsops.c | 17 ++--------------- > > 3 files changed, 33 insertions(+), 15 deletions(-) > > > > diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c > > index e6ba914f6d06..f2b35d59d51e 100644 > > --- a/fs/xfs/libxfs/xfs_ag.c > > +++ b/fs/xfs/libxfs/xfs_ag.c > > @@ -872,6 +872,34 @@ xfs_ag_shrink_space( > > return err2; > > } > > +void > > +xfs_growfs_compute_deltas( > > + struct xfs_mount *mp, > > + xfs_rfsblock_t nb, > > + int64_t *deltap, > > + xfs_agnumber_t *nagcountp) > > +{ > > + xfs_rfsblock_t nb_div, nb_mod; > > + int64_t delta; > > + xfs_agnumber_t nagcount; > > + > > + nb_div = nb; > > + nb_mod = do_div(nb_div, mp->m_sb.sb_agblocks); > > + if (nb_mod && nb_mod >= XFS_MIN_AG_BLOCKS) > > + nb_div++; > > + else if (nb_mod) > > + nb = nb_div * mp->m_sb.sb_agblocks; > > + > > + if (nb_div > XFS_MAX_AGNUMBER + 1) { > > + nb_div = XFS_MAX_AGNUMBER + 1; > > + nb = nb_div * mp->m_sb.sb_agblocks; > > + } > > + nagcount = nb_div; > > + delta = nb - mp->m_sb.sb_dblocks; > > + *deltap = delta; > > + *nagcountp = nagcount; > > +} > > + > > /* > > * Extent the AG indicated by the @id by the length passed in > > */ > > diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h > > index 1f24cfa27321..f7b56d486468 100644 > > --- a/fs/xfs/libxfs/xfs_ag.h > > +++ b/fs/xfs/libxfs/xfs_ag.h > > @@ -331,6 +331,9 @@ struct aghdr_init_data { > > int xfs_ag_init_headers(struct xfs_mount *mp, struct aghdr_init_data *id); > > int xfs_ag_shrink_space(struct xfs_perag *pag, struct xfs_trans **tpp, > > xfs_extlen_t delta); > > +void > > +xfs_growfs_compute_deltas(struct xfs_mount *mp, xfs_rfsblock_t nb, > > + int64_t *deltap, xfs_agnumber_t *nagcountp); > > int xfs_ag_extend_space(struct xfs_perag *pag, struct xfs_trans *tp, > > xfs_extlen_t len); > > int xfs_ag_get_geometry(struct xfs_perag *pag, struct xfs_ag_geometry *ageo); > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c > > index 0ada73569394..8353e2f186f6 100644 > > --- a/fs/xfs/xfs_fsops.c > > +++ b/fs/xfs/xfs_fsops.c > > @@ -92,18 +92,17 @@ xfs_growfs_data_private( > > struct xfs_growfs_data *in) /* growfs data input struct */ > > { > > xfs_agnumber_t oagcount = mp->m_sb.sb_agcount; > > + xfs_rfsblock_t nb = in->newblocks; > > struct xfs_buf *bp; > > int error; > > xfs_agnumber_t nagcount; > > xfs_agnumber_t nagimax = 0; > > - xfs_rfsblock_t nb, nb_div, nb_mod; > > int64_t delta; > > bool lastag_extended = false; > > struct xfs_trans *tp; > > struct aghdr_init_data id = {}; > > struct xfs_perag *last_pag; > > - nb = in->newblocks; > > error = xfs_sb_validate_fsb_count(&mp->m_sb, nb); > > if (error) > > return error; > > @@ -122,20 +121,8 @@ xfs_growfs_data_private( > > mp->m_sb.sb_rextsize); > > if (error) > > return error; > > + xfs_growfs_compute_deltas(mp, nb, &delta, &nagcount); > > - nb_div = nb; > > - nb_mod = do_div(nb_div, mp->m_sb.sb_agblocks); > > - if (nb_mod && nb_mod >= XFS_MIN_AG_BLOCKS) > > - nb_div++; > > - else if (nb_mod) > > - nb = nb_div * mp->m_sb.sb_agblocks; > > - > > - if (nb_div > XFS_MAX_AGNUMBER + 1) { > > - nb_div = XFS_MAX_AGNUMBER + 1; > > - nb = nb_div * mp->m_sb.sb_agblocks; > > - } > > - nagcount = nb_div; > > - delta = nb - mp->m_sb.sb_dblocks; > > /* > > * Reject filesystems with a single AG because they are not > > * supported, and reject a shrink operation that would cause a > > -- > Nirjhar Roy > Linux Kernel Developer > IBM, Bangalore > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC V3 2/3] xfs: Refactoring the nagcount and delta calculation 2026-02-02 16:50 ` Carlos Maiolino @ 2026-02-02 16:53 ` Nirjhar Roy (IBM) 0 siblings, 0 replies; 16+ messages in thread From: Nirjhar Roy (IBM) @ 2026-02-02 16:53 UTC (permalink / raw) To: Carlos Maiolino Cc: linux-xfs, ritesh.list, ojaswin, djwong, bfoster, david, hsiangkao On 2/2/26 22:20, Carlos Maiolino wrote: > Hi! > > On Mon, Feb 02, 2026 at 07:45:56PM +0530, Nirjhar Roy (IBM) wrote: >> On 10/20/25 21:13, Nirjhar Roy (IBM) wrote: >>> Introduce xfs_growfs_compute_delta() to calculate the nagcount >>> and delta blocks and refactor the code from xfs_growfs_data_private(). >>> No functional changes. >>> >>> Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com> >>> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> >> Hi Carlos, Darrick, >> >> Can this be picked up? This is quite independent of the rest of the patches >> in this series. > If you tag a series as RFC, don't expect the maintainer to pick it up. > > Please, re-send it again without the RFC tag. We don't have more > time for this merge window though, I'll pick it up for the next. Sure, I will re-send it without the RFC tag. Thank you. --NR > > Cheers. > >> --NR >> >>> --- >>> fs/xfs/libxfs/xfs_ag.c | 28 ++++++++++++++++++++++++++++ >>> fs/xfs/libxfs/xfs_ag.h | 3 +++ >>> fs/xfs/xfs_fsops.c | 17 ++--------------- >>> 3 files changed, 33 insertions(+), 15 deletions(-) >>> >>> diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c >>> index e6ba914f6d06..f2b35d59d51e 100644 >>> --- a/fs/xfs/libxfs/xfs_ag.c >>> +++ b/fs/xfs/libxfs/xfs_ag.c >>> @@ -872,6 +872,34 @@ xfs_ag_shrink_space( >>> return err2; >>> } >>> +void >>> +xfs_growfs_compute_deltas( >>> + struct xfs_mount *mp, >>> + xfs_rfsblock_t nb, >>> + int64_t *deltap, >>> + xfs_agnumber_t *nagcountp) >>> +{ >>> + xfs_rfsblock_t nb_div, nb_mod; >>> + int64_t delta; >>> + xfs_agnumber_t nagcount; >>> + >>> + nb_div = nb; >>> + nb_mod = do_div(nb_div, mp->m_sb.sb_agblocks); >>> + if (nb_mod && nb_mod >= XFS_MIN_AG_BLOCKS) >>> + nb_div++; >>> + else if (nb_mod) >>> + nb = nb_div * mp->m_sb.sb_agblocks; >>> + >>> + if (nb_div > XFS_MAX_AGNUMBER + 1) { >>> + nb_div = XFS_MAX_AGNUMBER + 1; >>> + nb = nb_div * mp->m_sb.sb_agblocks; >>> + } >>> + nagcount = nb_div; >>> + delta = nb - mp->m_sb.sb_dblocks; >>> + *deltap = delta; >>> + *nagcountp = nagcount; >>> +} >>> + >>> /* >>> * Extent the AG indicated by the @id by the length passed in >>> */ >>> diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h >>> index 1f24cfa27321..f7b56d486468 100644 >>> --- a/fs/xfs/libxfs/xfs_ag.h >>> +++ b/fs/xfs/libxfs/xfs_ag.h >>> @@ -331,6 +331,9 @@ struct aghdr_init_data { >>> int xfs_ag_init_headers(struct xfs_mount *mp, struct aghdr_init_data *id); >>> int xfs_ag_shrink_space(struct xfs_perag *pag, struct xfs_trans **tpp, >>> xfs_extlen_t delta); >>> +void >>> +xfs_growfs_compute_deltas(struct xfs_mount *mp, xfs_rfsblock_t nb, >>> + int64_t *deltap, xfs_agnumber_t *nagcountp); >>> int xfs_ag_extend_space(struct xfs_perag *pag, struct xfs_trans *tp, >>> xfs_extlen_t len); >>> int xfs_ag_get_geometry(struct xfs_perag *pag, struct xfs_ag_geometry *ageo); >>> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c >>> index 0ada73569394..8353e2f186f6 100644 >>> --- a/fs/xfs/xfs_fsops.c >>> +++ b/fs/xfs/xfs_fsops.c >>> @@ -92,18 +92,17 @@ xfs_growfs_data_private( >>> struct xfs_growfs_data *in) /* growfs data input struct */ >>> { >>> xfs_agnumber_t oagcount = mp->m_sb.sb_agcount; >>> + xfs_rfsblock_t nb = in->newblocks; >>> struct xfs_buf *bp; >>> int error; >>> xfs_agnumber_t nagcount; >>> xfs_agnumber_t nagimax = 0; >>> - xfs_rfsblock_t nb, nb_div, nb_mod; >>> int64_t delta; >>> bool lastag_extended = false; >>> struct xfs_trans *tp; >>> struct aghdr_init_data id = {}; >>> struct xfs_perag *last_pag; >>> - nb = in->newblocks; >>> error = xfs_sb_validate_fsb_count(&mp->m_sb, nb); >>> if (error) >>> return error; >>> @@ -122,20 +121,8 @@ xfs_growfs_data_private( >>> mp->m_sb.sb_rextsize); >>> if (error) >>> return error; >>> + xfs_growfs_compute_deltas(mp, nb, &delta, &nagcount); >>> - nb_div = nb; >>> - nb_mod = do_div(nb_div, mp->m_sb.sb_agblocks); >>> - if (nb_mod && nb_mod >= XFS_MIN_AG_BLOCKS) >>> - nb_div++; >>> - else if (nb_mod) >>> - nb = nb_div * mp->m_sb.sb_agblocks; >>> - >>> - if (nb_div > XFS_MAX_AGNUMBER + 1) { >>> - nb_div = XFS_MAX_AGNUMBER + 1; >>> - nb = nb_div * mp->m_sb.sb_agblocks; >>> - } >>> - nagcount = nb_div; >>> - delta = nb - mp->m_sb.sb_dblocks; >>> /* >>> * Reject filesystems with a single AG because they are not >>> * supported, and reject a shrink operation that would cause a >> -- >> Nirjhar Roy >> Linux Kernel Developer >> IBM, Bangalore >> >> -- Nirjhar Roy Linux Kernel Developer IBM, Bangalore ^ permalink raw reply [flat|nested] 16+ messages in thread
* [RFC V3 3/3] xfs: Add support to shrink multiple empty AGs 2025-10-20 15:43 [RFC V3 0/3] xfs: Add support to shrink multiple empty AGs Nirjhar Roy (IBM) 2025-10-20 15:43 ` [RFC V3 1/3] xfs: Re-introduce xg_active_wq field in struct xfs_group Nirjhar Roy (IBM) 2025-10-20 15:43 ` [RFC V3 2/3] xfs: Refactoring the nagcount and delta calculation Nirjhar Roy (IBM) @ 2025-10-20 15:43 ` Nirjhar Roy (IBM) 2025-10-22 7:17 ` [RFC V3 0/3] " Christoph Hellwig 3 siblings, 0 replies; 16+ messages in thread From: Nirjhar Roy (IBM) @ 2025-10-20 15:43 UTC (permalink / raw) To: linux-xfs Cc: nirjhar.roy.lists, ritesh.list, ojaswin, djwong, bfoster, david, hsiangkao This patch is based on a previous RFC[1] by Gao Xiang and various ideas proposed by Dave Chinner in the RFC[1]. This patch adds the functionality to shrink the filesystem beyond 1 AG. We can remove only empty AGs in order to prevent loss of data. Before I summarize the overall steps of the shrink process, I would like to introduce some of the terminologies: 1. Empty AG - An AG with no allocated space other than AG headers, empty AG btree root blocks, and AGFL reserved blocks. Removal of this AG will not result in any data loss. 2. Active/Online AG - Online AG and active AG will be used interchangebly. An AG is active or online when all the regular operations can be done on it. When we mount a filesystem, all the AGs are by default online/active. In terms of implementation, an online AG will have number of active references greater than 0 (default is 1 i.e, an AG by default is online/active). 3. AG offlining/deactivation - AG offlining and AG deactivation will be used interchangebly. An AG is said to be offlined/deactivated when no new high level operation can be started on the AG. This is implemented with the help of active references. When the active reference count of an AG is 0, the AG is said to be deactivated. No new active reference can be taken if the present active reference count is 0. This way a barrier is formed from preventing new high level operations to get started on an already offlined AG. 4. Reactivating an AG - If we try to remove an offlined AG but for some reason, we can't, then we reactivate the AG i.e, the AG will once more be in an usable state i.e, the active reference count will be set to 1. All the high level operations can now be performed on this AG. In terms of implementation, in order to activate an AG, we atomically set the active reference count to 1. 5. AG removal - This means that an AG no longer exists in the filesystem. It will be reflected in the usable/total size of the device too (using tools like df). 6. New tail AG - This refers to the last AG that will be formed after the removal of 1 or more AGs. For example, if there are 4 AGs, each with 32 blocks, then there are total of 4 * 32 = 128 blocks. Now, if we remove 40 blocks, AG 3(indexed at 0) will be completely removed (32 blocks) and from AG 2, we will remove 8 blocks. So AG 2 will be the new tail AG. 7. Old tail AG - This is the last AG before the start of the shrink process. If the number of blocks removed is less than the AG size, then the old tail AG will be the same as the new tail AG. 8. AG stabilization - This simply means that the in-memory contents are synched to the disk. The overall steps for the removal of AG(s) are as follows: PHASE 1: Preparing the AGs for removal 1. Deactivate the AGs to be removed completely - This is done by the function xfs_shrinkfs_deactivate_ags(). The steps to deactivate an AG are as follows(function is xfs_perag_deactivate()): 1.a Manually reserve/reduce from the global fdblock free counters the perag pagf_freeblks + pagf_flcount. This is done in order to prevent a race where, some AGs have been offlined but the delayed allocator has already promised some bytes and the real extent/block allocation is failing due to the AG(s) being offline. So shrink operation reserves to the shrink transaction the space to be removed from the incore fdblocks and either commits that change to the ondisk fdblocks (shrink succeeds) or gives it back (shrink fails). 1.b Wait for the active reference to come to 0. This is done so that no other entity is racing while the removal is in progress i.e, no new high level operation can start on that AG while we are trying to remove the AG. AG deactivation will fail if the AG is non-empty at the time of deactivation. 2. Once we have waited for the active references to come down to 0, we make sure that all the pending operations on that AG are completed and the in-core and on-disk structures are in synch i.e, the AG is stabilized on to the disk. The steps to stablize the AG onto the disk are as follows: 2.a We need to flush and empty the logs and wait for all the pending I/Os to complete - for this, perform a log force+ail push by calling xfs_ail_push_all_sync(). This also ensures that none of the future logged transactions will refer to these AGs during log recovery in case if sudden shutdown/crash happens while we are trying to remove these AGs. We also sync the superblock with the disk. 2.b Wait for all the busy extents for the target AGs to be resolved (done by the function xfs_extent_busy_wait_ags()) 2.c Flush the xfs_discard_wq workqueue 3. Once the AG is deactivated and stabilized on to the disk, we check if all the target AGs are empty, and if not, we fail the shrink process. We are not supporting partial shrink i.e, the shrink will either completely fail or completely succeed. PHASE 2: Shrink new tail group, punch out totally empty groups 4. Once the preparation phase is over, we start the actual removal process. This is done in the function xfs_shrinkfs_remove_ags(). Here we first remove the blocks, then update the metadata of new last tail AG and then remove the AGs (and their associated data structures) one by one (in function xfs_shrinkfs_remove_ag()). 5. In the end we log the changes and commit the transaction. Removal of each incore AG structure is done by the function xfs_shrinkfs_remove_ag(). The steps can be outlined as follows: 1. Free the per AG reservation - this will result in correct free space/used space information. 2. Freeing the intents drain queue. 3. Freeing busy extents list. 4. Remove the perag cached buffers and then the buffer cache. 5. Freeing the struct xfs_group pointer - Before this is done, we assert that all the active and passive references are down to 0. We remove all the cached buffers associated with the offlined AGs to be removed - this releases the passive references of the AGs consumed by the cached buffers. [1] https://lore.kernel.org/all/20210414195240.1802221-1-hsiangkao@redhat.com/ Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com> Inspired-by: Gao Xiang <hsiangkao@linux.alibaba.com> Suggested-by: Dave Chinner <david@fromorbit.com> --- fs/xfs/libxfs/xfs_ag.c | 163 ++++++++++++++- fs/xfs/libxfs/xfs_ag.h | 14 ++ fs/xfs/libxfs/xfs_alloc.c | 10 +- fs/xfs/xfs_buf.c | 78 ++++++++ fs/xfs/xfs_buf.h | 1 + fs/xfs/xfs_buf_item_recover.c | 37 ++-- fs/xfs/xfs_extent_busy.c | 30 +++ fs/xfs/xfs_extent_busy.h | 2 + fs/xfs/xfs_fsops.c | 364 ++++++++++++++++++++++++++++++++-- fs/xfs/xfs_mount.h | 3 + fs/xfs/xfs_trans.c | 1 - 11 files changed, 664 insertions(+), 39 deletions(-) diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c index f2b35d59d51e..d7f955d48c5c 100644 --- a/fs/xfs/libxfs/xfs_ag.c +++ b/fs/xfs/libxfs/xfs_ag.c @@ -193,20 +193,32 @@ xfs_agino_range( } /* - * Update the perag of the previous tail AG if it has been changed during - * recovery (i.e. recovery of a growfs). + * This function does the following: + * - Updates the previous perag tail if prev_agcount < current agcount i.e, the + * filesystem has grown OR + * - Updates the current tail AG when prev_agcount > current agcount i.e, the + * filesystem has shrunk beyond 1 AG OR + * - Updates the current tail AG when only the last AG was shrunk or grown i.e, + * prev_agcount == mp->m_sb.sb_agcount. */ int xfs_update_last_ag_size( struct xfs_mount *mp, xfs_agnumber_t prev_agcount) { - struct xfs_perag *pag = xfs_perag_grab(mp, prev_agcount - 1); + xfs_agnumber_t agno; + struct xfs_perag *pag; + if (prev_agcount >= mp->m_sb.sb_agcount) + agno = mp->m_sb.sb_agcount - 1; + else + agno = prev_agcount - 1; + + pag = xfs_perag_grab(mp, agno); if (!pag) return -EFSCORRUPTED; - pag_group(pag)->xg_block_count = __xfs_ag_block_count(mp, - prev_agcount - 1, mp->m_sb.sb_agcount, + pag_group(pag)->xg_block_count = __xfs_ag_block_count(mp, agno, + mp->m_sb.sb_agcount, mp->m_sb.sb_dblocks); __xfs_agino_range(mp, pag_group(pag)->xg_block_count, &pag->agino_min, &pag->agino_max); @@ -290,6 +302,48 @@ xfs_initialize_perag( return error; } +void +xfs_perag_activate(struct xfs_perag *pag) +{ + ASSERT(!xfs_perag_is_active(pag)); + init_waitqueue_head(&pag_group(pag)->xg_active_wq); + atomic_set(&pag_group(pag)->xg_active_ref, 1); + xfs_add_fdblocks(pag_mount(pag), pag->pagf_freeblks + + pag->pagf_flcount); +} + +int +xfs_perag_deactivate(struct xfs_perag *pag) +{ + int error = 0; + + ASSERT(xfs_perag_is_active(pag)); + if (!xfs_perag_is_empty(pag)) + return -ENOTEMPTY; + /* + * Manually reduce/reserve (pagf_freeblks + pagf_flcount) worth of + * free datablocks from the global counters. This is necessary + * in order to prevent a race where, some AGs have been temporarily + * offlined but the delayed allocator has already promised some bytes + * and later the real extent/block allocation is failing due to + * the AG(s) being offline. + * If the overall shrink fails, we will restore the values. + */ + error = xfs_dec_fdblocks(pag_mount(pag), + pag->pagf_freeblks + pag->pagf_flcount, false); + if (error) + return error; + xfs_perag_rele(pag); + do { + error = wait_event_killable(pag_group(pag)->xg_active_wq, + !xfs_perag_is_active(pag)); + if (error == -ERESTARTSYS) + return error; + + } while (xfs_perag_is_active(pag)); + return 0; +} + static int xfs_get_aghdr_buf( struct xfs_mount *mp, @@ -758,7 +812,6 @@ xfs_ag_shrink_space( xfs_agblock_t aglen; int error, err2; - ASSERT(pag_agno(pag) == mp->m_sb.sb_agcount - 1); error = xfs_ialloc_read_agi(pag, *tpp, 0, &agibp); if (error) return error; @@ -872,6 +925,104 @@ xfs_ag_shrink_space( return err2; } +/* + * This function checks whether an AG is empty. An AG is eligible to be + * removed if it is empty. + */ +bool +xfs_perag_is_empty(struct xfs_perag *pag) +{ + struct xfs_buf *agfbp = NULL; + struct xfs_mount *mp = pag_mount(pag); + bool is_empty = false; + int error = 0; + struct xfs_agf *agf = NULL; + + /* + * Read the on-disk data structures to get the correct length of the AG. + * All the AGs have the same length except the last AG. + */ + error = xfs_alloc_read_agf(pag, NULL, 0, &agfbp); + if (!error) { + agf = agfbp->b_addr; + /* + * We don't need to check if the log blocks belong here since + * the log blocks are taken from the number of free blocks, and + * if the given AG has log blocks, then those many number of + * blocks will be consumed from the number of free blocks and + * the AG empty condition will not hold true. + */ + if (pag->pagf_freeblks + pag->pagf_flcount + + mp->m_ag_prealloc_blocks == be32_to_cpu(agf->agf_length)) + is_empty = true; + xfs_buf_relse(agfbp); + } + return is_empty; +} + +/* + * This function removes an entire empty AG. Before removing the struct + * xfs_perag reference, it removes the associated data structures. Before + * removing an AG, the caller must ensure that the AG has been deactivated with + * no active references and it has been fully stabilized on the disk. + */ +void +xfs_shrinkfs_remove_ag( + struct xfs_mount *mp, + xfs_agnumber_t agno) +{ + struct xfs_group *xg = NULL; + struct xfs_perag *cur_pag = NULL; + + /* + * Number of AGs can't be less than 2 + */ + ASSERT(agno >= 2); + xg = xa_erase(&mp->m_groups[XG_TYPE_AG].xa, agno); + cur_pag = to_perag(xg); + + ASSERT(!xfs_perag_is_active(cur_pag)); + /* + * Since we are freeing the AG, we should clear the perag reservations + * for the corresponding AGs. + */ + xfs_ag_resv_free(cur_pag); + /* + * We have already ensured in the AG preparation phase that all intents + * for the offlined AGs have been resolved. So it safe to free it here. + */ + xfs_defer_drain_free(&xg->xg_intents_drain); + /* + * We have already ensured in the AG preparation phase that all busy + * extents for the offlined AGs have been resolved. So it safe to free + * it here. + */ + kfree(xg->xg_busy_extents); + cancel_delayed_work_sync(&cur_pag->pag_blockgc_work); + + /* + * Remove all the cached buffers for the given AG. + */ + xfs_buf_cache_invalidate(cur_pag); + /* + * Now that the cached buffers have been released, remove the + * cache/hashtable itself. We should not change the order of the buffer + * removal and cache removal. + */ + xfs_buf_cache_destroy(&cur_pag->pag_bcache); + /* + * One final assert, before we remove the xg. Since the cached buffers + * for the offlined AGs are already removed, their passive references + * should be 0. Also, the active references are 0 too, so no new + * operation can start and race and get new references. + */ + XFS_IS_CORRUPT(mp, atomic_read(&pag_group(cur_pag)->xg_ref) != 0); + /* + * Finally free the struct xfs_perag of the AG. + */ + kfree_rcu_mightsleep(xg); +} + void xfs_growfs_compute_deltas( struct xfs_mount *mp, diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h index f7b56d486468..c91698b96702 100644 --- a/fs/xfs/libxfs/xfs_ag.h +++ b/fs/xfs/libxfs/xfs_ag.h @@ -112,6 +112,11 @@ static inline xfs_agnumber_t pag_agno(const struct xfs_perag *pag) return pag->pag_group.xg_gno; } +static inline bool xfs_perag_is_active(struct xfs_perag *pag) +{ + return atomic_read(&pag_group(pag)->xg_active_ref) > 0; +} + /* * Per-AG operational state. These are atomic flag bits. */ @@ -140,6 +145,7 @@ void xfs_free_perag_range(struct xfs_mount *mp, xfs_agnumber_t first_agno, xfs_agnumber_t end_agno); int xfs_initialize_perag_data(struct xfs_mount *mp, xfs_agnumber_t agno); int xfs_update_last_ag_size(struct xfs_mount *mp, xfs_agnumber_t prev_agcount); +bool xfs_perag_is_empty(struct xfs_perag *pag); /* Passive AG references */ static inline struct xfs_perag * @@ -263,6 +269,9 @@ xfs_ag_contains_log(struct xfs_mount *mp, xfs_agnumber_t agno) agno == XFS_FSB_TO_AGNO(mp, mp->m_sb.sb_logstart); } +void xfs_perag_activate(struct xfs_perag *pag); +int xfs_perag_deactivate(struct xfs_perag *pag); + static inline struct xfs_perag * xfs_perag_next_wrap( struct xfs_perag *pag, @@ -290,6 +299,10 @@ xfs_perag_next_wrap( return NULL; } +#define for_each_agno_range_reverse(agno, oagcount, nagcount) \ + for ((agno) = ((oagcount) - 1); (typeof(oagcount))(agno) >= \ + ((typeof(oagcount))(nagcount) - 1); (agno)--) + /* * Iterate all AGs from start_agno through wrap_agno, then restart_agno through * (start_agno - 1). @@ -331,6 +344,7 @@ struct aghdr_init_data { int xfs_ag_init_headers(struct xfs_mount *mp, struct aghdr_init_data *id); int xfs_ag_shrink_space(struct xfs_perag *pag, struct xfs_trans **tpp, xfs_extlen_t delta); +void xfs_shrinkfs_remove_ag(struct xfs_mount *mp, xfs_agnumber_t agno); void xfs_growfs_compute_deltas(struct xfs_mount *mp, xfs_rfsblock_t nb, int64_t *deltap, xfs_agnumber_t *nagcountp); diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index ad381c73abc4..878a236735d3 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -3209,10 +3209,14 @@ xfs_validate_ag_length( if (length != mp->m_sb.sb_agblocks) { /* * During growfs, the new last AG can get here before we - * have updated the superblock. Give it a pass on the seqno - * check. + * have updated the superblock. During shrink, the new last AG + * will be updated and the AGs from newag to old AG will be + * removed. So seqno here maybe not be equal to + * mp->m_sb.sb_agcount - 1 since the super block is not yet + * updated globally. */ - if (bp->b_pag && seqno != mp->m_sb.sb_agcount - 1) + if (!xfs_is_shrinking(mp) && + bp->b_pag && seqno != mp->m_sb.sb_agcount - 1) return __this_address; if (length < XFS_MIN_AG_BLOCKS) return __this_address; diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 773d959965dc..225e527a47af 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -948,6 +948,84 @@ xfs_buf_rele( xfs_buf_rele_cached(bp); } +/* + * This function populates a list of all the cached buffers of the given AG + * in the to_be_free list head. + */ +static void +xfs_buf_cache_grab_all( + struct xfs_perag *pag, + struct list_head *to_be_freed) +{ + struct xfs_buf *bp; + struct rhashtable_iter iter; + + rhashtable_walk_enter(&pag->pag_bcache.bc_hash, &iter); + do { + rhashtable_walk_start(&iter); + while ((bp = rhashtable_walk_next(&iter)) && !IS_ERR(bp)) { + ASSERT(list_empty(&bp->b_list)); + ASSERT(list_empty(&bp->b_li_list)); + list_add_tail(&bp->b_list, to_be_freed); + } + rhashtable_walk_stop(&iter); + } while (cond_resched(), bp == ERR_PTR(-EAGAIN)); + rhashtable_walk_exit(&iter); +} + +/* + * This function frees all the cached buffers (struct xfs_buf) associated with + * the given offline AG. The caller must ensure that the AG which is passed + * is offline and completely stabilized on the disk. Also, the caller should + * ensure that all the cached buffers are not queued for any pending i/o + * i.e, the b_list for all the cached buffers are empty - since we will be using + * b_list to get list of all the bufs that need to be freed. + */ +void +xfs_buf_cache_invalidate(struct xfs_perag *pag) +{ + /* + * First get the list of buffers we want to free. + * We need to populate to_be_freed list and cannot directly free + * the buffers during the hashtable walk. rhashtable_walk_start() takes + * an RCU and xfs_buf_rele eventually calls xfs_buf_free (for + * cached buffers). xfs_buf_free() might sleep (depending on the + * whether the buffer was allocated using vmalloc or kmalloc) and + * cannot be called within an RCU context. Hence we first populate + * the buffers within an RCU context and free them outside it. + */ + struct list_head to_be_freed; + struct xfs_buf *bp, *tmp; + + ASSERT(!xfs_perag_is_active(pag)); + + INIT_LIST_HEAD(&to_be_freed); + + xfs_buf_cache_grab_all(pag, &to_be_freed); + list_for_each_entry_safe(bp, tmp, &to_be_freed, b_list) { + list_del(&bp->b_list); + spin_lock(&bp->b_lock); + ASSERT(bp->b_pag == pag); + ASSERT(!xfs_buf_is_uncached(bp)); + /* + * Since we have made sure that this is being called on an + * AG with active refcount = 0, the b_hold value of any cached + * buffer should not exceed 1 (i.e, the default value) and hence + * can be safely removed. Hence, it should also be in an + * unlocked state. + */ + ASSERT(bp->b_hold == 1); + ASSERT(!xfs_buf_islocked(bp)); + /* + * We should set b_lru_ref to 0 so that it gets deleted from + * the lru during the call to xfs_buf_rele. + */ + atomic_set(&bp->b_lru_ref, 0); + spin_unlock(&bp->b_lock); + xfs_buf_rele(bp); + } +} + /* * Lock a buffer object, if it is not already locked. * diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h index 8fa7bdf59c91..210d7f904a68 100644 --- a/fs/xfs/xfs_buf.h +++ b/fs/xfs/xfs_buf.h @@ -282,6 +282,7 @@ void xfs_buf_hold(struct xfs_buf *bp); /* Releasing Buffers */ extern void xfs_buf_rele(struct xfs_buf *); +void xfs_buf_cache_invalidate(struct xfs_perag *pag); /* Locking and Unlocking Buffers */ extern int xfs_buf_trylock(struct xfs_buf *); diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c index e4c8af873632..f9aadfe071d5 100644 --- a/fs/xfs/xfs_buf_item_recover.c +++ b/fs/xfs/xfs_buf_item_recover.c @@ -747,8 +747,7 @@ xlog_recover_do_primary_sb_buffer( } if (mp->m_sb.sb_agcount < orig_agcount) { - xfs_alert(mp, "Shrinking AG count in log recovery not supported"); - return -EFSCORRUPTED; + xfs_warn_experimental(mp, XFS_EXPERIMENTAL_SHRINK); } if (mp->m_sb.sb_rgcount < orig_rgcount) { xfs_warn(mp, @@ -774,18 +773,28 @@ xlog_recover_do_primary_sb_buffer( if (error) return error; } - - /* - * Initialize the new perags, and also update various block and inode - * allocator setting based off the number of AGs or total blocks. - * Because of the latter this also needs to happen if the agcount did - * not change. - */ - error = xfs_initialize_perag(mp, orig_agcount, mp->m_sb.sb_agcount, - mp->m_sb.sb_dblocks, &mp->m_maxagi); - if (error) { - xfs_warn(mp, "Failed recovery per-ag init: %d", error); - return error; + if (orig_agcount > mp->m_sb.sb_agcount) { + /* + * Remove the old AGs that were removed previously by a growfs + */ + xfs_free_perag_range(mp, mp->m_sb.sb_agcount, orig_agcount); + mp->m_maxagi = xfs_set_inode_alloc(mp, mp->m_sb.sb_agcount); + mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp); + } else { + /* + * Initialize the new perags, and also the update various block + * and inode allocator setting based off the number of AGs or + * total blocks. + * Because of the latter, this also needs to happen if the + * agcount did not change. + */ + error = xfs_initialize_perag(mp, orig_agcount, + mp->m_sb.sb_agcount, + mp->m_sb.sb_dblocks, &mp->m_maxagi); + if (error) { + xfs_warn(mp, "Failed recovery per-ag init: %d", error); + return error; + } } mp->m_alloc_set_aside = xfs_alloc_set_aside(mp); diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c index da3161572735..6e02632022d2 100644 --- a/fs/xfs/xfs_extent_busy.c +++ b/fs/xfs/xfs_extent_busy.c @@ -676,6 +676,36 @@ xfs_extent_busy_wait_all( xfs_extent_busy_wait_group(rtg_group(rtg)); } +/* + * Similar to xfs_extent_busy_wait_all() - It waits for all the busy extents to + * get resolved for the range of AGs provided. For now, this function is + * introduced to be used in online shrink process. Unlike + * xfs_extent_busy_wait_all(), this takes a passive reference, because this + * function is expected to be called for the AGs whose active reference has + * been reduced to 0 i.e, offline AGs. + * + * @mp - The xfs mount point + * @first_agno - The 0 based AG index of the range of AGs from which we will + * start. + * @end_agno - The 0 based AG index of the range of AGs from till which we will + * traverse. + */ +void +xfs_extent_busy_wait_ags( + struct xfs_mount *mp, + xfs_agnumber_t first_agno, + xfs_agnumber_t end_agno) +{ + xfs_agnumber_t agno; + struct xfs_perag *pag = NULL; + + for_each_agno_range_reverse(agno, end_agno + 1, first_agno + 1) { + pag = xfs_perag_get(mp, agno); + xfs_extent_busy_wait_group(pag_group(pag)); + xfs_perag_put(pag); + } +} + /* * Callback for list_sort to sort busy extents by the group they reside in. */ diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h index 3e6e019b6146..6fcab714be07 100644 --- a/fs/xfs/xfs_extent_busy.h +++ b/fs/xfs/xfs_extent_busy.h @@ -57,6 +57,8 @@ bool xfs_extent_busy_trim(struct xfs_group *xg, xfs_extlen_t minlen, unsigned *busy_gen); int xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_group *xg, unsigned busy_gen, uint32_t alloc_flags); +void xfs_extent_busy_wait_ags(struct xfs_mount *mp, xfs_agnumber_t first_agno, + xfs_agnumber_t end_agno); void xfs_extent_busy_wait_all(struct xfs_mount *mp); bool xfs_extent_busy_list_empty(struct xfs_group *xg, unsigned int *busy_gen); struct xfs_extent_busy_tree *xfs_extent_busy_alloc(void); diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c index 8353e2f186f6..faac1b3c8daa 100644 --- a/fs/xfs/xfs_fsops.c +++ b/fs/xfs/xfs_fsops.c @@ -25,6 +25,7 @@ #include "xfs_rtrmap_btree.h" #include "xfs_rtrefcount_btree.h" #include "xfs_metafile.h" +#include "xfs_trans_priv.h" /* * Write new AG headers to disk. Non-transactional, but need to be @@ -83,6 +84,297 @@ xfs_resizefs_init_new_ags( return error; } +static int +xfs_shrinkfs_quiesce_ags( + struct xfs_mount *mp, + xfs_agnumber_t oagcount, + xfs_agnumber_t nagcount) +{ + int error = 0; + int count = 0; + + /* + * We should wait for the log to be empty and all the pending I/Os to + * be completed so that the AGs are completely stabilized before we + * start tearing them down. Flushing the AIL and synching the superblock + * here ensures that none of the future logged transactions will refer + * to these AGs during log recovery in case if sudden shutdown/crash + * happens while we are trying to remove these AGs. + * The following code is similar to xfs_log_quiesce() and xfs_log_cover. + * + * We are doing a xfs_sync_sb_buf + AIL flush twice. The first + * xfs_sync_sb_buf writes a checkpoint, then the first AIL flush makes + * the first checkpoint stable. The second set of xfs_sync_sb_buf + AIL + * flush synchs the on-disk LSN with the in-core LSN. + * Unlike xfs_log_cover(), we don't necessarily want the background + * filesytem activity/log activity to stop (like in case of unmount + * or freeze). + */ + cancel_delayed_work_sync(&mp->m_log->l_work); + error = xfs_log_force(mp, XFS_LOG_SYNC); + if (error) + goto out; + + error = xfs_sync_sb_buf(mp, false); + if (error) + goto out; + + xfs_ail_push_all_sync(mp->m_ail); + xfs_buftarg_wait(mp->m_ddev_targp); + xfs_buf_lock(mp->m_sb_bp); + xfs_buf_unlock(mp->m_sb_bp); + + /* + * The first xfs_sync_sb serves as a reference for the in-core tail + * pointer and the second one updates the on-disk tail with the in-core + * lsn. This is similar to what is being done in xfs_log_cover, however + * here we are explicitly doing this twice in order to ensure forward + * progress as, during shrink the filesystem is active. + */ + for (count = 0; count < 2; count++) { + error = xfs_sync_sb(mp, true); + if (error) + goto out; + xfs_ail_push_all_sync(mp->m_ail); + } + + /* + * Wait for all the busy extents to get resolved along with pending trim + * ops for all the offlined AGs. + */ + xfs_extent_busy_wait_ags(mp, nagcount, oagcount - 1); + flush_workqueue(xfs_discard_wq); +out: + xfs_log_work_queue(mp); + return error; +} + +/* + * Get new active references for all the AGs. This might be called when + * shrinkage process encounters a failure at an intermediate stage after the + * active references of all/some of the target AGs have become 0. + */ +static void +xfs_shrinkfs_reactivate_ags( + struct xfs_mount *mp, + xfs_agnumber_t oagcount, + xfs_agnumber_t nagcount) +{ + struct xfs_perag *pag = NULL; + xfs_agnumber_t agno; + + ASSERT(nagcount < oagcount); + + for_each_agno_range_reverse(agno, oagcount, nagcount + 1) { + pag = xfs_perag_get(mp, agno); + xfs_perag_activate(pag); + xfs_perag_put(pag); + } +} + +/* + * The function deactivates or puts the AGs to an offline mode. AG deactivation + * or AG offlining means that no new operation can be started on that AG. The AG + * still exists, however no new high level operation (like extent allocation) + * can be started. In terms of implementation, an AG is taken offline or is + * deactivated when xg_active_ref of the struct xfs_perag is 0 i.e, the number + * of active references becomes 0. + * Since active references act as a form of barrier, so once the active + * reference of an AG is 0, no new entity can get an active reference and in + * this way we ensure that once an AG is offline (i.e, active reference count is + * 0), no one will be able to start a new operation in it unless the active + * reference count is explicitly set to 1 i.e, the AG is made online/activated. + */ +static int +xfs_shrinkfs_deactivate_ags( + struct xfs_mount *mp, + xfs_agnumber_t oagcount, + xfs_agnumber_t nagcount) +{ + int error = 0; + struct xfs_perag *pag = NULL; + xfs_agnumber_t agno; + + ASSERT(nagcount < oagcount); + + /* + * If we are removing 1 or more entire AGs, we only need to take those + * AGs offline which we are planning to remove completely. The new tail + * AG which will be partially shrunk should not be taken offline - since + * we will be doing an online operation on them, just like any other + * high level operation. For complete AG removal, we need to take them + * offline since we cannot start any new operation on them as they will + * be removed eventually. + * + * However, if the number of blocks that we are trying to remove is + * an exact multiple of the AG size (in blocks), then the new tail AG + * will not be shrunk at all. + */ + for_each_agno_range_reverse(agno, oagcount, nagcount + 1) { + pag = xfs_perag_get(mp, agno); + error = xfs_perag_deactivate(pag); + if (error) { + xfs_perag_put(pag); + if (agno < oagcount - 1) + xfs_shrinkfs_reactivate_ags(mp, oagcount, + agno + 1); + xfs_clear_shrinking(mp); + error = error == -ERESTARTSYS ? + -ERESTARTSYS : -ENOTEMPTY; + return error; + } + xfs_perag_put(pag); + } + /* + * Now that we have deactivated/offlined the AGs, we need to make sure + * that all the pending operations are completed and the in-core and + * the on disk contents are completely in synch i.e, AGs are stablized + * on to the disk. + */ + error = xfs_shrinkfs_quiesce_ags(mp, oagcount, nagcount); + if (error) { + xfs_shrinkfs_reactivate_ags(mp, oagcount, nagcount); + xfs_clear_shrinking(mp); + return error; + } + + return 0; +} + +/* + * This function does 3 things: + * 1. Deactivate the AGs i.e, wait for all the active references to come to 0. + * 2. Checks whether all the AGs that shrink process needs to remove are empty. + * If at least one of the target AGs is non-empty, shrink fails and + * xfs_shrinkfs_reactivate_ags() is called. + * 3. Calculates the total number of fdblocks (free data blocks) that will be + * removed and stores in id->nfree. + * Please look into the individual functions for more details and the definition + * of the terminologies. + */ +static int +xfs_shrinkfs_prepare_ags( + struct xfs_mount *mp, + xfs_agnumber_t oagcount, + xfs_agnumber_t nagcount, + struct aghdr_init_data *id) +{ + + struct xfs_perag *pag = NULL; + xfs_agnumber_t agno; + int error = 0; + + ASSERT(nagcount < oagcount); + + /* + * Deactivating/offlining the AGs i.e waiting for the active references + * to come down to 0. + */ + error = xfs_shrinkfs_deactivate_ags(mp, oagcount, nagcount); + if (error) + return error; + /* + * At this point the AGs have been deactivated/offlined and the in-core + * and the on-disk are synch. So now we need to check whether all the + * AGs that we are trying to remove/delete are empty. Since we are not + * supporting partial shrink success (i.e, the entire requested size + * will be removed or none), we will bail out with a failure code even + * if 1 AG is non-empty. + */ + for_each_agno_range_reverse(agno, oagcount, nagcount + 1) { + pag = xfs_perag_get(mp, agno); + if (!xfs_perag_is_empty(pag)) { + /* Error out even if one AG is non-empty */ + xfs_perag_put(pag); + xfs_shrinkfs_reactivate_ags(mp, oagcount, nagcount); + xfs_clear_shrinking(mp); + return -ENOTEMPTY; + } + /* + * Since these are removed, these free blocks should also be + * subtracted from the total list of free blocks. + */ + id->nfree += (pag->pagf_freeblks + pag->pagf_flcount); + xfs_perag_put(pag); + } + return 0; +} + +/* + * This function does the job of fully removing the blocks and empty AGs ( + * depending of the values of oagcount and nagcount). By removal it means, + * removal of all the perag data structures, other data structures associated + * with it and all the perag cached buffers (when AGs are removed). Once this + * function succeeds, the AGs/blocks will no longer exist. + * The overall steps are as follows (details are in the function): + * - calculate the number of blocks that will be removed from the new tail AG + * i.e, the AG that will be shrunk partially. + * - call xfs_shrinkfs_remove_ag() that removes the perag cached buffers, + * then frees the perag reservation, other associated datastructures and + * finally the in-memory perag group instance. + */ +static int +xfs_shrinkfs_remove_ags( + struct xfs_mount *mp, + struct xfs_trans **tp, + xfs_agnumber_t oagcount, + xfs_agnumber_t nagcount, + int64_t delta_rem, + xfs_agnumber_t *nagmax) +{ + xfs_agnumber_t agno; + int error = 0; + struct xfs_perag *cur_pag = NULL; + + /* + * This loop is calculating the number of blocks that needs to be + * removed from the new tail AG. If delta_rem is 0 after the loop exits, + * then it means that the number of blocks we want to remove is a + * multiple of AG size (in blocks). + */ + for_each_agno_range_reverse(agno, oagcount, nagcount + 1) { + cur_pag = xfs_perag_get(mp, agno); + delta_rem -= xfs_ag_block_count(mp, agno); + xfs_perag_put(cur_pag); + } + /* + * We are first removing blocks from the AG that will form the new tail + * AG. The reason is that, if we encounter an error here, we can simply + * reactivate the AGs (by calling xfs_shrinkfs_reactivate_ags()). + * Removal of complete empty AGs always succeed anyway. However if we + * remove the empty AGs first (which will succeed) and then the new + * last AG shrink fails, then we will again have to re-initialize the + * removed AGs. Hence the former approach seems more efficient to me. + */ + if (delta_rem) { + /* + * Remove delta_rem blocks from the AG that will form the new + * tail AG after the AGs are removed. If the number of blocks to + * be removed is a multiple of AG size, then nothing is done + * here. + */ + cur_pag = xfs_perag_get(mp, nagcount - 1); + error = xfs_ag_shrink_space(cur_pag, tp, delta_rem); + xfs_perag_put(cur_pag); + if (error) { + if (nagcount < oagcount) + xfs_shrinkfs_reactivate_ags(mp, oagcount, + nagcount); + xfs_clear_shrinking(mp); + return error; + } + } + /* + * Now, in this final step we remove the perag instance and the + * associated datastructures and cached buffers. This fully removes the + * AG. + */ + for_each_agno_range_reverse(agno, oagcount, nagcount + 1) + xfs_shrinkfs_remove_ag(mp, agno); + *nagmax = xfs_set_inode_alloc(mp, nagcount); + return error; +} + /* * growfs operations */ @@ -98,10 +390,11 @@ xfs_growfs_data_private( xfs_agnumber_t nagcount; xfs_agnumber_t nagimax = 0; int64_t delta; + xfs_rfsblock_t nb_div, nb_mod; bool lastag_extended = false; struct xfs_trans *tp; struct aghdr_init_data id = {}; - struct xfs_perag *last_pag; + struct xfs_perag *last_pag = NULL; error = xfs_sb_validate_fsb_count(&mp->m_sb, nb); if (error) @@ -122,6 +415,13 @@ xfs_growfs_data_private( if (error) return error; xfs_growfs_compute_deltas(mp, nb, &delta, &nagcount); + /* + * Fail if the new tail AG length is < XFS_MIN_AG_BLOCKS during shrink + */ + nb_div = nb; + nb_mod = do_div(nb_div, mp->m_sb.sb_agblocks); + if (delta < 0 && nb_mod && nb_mod < XFS_MIN_AG_BLOCKS) + return -EINVAL; /* * Reject filesystems with a single AG because they are not @@ -135,14 +435,23 @@ xfs_growfs_data_private( if (delta == 0) return 0; - /* TODO: shrinking the entire AGs hasn't yet completed */ - if (nagcount < oagcount) - return -EINVAL; + if (delta < 0) + xfs_set_shrinking(mp); + + if (nagcount < oagcount) { + error = xfs_shrinkfs_prepare_ags(mp, oagcount, nagcount, &id); + if (error) + return error; + } /* allocate the new per-ag structures */ error = xfs_initialize_perag(mp, oagcount, nagcount, nb, &nagimax); - if (error) + if (error) { + if (nagcount < oagcount) + xfs_shrinkfs_reactivate_ags(mp, oagcount, nagcount); + xfs_clear_shrinking(mp); return error; + } if (delta > 0) error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growdata, @@ -151,18 +460,22 @@ xfs_growfs_data_private( else error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growdata, -delta, 0, 0, &tp); - if (error) + if (error) { + if (nagcount < oagcount) + xfs_shrinkfs_reactivate_ags(mp, oagcount, nagcount); goto out_free_unused_perag; + } - last_pag = xfs_perag_get(mp, oagcount - 1); if (delta > 0) { + last_pag = xfs_perag_get(mp, oagcount - 1); error = xfs_resizefs_init_new_ags(tp, &id, oagcount, nagcount, delta, last_pag, &lastag_extended); + xfs_perag_put(last_pag); } else { xfs_warn_experimental(mp, XFS_EXPERIMENTAL_SHRINK); - error = xfs_ag_shrink_space(last_pag, &tp, -delta); + error = xfs_shrinkfs_remove_ags(mp, &tp, oagcount, nagcount, + -delta, &nagimax); } - xfs_perag_put(last_pag); if (error) goto out_trans_cancel; @@ -171,12 +484,26 @@ xfs_growfs_data_private( * seen by the rest of the world until the transaction commit applies * them atomically to the superblock. */ - if (nagcount > oagcount) - xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount); + if (nagcount != oagcount) + xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, + (int64_t)nagcount - (int64_t)oagcount); if (delta) xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, delta); - if (id.nfree) - xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree); + if (id.nfree) { + /* + * If nagcount < oagcount, then the AG deactivation step + * of the shrink process has already reserved the free + * datablocks and subtracted it from the incore fdblock + * counters. So now all we need to do is subtract 400 from the + * ondisk fdblocks counters. + */ + if (nagcount < oagcount) + xfs_trans_mod_sb(tp, XFS_TRANS_SB_RES_FDBLOCKS, + delta > 0 ? id.nfree : (int64_t)-id.nfree); + else + xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, + delta > 0 ? id.nfree : (int64_t)-id.nfree); + } /* * Sync sb counters now to reflect the updated values. This is @@ -188,12 +515,18 @@ xfs_growfs_data_private( xfs_trans_set_sync(tp); error = xfs_trans_commit(tp); - if (error) + if (error) { + if (nagcount < oagcount) + xfs_shrinkfs_reactivate_ags(mp, oagcount, nagcount); + xfs_clear_shrinking(mp); return error; + } /* New allocation groups fully initialized, so update mount struct */ if (nagimax) mp->m_maxagi = nagimax; + if (nagcount < oagcount) + mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp); xfs_set_low_space_thresholds(mp); mp->m_alloc_set_aside = xfs_alloc_set_aside(mp); @@ -222,7 +555,7 @@ xfs_growfs_data_private( xfs_rtrmapbt_compute_maxlevels(mp); xfs_rtrefcountbt_compute_maxlevels(mp); } - + xfs_clear_shrinking(mp); return error; out_trans_cancel: @@ -230,6 +563,7 @@ xfs_growfs_data_private( out_free_unused_perag: if (nagcount > oagcount) xfs_free_perag_range(mp, oagcount, nagcount); + xfs_clear_shrinking(mp); return error; } diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index f046d1215b04..3ee893e77ab7 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -578,6 +578,8 @@ __XFS_HAS_FEAT(nouuid, NOUUID) #define XFS_OPSTATE_WARNED_ZONED 19 /* (Zoned) GC is in progress */ #define XFS_OPSTATE_ZONEGC_RUNNING 20 +/* filesystem is shrinking */ +#define XFS_OPSTATE_SHRINKING 21 #define __XFS_IS_OPSTATE(name, NAME) \ static inline bool xfs_is_ ## name (struct xfs_mount *mp) \ @@ -600,6 +602,7 @@ __XFS_IS_OPSTATE(inode32, INODE32) __XFS_IS_OPSTATE(readonly, READONLY) __XFS_IS_OPSTATE(inodegc_enabled, INODEGC_ENABLED) __XFS_IS_OPSTATE(blockgc_enabled, BLOCKGC_ENABLED) +__XFS_IS_OPSTATE(shrinking, SHRINKING) #ifdef CONFIG_XFS_QUOTA __XFS_IS_OPSTATE(quotacheck_running, QUOTACHECK_RUNNING) __XFS_IS_OPSTATE(resuming_quotaon, RESUMING_QUOTAON) diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 474f5a04ec63..986b15e2b5d0 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -409,7 +409,6 @@ xfs_trans_mod_sb( tp->t_dblocks_delta += delta; break; case XFS_TRANS_SB_AGCOUNT: - ASSERT(delta > 0); tp->t_agcount_delta += delta; break; case XFS_TRANS_SB_IMAXPCT: -- 2.43.5 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [RFC V3 0/3] xfs: Add support to shrink multiple empty AGs 2025-10-20 15:43 [RFC V3 0/3] xfs: Add support to shrink multiple empty AGs Nirjhar Roy (IBM) ` (2 preceding siblings ...) 2025-10-20 15:43 ` [RFC V3 3/3] xfs: Add support to shrink multiple empty AGs Nirjhar Roy (IBM) @ 2025-10-22 7:17 ` Christoph Hellwig 2025-10-22 16:05 ` Darrick J. Wong 3 siblings, 1 reply; 16+ messages in thread From: Christoph Hellwig @ 2025-10-22 7:17 UTC (permalink / raw) To: Nirjhar Roy (IBM) Cc: linux-xfs, ritesh.list, ojaswin, djwong, bfoster, david, hsiangkao On Mon, Oct 20, 2025 at 09:13:41PM +0530, Nirjhar Roy (IBM) wrote: > This work is based on a previous RFC[1] by Gao Xiang and various ideas > proposed by Dave Chinner in the RFC[1]. > > Currently the functionality of shrink is limited to shrinking the last > AG partially but not beyond that. This patch extends the functionality > to support shrinking beyond 1 AG. However the AGs that we will be remove > have to empty in order to prevent any loss of data. > > The patch begins with the re-introduction of some of the data > structures that were removed, some code refactoring and > finally the patch that implements the multi AG shrink design. > The final patch has all the details including the definition of the > terminologies and the overall design. I'm still missing what the overall plan is here. For "normal" XFS setups you'll always have inodes that we can't migrate. Do you plan to use this with inode32 only? Also it would be nice to extent this to rtgroups, as we are guaranteed to not have non-migratable metadata there and things will actually just work. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC V3 0/3] xfs: Add support to shrink multiple empty AGs 2025-10-22 7:17 ` [RFC V3 0/3] " Christoph Hellwig @ 2025-10-22 16:05 ` Darrick J. Wong 2025-10-23 5:40 ` Nirjhar Roy (IBM) 2025-10-23 6:34 ` Christoph Hellwig 0 siblings, 2 replies; 16+ messages in thread From: Darrick J. Wong @ 2025-10-22 16:05 UTC (permalink / raw) To: Christoph Hellwig Cc: Nirjhar Roy (IBM), linux-xfs, ritesh.list, ojaswin, bfoster, david, hsiangkao On Wed, Oct 22, 2025 at 12:17:27AM -0700, Christoph Hellwig wrote: > On Mon, Oct 20, 2025 at 09:13:41PM +0530, Nirjhar Roy (IBM) wrote: > > This work is based on a previous RFC[1] by Gao Xiang and various ideas > > proposed by Dave Chinner in the RFC[1]. > > > > Currently the functionality of shrink is limited to shrinking the last > > AG partially but not beyond that. This patch extends the functionality > > to support shrinking beyond 1 AG. However the AGs that we will be remove > > have to empty in order to prevent any loss of data. > > > > The patch begins with the re-introduction of some of the data > > structures that were removed, some code refactoring and > > finally the patch that implements the multi AG shrink design. > > The final patch has all the details including the definition of the > > terminologies and the overall design. > > I'm still missing what the overall plan is here. For "normal" XFS > setups you'll always have inodes that we can't migrate. Do you plan > to use this with inode32 only? ...or resurrect xfs_reno? Data/attr extent migration might not be too hard if we can repurpose xfs_zonegc for relocations. I think moving inodes is going to be very very difficult because there's no way to atomically update all the parents. (Not to mention whatever happens when the inumber abruptly changes) > Also it would be nice to extent this > to rtgroups, as we are guaranteed to not have non-migratable metadata > there and things will actually just work. Seconded. --D ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC V3 0/3] xfs: Add support to shrink multiple empty AGs 2025-10-22 16:05 ` Darrick J. Wong @ 2025-10-23 5:40 ` Nirjhar Roy (IBM) 2025-10-23 6:34 ` Christoph Hellwig 1 sibling, 0 replies; 16+ messages in thread From: Nirjhar Roy (IBM) @ 2025-10-23 5:40 UTC (permalink / raw) To: Darrick J. Wong, Christoph Hellwig Cc: linux-xfs, ritesh.list, ojaswin, bfoster, david, hsiangkao On 10/22/25 21:35, Darrick J. Wong wrote: > On Wed, Oct 22, 2025 at 12:17:27AM -0700, Christoph Hellwig wrote: >> On Mon, Oct 20, 2025 at 09:13:41PM +0530, Nirjhar Roy (IBM) wrote: >>> This work is based on a previous RFC[1] by Gao Xiang and various ideas >>> proposed by Dave Chinner in the RFC[1]. >>> >>> Currently the functionality of shrink is limited to shrinking the last >>> AG partially but not beyond that. This patch extends the functionality >>> to support shrinking beyond 1 AG. However the AGs that we will be remove >>> have to empty in order to prevent any loss of data. >>> >>> The patch begins with the re-introduction of some of the data >>> structures that were removed, some code refactoring and >>> finally the patch that implements the multi AG shrink design. >>> The final patch has all the details including the definition of the >>> terminologies and the overall design. >> I'm still missing what the overall plan is here. For "normal" XFS >> setups you'll always have inodes that we can't migrate. Do you plan >> to use this with inode32 only? > ...or resurrect xfs_reno? > > Data/attr extent migration might not be too hard if we can repurpose > xfs_zonegc for relocations. I think moving inodes is going to be very > very difficult because there's no way to atomically update all the > parents. > > (Not to mention whatever happens when the inumber abruptly changes) > >> Also it would be nice to extent this >> to rtgroups, as we are guaranteed to not have non-migratable metadata >> there and things will actually just work. > Seconded. Yes, I plan to extend this to rtgroups too in a separate series. For this patch series, I have targeted only the non realtime groups. --NR > > --D -- Nirjhar Roy Linux Kernel Developer IBM, Bangalore ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC V3 0/3] xfs: Add support to shrink multiple empty AGs 2025-10-22 16:05 ` Darrick J. Wong 2025-10-23 5:40 ` Nirjhar Roy (IBM) @ 2025-10-23 6:34 ` Christoph Hellwig 2025-11-05 7:56 ` Nirjhar Roy (IBM) 1 sibling, 1 reply; 16+ messages in thread From: Christoph Hellwig @ 2025-10-23 6:34 UTC (permalink / raw) To: Darrick J. Wong Cc: Christoph Hellwig, Nirjhar Roy (IBM), linux-xfs, ritesh.list, ojaswin, bfoster, david, hsiangkao On Wed, Oct 22, 2025 at 09:05:32AM -0700, Darrick J. Wong wrote: > > I'm still missing what the overall plan is here. For "normal" XFS > > setups you'll always have inodes that we can't migrate. Do you plan > > to use this with inode32 only? > > ...or resurrect xfs_reno? That only brings up some vague memories. But anything in userspace would not be transactional safe anyway. > > Data/attr extent migration might not be too hard if we can repurpose > xfs_zonegc for relocations. The zonegc code is very heavily dependent on not having to deal with freespace fragmentation for writes, so I don't think the code is directly reusable. But the overall idea applies, yes. > I think moving inodes is going to be very > very difficult because there's no way to atomically update all the > parents. > > (Not to mention whatever happens when the inumber abruptly changes) Yes. That's my big issues with all the shrink plans, what are we going to do about inodes? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC V3 0/3] xfs: Add support to shrink multiple empty AGs 2025-10-23 6:34 ` Christoph Hellwig @ 2025-11-05 7:56 ` Nirjhar Roy (IBM) 2025-11-05 13:12 ` Christoph Hellwig 0 siblings, 1 reply; 16+ messages in thread From: Nirjhar Roy (IBM) @ 2025-11-05 7:56 UTC (permalink / raw) To: Christoph Hellwig, Darrick J. Wong Cc: linux-xfs, ritesh.list, ojaswin, bfoster, david, hsiangkao On 10/23/25 12:04, Christoph Hellwig wrote: > On Wed, Oct 22, 2025 at 09:05:32AM -0700, Darrick J. Wong wrote: >>> I'm still missing what the overall plan is here. For "normal" XFS >>> setups you'll always have inodes that we can't migrate. Do you plan >>> to use this with inode32 only? >> ...or resurrect xfs_reno? > That only brings up some vague memories. But anything in userspace > would not be transactional safe anyway. > >> Data/attr extent migration might not be too hard if we can repurpose >> xfs_zonegc for relocations. > The zonegc code is very heavily dependent on not having to deal with > freespace fragmentation for writes, so I don't think the code is > directly reusable. But the overall idea applies, yes. > >> I think moving inodes is going to be very >> very difficult because there's no way to atomically update all the >> parents. >> >> (Not to mention whatever happens when the inumber abruptly changes) > Yes. That's my big issues with all the shrink plans, what are we > going to do about inodes? Hi Christoph and Darrick, Sorry for the delayed response. So, my initial plan was to get the the shrink work only for empty AGs for now (since we already have the last AG partial shrink merged). Do you think this will be helpful for users? Regarding the data/inode movement, can you please give me some ideas/pointers as to how can we move the inodes. I can in parallel start exploring those areas and work incrementally. --NR -- Nirjhar Roy Linux Kernel Developer IBM, Bangalore ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC V3 0/3] xfs: Add support to shrink multiple empty AGs 2025-11-05 7:56 ` Nirjhar Roy (IBM) @ 2025-11-05 13:12 ` Christoph Hellwig 2025-11-05 19:13 ` Darrick J. Wong 0 siblings, 1 reply; 16+ messages in thread From: Christoph Hellwig @ 2025-11-05 13:12 UTC (permalink / raw) To: Nirjhar Roy (IBM) Cc: Christoph Hellwig, Darrick J. Wong, linux-xfs, ritesh.list, ojaswin, bfoster, david, hsiangkao On Wed, Nov 05, 2025 at 01:26:50PM +0530, Nirjhar Roy (IBM) wrote: > Sorry for the delayed response. So, my initial plan was to get the the > shrink work only for empty AGs for now (since we already have the last AG > partial shrink merged). For normal XFS file systems that isn't really very useful, as the last AG will typically have inodes as well. Unless we decide and actively promoted inode32 for uses cases that want shrinking. Which reminds me that we really should look into maybe promoting metadata primary AGs - on SSDs that will most likely give us better I/O patterns to the device, or at least none that are any worse without it. > Do you think this will be helpful for users? > Regarding the data/inode movement, can you please give me some > ideas/pointers as to how can we move the inodes. I can in parallel start > exploring those areas and work incrementally. I don't really have a really good idea except for having either a new btree or a major modification to the inobt provide the inode number to disk location mapping. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC V3 0/3] xfs: Add support to shrink multiple empty AGs 2025-11-05 13:12 ` Christoph Hellwig @ 2025-11-05 19:13 ` Darrick J. Wong 2025-11-10 7:05 ` Nirjhar Roy (IBM) 0 siblings, 1 reply; 16+ messages in thread From: Darrick J. Wong @ 2025-11-05 19:13 UTC (permalink / raw) To: Christoph Hellwig Cc: Nirjhar Roy (IBM), linux-xfs, ritesh.list, ojaswin, bfoster, david, hsiangkao On Wed, Nov 05, 2025 at 05:12:55AM -0800, Christoph Hellwig wrote: > On Wed, Nov 05, 2025 at 01:26:50PM +0530, Nirjhar Roy (IBM) wrote: > > Sorry for the delayed response. So, my initial plan was to get the the > > shrink work only for empty AGs for now (since we already have the last AG > > partial shrink merged). > > For normal XFS file systems that isn't really very useful, as the last > AG will typically have inodes as well. > > Unless we decide and actively promoted inode32 for uses cases that want > shrinking. Which reminds me that we really should look into maybe > promoting metadata primary AGs - on SSDs that will most likely give us > better I/O patterns to the device, or at least none that are any worse > without it. I don't think we quite want inode32 per se -- I think what would be more useful for these shrink cases is constraining inode allocations between AG 0 and whichever AG the log is in (since you also can't move the log), and only expanding the allowed AGs if we hit ENOSPC. (Or as hch suggested, porting to rtgroups would at least strengthen the justification for merging because there are no inodes to get in the way on the realtime volume.) > > Do you think this will be helpful for users? > > Regarding the data/inode movement, can you please give me some > > ideas/pointers as to how can we move the inodes. I can in parallel start > > exploring those areas and work incrementally. > > I don't really have a really good idea except for having either a new > btree or a major modification to the inobt provide the inode number to > disk location mapping. Storing the inode cores in the inobt itself, which would be uuuuugly. --D ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [RFC V3 0/3] xfs: Add support to shrink multiple empty AGs 2025-11-05 19:13 ` Darrick J. Wong @ 2025-11-10 7:05 ` Nirjhar Roy (IBM) 0 siblings, 0 replies; 16+ messages in thread From: Nirjhar Roy (IBM) @ 2025-11-10 7:05 UTC (permalink / raw) To: Darrick J. Wong, Christoph Hellwig Cc: linux-xfs, ritesh.list, ojaswin, bfoster, david, hsiangkao On 11/6/25 00:43, Darrick J. Wong wrote: > On Wed, Nov 05, 2025 at 05:12:55AM -0800, Christoph Hellwig wrote: >> On Wed, Nov 05, 2025 at 01:26:50PM +0530, Nirjhar Roy (IBM) wrote: >>> Sorry for the delayed response. So, my initial plan was to get the the >>> shrink work only for empty AGs for now (since we already have the last AG >>> partial shrink merged). >> For normal XFS file systems that isn't really very useful, as the last >> AG will typically have inodes as well. >> >> Unless we decide and actively promoted inode32 for uses cases that want >> shrinking. Which reminds me that we really should look into maybe >> promoting metadata primary AGs - on SSDs that will most likely give us >> better I/O patterns to the device, or at least none that are any worse >> without it. > I don't think we quite want inode32 per se -- I think what would be more > useful for these shrink cases is constraining inode allocations between > AG 0 and whichever AG the log is in (since you also can't move the log), > and only expanding the allowed AGs if we hit ENOSPC. Makes sense. > > (Or as hch suggested, porting to rtgroups would at least strengthen the > justification for merging because there are no inodes to get in the way > on the realtime volume.) Okay - in that case, I will start working on expanding this present patch series to be working with real time groups as well. Thank you for your suggestions. --NR > >>> Do you think this will be helpful for users? >>> Regarding the data/inode movement, can you please give me some >>> ideas/pointers as to how can we move the inodes. I can in parallel start >>> exploring those areas and work incrementally. >> I don't really have a really good idea except for having either a new >> btree or a major modification to the inobt provide the inode number to >> disk location mapping. > Storing the inode cores in the inobt itself, which would be uuuuugly. > > --D -- Nirjhar Roy Linux Kernel Developer IBM, Bangalore ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2026-02-02 16:53 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-10-20 15:43 [RFC V3 0/3] xfs: Add support to shrink multiple empty AGs Nirjhar Roy (IBM) 2025-10-20 15:43 ` [RFC V3 1/3] xfs: Re-introduce xg_active_wq field in struct xfs_group Nirjhar Roy (IBM) 2025-10-20 15:43 ` [RFC V3 2/3] xfs: Refactoring the nagcount and delta calculation Nirjhar Roy (IBM) 2026-02-02 14:15 ` Nirjhar Roy (IBM) 2026-02-02 14:38 ` Christoph Hellwig 2026-02-02 16:50 ` Carlos Maiolino 2026-02-02 16:53 ` Nirjhar Roy (IBM) 2025-10-20 15:43 ` [RFC V3 3/3] xfs: Add support to shrink multiple empty AGs Nirjhar Roy (IBM) 2025-10-22 7:17 ` [RFC V3 0/3] " Christoph Hellwig 2025-10-22 16:05 ` Darrick J. Wong 2025-10-23 5:40 ` Nirjhar Roy (IBM) 2025-10-23 6:34 ` Christoph Hellwig 2025-11-05 7:56 ` Nirjhar Roy (IBM) 2025-11-05 13:12 ` Christoph Hellwig 2025-11-05 19:13 ` Darrick J. Wong 2025-11-10 7:05 ` Nirjhar Roy (IBM)
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox