From: "Darrick J. Wong" <djwong@kernel.org>
To: "Nirjhar Roy (IBM)" <nirjhar.roy.lists@gmail.com>
Cc: linux-xfs@vger.kernel.org, ritesh.list@gmail.com,
ojaswin@linux.ibm.com, bfoster@redhat.com, david@fromorbit.com,
hsiangkao@linux.alibaba.com
Subject: Re: [RFC 3/3] xfs: Add support to shrink multiple empty AGs
Date: Thu, 14 Aug 2025 11:31:39 -0700 [thread overview]
Message-ID: <20250814183139.GE7981@frogsfrogsfrogs> (raw)
In-Reply-To: <878dcefe-944a-44ad-8975-890a3973c959@gmail.com>
On Wed, Aug 13, 2025 at 01:34:59PM +0530, Nirjhar Roy (IBM) wrote:
>
> On 8/13/25 00:37, Darrick J. Wong wrote:
> > On Thu, Jul 31, 2025 at 12:06:11AM +0530, Nirjhar Roy (IBM) wrote:
> > > On 7/30/25 02:35, Darrick J. Wong wrote:
> > > > On Thu, Jul 17, 2025 at 04:00:45PM +0530, Nirjhar Roy (IBM) wrote:
> > > > > This patch is based on a previous RFC[1] by Gao Xiang and various
> > > > > ideas proposed by Dave Chinner in the RFC[1].
> > > > >
> > > > > This patch adds the functionality to shrink the filesystem beyond
> > > > > 1 AG. We can remove only empty AGs to prevent loss of data.
> > > > > Before I summarize the overall steps of the shrink process, I
> > > > > would like to introduce some of the terminologies:
> > > > >
> > > > > 1. Empty AG - An AG that is completely used, and no block
> > > > > is being used/allocated for data or metadata and no
> > > > > log blocks are allocated here. This ensures that
> > > > > removal of this AG doesn't result in data loss.
> > > > >
> > > > > 2. Active/Online AG - Online AG and active AG will be used
> > > > > interchangebly. An AG is active or online when all the regular
> > > > > operations can be done on it. When we mount a filesystem, all
> > > > > the AGs are by default online/active. In terms of implementation,
> > > > > an online AG will have number of active references greater than 1
> > > > > (default is 1 i.e, an AG by default is online/active).
> > > > >
> > > > > 3. AG offlining/deactivation - AG Offlining and AG deactivation will
> > > > > be used interchangebly. An AG is said to be offline/deactivated
> > > > > when no new high level operation can be started on the AG. This is
> > > > > implemented with the help of active references. When the active
> > > > > reference count of an AG is 0, the AG is said to be deactivated.
> > > > > No new active reference can be taken if the present active reference
> > > > > count is 0. This way a barrier is formed from preventing new high
> > > > > level operations to get started on an already offlined AG.
> > > > >
> > > > > 4. Reactivating an AG - If we try to remove an offline AG but for some
> > > > > reason, we can't, then we reactivate the AG i.e, the AG will once
> > > > > more be in an usable state i.e, the active reference count will be
> > > > > set to 1. All the high level operations can now be performed on this
> > > > > AG. In terms of implementation, in order to activate an AG, we
> > > > > atomically set the active reference count to 1.
> > > > >
> > > > > 5. AG removal - This means that AG no longer exists in the filesystem.
> > > > > It will be reflected in the usable/total size of the device too
> > > > > (using tools like df).
> > > > >
> > > > > 6. New tail AG - This refers to the last AG that will be formed after
> > > > > removal of 1 or more AGs. For example if there 4 AGs and each
> > > > > with 32 blocks, so there are total of 4 * 32 = 128 blocks. Now,
> > > > > if we remove 40 blocks, AG 3(indexed at 0 will be completely
> > > > > removed (32 blocks) and from AG 2, we will remove 8 blocks.
> > > > > So AG 2, is the new tail AG.
> > > > > 7. Old tail AG - This is the last AG before the start of the shrink
> > > > > process.
> > > > >
> > > > > 8. AG stabilization - This simply means that the in-memory contents
> > > > > are synched to the disk.
> > > > >
> > > > > The overall steps for shrinking AGs is as follows:
> > > > > PHASE 1: Preparing the AGs for removal
> > > > > 1. Deactivate the AGs to be removed completely - This is done
> > > > > by the function xfs_shrinkfs_deactivate_ags(). This step involves
> > > > > waiting for the active references for target AGs to go come down
> > > > > to 0.
> > > > > This is done so that no other entity is racing while the removal
> > > > > is in progress i.e, no new high level operation can start on that
> > > > > AG while we are trying to remove the AG.
> > > > > 2. Once we have waited for the active references to come down to 0,
> > > > > we make sure that all the pending operations on that AG are completed
> > > > > and the in-core and on-disk structures are in synch i.e, the AG is
> > > > > stablized on to the disk.
> > > > stabilized
> > > Noted.
> > > > > The steps to stablize the AG onto the disk are as follows:
> > > > > 2.a Wait for all the busy extents for the target AGs to be resolved
> > > > > (done by the function xfs_extent_busy_wait_range())
> > > > > 2.b Flush the xfs_discard_wq workqueue
> > > > > 2.c We need to flush and empty the logs and wait for all the pending
> > > > > I/Os to complete - for this perform a log quiesce by calling
> > > > > xfs_log_quiesce(). This also ensures that none of the future
> > > > > logged transactions will refer to these AGs during log
> > > > > recovery in case if sudden shutdown/crash happens while we
> > > > > are trying to remove these AGs.
> > > > Doesn't flushing the log put a bunch of busy extents on xfs_discard_wq?
> > > > So doesn't 2c have to happen before 2b?
> > > Can you please tell me how flushing of logs will add more busy extents? My
> > > understanding is that busy extents get created during extent free
> > > operations(like file delete/truncate etc). Are you saying that once the logs
> > > are flushed, the log infrastructure will free the extents it was
> > > using(during reservation) and that would add to the list of busy extents?
> > IIRC, forcing the log to disk writes any open iclogs to disk, and the IO
> > completion from the iclog kicks off bdev discards of everything on the
> > busy extent list. When the discard bios complete, they get put on
> > xfs_discard_wq for ioend processing, which removes the busy extent list
> > entry.
> >
> > In short, I think you need to wait for the log cover to hit the disk and
> > then for all the discard bios to complete, so that there really is no
> > more IO going to the blocks in the doomed AG(s) before we get to step 3.
> Okay. Thank you for the explanation. I will dig deeper into this based on
> your explanation to have a better understanding.
> >
> > > > > 3. Once the AG is deactivated and stabilized on to the disk, we check if
> > > > > all the target AGs are empty, and if not, we fail the shrink process.
> > > > > We are not supporting partial shrink support i.e, the shrink will
> > > > > either completely fail or completely succeed.
> > > > >
> > > > > PHASE 2: Actual removal of the AGs
> > > > So in phase 2, we know that all the AGs can be freed and the new tail AG
> > > > shrunk?
> > > After phase 1 is completed, and we are in the beginning of phase 2, we know
> > > that all the AGs (which we are trying to remove COMPLETELY) can be
> > > removed/freed. The tail/new-tail AG is NOT shrunk in phase 1. It will be
> > > shrunk in phase 2. Phase 2 does the complete AG removal + tail/new-tail AG
> > > shrink. Does that answer your question?
> > >
> > > Maybe I can rephrase it to "PHASE 2: Actual removal of the AGs +
> > > tail/new-tail AG shrink" - what do you think?
> > "PHASE 2: Shrink new tail group, punch out totally empty groups" ?
> Yeah, this sounds good.
> >
> > > > > 4. Once the preparation phase is over, we start the actual removal
> > > > > process. This is done in the function xfs_shrink_start(). Here we
> > > > > first remove the blocks, then update the metadata of new last tail
> > > > > AG and then remove the AGs (and their associated data structures)
> > > > > one by one (in function xfs_shrink_remove_empty_ag()).
> > > > > 5. In the end we log the changes and commit the transaction.
> > > > >
> > > > > Removal of each AG is done by the function xfs_shrink_remove_empty_ag().
> > > > > The steps can be outlined as follows:
> > > > > 1. Free the per AG reservation - this will result in correct free
> > > > > space/used space information.
> > > > > 2. Freeing the intents drain queue.
> > > > > 3. Freeing busy extents list.
> > > > > 4. Remove the perag cached buffers and then the buffer cache.
> > > > > 5. Freeing the struct xfs_group pointer - Before this is done, we
> > > > > assert that all the active and passive references are down to 0.
> > > > > We remove all the cached buffers associated with the offlined AGs
> > > > > to be removed - this releases the passive references of the AGs
> > > > > consumed by the cached buffers.
> > > > Migrating data out of the AG is left as an exercise to userspace, I
> > > > assume?
> > > That is correct. For now, I am targeting to remove only empty regular
> > > (non-realtime) AGs.
> > > > > [1] https://lore.kernel.org/all/20210414195240.1802221-1-hsiangkao@redhat.com/
> > > > >
> > > > > Signed-off-by: Nirjhar Roy (IBM) <nirjhar.roy.lists@gmail.com>
> > > > > Inspired-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> > > > > Suggested-by: Dave Chinner <david@fromorbit.com>
> > > > > ---
> > > > > fs/xfs/libxfs/xfs_ag.c | 135 ++++++++++++++++++-
> > > > > fs/xfs/libxfs/xfs_ag.h | 10 ++
> > > > > fs/xfs/libxfs/xfs_alloc.c | 9 +-
> > > > > fs/xfs/xfs_buf.c | 76 +++++++++++
> > > > > fs/xfs/xfs_buf.h | 1 +
> > > > > fs/xfs/xfs_buf_item_recover.c | 37 ++++--
> > > > > fs/xfs/xfs_extent_busy.c | 28 ++++
> > > > > fs/xfs/xfs_extent_busy.h | 2 +
> > > > > fs/xfs/xfs_fsops.c | 241 ++++++++++++++++++++++++++++++++--
> > > > > fs/xfs/xfs_trans.c | 1 -
> > > > > 10 files changed, 502 insertions(+), 38 deletions(-)
> > > > >
> > > > > diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
> > > > > index dcaf5683028e..f7fa7f53f2ec 100644
> > > > > --- a/fs/xfs/libxfs/xfs_ag.c
> > > > > +++ b/fs/xfs/libxfs/xfs_ag.c
> > > > > @@ -193,21 +193,32 @@ xfs_agino_range(
> > > > > }
> > > > > /*
> > > > > - * Update the perag of the previous tail AG if it has been changed during
> > > > > - * recovery (i.e. recovery of a growfs).
> > > > > + * This function does the following:
> > > > > + * - Updates the previous perag tail if prev_agcount < current agcount i.e, the
> > > > > + * filesystem has grown OR
> > > > > + * - Updates the current tail AG when prev_agcount > current agcount i.e, the
> > > > > + * filesystem has shrunk beyond 1 AG OR
> > > > > + * - Updates the current tail AG when only the last AG was shrunk or grown i.e,
> > > > > + * prev_agcount == mp->m_sb.sb_agcount.
> > > > > */
> > > > > int
> > > > > xfs_update_last_ag_size(
> > > > > struct xfs_mount *mp,
> > > > > xfs_agnumber_t prev_agcount)
> > > > > {
> > > > > - struct xfs_perag *pag = xfs_perag_grab(mp, prev_agcount - 1);
> > > > > + xfs_agnumber_t agno;
> > > > > +
> > > > > + if (prev_agcount >= mp->m_sb.sb_agcount)
> > > > > + agno = mp->m_sb.sb_agcount - 1;
> > > > > + else
> > > > > + agno = prev_agcount - 1;
> > > > > +
> > > > > + struct xfs_perag *pag = xfs_perag_grab(mp, agno);
> > > > Can we please keep the variable declarations at the top?
> > > >
> > > > I know I'm an old C89 dinosaur.
> > > Noted.
> > > > > if (!pag)
> > > > > return -EFSCORRUPTED;
> > > > > - pag_group(pag)->xg_block_count = __xfs_ag_block_count(mp,
> > > > > - prev_agcount - 1, mp->m_sb.sb_agcount,
> > > > > - mp->m_sb.sb_dblocks);
> > > > > + pag_group(pag)->xg_block_count = __xfs_ag_block_count(mp, agno,
> > > > > + mp->m_sb.sb_agcount, mp->m_sb.sb_dblocks);
> > > > and please keep the two level indent for the continuation
> > > Noted.
> > > > > __xfs_agino_range(mp, pag_group(pag)->xg_block_count, &pag->agino_min,
> > > > > &pag->agino_max);
> > > > > xfs_perag_rele(pag);
> > > > > @@ -290,6 +301,22 @@ xfs_initialize_perag(
> > > > > return error;
> > > > > }
> > > > > +void
> > > > > +xfs_activate_ag(struct xfs_perag *pag)
> > > > xfs_perag_activate() ?
> > > >
> > > > Functions that act upon an xfs_perag object tend to have "xfs_perag" in
> > > > their name, not xfs_ag.
> > > Okay. Noted.
> > > > > +{
> > > > > + ASSERT(!xfs_ag_is_active(pag));
> > > > > + init_waitqueue_head(&pag_group(pag)->xg_active_wq);
> > > > > + atomic_set(&pag_group(pag)->xg_active_ref, 1);
> > > > > +}
> > > > > +
> > > > > +void
> > > > > +xfs_deactivate_ag(struct xfs_perag *pag)
> > > > > +{
> > > > > + ASSERT(xfs_ag_is_active(pag));
> > > > > + xfs_perag_rele(pag);
> > > > > + wait_event(pag_group(pag)->xg_active_wq, !xfs_ag_is_active(pag));
> > > > > +}
> > > > > +
> > > > > static int
> > > > > xfs_get_aghdr_buf(
> > > > > struct xfs_mount *mp,
> > > > > @@ -758,7 +785,6 @@ xfs_ag_shrink_space(
> > > > > xfs_agblock_t aglen;
> > > > > int error, err2;
> > > > > - ASSERT(pag_agno(pag) == mp->m_sb.sb_agcount - 1);
> > > > > error = xfs_ialloc_read_agi(pag, *tpp, 0, &agibp);
> > > > > if (error)
> > > > > return error;
> > > > > @@ -872,6 +898,101 @@ xfs_ag_shrink_space(
> > > > > return err2;
> > > > > }
> > > > > +/*
> > > > > + * This function checks whether an AG is empty. An AG is eligbible to be
> > > > eligible
> > > > 1
> > > Noted.
> > > > > + * removed if it empty.
> > > > > + */
> > > > > +bool
> > > > > +xfs_ag_is_empty(struct xfs_perag *pag)
> > > > > +{
> > > > > + struct xfs_buf *agfbp = NULL;
> > > > > + struct xfs_mount *mp = pag_mount(pag);
> > > > > + bool is_empty = false;
> > > > > + int error = 0;
> > > > (More indenting/style problems)
> > > Noted. I will fix all the style/typos in the next revision.
> > > > > +
> > > > > + /*
> > > > > + * Read the on-disk data structures to get the correct length of the AG.
> > > > > + * All the AGs have the same length except the last AG.
> > > > > + */
> > > > > + error = xfs_alloc_read_agf(pag, NULL, 0, &agfbp);
> > > > > + if (!error) {
> > > > > + struct xfs_agf *agf = agfbp->b_addr;
> > > > > + /*
> > > > > + * We don't need to check if log blocks belong here since the
> > > > > + * log blocks are taken from the number of free blocks, and if
> > > > > + * the given AG has log blocks, then those many number of
> > > > > + * blocks will be consumed from the number of free blocks and
> > > > > + * the AG empty condition will not hold true.
> > > > > + */
> > > > > + if (pag->pagf_freeblks + pag->pagf_flcount +
> > > > > + mp->m_ag_prealloc_blocks ==
> > > > > + be32_to_cpu(agf->agf_length)) {
> > > > > + ASSERT(!xfs_ag_contains_log(mp, pag_agno(pag)));
> > > > Why not just check for this at the top?
> > > >
> > > > if (xfs_ag_contains_log(mp, pag_agno(pag)))
> > > > return false;
> > > "if (pag->pagf_freeblks + pag->pagf_flcount + mp->m_ag_prealloc_blocks ==
> > > be32_to_cpu(agf->agf_length))" already covers this case. Or are you saying
> > > that by checking for the logs first saves us some time from calling
> > > xfs_alloc_read_agf() - if the AG indeed contains log blocks? My motive for
> > > the ASSERTion is mostly out of paranoia. Do you feel this ASSERTion is
> > > redundant?
> > I think I was reading too fast. But maybe the assertion isn't necessary
> > at all? The free space btrees will not have an entry for the log space,
> > so we should never be able to delete the log's space anyway.
> Okay.
> >
> > What happens if you want to shrink the fs so that the last block of the
> > fs is the last block of the log?
>
> If we're shrinking such that the last fsblock of the filesystem = last
> fsblock of the log, then the shrink fails with -ENOSPC. The code that fails
> is the following (in xfs_ag_shrink_space()):
>
> /* internal log shouldn't also show up in the free space btrees */
> error = xfs_alloc_vextent_exact_bno(&args,
> xfs_agbno_to_fsb(pag, aglen - delta));
> if (!error && args.agbno == NULLAGBLOCK) {
> error = -ENOSPC;
> }
>
> For example, the empty fs I tried has 16 AGs of size 128m (with 4k block
> size). So total of blocks=524288 and agcount=16, agsize=32768. From xfs_db,
> I found out the logstart and logblocks
>
> logstart = 262151 and logblocks=16384. So, the log blocks are from 262151 to
> 278534. Now I tried
>
> root@AMARPC-vm: /home/ubuntu$ xfs_growfs -D $(( 262151 + 16384 - 1 ))
> /mnt1/scratch
> ...
> [EXPERIMENTAL] try to shrink unused space 278534, old size is 524288
> xfs_growfs: XFS_IOC_FSGROWFSDATA xfsctl failed: No space left on device
Ah ok.
> One interesting thing that I noticed is that, the shrink fails for all the
> fssizes from the last log fsblock to last log fsblock+18 i.e, xfs_growfs -D
> {278534 to 278551} fails with -ENOSPC (same code in xfs_ag_shrink_space()
> mentioned above). I am not sure what these extra blocks are allocated for
> (in an empty fs), maybe some reservation for btree growth or something? Let
> me figure that out.
>
Yeah, that's probably the perag free space reservation code getting in
the way.
--D
> > > > error = xfs_alloc_read_agf()
> > > >
> > > > > + is_empty = true;
> > > > > + }
> > > > > + xfs_buf_relse(agfbp);
> > > > > + }
> > > > > + return is_empty;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * This function removes an entire empty AG. Before removing the struct
> > > > > + * xfs_perag reference, it removes the associated data structures. Before
> > > > > + * removing an AG, the caller must ensure that the AG has been deactivated with
> > > > > + * no active references and it has been fully stabilized on the disk.
> > > > > + */
> > > > > +void
> > > > > +xfs_shrinkfs_remove_ag(struct xfs_mount *mp, xfs_agnumber_t agno)
> > > > > +{
> > > > > + /*
> > > > > + * Number of AGs can't be less than 2
> > > > > + */
> > > > > + ASSERT(agno >= 2);
> > > > > + struct xfs_group *xg = xa_erase(&mp->m_groups[XG_TYPE_AG].xa, agno);
> > > > > + struct xfs_perag *cur_pag = to_perag(xg);
> > > > > +
> > > > > + ASSERT(!xfs_ag_is_active(cur_pag));
> > > > > + /*
> > > > > + * Since we are freeing the AG, we should clear the perag reservations
> > > > > + * for the corresponding AGs.
> > > > > + */
> > > > > + xfs_ag_resv_free(cur_pag);
> > > > > + /*
> > > > > + * We have already ensured in the AG preparation phase that all intents
> > > > > + * for the offlined AGs have been resolved. So it safe to free it here.
> > > > > + */
> > > > > + xfs_defer_drain_free(&xg->xg_intents_drain);
> > > > > + /*
> > > > > + * We have already ensured in the AG preparation phase that all busy
> > > > > + * extents for the offlined AGs have been resolved. So it safe to free
> > > > > + * it here.
> > > > > + */
> > > > > + kfree(xg->xg_busy_extents);
> > > > > + cancel_delayed_work_sync(&cur_pag->pag_blockgc_work);
> > > > > +
> > > > > + /*
> > > > > + * Remove all the cached buffers for the given AG.
> > > > > + */
> > > > > + xfs_buf_offline_perag_rele_cached(cur_pag);
> > > > > + /*
> > > > > + * Now that the cached buffers have been released, remove the
> > > > > + * cache/hashtable itself. We should not change the order of the buffer
> > > > > + * removal and cache removal.
> > > > > + */
> > > > > + xfs_buf_cache_destroy(&cur_pag->pag_bcache);
> > > > > + /*
> > > > > + * One final assert, before we remove the xg. Since the cached buffers
> > > > > + * for the offlined AGs are already removed, their passive references
> > > > > + * should be 0. Also, the active references are 0 too, so no new
> > > > > + * operation can start and race and get new references.
> > > > > + */
> > > > > + XFS_IS_CORRUPT(mp, atomic_read(&pag_group(cur_pag)->xg_ref) != 0);
> > > > > + /*
> > > > > + * Finally free the struct xfs_perag of the AG.
> > > > > + */
> > > > > + kfree_rcu_mightsleep(xg);
> > > > > +}
> > > > > +
> > > > > void
> > > > > xfs_growfs_get_delta(struct xfs_mount *mp, xfs_rfsblock_t nb,
> > > > > int64_t *deltap, xfs_agnumber_t *nagcountp)
> > > > > diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
> > > > > index 190af11f6941..15886e2b40aa 100644
> > > > > --- a/fs/xfs/libxfs/xfs_ag.h
> > > > > +++ b/fs/xfs/libxfs/xfs_ag.h
> > > > > @@ -112,6 +112,11 @@ static inline xfs_agnumber_t pag_agno(const struct xfs_perag *pag)
> > > > > return pag->pag_group.xg_gno;
> > > > > }
> > > > > +static inline bool xfs_ag_is_active(struct xfs_perag *pag)
> > > > > +{
> > > > > + return atomic_read(&pag_group(pag)->xg_active_ref) > 0;
> > > > > +}
> > > > > +
> > > > > /*
> > > > > * Per-AG operational state. These are atomic flag bits.
> > > > > */
> > > > > @@ -140,6 +145,7 @@ void xfs_free_perag_range(struct xfs_mount *mp, xfs_agnumber_t first_agno,
> > > > > xfs_agnumber_t end_agno);
> > > > > int xfs_initialize_perag_data(struct xfs_mount *mp, xfs_agnumber_t agno);
> > > > > int xfs_update_last_ag_size(struct xfs_mount *mp, xfs_agnumber_t prev_agcount);
> > > > > +bool xfs_ag_is_empty(struct xfs_perag *pag);
> > > > > /* Passive AG references */
> > > > > static inline struct xfs_perag *
> > > > > @@ -263,6 +269,9 @@ xfs_ag_contains_log(struct xfs_mount *mp, xfs_agnumber_t agno)
> > > > > agno == XFS_FSB_TO_AGNO(mp, mp->m_sb.sb_logstart);
> > > > > }
> > > > > +void xfs_activate_ag(struct xfs_perag *pag);
> > > > > +void xfs_deactivate_ag(struct xfs_perag *pag);
> > > > > +
> > > > > static inline struct xfs_perag *
> > > > > xfs_perag_next_wrap(
> > > > > struct xfs_perag *pag,
> > > > > @@ -331,6 +340,7 @@ struct aghdr_init_data {
> > > > > int xfs_ag_init_headers(struct xfs_mount *mp, struct aghdr_init_data *id);
> > > > > int xfs_ag_shrink_space(struct xfs_perag *pag, struct xfs_trans **tpp,
> > > > > xfs_extlen_t delta);
> > > > > +void xfs_shrinkfs_remove_ag(struct xfs_mount *mp, xfs_agnumber_t agno);
> > > > > void
> > > > > xfs_growfs_get_delta(struct xfs_mount *mp, xfs_rfsblock_t nb,
> > > > > int64_t *deltap, xfs_agnumber_t *nagcountp);
> > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > > > > index 000cc7f4a3ce..e16803214223 100644
> > > > > --- a/fs/xfs/libxfs/xfs_alloc.c
> > > > > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > > > > @@ -3209,11 +3209,12 @@ xfs_validate_ag_length(
> > > > > if (length != mp->m_sb.sb_agblocks) {
> > > > > /*
> > > > > * During growfs, the new last AG can get here before we
> > > > > - * have updated the superblock. Give it a pass on the seqno
> > > > > - * check.
> > > > > + * have updated the superblock. During shrink, the new last AG
> > > > > + * will be updated and the AGs from newag to old AG will be
> > > > > + * removed. So seqno here maybe not be equal to
> > > > > + * mp->m_sb.sb_agcount - 1 since the super block is not yet
> > > > > + * updated globally.
> > > > > */
> > > > > - if (bp->b_pag && seqno != mp->m_sb.sb_agcount - 1)
> > > > > - return __this_address;
> > > > > if (length < XFS_MIN_AG_BLOCKS)
> > > > > return __this_address;
> > > > > if (length > mp->m_sb.sb_agblocks)
> > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > > index ba5bd6031ece..d372f65068a2 100644
> > > > > --- a/fs/xfs/xfs_buf.c
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -951,6 +951,82 @@ xfs_buf_rele(
> > > > > xfs_buf_rele_cached(bp);
> > > > > }
> > > > > +/*
> > > > > + * This function populates a list of all the cached buffers of the given AG
> > > > > + * in the to_be_free list head.
> > > > > + */
> > > > > +static void
> > > > > +xfs_pag_populate_cached_bufs(struct xfs_perag *pag,
> > > > Do both of these functions implement a buf_cache invalidation?
> > > Yes. Invalidation and freeing/removal of the xfs_buf_cache along with the
> > > cached buffers inside the xfs_buf_cache.
> > > > This one I think is xfs_buf_cache_grab_all()...
> > > Okay, I can rename this.
> > > > > + struct list_head *to_be_freed)
> > > > > +{
> > > > > + struct xfs_buf *bp;
> > > > > + struct rhashtable_iter iter;
> > > > > +
> > > > > + INIT_LIST_HEAD(to_be_freed);
> > > > Initialize the list where you declare it, please.
> > > Noted.
> > > > > + rhashtable_walk_enter(&pag->pag_bcache.bc_hash, &iter);
> > > > > + do {
> > > > > + rhashtable_walk_start(&iter);
> > > > > + while ((bp = rhashtable_walk_next(&iter)) && !IS_ERR(bp)) {
> > > > > + ASSERT(list_empty(&bp->b_list));
> > > > > + ASSERT(list_empty(&bp->b_li_list));
> > > > > + list_add_tail(&bp->b_list, to_be_freed);
> > > > > + }
> > > > > + rhashtable_walk_stop(&iter);
> > > > > + } while (cond_resched(), bp == ERR_PTR(-EAGAIN));
> > > > > + rhashtable_walk_exit(&iter);
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * This function frees all the cached buffers (struct xfs_buf) associated with
> > > > > + * the given offline AG. The caller must ensure that the AG which is passed
> > > > > + * is offline and completely stabilized on the disk. Also, the caller should
> > > > > + * ensure that all the cached buffers are not queued for any pending i/o
> > > > > + * i.e, the b_list for all the cached buffers are empty - since we will be using
> > > > > + * b_list to get list of all the bufs that need to be freed.
> > > > > + */
> > > > > +void
> > > > > +xfs_buf_offline_perag_rele_cached(struct xfs_perag *pag)
> > > > ...and this is xfs_buf_cache_invalidate().
> > > Okay, I can rename this. I was thinking to somehow have the word "offline"
> > > in the function name so that it is obvious that we are removing the cached
> > > bufs ONLY from the offline/deactivated AGs (at least right now).
> > "offline" is one of those annoying words that's an adjective and a verb.
> > :)
> Okay. Sure, I will rename the functions based on the above suggestions.
> >
> > > > > +{
> > > > > + ASSERT(!xfs_ag_is_active(pag));
> > > > > + /*
> > > > > + * First get the list of buffers we want to free.
> > > > > + * We need to populate to_be_freed list and cannot directly free
> > > > > + * the buffers during the hashtable walk. rhashtable_walk_start() takes
> > > > > + * an RCU and xfs_buf_rele eventually calls xfs_buf_free (for
> > > > > + * cached buffers). xfs_buf_free() might sleep (depending on the
> > > > > + * whether the buffer was allocated using vmalloc or kmalloc) and
> > > > > + * cannot be called within an RCU context. Hence we first populate
> > > > > + * the buffers within an RCU context and free them outside it.
> > > > > + */
> > > > > + struct list_head to_be_freed;
> > > > > + struct xfs_buf *bp, *tmp;
> > > > > +
> > > > > + xfs_pag_populate_cached_bufs(pag, &to_be_freed);
> > > > > + list_for_each_entry_safe(bp, tmp, &to_be_freed, b_list) {
> > > > > + list_del(&bp->b_list);
> > > > > + spin_lock(&bp->b_lock);
> > > > > + ASSERT(bp->b_pag == pag);
> > > > > + ASSERT(!xfs_buf_is_uncached(bp));
> > > > > + /*
> > > > > + * Since we have made sure that this is being called on an
> > > > > + * AG with active refcount = 0, the b_hold value of any cached
> > > > > + * buffer should not exceed 1 (i.e, the default value) and hence
> > > > > + * can be safely removed. Hence, it should also be in an
> > > > > + * unlocked state.
> > > > > + */
> > > > > + ASSERT(bp->b_hold == 1);
> > > > > + ASSERT(!xfs_buf_islocked(bp));
> > > > > + /*
> > > > > + * We should set b_lru_ref to 0 so that it gets deleted from
> > > > > + * the lru during the call to xfs_buf_rele.
> > > > > + */
> > > > > + atomic_set(&bp->b_lru_ref, 0);
> > > > > + spin_unlock(&bp->b_lock);
> > > > > + xfs_buf_rele(bp);
> > > > > + }
> > > > > +}
> > > > > +
> > > > > /*
> > > > > * Lock a buffer object, if it is not already locked.
> > > > > *
> > > > > diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
> > > > > index 15fc56948346..4c7023a5afb2 100644
> > > > > --- a/fs/xfs/xfs_buf.h
> > > > > +++ b/fs/xfs/xfs_buf.h
> > > > > @@ -282,6 +282,7 @@ void xfs_buf_hold(struct xfs_buf *bp);
> > > > > /* Releasing Buffers */
> > > > > extern void xfs_buf_rele(struct xfs_buf *);
> > > > > +void xfs_buf_offline_perag_rele_cached(struct xfs_perag *pag);
> > > > > /* Locking and Unlocking Buffers */
> > > > > extern int xfs_buf_trylock(struct xfs_buf *);
> > > > > diff --git a/fs/xfs/xfs_buf_item_recover.c b/fs/xfs/xfs_buf_item_recover.c
> > > > > index d4c5cef5bc43..088a9b046af1 100644
> > > > > --- a/fs/xfs/xfs_buf_item_recover.c
> > > > > +++ b/fs/xfs/xfs_buf_item_recover.c
> > > > > @@ -737,8 +737,7 @@ xlog_recover_do_primary_sb_buffer(
> > > > > xfs_sb_from_disk(&mp->m_sb, dsb);
> > > > > if (mp->m_sb.sb_agcount < orig_agcount) {
> > > > > - xfs_alert(mp, "Shrinking AG count in log recovery not supported");
> > > > > - return -EFSCORRUPTED;
> > > > > + xfs_warn_experimental(mp, XFS_EXPERIMENTAL_SHRINK);
> > > > > }
> > > > > if (mp->m_sb.sb_rgcount < orig_rgcount) {
> > > > > xfs_warn(mp,
> > > > > @@ -764,18 +763,28 @@ xlog_recover_do_primary_sb_buffer(
> > > > > if (error)
> > > > > return error;
> > > > > }
> > > > > -
> > > > > - /*
> > > > > - * Initialize the new perags, and also update various block and inode
> > > > > - * allocator setting based off the number of AGs or total blocks.
> > > > > - * Because of the latter this also needs to happen if the agcount did
> > > > > - * not change.
> > > > > - */
> > > > > - error = xfs_initialize_perag(mp, orig_agcount, mp->m_sb.sb_agcount,
> > > > > - mp->m_sb.sb_dblocks, &mp->m_maxagi);
> > > > > - if (error) {
> > > > > - xfs_warn(mp, "Failed recovery per-ag init: %d", error);
> > > > > - return error;
> > > > > + if (orig_agcount > mp->m_sb.sb_agcount) {
> > > > > + /*
> > > > > + * Remove the old AGs that were removed previously by a growfs
> > > > > + */
> > > > > + xfs_free_perag_range(mp, mp->m_sb.sb_agcount, orig_agcount);
> > > > > + mp->m_maxagi = xfs_set_inode_alloc(mp, mp->m_sb.sb_agcount);
> > > > > + mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp);
> > > > > + } else {
> > > > > + /*
> > > > > + * Initialize the new perags, and also the update various block
> > > > > + * and inode allocator setting based off the number of AGs or
> > > > > + * total blocks.
> > > > > + * Because of the latter, this also needs to happen if the
> > > > > + * agcount did not change.
> > > > > + */
> > > > > + error = xfs_initialize_perag(mp, orig_agcount,
> > > > > + mp->m_sb.sb_agcount,
> > > > > + mp->m_sb.sb_dblocks, &mp->m_maxagi);
> > > > > + if (error) {
> > > > > + xfs_warn(mp, "Failed recovery per-ag init: %d", error);
> > > > > + return error;
> > > > > + }
> > > > > }
> > > > > mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > index da3161572735..1055681648ba 100644
> > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > @@ -676,6 +676,34 @@ xfs_extent_busy_wait_all(
> > > > > xfs_extent_busy_wait_group(rtg_group(rtg));
> > > > > }
> > > > > +/*
> > > > > + * Similar to xfs_extent_busy_wait_all() - It waits for all the busy extents to
> > > > > + * get resolved for the range of AGs provided. For now, this function is
> > > > > + * introduced to be used in online shrink process. Unlike
> > > > > + * xfs_extent_busy_wait_all(), this takes a passive reference, because this
> > > > > + * function is expected to be called for the AGs whose active reference has
> > > > > + * been reduced to 0 i.e, offline AGs.
> > > > > + *
> > > > > + * @mp - The xfs mount point
> > > > > + * @first_agno - The 0 based AG index of the range of AGs from which we will
> > > > > + * start.
> > > > > + * @end_agno - The 0 based AG index of the range of AGs from till which we will
> > > > > + * traverse.
> > > > > + */
> > > > > +void
> > > > > +xfs_extent_busy_wait_range(struct xfs_mount *mp, xfs_agnumber_t first_agno,
> > > > Range of what? Blocks?
> > > Range of AGs, i.e, waiting for the busy extents to get resolved for a range
> > > of AGs (AGs that we are trying to remove completely).
> > > > > + xfs_agnumber_t end_agno)
> > > > > +{
> > > > Perhaps this should be called xfs_extent_busy_wait_ags().
> > > Okay.
> > > > > + xfs_agnumber_t agno;
> > > > > + struct xfs_perag *pag = NULL;
> > > > > +
> > > > > + for (agno = end_agno; agno >= first_agno; agno--) {
> > > > > + pag = xfs_perag_get(mp, agno);
> > > > > + xfs_extent_busy_wait_group(pag_group(pag));
> > > > > + xfs_perag_put(pag);
> > > > > + }
> > > > > +}
> > > > > +
> > > > > /*
> > > > > * Callback for list_sort to sort busy extents by the group they reside in.
> > > > > */
> > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > index f069b04e8ea1..69ae78964828 100644
> > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > @@ -57,6 +57,8 @@ bool xfs_extent_busy_trim(struct xfs_group *xg, xfs_extlen_t minlen,
> > > > > unsigned *busy_gen);
> > > > > int xfs_extent_busy_flush(struct xfs_trans *tp, struct xfs_group *xg,
> > > > > unsigned busy_gen, uint32_t alloc_flags);
> > > > > +void xfs_extent_busy_wait_range(struct xfs_mount *mp, xfs_agnumber_t first_agno,
> > > > > + xfs_agnumber_t end_agno);
> > > > > void xfs_extent_busy_wait_all(struct xfs_mount *mp);
> > > > > bool xfs_extent_busy_list_empty(struct xfs_group *xg, unsigned int *busy_gen);
> > > > > struct xfs_extent_busy_tree *xfs_extent_busy_alloc(void);
> > > > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > > > index 91da9f733659..d9d33de4e679 100644
> > > > > --- a/fs/xfs/xfs_fsops.c
> > > > > +++ b/fs/xfs/xfs_fsops.c
> > > > > @@ -83,6 +83,219 @@ xfs_resizefs_init_new_ags(
> > > > > return error;
> > > > > }
> > > > > +/*
> > > > > + * Get new active references for all the AGs. This might be called when
> > > > > + * shrinkage process encounters a failure at an intermediate stage after the
> > > > > + * active references of all the target AGs have become 0.
> > > > > + */
> > > > > +static void
> > > > > +xfs_shrinkfs_reactivate_ags(struct xfs_mount *mp, xfs_agnumber_t oagcount,
> > > > > + xfs_agnumber_t nagcount)
> > > > > +{
> > > > > + struct xfs_perag *pag = NULL;
> > > > > +
> > > > > + if (nagcount >= oagcount)
> > > > > + return;
> > > > When would anyone call xfs_shrinkfs_reactivate_ags with n >= o?
> > > n >= o is invalid, that is why I am simply returning with a no-op. Or should
> > > I just have an ASSERTion here ASSERT(nagcount < oagcount);? Which one do you
> > > prefer?
> > I think an assert is fine.
> Okay.
> >
> > > > > + for (xfs_agnumber_t agno = oagcount - 1; agno > nagcount - 1; agno--) {
> > > > (Plz follow XFS coding conventions for consistency with the rest of the
> > > > codebase.)
> > > Yes. I will fix all the style related issues in the next version.
> > > > > + pag = xfs_perag_get(mp, agno);
> > > > > + xfs_activate_ag(pag);
> > > > > + xfs_perag_put(pag);
> > > > > + }
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * The function deactivates or puts the AGs to an offline mode. AG deactivation
> > > > > + * or AG offlining means that no new operation can be started on that AG. The AG
> > > > > + * still exists, however no new high level operation (like extent allocation)
> > > > > + * can be started. In terms of implementation, an AG is taken offline or is
> > > > > + * deactivated when xg_active_ref of the struct xfs_perag is 0 i.e, the number
> > > > > + * of active references becomes 0.
> > > > > + * Since active references act as a form of barrier, so once the active
> > > > > + * reference of an AG is 0, no new entity can get an active reference and in
> > > > > + * this way we ensure that once an AG is offline (i.e, active reference count is
> > > > > + * 0), no one will be able to start a new operation in it unless the active
> > > > > + * reference count is explicitly set to 1 i.e, the AG is made online/activated.
> > > > > + */
> > > > > +static int
> > > > > +xfs_shrinkfs_deactivate_ags(struct xfs_mount *mp, xfs_agnumber_t oagcount,
> > > > > + xfs_agnumber_t nagcount)
> > > > > +{
> > > > > + int error = 0;
> > > > > + struct xfs_perag *pag = NULL;
> > > > > +
> > > > > + if (oagcount <= nagcount)
> > > > > + return 0;
> > > > > + /*
> > > > > + * If we are removing 1 or more entire AGs, we only need to take those
> > > > > + * AGs offline which we are planning to remove completely. The new tail
> > > > > + * AG which will be partially shrunk need not be taken offline - since
> > > > should not be taken offline
> > > Okay.
> > > > > + * we will be doing an online operation on them, just like any other
> > > > > + * high level operation. For complete AG removal, we need to take them
> > > > > + * offline since we cannot start any new operation on them as they will
> > > > > + * be removed eventually.
> > > > > + *
> > > > > + * However, if the number of blocks that we are trying to remove is
> > > > > + * an exact multiple of the AG size (in blocks), then the new tail AG
> > > > > + * will not be shrunk at all.
> > > > > + */
> > > > > + for (xfs_agnumber_t agno = oagcount - 1; agno > nagcount - 1; agno--) {
> > > > maybe this ought to be a for_each_perag_range_reverse() iterator macro?
> > > Okay, I can introduce this macro and use that instead in place where I have
> > > this loop. There are a couple of places I have used this loop. Thank you for
> > > the suggestion.
> > <nod>
> >
> > > > > + pag = xfs_perag_get(mp, agno);
> > > > > + xfs_deactivate_ag(pag);
> > > > > + xfs_perag_put(pag);
> > > > > + }
> > > > > + /*
> > > > > + * Now that we have deactivated/offlined the AGs, we need to make sure
> > > > > + * that all the pending operations are completed and the in-core and
> > > > > + * the on disk contents are completely in synch.
> > > > > + */
> > > > > +
> > > > > + /*
> > > > > + * Wait for all the busy extents to get resolved along with pending trim
> > > > > + * ops for all the offlined AGs.
> > > > > + */
> > > > > + xfs_extent_busy_wait_range(mp, nagcount, oagcount - 1);
> > > > > + flush_workqueue(xfs_discard_wq);
> > > > > + /*
> > > > > + * We should wait for the log to be empty and all the pending I/Os to
> > > > > + * be completed so that the AGs are compeletly stabilized before we
> > > > completely
> > > Noted.
> > > > > + * start tearing them down. xfs_log_quiesce() call here ensures that
> > > > > + * none of the future logged transactions will refer to these AGs
> > > > > + * during log recovery in case if sudden shutdown/crash happens while
> > > > > + * we are trying to remove these AGs.
> > > > > + */
> > > > > + error = xfs_log_quiesce(mp);
> > > > So I guess we're quiescing the log here so that it flushes the log to
> > > > disk, pushes the AIL so everything in the log is written back to the
> > > > filesystem, and then covers the log so that after this point, recovery
> > > > will only see log contents from this point forward -- the log won't
> > > > updates involving the perags that we just deactivated 30 lines ago?
> > > Yes, after this point, the recovery code will not see any log/update
> > > involving the deactivated AGs.
> > > > > + if (error)
> > > > > + xfs_shrinkfs_reactivate_ags(mp, oagcount, nagcount);
> > > > > + /*
> > > > > + * Reactivate the log work queue which was deactivated in
> > > > > + * xfs_log_quiesce
> > > > > + */
> > > > > + xfs_log_work_queue(mp);
> > > > > + return error;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * This function does 3 things:
> > > > > + * 1. Deactivate the AGs i.e, wait for all the active references to come to 0.
> > > > > + * 2. Checks whether all the AGs that shrink process needs to remove are empty.
> > > > > + * If at least one of the target AGs is non-empty, shrink fails and
> > > > > + * xfs_shrinkfs_reactivate_ags() is called.
> > > > > + * 3. Calculates the total number of fdblocks (free data blocks) that will be
> > > > > + * removed and stores in id->nfree.
> > > > > + * Please look into the individual functions for more details and the definition
> > > > > + * of the terminologies.
> > > > > + */
> > > > > +static int
> > > > > +xfs_shrinkfs_prepare_ags(struct xfs_mount *mp, xfs_agnumber_t oagcount,
> > > > > + xfs_agnumber_t nagcount, struct aghdr_init_data *id)
> > > > > +{
> > > > > + ASSERT(nagcount < oagcount);
> > > > > + struct xfs_perag *pag = NULL;
> > > > > + xfs_agnumber_t agno;
> > > > > + int error = 0;
> > > > > + /*
> > > > > + * Deactivating/offlining the AGs i.e waiting for the active references
> > > > > + * to come down to 0.
> > > > > + */
> > > > > + error = xfs_shrinkfs_deactivate_ags(mp, oagcount, nagcount);
> > > > > + if (error)
> > > > > + return error;
> > > > > + /*
> > > > > + * At this point the AGs have been deactivated/offlined and the in-core
> > > > > + * and the on-disk are synch. So now we need to check whether all the
> > > > > + * AGs that we are trying to remove/delete are empty. Since we are not
> > > > > + * supporting partial shrink success (i.e, the entire requested size
> > > > > + * will be removed or none), we will bail out with a failure code even
> > > > > + * if 1 AG is non-empty.
> > > > > + */
> > > > > + for (agno = oagcount - 1; agno > nagcount - 1; agno--) {
> > > > > + pag = xfs_perag_get(mp, agno);
> > > > > + if (!xfs_ag_is_empty(pag)) {
> > > > > + /* Error out even if one AG is non-empty */
> > > > > + error = -ENOTEMPTY;
> > > > > + xfs_perag_put(pag);
> > > > > + xfs_shrinkfs_reactivate_ags(mp, oagcount, nagcount);
> > > > > + return error;
> > > > > + }
> > > > > + /*
> > > > > + * Since these are removed, these free blocks should also be
> > > > > + * subtracted from the total list of free blocks.
> > > > > + */
> > > > > + id->nfree += (pag->pagf_freeblks + pag->pagf_flcount);
> > > > > + xfs_perag_put(pag);
> > > > > +
> > > > > + }
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * This function does the job of fully removing the blocks and empty AGs (
> > > > > + * depending of the values of oagcount and nagcount). By removal it means,
> > > > > + * removal of all the perag data structures, other data structures associated
> > > > > + * with it and all the perag cached buffers (when AGs are removed). Once this
> > > > > + * function succeeds, the AGs/blocks will no longer exist.
> > > > > + * The overall steps are as follows (details are in the function):
> > > > > + * - calculate the number of blocks that will be removed from the new tail AG
> > > > > + * i.e, the AG that will be shrunk partially.
> > > > > + * - call xfs_shrinkfs_remove_ag() that removes the perag cached buffers,
> > > > > + * then frees the perag reservation, other associated datastructures and
> > > > > + * finally the in-memory perag group instance.
> > > > > + */
> > > > > +static int
> > > > > +xfs_shrinkfs_remove_ags(struct xfs_mount *mp, struct xfs_trans **tp,
> > > > > + xfs_agnumber_t oagcount, xfs_agnumber_t nagcount,
> > > > > + int64_t delta_rem, xfs_agnumber_t *nagmax)
> > > > > +{
> > > > > + xfs_agnumber_t agno;
> > > > > + int error = 0;
> > > > > + struct xfs_perag *cur_pag = NULL;
> > > > > +
> > > > > + /*
> > > > > + * This loop is calculating the number of blocks that needs to be
> > > > > + * removed from the new tail AG. If delta_rem is 0 after the loop exits,
> > > > > + * then it means that the number of blocks we want to remove is a
> > > > > + * multiple of AG size (in blocks).
> > > > > + */
> > > > > + for (agno = oagcount - 1; agno > nagcount - 1; agno--) {
> > > > > + cur_pag = xfs_perag_get(mp, agno);
> > > > > + delta_rem -= xfs_ag_block_count(mp, agno);
> > > > > + xfs_perag_put(cur_pag);
> > > > > + }
> > > > > + /*
> > > > > + * We are first removing blocks from the AG that will form the new tail
> > > > > + * AG. The reason is that, if we encounter an error here, we can simply
> > > > > + * reactivate the AGs (by calling xfs_shrinkfs_reactivate_ags()).
> > > > > + * Removal of complete empty AGs always succeed anyway. However if we
> > > > > + * remove the empty AGs first (which will succeed) and then the new
> > > > > + * last AG shrink fails, then we will again have to re-initialize the
> > > > > + * removed AGs. Hence the former approach seems more efficient to me.
> > > > > + */
> > > > > + if (delta_rem) {
> > > > > + /*
> > > > > + * Remove delta_rem blocks from the AG that will form the new
> > > > > + * tail AG after the AGs are removed. If the number of blocks to
> > > > > + * be removed is a multiple of AG size, then nothing is done
> > > > > + * here.
> > > > > + */
> > > > > + cur_pag = xfs_perag_get(mp, nagcount - 1);
> > > > > + error = xfs_ag_shrink_space(cur_pag, tp, delta_rem);
> > > > > + xfs_perag_put(cur_pag);
> > > > > + if (error) {
> > > > > + xfs_shrinkfs_reactivate_ags(mp, oagcount, nagcount);
> > > > > + return error;
> > > > > + }
> > > > How do we know that the last @delta_rem fsblocks in the new tail AG are
> > > > free?
> > > The last delta_rem blocks in the tail/new-tail are removed by
> > > xfs_ag_shrink_space(). The xfs_ag_shrink_space() takes care of this. If I
> > > understand it correctly, the following check in xfs_ag_shrink_space() makes
> > > sure that the last delta_rem blocks in the tail/new-tail AG are
> > > free/unallocated:
> > >
> > > error = xfs_alloc_vextent_exact_bno(&args,
> > > xfs_agbno_to_fsb(pag, aglen - delta));
> > > if (!error && args.agbno == NULLAGBLOCK)
> > > error = -ENOSPC;
> > > if we are not able to allocate "aglen - delta"th block(AG relative fsblock
> > > number) in the pag passed to it, then this function fails and the overall
> > > shrink fails. Do you agree with this?
> > Ah yes. Thanks for doing the research. :)
>
> Thank you.
>
> --
>
> >
> > --D
> >
> > > > > + }
> > > > > + /*
> > > > > + * Now, in this final step we remove the perag instance and the
> > > > > + * associated datastructures and cached buffers. This fully removes the
> > > > > + * AG.
> > > > > + */
> > > > > + for (agno = oagcount - 1; agno > nagcount - 1; agno--)
> > > > > + xfs_shrinkfs_remove_ag(mp, agno);
> > > > > + *nagmax = xfs_set_inode_alloc(mp, nagcount);
> > > > > + return error;
> > > > > +}
> > > > > +
> > > > > /*
> > > > > * growfs operations
> > > > > */
> > > > > @@ -101,7 +314,6 @@ xfs_growfs_data_private(
> > > > > bool lastag_extended = false;
> > > > > struct xfs_trans *tp;
> > > > > struct aghdr_init_data id = {};
> > > > > - struct xfs_perag *last_pag;
> > > > > error = xfs_sb_validate_fsb_count(&mp->m_sb, nb);
> > > > > if (error)
> > > > > @@ -122,7 +334,6 @@ xfs_growfs_data_private(
> > > > > if (error)
> > > > > return error;
> > > > > xfs_growfs_get_delta(mp, nb, &delta, &nagcount);
> > > > > -
> > > > > /*
> > > > > * Reject filesystems with a single AG because they are not
> > > > > * supported, and reject a shrink operation that would cause a
> > > > > @@ -135,9 +346,11 @@ xfs_growfs_data_private(
> > > > > if (delta == 0)
> > > > > return 0;
> > > > > - /* TODO: shrinking the entire AGs hasn't yet completed */
> > > > > - if (nagcount < oagcount)
> > > > > - return -EINVAL;
> > > > > + if (nagcount < oagcount) {
> > > > > + error = xfs_shrinkfs_prepare_ags(mp, oagcount, nagcount, &id);
> > > > > + if (error)
> > > > > + return error;
> > > > > + }
> > > > > /* allocate the new per-ag structures */
> > > > > error = xfs_initialize_perag(mp, oagcount, nagcount, nb, &nagimax);
> > > > > @@ -154,15 +367,16 @@ xfs_growfs_data_private(
> > > > > if (error)
> > > > > goto out_free_unused_perag;
> > > > > - last_pag = xfs_perag_get(mp, oagcount - 1);
> > > > > if (delta > 0) {
> > > > > + struct xfs_perag *last_pag = xfs_perag_get(mp, oagcount - 1);
> > > > Blank line between declarations and code please.
> > > Noted.
> > >
> > > --NR
> > >
> > > > --D
> > > >
> > > > > error = xfs_resizefs_init_new_ags(tp, &id, oagcount, nagcount,
> > > > > delta, last_pag, &lastag_extended);
> > > > > + xfs_perag_put(last_pag);
> > > > > } else {
> > > > > xfs_warn_experimental(mp, XFS_EXPERIMENTAL_SHRINK);
> > > > > - error = xfs_ag_shrink_space(last_pag, &tp, -delta);
> > > > > + error = xfs_shrinkfs_remove_ags(mp, &tp, oagcount, nagcount,
> > > > > + -delta, &nagimax);
> > > > > }
> > > > > - xfs_perag_put(last_pag);
> > > > > if (error)
> > > > > goto out_trans_cancel;
> > > > > @@ -171,12 +385,14 @@ xfs_growfs_data_private(
> > > > > * seen by the rest of the world until the transaction commit applies
> > > > > * them atomically to the superblock.
> > > > > */
> > > > > - if (nagcount > oagcount)
> > > > > - xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount);
> > > > > + if (nagcount != oagcount)
> > > > > + xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT,
> > > > > + (int64_t)nagcount - (int64_t)oagcount);
> > > > > if (delta)
> > > > > xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, delta);
> > > > > if (id.nfree)
> > > > > - xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS, id.nfree);
> > > > > + xfs_trans_mod_sb(tp, XFS_TRANS_SB_FDBLOCKS,
> > > > > + delta > 0 ? id.nfree : (int64_t)-id.nfree);
> > > > > /*
> > > > > * Sync sb counters now to reflect the updated values. This is
> > > > > @@ -190,10 +406,11 @@ xfs_growfs_data_private(
> > > > > error = xfs_trans_commit(tp);
> > > > > if (error)
> > > > > return error;
> > > > > -
> > > > > /* New allocation groups fully initialized, so update mount struct */
> > > > > if (nagimax)
> > > > > mp->m_maxagi = nagimax;
> > > > > + if (nagcount < oagcount)
> > > > > + mp->m_ag_prealloc_blocks = xfs_prealloc_blocks(mp);
> > > > > xfs_set_low_space_thresholds(mp);
> > > > > mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> > > > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > > > > index b4a07af513ba..e94d96db5383 100644
> > > > > --- a/fs/xfs/xfs_trans.c
> > > > > +++ b/fs/xfs/xfs_trans.c
> > > > > @@ -438,7 +438,6 @@ xfs_trans_mod_sb(
> > > > > tp->t_dblocks_delta += delta;
> > > > > break;
> > > > > case XFS_TRANS_SB_AGCOUNT:
> > > > > - ASSERT(delta > 0);
> > > > > tp->t_agcount_delta += delta;
> > > > > break;
> > > > > case XFS_TRANS_SB_IMAXPCT:
> > > > > --
> > > > > 2.43.5
> > > > >
> > > > >
> > > --
> > > Nirjhar Roy
> > > Linux Kernel Developer
> > > IBM, Bangalore
> > >
> > >
> --
> Nirjhar Roy
> Linux Kernel Developer
> IBM, Bangalore
>
next prev parent reply other threads:[~2025-08-14 18:31 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-17 10:30 [RFC 0/3] xfs: Add support to shrink multiple empty AGs Nirjhar Roy (IBM)
2025-07-17 10:30 ` [RFC 1/3] xfs: Re-introduce xg_active_wq field in struct xfs_group Nirjhar Roy (IBM)
2025-07-29 20:26 ` Darrick J. Wong
2025-07-30 6:14 ` Nirjhar Roy (IBM)
2025-07-17 10:30 ` [RFC 2/3] xfs: Refactoring the nagcount and delta calculation Nirjhar Roy (IBM)
2025-07-29 20:24 ` Darrick J. Wong
2025-07-30 5:04 ` Nirjhar Roy (IBM)
2025-07-17 10:30 ` [RFC 3/3] xfs: Add support to shrink multiple empty AGs Nirjhar Roy (IBM)
2025-07-29 21:05 ` Darrick J. Wong
2025-07-30 18:36 ` Nirjhar Roy (IBM)
2025-08-12 19:07 ` Darrick J. Wong
2025-08-13 8:04 ` Nirjhar Roy (IBM)
2025-08-14 18:31 ` Darrick J. Wong [this message]
2025-08-19 5:56 ` Nirjhar Roy (IBM)
2025-07-30 4:55 ` Ritesh Harjani
2025-07-30 11:53 ` Nirjhar Roy (IBM)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250814183139.GE7981@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=hsiangkao@linux.alibaba.com \
--cc=linux-xfs@vger.kernel.org \
--cc=nirjhar.roy.lists@gmail.com \
--cc=ojaswin@linux.ibm.com \
--cc=ritesh.list@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).