From: Gao Xiang <hsiangkao@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [RFC PATCH 4/4] xfs: support shrinking empty AGs
Date: Thu, 15 Apr 2021 13:22:26 +0800 [thread overview]
Message-ID: <20210415052226.GC1864610@xiangao.remote.csb> (raw)
In-Reply-To: <20210415042549.GM63242@dread.disaster.area>
Hi Dave,
On Thu, Apr 15, 2021 at 02:25:49PM +1000, Dave Chinner wrote:
> On Thu, Apr 15, 2021 at 03:52:40AM +0800, Gao Xiang wrote:
> > Roughly, freespace can be shrinked atomicly with the following steps:
> >
> > - make sure the pending-for-discard AGs are all stablized as empty;
> > - a transaction to
> > fix up freespace btrees for the target tail AG;
> > decrease agcount to the target value.
> >
> > In order to make sure such AGs are empty, first to mark them as
> > inactive, then check if these AGs are all empty one-by-one. Also
> > need to log force, complete all discard operations together.
> >
> > Even it's needed to drain out all related cached buffers in case of
> > corrupt fs if growfs again, so ail items are all needed to be pushed
> > out as well.
> >
> > Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
> > ---
> > fs/xfs/libxfs/xfs_ag.c | 1 -
> > fs/xfs/libxfs/xfs_ag.h | 2 +-
> > fs/xfs/xfs_fsops.c | 148 ++++++++++++++++++++++++++++++++++++++---
> > fs/xfs/xfs_mount.c | 87 +++++++++++++++++++-----
> > fs/xfs/xfs_trans.c | 1 -
> > 5 files changed, 210 insertions(+), 29 deletions(-)
> >
> > diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c
> > index ba5702e5c9ad..eb370fbae192 100644
> > --- a/fs/xfs/libxfs/xfs_ag.c
> > +++ b/fs/xfs/libxfs/xfs_ag.c
> > @@ -512,7 +512,6 @@ xfs_ag_shrink_space(
> > struct xfs_agf *agf;
> > int error, err2;
> >
> > - ASSERT(agno == mp->m_sb.sb_agcount - 1);
> > error = xfs_ialloc_read_agi(mp, *tpp, agno, &agibp);
> > if (error)
> > return error;
> > diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
> > index 4535de1d88ea..7031e2c7ef66 100644
> > --- a/fs/xfs/libxfs/xfs_ag.h
> > +++ b/fs/xfs/libxfs/xfs_ag.h
> > @@ -15,7 +15,7 @@ struct aghdr_init_data {
> > xfs_agblock_t agno; /* ag to init */
> > xfs_extlen_t agsize; /* new AG size */
> > struct list_head buffer_list; /* buffer writeback list */
> > - xfs_rfsblock_t nfree; /* cumulative new free space */
> > + int64_t nfree; /* cumulative new free space */
> >
> > /* per header data */
> > xfs_daddr_t daddr; /* header location */
> > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > index b33c894b6cf3..659ca1836bba 100644
> > --- a/fs/xfs/xfs_fsops.c
> > +++ b/fs/xfs/xfs_fsops.c
> > @@ -14,11 +14,14 @@
> > #include "xfs_trans.h"
> > #include "xfs_error.h"
> > #include "xfs_alloc.h"
> > +#include "xfs_ialloc.h"
> > +#include "xfs_extent_busy.h"
> > #include "xfs_fsops.h"
> > #include "xfs_trans_space.h"
> > #include "xfs_log.h"
> > #include "xfs_ag.h"
> > #include "xfs_ag_resv.h"
> > +#include "xfs_trans_priv.h"
> >
> > /*
> > * Write new AG headers to disk. Non-transactional, but need to be
> > @@ -78,6 +81,112 @@ xfs_resizefs_init_new_ags(
> > return error;
> > }
> >
> > +static int
> > +xfs_shrinkfs_deactivate_ags(
> > + struct xfs_mount *mp,
> > + xfs_agnumber_t oagcount,
> > + xfs_agnumber_t nagcount)
> > +{
> > + xfs_agnumber_t agno;
> > + int error;
> > +
> > + /* confirm AGs pending for shrinking are all inactive */
> > + for (agno = nagcount; agno < oagcount; ++agno) {
> > + struct xfs_buf *agfbp, *agibp;
> > + struct xfs_perag *pag = xfs_perag_get(mp, agno);
> > +
> > + down_write(&pag->pag_inactive_rwsem);
> > + /* need to lock agi, agf buffers here to close all races */
> > + error = xfs_read_agi(mp, NULL, agno, &agibp);
> > + if (!error) {
> > + error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agfbp);
> > + if (!error) {
> > + pag->pag_inactive = true;
> > + xfs_buf_relse(agfbp);
> > + }
> > + xfs_buf_relse(agibp);
> > + }
> > + up_write(&pag->pag_inactive_rwsem);
> > + xfs_perag_put(pag);
> > + if (error)
> > + break;
> > + }
> > + return error;
> > +}
>
> Hmmmm. Ok, that's why the first patch had the specific locking
> pattern it had, because once the AGI is locked under the
> inactive_rwsem. This seems ... fragile. It relies on the code
> looking up the perag to check the pag->pag_inactive flag before it
> takes an AGF or AGI lock, but does not allow a caller than has
> an AGI or AGF locked to take the inactive_sem to check if the per-ag
> is inactive or not. It's a one-way locking mechanism...
It guarantees that when AGF, AGI locked, pag_inactive won't be
switched, and before taking AGF or AGI, pag_inactive_sem should
be taken to confirm AGF, AGI can be read. That is the way that
I can think out with much less invasion than touch more XFS
codebase....
>
> I'd much prefer active/passive references here, and the order of
> AG inactivation is highest to lowest...
>
> static void
> xfs_shrinkfs_deactivate_ags(
> struct xfs_mount *mp,
> xfs_agnumber_t oagcount,
> xfs_agnumber_t nagcount)
> {
> xfs_agnumber_t agno;
>
> for (agno = oagcount - 1; agno >= nagcount; agno--) {
> struct xfs_perag *pag = xfs_perag_get(mp, agno);
>
> /* drop active reference */
> xfs_perag_put_active(pag);
>
> /* wait for pag->pag_active_refs to hit zero */
> .....
>
> xfs_perag_put(pag);
> }
> }
>
> At this point, we know there are going to be no new racing accesses
> to the perags...
>
>
> > +static void
> > +xfs_shrinkfs_activate_ags(
> > + struct xfs_mount *mp,
> > + xfs_agnumber_t oagcount,
> > + xfs_agnumber_t nagcount)
> > +{
> > + xfs_agnumber_t agno;
> > +
> > + for (agno = nagcount; agno < oagcount; ++agno) {
> > + struct xfs_perag *pag = xfs_perag_get(mp, agno);
> > +
> > + down_write(&pag->pag_inactive_rwsem);
> > + pag->pag_inactive = false;
> > + up_write(&pag->pag_inactive_rwsem);
> > + }
> > +}
>
> static void
> xfs_shrinkfs_reactivate_ags(
> struct xfs_mount *mp,
> xfs_agnumber_t oagcount,
> xfs_agnumber_t nagcount)
> {
> xfs_agnumber_t agno;
>
> for (agno = oagcount - 1; agno >= nagcount; agno--) {
> struct xfs_perag *pag = xfs_perag_get(mp, agno);
>
> /* get a new active reference for the mount */
> atomic_inc(&pag->pag_active_ref);
>
> xfs_perag_put(pag);
> }
> }
>
>
> > +
> > +static int
> > +xfs_shrinkfs_prepare_ags(
> > + struct xfs_mount *mp,
> > + struct aghdr_init_data *id,
> > + xfs_agnumber_t oagcount,
> > + xfs_agnumber_t nagcount)
> > +{
> > + xfs_agnumber_t agno;
> > + int error;
> > +
> > + error = xfs_shrinkfs_deactivate_ags(mp, oagcount, nagcount);
> > + if (error)
> > + goto err_out;
>
> Waiting for active references to drain means this can't fail.
>
> > +
> > + /* confirm AGs pending for shrinking are all empty */
> > + for (agno = nagcount; agno < oagcount; ++agno) {
>
> Again, needs to work from last AG back to first.
ok.
>
> > + struct xfs_buf *agfbp;
> > + struct xfs_perag *pag;
> > +
> > + error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agfbp);
> > + if (error)
> > + goto err_out;
> > +
> > + pag = agfbp->b_pag;
> > + error = xfs_ag_resv_free(pag);
> > + if (!error) {
> > + error = xfs_ag_is_empty(agfbp);
> > + if (!error) {
> > + ASSERT(!pag->pagf_flcount);
> > + id->nfree -= pag->pagf_freeblks;
> > + }
> > + }
>
> Please don't nest "if (!error)" statements like this. It results in
> excessive indent in the code, and it makes it harder to determine
> what the actual error handling for failure is.
ok.
>
> > + xfs_buf_relse(agfbp);
> > + if (error)
> > + goto err_out;
> > + }
> > + xfs_log_force(mp, XFS_LOG_SYNC);
>
> What does this do,
It just makes sure that no ongoing write transactions before tearing
down AGs. The reason it was here about AGFL drain, but since it's a
sync transaction, so I think it should be raised up.
> and why is it not needed before we try to free
> reservations and determine if the AG is empty?
Yeah, but it can only cause false nagative, anyway, I will raise it
up.
>
> > + /*
> > + * Wait for all busy extents to be freed, including completion of
> > + * any discard operation.
> > + */
> > + xfs_extent_busy_wait_all(mp);
> > + flush_workqueue(xfs_discard_wq);
>
> Shouldn't this happen before we start trying to tear down the AGs?
May I ask what's your suggestted place? since the AGs are already
inactive here.
>
> > +
> > + /*
> > + * Also need to drain out all related cached buffers, at least,
> > + * in case of growfs back later (which uses uncached buffers.)
> > + */
> > + xfs_ail_push_all_sync(mp->m_ail);
> > + xfs_buftarg_drain(mp->m_ddev_targp);
>
> Urk, no, this can livelock on active filesystems.
>
> What you want to do is drain the per-ag buffer cache, not the global
> filesystem LRU. Given that, at this point, all the buffers still
> cached in the per-ag should have zero references to them, a walk of
> the rbtree taking a reference to each buffer, marking it stale and
> then calling xfs_buf_rele() on it should be sufficient to free all
> the buffers in the AG and release all the remaining passive
> references to the struct perag for the AG.
>
I understand the issue, yeah, it'd be much better to use
xfs_buftarg_drain() here. Thanks for your suggestion about this.
> At this point, we can remove the perag from the m_perag radix tree,
> do the final teardown on it, and free if via call_rcu()....
I still think active/passive reference approach causes a lot of
random modification all over the whole XFS codebase since it
assumes current perag won't be removed/freed even reference count
reaches zero, adding new active reference counts in principle
sounds better yet a bit far away from the current XFS codebase
status.
Thanks,
Gao Xiang
>
> > + return error;
> > +err_out:
> > + xfs_shrinkfs_activate_ags(mp, oagcount, nagcount);
> > + return error;
> > +}
> > +
> > /*
> > * growfs operations
> > */
> > @@ -93,7 +202,7 @@ xfs_growfs_data_private(
> > xfs_rfsblock_t nb, nb_div, nb_mod;
> > int64_t delta;
> > bool lastag_extended;
> > - xfs_agnumber_t oagcount;
> > + xfs_agnumber_t oagcount, agno;
> > struct xfs_trans *tp;
> > struct aghdr_init_data id = {};
> >
> > @@ -130,14 +239,13 @@ xfs_growfs_data_private(
> > oagcount = mp->m_sb.sb_agcount;
> >
> > /* allocate the new per-ag structures */
> > - if (nagcount > oagcount) {
> > + if (nagcount > oagcount)
> > error = xfs_initialize_perag(mp, nagcount, &nagimax);
> > - if (error)
> > - return error;
> > - } else if (nagcount < oagcount) {
> > - /* TODO: shrinking the entire AGs hasn't yet completed */
> > - return -EINVAL;
> > - }
> > + else if (nagcount < oagcount)
> > + error = xfs_shrinkfs_prepare_ags(mp, &id, oagcount, nagcount);
> > +
> > + if (error)
> > + return error;
> >
> > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growdata,
> > (delta > 0 ? XFS_GROWFS_SPACE_RES(mp) : -delta), 0,
> > @@ -151,13 +259,29 @@ xfs_growfs_data_private(
> > } else {
> > static struct ratelimit_state shrink_warning = \
> > RATELIMIT_STATE_INIT("shrink_warning", 86400 * HZ, 1);
> > + xfs_agblock_t agdelta;
> > +
> > ratelimit_set_flags(&shrink_warning, RATELIMIT_MSG_ON_RELEASE);
> >
> > if (__ratelimit(&shrink_warning))
> > xfs_alert(mp,
> > "EXPERIMENTAL online shrink feature in use. Use at your own risk!");
> >
> > - error = xfs_ag_shrink_space(mp, &tp, nagcount - 1, -delta);
> > + for (agno = nagcount; agno < oagcount; ++agno) {
> > + struct xfs_perag *pag = xfs_perag_get(mp, agno);
> > +
> > + pag->pagf_freeblks = 0;
> > + pag->pagf_longest = 0;
> > + xfs_perag_put(pag);
> > + }
> > +
> > + xfs_trans_agblocks_delta(tp, id.nfree);
> > +
> > + if (nagcount != oagcount)
> > + agdelta = nagcount * mp->m_sb.sb_agblocks - nb;
> > + else
> > + agdelta = -delta;
> > + error = xfs_ag_shrink_space(mp, &tp, nagcount - 1, agdelta);
> > }
> > if (error)
> > goto out_trans_cancel;
> > @@ -167,8 +291,10 @@ xfs_growfs_data_private(
> > * seen by the rest of the world until the transaction commit applies
> > * them atomically to the superblock.
> > */
> > - if (nagcount > oagcount)
> > - xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT, nagcount - oagcount);
> > + if (nagcount != oagcount)
> > + xfs_trans_mod_sb(tp, XFS_TRANS_SB_AGCOUNT,
> > + (int64_t)nagcount - (int64_t)oagcount);
> > +
> > if (delta)
> > xfs_trans_mod_sb(tp, XFS_TRANS_SB_DBLOCKS, delta);
> > if (id.nfree)
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index 69a60b5f4a32..ca9513fdc09e 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -172,6 +172,47 @@ xfs_sb_validate_fsb_count(
> > return 0;
> > }
> >
> > +static int
> > +xfs_perag_reset(
> > + struct xfs_perag *pag)
> > +{
> > + int error;
> > +
> > + spin_lock_init(&pag->pag_ici_lock);
> > + INIT_DELAYED_WORK(&pag->pag_blockgc_work, xfs_blockgc_worker);
> > + INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
> > +
> > + error = xfs_buf_hash_init(pag);
> > + if (error)
> > + return error;
> > +
> > + init_waitqueue_head(&pag->pagb_wait);
> > + spin_lock_init(&pag->pagb_lock);
> > + pag->pagb_count = 0;
> > + pag->pagb_tree = RB_ROOT;
> > +
> > + error = xfs_iunlink_init(pag);
> > + if (error) {
> > + xfs_buf_hash_destroy(pag);
> > + return error;
> > + }
> > + spin_lock_init(&pag->pag_state_lock);
> > + return 0;
> > +}
> > +
> > +static int
> > +xfs_perag_inactive_reset(
> > + struct xfs_perag *pag)
> > +{
> > + cancel_delayed_work_sync(&pag->pag_blockgc_work);
> > + xfs_iunlink_destroy(pag);
> > + xfs_buf_hash_destroy(pag);
> > +
> > + memset((char *)pag + offsetof(struct xfs_perag, pag_inactive), 0,
> > + sizeof(*pag) - offsetof(struct xfs_perag, pag_inactive));
> > + return xfs_perag_reset(pag);
> > +}
> > +
> > int
> > xfs_initialize_perag(
> > xfs_mount_t *mp,
> > @@ -180,6 +221,8 @@ xfs_initialize_perag(
> > {
> > xfs_agnumber_t index;
> > xfs_agnumber_t first_initialised = NULLAGNUMBER;
> > + xfs_agnumber_t first_inactive = NULLAGNUMBER;
> > + xfs_agnumber_t last_inactive = NULLAGNUMBER;
> > xfs_perag_t *pag;
> > int error = -ENOMEM;
> >
> > @@ -191,6 +234,20 @@ xfs_initialize_perag(
> > for (index = 0; index < agcount; index++) {
> > pag = xfs_perag_get(mp, index);
> > if (pag) {
> > + down_write(&pag->pag_inactive_rwsem);
> > + if (pag->pag_inactive) {
> > + error = xfs_perag_inactive_reset(pag);
> > + if (error) {
> > + pag->pag_inactive = true;
> > + up_write(&pag->pag_inactive_rwsem);
> > + xfs_perag_put(pag);
> > + goto out_unwind_new_pags;
> > + }
> > + if (first_inactive == NULLAGNUMBER)
> > + first_inactive = index;
> > + last_inactive = index;
> > + }
> > + up_write(&pag->pag_inactive_rwsem);
>
> This should all go away if the perags have already been removed from
> the tree. In fact, you shouldn't need to call xfs_initialize_perag()
> at all for the shrink case that removes whole AGs....
>
> CHeers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
next prev parent reply other threads:[~2021-04-15 5:22 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-14 19:52 [RFC PATCH 0/4] xfs: support shrinking empty AGs Gao Xiang
2021-04-14 19:52 ` [RFC PATCH 1/4] xfs: support deactivating AGs Gao Xiang
2021-04-15 3:42 ` Dave Chinner
2021-04-15 4:28 ` Gao Xiang
2021-04-15 6:28 ` Dave Chinner
2021-04-15 7:08 ` Gao Xiang
2021-04-15 8:44 ` Dave Chinner
2021-04-14 19:52 ` [RFC PATCH 2/4] xfs: check ag is empty Gao Xiang
2021-04-15 3:52 ` Dave Chinner
2021-04-15 4:34 ` Gao Xiang
2021-04-14 19:52 ` [RFC PATCH 3/4] xfs: introduce max_agcount Gao Xiang
2021-04-15 3:59 ` Dave Chinner
2021-04-14 19:52 ` [RFC PATCH 4/4] xfs: support shrinking empty AGs Gao Xiang
2021-04-15 4:25 ` Dave Chinner
2021-04-15 5:22 ` Gao Xiang [this message]
2021-04-15 8:33 ` Dave Chinner
2021-04-15 17:00 ` Darrick J. Wong
2021-04-15 21:24 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210415052226.GC1864610@xiangao.remote.csb \
--to=hsiangkao@redhat.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox