From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable
Date: Thu, 7 Feb 2019 08:33:13 -0800 [thread overview]
Message-ID: <20190207163313.GC7991@magnolia> (raw)
In-Reply-To: <20190207143114.GB2880@bfoster>
On Thu, Feb 07, 2019 at 09:31:15AM -0500, Brian Foster wrote:
> On Wed, Feb 06, 2019 at 08:55:36AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > Use a rhashtable to cache the unlinked list incore. This should speed
> > up unlinked processing considerably when there are a lot of inodes on
> > the unlinked list because iunlink_remove no longer has to traverse an
> > entire bucket list to find which inode points to the one being removed.
> >
> > The incore list structure records "X.next_unlinked = Y" relations, with
> > the rhashtable using Y to index the records. This makes finding the
> > inode X that points to a inode Y very quick. If our cache fails to find
> > anything we can always fall back on the old method.
> >
> > FWIW this drastically reduces the amount of time it takes to remove
> > inodes from the unlinked list. I wrote a program to open a lot of
> > O_TMPFILE files and then close them in the same order, which takes
> > a very long time if we have to traverse the unlinked lists. With the
> > ptach, I see:
> >
> ...
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
>
> A few minor comments below. With those addressed, this otherwise looks
> pretty good to me:
>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
>
> > fs/xfs/libxfs/xfs_errortag.h | 4 -
> > fs/xfs/xfs_error.c | 3
> > fs/xfs/xfs_inode.c | 269 +++++++++++++++++++++++++++++++++++++++++-
> > fs/xfs/xfs_inode.h | 3
> > fs/xfs/xfs_mount.c | 5 +
> > fs/xfs/xfs_mount.h | 7 +
> > fs/xfs/xfs_trace.h | 1
> > 7 files changed, 285 insertions(+), 7 deletions(-)
> >
> >
> ...
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 53a9c8c26d18..fdfcb3a9bac7 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -1906,6 +1906,194 @@ xfs_inactive(
> > xfs_qm_dqdetach(ip);
> > }
> >
> ...
> > +/*
> > + * Remember that @prev_agino.next_unlinked = @this_agino. Fail loudly if there
> > + * already was an entry, but absorb any other runtime errors.
> > + */
> > +int
> > +xfs_iunlink_add_backref(
> > + struct xfs_perag *pag,
> > + xfs_agino_t prev_agino,
> > + xfs_agino_t this_agino)
> > +{
> > + struct xfs_iunlink *iu;
> > + int error;
> > +
> > + if (XFS_TEST_ERROR(false, pag->pag_mount, XFS_ERRTAG_IUNLINK_FALLBACK))
> > + return 0;
> > +
> > + iu = kmem_zalloc(sizeof(*iu), KM_SLEEP | KM_NOFS);
> > + iu->iu_agino = prev_agino;
> > + iu->iu_next_unlinked = this_agino;
> > +
> > + error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> > + &iu->iu_rhash_head, xfs_iunlink_hash_params);
> > + if (error == 0 || error == -EEXIST)
> > + return error;
> > + return 0;
>
> The return val looks funny at a glance: return -EEXIST or otherwise
> return 0 for success and all other errors (also the error == 0 looks
> superfluous). I see the comment above describes what this does, but it
> doesn't explain why. I'd move the comment to above the if check and
> explain why it's there (i.e., "Absorb all runtime errors besides -EEXIST
> because fallback blah blah. We care about -EEXIST because ...").
Will do.
> Also I assume we need to free the iu object on insert failure,
> regardless of whether we ultimately return the error.
Oops, yes, good catch!
> > +}
> > +
> > +/*
> > + * Replace X.next_unlinked = @agino with X.next_unlinked = @next_unlinked.
> > + * If @next_unlinked is NULLAGINO, we drop the backref and exit. If there
> > + * wasn't any such entry then we don't bother.
> > + */
> > +int
> > +xfs_iunlink_change_backref(
> > + struct xfs_perag *pag,
> > + xfs_agino_t agino,
> > + xfs_agino_t next_unlinked)
> > +{
> > + struct xfs_iunlink *iu;
> > + int error;
> > +
> > + /* Look up the old entry; if there wasn't one then exit. */
> > + iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
> > + xfs_iunlink_hash_params);
> > + if (!iu)
> > + return 0;
> > +
> > + /*
> > + * Remove the entry. This shouldn't ever return an error, but if we
> > + * couldn't remove the old entry we don't want to add it again to the
> > + * hash table, and if the entry disappeared on us then someone's
> > + * violated the locking rules and we need to fail loudly.
> > + */
> > + error = rhashtable_remove_fast(&pag->pagi_unlinked_hash,
> > + &iu->iu_rhash_head, xfs_iunlink_hash_params);
> > + if (error)
> > + return error;
>
> What's interesting about remove failure is that if we otherwise found an
> iu for this inode, removing the inode from the unlinked list leaves a
> stale/incorrect iu around in the in-core table. So it's one thing for
> the in-core table to be a valid subset of the on-disk structures, but
> another thing entirely to have incoherence between the two. It might be
> worth pointing out that it's critical to return an error here so we
> don't actually remove the inode (whereas the re-add below is less
> strict).
Ok.
> > +
> > + /* If there is no new next entry just free our item and return. */
> > + if (next_unlinked == NULLAGINO) {
> > + kmem_free(iu);
> > + return 0;
> > + }
> > +
> > + /*
> > + * Update the hash table to the new entry, ignoring operational
> > + * errors.
> > + */
> > + iu->iu_next_unlinked = next_unlinked;
> > + error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> > + &iu->iu_rhash_head, xfs_iunlink_hash_params);
> > + if (error == 0 || error == -EEXIST)
> > + return error;
> > + return 0;
>
> Similar error == 0 thing and potential memory leak here.
Refactored into a common helper to capture all the logic and
documentation.
> > +}
> > +
> ...
> > @@ -2141,6 +2341,27 @@ xfs_iunlink_map_prev(
> >
> > ASSERT(head_agino != target_agino);
> >
> > + /* See if our backref cache can find it faster. */
> > + *agino = xfs_iunlink_lookup_backref(pag, target_agino);
> > + if (*agino != NULLAGINO) {
> > + error = xfs_iunlink_map_ino(tp, agno, *agino, imap, dipp, bpp);
> > + if (error)
> > + return error;
> > +
> > + if (be32_to_cpu((*dipp)->di_next_unlinked) == target_agino)
> > + return 0;
> > +
> > + /*
> > + * If we get here the cache contents were corrupt, so drop the
> > + * buffer and fall back to walking the bucket list.
> > + */
> > + xfs_trans_brelse(tp, *bpp);
> > + *bpp = NULL;
>
> I like the logic to ensure we don't screw around with an inode that
> doesn't actually point to our target, but I do wonder whether we should
> do a little more than silently ignore the fact we found an incoherent
> backref. Even just a WARN_ON[_ONCE]() or something here to help us
> detect this case during testing might be sufficient..?
Yeah, that's a good idea. The cache can miss, but it shouldn't ever be
corrupt.
--D
>
> Brian
>
> > + }
> > +
> > + trace_xfs_iunlink_map_prev_fallback(mp, agno);
> > +
> > + /* Otherwise, walk the entire bucket until we find it. */
> > next_agino = head_agino;
> > while (next_agino != target_agino) {
> > xfs_agino_t unlinked_agino;
> > @@ -2186,6 +2407,7 @@ xfs_iunlink_remove(
> > struct xfs_buf *agibp;
> > struct xfs_buf *last_ibp = NULL;
> > struct xfs_dinode *last_dip = NULL;
> > + struct xfs_perag *pag = NULL;
> > xfs_agnumber_t agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> > xfs_agino_t agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> > xfs_agino_t next_agino;
> > @@ -2221,27 +2443,62 @@ xfs_iunlink_remove(
> > if (error)
> > return error;
> >
> > + /*
> > + * If there was a backref pointing from the next inode back to this
> > + * one, remove it because we've removed this inode from the list.
> > + *
> > + * Later, if this inode was in the middle of the list we'll update
> > + * this inode's backref to point from the next inode.
> > + */
> > + if (next_agino != NULLAGINO) {
> > + pag = xfs_perag_get(mp, agno);
> > + error = xfs_iunlink_change_backref(pag, next_agino,
> > + NULLAGINO);
> > + if (error)
> > + goto out;
> > + }
> > +
> > if (head_agino == agino) {
> > /* Point the head of the list to the next unlinked inode. */
> > error = xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index,
> > next_agino);
> > if (error)
> > - return error;
> > + goto out;
> > } else {
> > struct xfs_imap imap;
> > xfs_agino_t prev_agino;
> >
> > + if (!pag)
> > + pag = xfs_perag_get(mp, agno);
> > +
> > /* We need to search the list for the inode being freed. */
> > error = xfs_iunlink_map_prev(tp, agno, head_agino, agino,
> > - &prev_agino, &imap, &last_dip, &last_ibp);
> > + &prev_agino, &imap, &last_dip, &last_ibp,
> > + pag);
> > if (error)
> > - return error;
> > + goto out;
> >
> > /* Point the previous inode on the list to the next inode. */
> > xfs_iunlink_update_dinode(tp, agno, prev_agino, last_ibp,
> > last_dip, &imap, next_agino);
> > +
> > + /*
> > + * Now we deal with the backref for this inode. If this inode
> > + * pointed at a real inode, change the backref that pointed to
> > + * us to point to our old next. If this inode was the end of
> > + * the list, delete the backref that pointed to us. Note that
> > + * change_backref takes care of deleting the backref if
> > + * next_agino is NULLAGINO.
> > + */
> > + error = xfs_iunlink_change_backref(pag, agino, next_agino);
> > + if (error)
> > + goto out;
> > }
> > - return 0;
> > +
> > +out:
> > + if (pag)
> > + xfs_perag_put(pag);
> > + return error;
> > }
> >
> > /*
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index be2014520155..e62074a5257c 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -500,4 +500,7 @@ extern struct kmem_zone *xfs_inode_zone;
> >
> > bool xfs_inode_verify_forks(struct xfs_inode *ip);
> >
> > +int xfs_iunlink_init(struct xfs_perag *pag);
> > +void xfs_iunlink_destroy(struct xfs_perag *pag);
> > +
> > #endif /* __XFS_INODE_H__ */
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index b4d8c318be3c..fd63b0b1307c 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -149,6 +149,7 @@ xfs_free_perag(
> > spin_unlock(&mp->m_perag_lock);
> > ASSERT(pag);
> > ASSERT(atomic_read(&pag->pag_ref) == 0);
> > + xfs_iunlink_destroy(pag);
> > xfs_buf_hash_destroy(pag);
> > mutex_destroy(&pag->pag_ici_reclaim_lock);
> > call_rcu(&pag->rcu_head, __xfs_free_perag);
> > @@ -227,6 +228,9 @@ xfs_initialize_perag(
> > /* first new pag is fully initialized */
> > if (first_initialised == NULLAGNUMBER)
> > first_initialised = index;
> > + error = xfs_iunlink_init(pag);
> > + if (error)
> > + goto out_hash_destroy;
> > }
> >
> > index = xfs_set_inode_alloc(mp, agcount);
> > @@ -249,6 +253,7 @@ xfs_initialize_perag(
> > if (!pag)
> > break;
> > xfs_buf_hash_destroy(pag);
> > + xfs_iunlink_destroy(pag);
> > mutex_destroy(&pag->pag_ici_reclaim_lock);
> > kmem_free(pag);
> > }
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index 7daafe064af8..a33f45077867 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -396,6 +396,13 @@ typedef struct xfs_perag {
> >
> > /* reference count */
> > uint8_t pagf_refcount_level;
> > +
> > + /*
> > + * Unlinked inode information. This incore information reflects
> > + * data stored in the AGI, so callers must hold the AGI buffer lock
> > + * or have some other means to control concurrency.
> > + */
> > + struct rhashtable pagi_unlinked_hash;
> > } xfs_perag_t;
> >
> > static inline struct xfs_ag_resv *
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index a6e384a642b1..c83ce022a355 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -3447,6 +3447,7 @@ DEFINE_EVENT(xfs_ag_inode_class, name, \
> > TP_ARGS(ip))
> > DEFINE_AGINODE_EVENT(xfs_iunlink);
> > DEFINE_AGINODE_EVENT(xfs_iunlink_remove);
> > +DEFINE_AG_EVENT(xfs_iunlink_map_prev_fallback);
> >
> > #endif /* _TRACE_XFS_H */
> >
> >
next prev parent reply other threads:[~2019-02-07 16:33 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-06 16:54 [PATCH v3 0/9] xfs: incore unlinked list Darrick J. Wong
2019-02-06 16:54 ` [PATCH 1/9] xfs: clean up iunlink functions Darrick J. Wong
2019-02-06 16:54 ` [PATCH 2/9] xfs: add xfs_verify_agino_or_null helper Darrick J. Wong
2019-02-06 16:54 ` [PATCH 3/9] xfs: refactor AGI unlinked bucket updates Darrick J. Wong
2019-02-06 16:55 ` [PATCH 4/9] xfs: strengthen AGI unlinked inode bucket pointer checks Darrick J. Wong
2019-02-06 16:55 ` [PATCH 5/9] xfs: refactor inode unlinked pointer update functions Darrick J. Wong
2019-02-06 16:55 ` [PATCH 6/9] xfs: refactor unlinked list search and mapping to a separate function Darrick J. Wong
2019-02-07 14:28 ` Brian Foster
2019-02-07 16:19 ` Darrick J. Wong
2019-02-08 7:57 ` Christoph Hellwig
2019-02-06 16:55 ` [PATCH 7/9] xfs: refactor inode update in iunlink_remove Darrick J. Wong
2019-02-06 16:55 ` [PATCH 8/9] xfs: add tracepoints for high level iunlink operations Darrick J. Wong
2019-02-06 16:55 ` [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable Darrick J. Wong
2019-02-07 14:31 ` Brian Foster
2019-02-07 16:33 ` Darrick J. Wong [this message]
2019-02-08 8:00 ` Christoph Hellwig
2019-02-08 12:06 ` Brian Foster
2019-02-08 16:54 ` Darrick J. Wong
2019-02-11 12:21 ` Brian Foster
2019-02-11 8:08 ` Christoph Hellwig
2019-02-11 12:21 ` Brian Foster
2019-02-07 18:24 ` [PATCH v2 " Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190207163313.GC7991@magnolia \
--to=darrick.wong@oracle.com \
--cc=bfoster@redhat.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox