From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable
Date: Thu, 7 Feb 2019 08:33:13 -0800 [thread overview]
Message-ID: <20190207163313.GC7991@magnolia> (raw)
In-Reply-To: <20190207143114.GB2880@bfoster>
On Thu, Feb 07, 2019 at 09:31:15AM -0500, Brian Foster wrote:
> On Wed, Feb 06, 2019 at 08:55:36AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > Use a rhashtable to cache the unlinked list incore. This should speed
> > up unlinked processing considerably when there are a lot of inodes on
> > the unlinked list because iunlink_remove no longer has to traverse an
> > entire bucket list to find which inode points to the one being removed.
> >
> > The incore list structure records "X.next_unlinked = Y" relations, with
> > the rhashtable using Y to index the records. This makes finding the
> > inode X that points to a inode Y very quick. If our cache fails to find
> > anything we can always fall back on the old method.
> >
> > FWIW this drastically reduces the amount of time it takes to remove
> > inodes from the unlinked list. I wrote a program to open a lot of
> > O_TMPFILE files and then close them in the same order, which takes
> > a very long time if we have to traverse the unlinked lists. With the
> > ptach, I see:
> >
> ...
> >
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
>
> A few minor comments below. With those addressed, this otherwise looks
> pretty good to me:
>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
>
> > fs/xfs/libxfs/xfs_errortag.h | 4 -
> > fs/xfs/xfs_error.c | 3
> > fs/xfs/xfs_inode.c | 269 +++++++++++++++++++++++++++++++++++++++++-
> > fs/xfs/xfs_inode.h | 3
> > fs/xfs/xfs_mount.c | 5 +
> > fs/xfs/xfs_mount.h | 7 +
> > fs/xfs/xfs_trace.h | 1
> > 7 files changed, 285 insertions(+), 7 deletions(-)
> >
> >
> ...
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 53a9c8c26d18..fdfcb3a9bac7 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -1906,6 +1906,194 @@ xfs_inactive(
> > xfs_qm_dqdetach(ip);
> > }
> >
> ...
> > +/*
> > + * Remember that @prev_agino.next_unlinked = @this_agino. Fail loudly if there
> > + * already was an entry, but absorb any other runtime errors.
> > + */
> > +int
> > +xfs_iunlink_add_backref(
> > + struct xfs_perag *pag,
> > + xfs_agino_t prev_agino,
> > + xfs_agino_t this_agino)
> > +{
> > + struct xfs_iunlink *iu;
> > + int error;
> > +
> > + if (XFS_TEST_ERROR(false, pag->pag_mount, XFS_ERRTAG_IUNLINK_FALLBACK))
> > + return 0;
> > +
> > + iu = kmem_zalloc(sizeof(*iu), KM_SLEEP | KM_NOFS);
> > + iu->iu_agino = prev_agino;
> > + iu->iu_next_unlinked = this_agino;
> > +
> > + error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> > + &iu->iu_rhash_head, xfs_iunlink_hash_params);
> > + if (error == 0 || error == -EEXIST)
> > + return error;
> > + return 0;
>
> The return val looks funny at a glance: return -EEXIST or otherwise
> return 0 for success and all other errors (also the error == 0 looks
> superfluous). I see the comment above describes what this does, but it
> doesn't explain why. I'd move the comment to above the if check and
> explain why it's there (i.e., "Absorb all runtime errors besides -EEXIST
> because fallback blah blah. We care about -EEXIST because ...").
Will do.
> Also I assume we need to free the iu object on insert failure,
> regardless of whether we ultimately return the error.
Oops, yes, good catch!
> > +}
> > +
> > +/*
> > + * Replace X.next_unlinked = @agino with X.next_unlinked = @next_unlinked.
> > + * If @next_unlinked is NULLAGINO, we drop the backref and exit. If there
> > + * wasn't any such entry then we don't bother.
> > + */
> > +int
> > +xfs_iunlink_change_backref(
> > + struct xfs_perag *pag,
> > + xfs_agino_t agino,
> > + xfs_agino_t next_unlinked)
> > +{
> > + struct xfs_iunlink *iu;
> > + int error;
> > +
> > + /* Look up the old entry; if there wasn't one then exit. */
> > + iu = rhashtable_lookup_fast(&pag->pagi_unlinked_hash, &agino,
> > + xfs_iunlink_hash_params);
> > + if (!iu)
> > + return 0;
> > +
> > + /*
> > + * Remove the entry. This shouldn't ever return an error, but if we
> > + * couldn't remove the old entry we don't want to add it again to the
> > + * hash table, and if the entry disappeared on us then someone's
> > + * violated the locking rules and we need to fail loudly.
> > + */
> > + error = rhashtable_remove_fast(&pag->pagi_unlinked_hash,
> > + &iu->iu_rhash_head, xfs_iunlink_hash_params);
> > + if (error)
> > + return error;
>
> What's interesting about remove failure is that if we otherwise found an
> iu for this inode, removing the inode from the unlinked list leaves a
> stale/incorrect iu around in the in-core table. So it's one thing for
> the in-core table to be a valid subset of the on-disk structures, but
> another thing entirely to have incoherence between the two. It might be
> worth pointing out that it's critical to return an error here so we
> don't actually remove the inode (whereas the re-add below is less
> strict).
Ok.
> > +
> > + /* If there is no new next entry just free our item and return. */
> > + if (next_unlinked == NULLAGINO) {
> > + kmem_free(iu);
> > + return 0;
> > + }
> > +
> > + /*
> > + * Update the hash table to the new entry, ignoring operational
> > + * errors.
> > + */
> > + iu->iu_next_unlinked = next_unlinked;
> > + error = rhashtable_insert_fast(&pag->pagi_unlinked_hash,
> > + &iu->iu_rhash_head, xfs_iunlink_hash_params);
> > + if (error == 0 || error == -EEXIST)
> > + return error;
> > + return 0;
>
> Similar error == 0 thing and potential memory leak here.
Refactored into a common helper to capture all the logic and
documentation.
> > +}
> > +
> ...
> > @@ -2141,6 +2341,27 @@ xfs_iunlink_map_prev(
> >
> > ASSERT(head_agino != target_agino);
> >
> > + /* See if our backref cache can find it faster. */
> > + *agino = xfs_iunlink_lookup_backref(pag, target_agino);
> > + if (*agino != NULLAGINO) {
> > + error = xfs_iunlink_map_ino(tp, agno, *agino, imap, dipp, bpp);
> > + if (error)
> > + return error;
> > +
> > + if (be32_to_cpu((*dipp)->di_next_unlinked) == target_agino)
> > + return 0;
> > +
> > + /*
> > + * If we get here the cache contents were corrupt, so drop the
> > + * buffer and fall back to walking the bucket list.
> > + */
> > + xfs_trans_brelse(tp, *bpp);
> > + *bpp = NULL;
>
> I like the logic to ensure we don't screw around with an inode that
> doesn't actually point to our target, but I do wonder whether we should
> do a little more than silently ignore the fact we found an incoherent
> backref. Even just a WARN_ON[_ONCE]() or something here to help us
> detect this case during testing might be sufficient..?
Yeah, that's a good idea. The cache can miss, but it shouldn't ever be
corrupt.
--D
>
> Brian
>
> > + }
> > +
> > + trace_xfs_iunlink_map_prev_fallback(mp, agno);
> > +
> > + /* Otherwise, walk the entire bucket until we find it. */
> > next_agino = head_agino;
> > while (next_agino != target_agino) {
> > xfs_agino_t unlinked_agino;
> > @@ -2186,6 +2407,7 @@ xfs_iunlink_remove(
> > struct xfs_buf *agibp;
> > struct xfs_buf *last_ibp = NULL;
> > struct xfs_dinode *last_dip = NULL;
> > + struct xfs_perag *pag = NULL;
> > xfs_agnumber_t agno = XFS_INO_TO_AGNO(mp, ip->i_ino);
> > xfs_agino_t agino = XFS_INO_TO_AGINO(mp, ip->i_ino);
> > xfs_agino_t next_agino;
> > @@ -2221,27 +2443,62 @@ xfs_iunlink_remove(
> > if (error)
> > return error;
> >
> > + /*
> > + * If there was a backref pointing from the next inode back to this
> > + * one, remove it because we've removed this inode from the list.
> > + *
> > + * Later, if this inode was in the middle of the list we'll update
> > + * this inode's backref to point from the next inode.
> > + */
> > + if (next_agino != NULLAGINO) {
> > + pag = xfs_perag_get(mp, agno);
> > + error = xfs_iunlink_change_backref(pag, next_agino,
> > + NULLAGINO);
> > + if (error)
> > + goto out;
> > + }
> > +
> > if (head_agino == agino) {
> > /* Point the head of the list to the next unlinked inode. */
> > error = xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index,
> > next_agino);
> > if (error)
> > - return error;
> > + goto out;
> > } else {
> > struct xfs_imap imap;
> > xfs_agino_t prev_agino;
> >
> > + if (!pag)
> > + pag = xfs_perag_get(mp, agno);
> > +
> > /* We need to search the list for the inode being freed. */
> > error = xfs_iunlink_map_prev(tp, agno, head_agino, agino,
> > - &prev_agino, &imap, &last_dip, &last_ibp);
> > + &prev_agino, &imap, &last_dip, &last_ibp,
> > + pag);
> > if (error)
> > - return error;
> > + goto out;
> >
> > /* Point the previous inode on the list to the next inode. */
> > xfs_iunlink_update_dinode(tp, agno, prev_agino, last_ibp,
> > last_dip, &imap, next_agino);
> > +
> > + /*
> > + * Now we deal with the backref for this inode. If this inode
> > + * pointed at a real inode, change the backref that pointed to
> > + * us to point to our old next. If this inode was the end of
> > + * the list, delete the backref that pointed to us. Note that
> > + * change_backref takes care of deleting the backref if
> > + * next_agino is NULLAGINO.
> > + */
> > + error = xfs_iunlink_change_backref(pag, agino, next_agino);
> > + if (error)
> > + goto out;
> > }
> > - return 0;
> > +
> > +out:
> > + if (pag)
> > + xfs_perag_put(pag);
> > + return error;
> > }
> >
> > /*
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index be2014520155..e62074a5257c 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -500,4 +500,7 @@ extern struct kmem_zone *xfs_inode_zone;
> >
> > bool xfs_inode_verify_forks(struct xfs_inode *ip);
> >
> > +int xfs_iunlink_init(struct xfs_perag *pag);
> > +void xfs_iunlink_destroy(struct xfs_perag *pag);
> > +
> > #endif /* __XFS_INODE_H__ */
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index b4d8c318be3c..fd63b0b1307c 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -149,6 +149,7 @@ xfs_free_perag(
> > spin_unlock(&mp->m_perag_lock);
> > ASSERT(pag);
> > ASSERT(atomic_read(&pag->pag_ref) == 0);
> > + xfs_iunlink_destroy(pag);
> > xfs_buf_hash_destroy(pag);
> > mutex_destroy(&pag->pag_ici_reclaim_lock);
> > call_rcu(&pag->rcu_head, __xfs_free_perag);
> > @@ -227,6 +228,9 @@ xfs_initialize_perag(
> > /* first new pag is fully initialized */
> > if (first_initialised == NULLAGNUMBER)
> > first_initialised = index;
> > + error = xfs_iunlink_init(pag);
> > + if (error)
> > + goto out_hash_destroy;
> > }
> >
> > index = xfs_set_inode_alloc(mp, agcount);
> > @@ -249,6 +253,7 @@ xfs_initialize_perag(
> > if (!pag)
> > break;
> > xfs_buf_hash_destroy(pag);
> > + xfs_iunlink_destroy(pag);
> > mutex_destroy(&pag->pag_ici_reclaim_lock);
> > kmem_free(pag);
> > }
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index 7daafe064af8..a33f45077867 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -396,6 +396,13 @@ typedef struct xfs_perag {
> >
> > /* reference count */
> > uint8_t pagf_refcount_level;
> > +
> > + /*
> > + * Unlinked inode information. This incore information reflects
> > + * data stored in the AGI, so callers must hold the AGI buffer lock
> > + * or have some other means to control concurrency.
> > + */
> > + struct rhashtable pagi_unlinked_hash;
> > } xfs_perag_t;
> >
> > static inline struct xfs_ag_resv *
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index a6e384a642b1..c83ce022a355 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -3447,6 +3447,7 @@ DEFINE_EVENT(xfs_ag_inode_class, name, \
> > TP_ARGS(ip))
> > DEFINE_AGINODE_EVENT(xfs_iunlink);
> > DEFINE_AGINODE_EVENT(xfs_iunlink_remove);
> > +DEFINE_AG_EVENT(xfs_iunlink_map_prev_fallback);
> >
> > #endif /* _TRACE_XFS_H */
> >
> >
next prev parent reply other threads:[~2019-02-07 16:33 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-06 16:54 [PATCH v3 0/9] xfs: incore unlinked list Darrick J. Wong
2019-02-06 16:54 ` [PATCH 1/9] xfs: clean up iunlink functions Darrick J. Wong
2019-02-06 16:54 ` [PATCH 2/9] xfs: add xfs_verify_agino_or_null helper Darrick J. Wong
2019-02-06 16:54 ` [PATCH 3/9] xfs: refactor AGI unlinked bucket updates Darrick J. Wong
2019-02-06 16:55 ` [PATCH 4/9] xfs: strengthen AGI unlinked inode bucket pointer checks Darrick J. Wong
2019-02-06 16:55 ` [PATCH 5/9] xfs: refactor inode unlinked pointer update functions Darrick J. Wong
2019-02-06 16:55 ` [PATCH 6/9] xfs: refactor unlinked list search and mapping to a separate function Darrick J. Wong
2019-02-07 14:28 ` Brian Foster
2019-02-07 16:19 ` Darrick J. Wong
2019-02-08 7:57 ` Christoph Hellwig
2019-02-06 16:55 ` [PATCH 7/9] xfs: refactor inode update in iunlink_remove Darrick J. Wong
2019-02-06 16:55 ` [PATCH 8/9] xfs: add tracepoints for high level iunlink operations Darrick J. Wong
2019-02-06 16:55 ` [PATCH 9/9] xfs: cache unlinked pointers in an rhashtable Darrick J. Wong
2019-02-07 14:31 ` Brian Foster
2019-02-07 16:33 ` Darrick J. Wong [this message]
2019-02-08 8:00 ` Christoph Hellwig
2019-02-08 12:06 ` Brian Foster
2019-02-08 16:54 ` Darrick J. Wong
2019-02-11 12:21 ` Brian Foster
2019-02-11 8:08 ` Christoph Hellwig
2019-02-11 12:21 ` Brian Foster
2019-02-07 18:24 ` [PATCH v2 " Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190207163313.GC7991@magnolia \
--to=darrick.wong@oracle.com \
--cc=bfoster@redhat.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.