From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 10/10] xfs: cache unlinked pointers in an rhashtable
Date: Tue, 5 Feb 2019 09:53:09 -0800 [thread overview]
Message-ID: <20190205175309.GJ7991@magnolia> (raw)
In-Reply-To: <20190205142458.GD51421@bfoster>
On Tue, Feb 05, 2019 at 09:24:59AM -0500, Brian Foster wrote:
> On Mon, Feb 04, 2019 at 10:00:05AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > Use a rhashtable to cache the unlinked list incore. This should speed
> > up unlinked processing considerably when there are a lot of inodes on
> > the unlinked list because iunlink_remove no longer has to traverse an
> > entire bucket list to find which inode points to the one being removed.
> >
> > The incore list structure records "X.next_unlinked = Y" relations, with
> > the rhashtable using Y to index the records. This makes finding the
> > inode X that points to a inode Y very quick. If our cache fails to find
> > anything we can always fall back on the old method.
> >
> > FWIW this drastically reduces the amount of time it takes to remove
> > inodes from the unlinked list. I wrote a program to open a lot of
> > O_TMPFILE files and then close them in the same order, which takes
> > a very long time if we have to traverse the unlinked lists. With the
> > ptach, I see:
> >
> ...
> > ---
> > fs/xfs/xfs_inode.c | 207 ++++++++++++++++++++++++++++++++++++++++++++++
> > fs/xfs/xfs_inode.h | 9 ++
> > fs/xfs/xfs_log_recover.c | 12 ++-
> > fs/xfs/xfs_mount.c | 5 +
> > fs/xfs/xfs_mount.h | 1
> > 5 files changed, 233 insertions(+), 1 deletion(-)
> >
> >
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index b9696d762c8f..baee8c894447 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -1880,6 +1880,167 @@ xfs_inactive(
> > xfs_qm_dqdetach(ip);
> > }
> >
> ...
> > +
> > +static const struct rhashtable_params xfs_iunlink_hash_params = {
> > + .min_size = XFS_AGI_UNLINKED_BUCKETS,
> > + .nelem_hint = 512,
>
> Any reasoning behind the 512 value? It seems rather large to me, at
> least until we get more into deferred inactivation and whatnot. It looks
> like the rhashtable code will round this up to 1024 as well, FWIW.
>
> I'm also wondering whether a kmem_zone might be worthwhile for
> xfs_iunlink structures, but that's probably also more for when we expect
> to drive deeper unlinked lists.
I picked an arbitrary value of 64 buckets * 8 items per list. I /do/
have plans to test various values to see if there's a particular sweet
spot, though I guess this could be much lower on the assumption that
we don't expect /that/ many unlinked inodes(?)
> > + .key_len = sizeof(xfs_agino_t),
> > + .key_offset = offsetof(struct xfs_iunlink, iu_next_unlinked),
> > + .head_offset = offsetof(struct xfs_iunlink, iu_rhash_head),
> > + .automatic_shrinking = true,
> > + .obj_cmpfn = xfs_iunlink_obj_cmpfn,
> > +};
> > +
> ...
> > /*
> > * Point the AGI unlinked bucket at an inode and log the results. The caller
> > * is responsible for validating the old value.
> > @@ -2055,6 +2216,14 @@ xfs_iunlink(
> > if (error)
> > goto out;
> > ASSERT(old_agino == NULLAGINO);
> > +
> > + /*
> > + * agino has been unlinked, add a backref from the next inode
> > + * back to agino.
> > + */
> > + error = xfs_iunlink_add_backref(pag, agino, next_agino);
> > + if (error)
> > + goto out;
>
> At a glance, it looks like -ENOMEM/-EEXIST is possible from
> rhashtable_insert_fast(). Do we really want to translate those into a
> broader operational failure, or perhaps just skip the hashtable update?
> The latter seems more appropriate given that we already account for the
> possibility of a missing hashtable entry on lookup.
Good point, we could be much more resilient to backref cache failures
since we do have the option of doing it the slow way.
(And, I guess by extension, a debugging knob or something to disable the
cache so that we can test the bucket list walker...)
> > }
> >
> > /* Point the head of the list to point to this inode. */
> > @@ -2127,6 +2296,17 @@ xfs_iunlink_map_prev(
> >
> > ASSERT(head_agino != target_agino);
> >
> > + /* See if our backref cache can find it faster. */
> > + next_agino = xfs_iunlink_lookup_backref(pag, target_agino);
> > + if (next_agino != NULLAGINO) {
> > + next_ino = XFS_AGINO_TO_INO(mp, pag->pag_agno, next_agino);
> > + error = xfs_iunlink_map_ino(tp, next_ino, imap, &last_dip,
> > + &last_ibp);
> > + if (error)
> > + return error;
>
> Could we assert or somehow or another verify that
> last_dip->di_next_unlinked points at target_agino before we proceed?
Ok.
> > + goto out;
> > + }
> > +
> > next_agino = head_agino;
> > while (next_agino != target_agino) {
> > xfs_agino_t unlinked_agino;
> > @@ -2156,6 +2336,7 @@ xfs_iunlink_map_prev(
> > next_agino = unlinked_agino;
> > }
> >
> > +out:
> > /* Should never happen, but don't return garbage. */
> > if (next_ino == NULLFSINO)
> > return -EFSCORRUPTED;
> > @@ -2218,6 +2399,20 @@ xfs_iunlink_remove(
> > if (error)
> > goto out;
> >
> > + /*
> > + * If there was a backref pointing from the next inode back to this
> > + * one, remove it because we've removed this inode from the list.
> > + *
> > + * Later, if this inode was in the middle of the list we'll update
> > + * this inode's backref to point from the next inode.
> > + */
> > + if (next_agino != NULLAGINO) {
> > + error = xfs_iunlink_change_backref(pag, next_agino,
> > + NULLAGINO);
> > + if (error)
> > + goto out;
> > + }
> > +
> > if (head_agino == agino) {
> > /* Point the head of the list to the next unlinked inode. */
> > error = xfs_iunlink_update_bucket(tp, agno, agibp, bucket_index,
> > @@ -2236,6 +2431,18 @@ xfs_iunlink_remove(
> > /* Point the previous inode on the list to the next inode. */
> > xfs_iunlink_update_dinode(tp, agno, last_ibp, last_dip, &imap,
> > prev_ino, next_agino);
> > +
> > + /*
> > + * Now we deal with the backref for this inode. If this inode
> > + * pointed at a real inode, change the backref that pointed to
> > + * us to point to our old next. If this inode was the end of
> > + * the list, delete the backref that pointed to us. Note that
> > + * change_backref takes care of deleting the backref if
> > + * next_agino is NULLAGINO.
> > + */
>
> Thanks for the comment.
>
> > + error = xfs_iunlink_change_backref(pag, agino, next_agino);
> > + if (error)
> > + goto out;
>
> Ok, but the whole lookup path accounts for the possibility of a missing
> entry and falls back to the old on-disk walk. It doesn't look like we
> handle that properly here. IOW, if the hashtable lookup ever fails and
> we do fallback as such, we're guaranteed to eventually fail here.
>
> It seems to me we need to be extra careful so as to not turn in-core
> hashtable issues (whether it be external things such as -ENOENT or
> internal -ENOMEM or whatever) into fatal filesystem errors.
<nod> I'll fix this for v3 so that backref cache failures don't shut
down the filesystem.
> > }
> > pag->pagi_unlinked_count--;
> > out:
> ...
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index c634fbfea4a8..f5fb8885662f 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -5078,10 +5078,20 @@ xlog_recover_process_one_iunlink(
> > agino = be32_to_cpu(dip->di_next_unlinked);
> > xfs_buf_relse(ibp);
> >
> > - /* Make sure the in-core data knows about this unlinked inode. */
> > + /*
> > + * Make sure the in-core data knows about this unlinked inode. Since
> > + * our iunlinks recovery basically just deletes the head of a bucket
> > + * list until the bucket is empty, we need only to add the backref from
> > + * the current list item to the next one, if this isn't the list tail.
> > + */
> > pag = xfs_perag_get(mp, agno);
> > pag->pagi_unlinked_count++;
> > + if (agino != NULLAGINO)
> > + error = xfs_iunlink_add_backref(pag, XFS_INO_TO_AGINO(mp, ino),
> > + agino);
>
> ISTM that fixing the iunlink_remove code to not rely on hashtable
> entries could also alleviate the need for this hack?
>
> > xfs_perag_put(pag);
> > + if (error)
> > + goto fail_iput;
>
> Similar potential for error amplification here.
Agreed.
--D
> Brian
>
> >
> > /*
> > * Prevent any DMAPI event from being sent when the reference on
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index 6ca92a71c233..9a181f7ca1d5 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -151,6 +151,7 @@ xfs_free_perag(
> > ASSERT(atomic_read(&pag->pag_ref) == 0);
> > ASSERT(pag->pagi_unlinked_count == 0 ||
> > XFS_FORCED_SHUTDOWN(mp));
> > + xfs_iunlink_destroy(pag);
> > xfs_buf_hash_destroy(pag);
> > mutex_destroy(&pag->pag_ici_reclaim_lock);
> > call_rcu(&pag->rcu_head, __xfs_free_perag);
> > @@ -229,6 +230,9 @@ xfs_initialize_perag(
> > /* first new pag is fully initialized */
> > if (first_initialised == NULLAGNUMBER)
> > first_initialised = index;
> > + error = xfs_iunlink_init(pag);
> > + if (error)
> > + goto out_hash_destroy;
> > }
> >
> > index = xfs_set_inode_alloc(mp, agcount);
> > @@ -251,6 +255,7 @@ xfs_initialize_perag(
> > if (!pag)
> > break;
> > xfs_buf_hash_destroy(pag);
> > + xfs_iunlink_destroy(pag);
> > mutex_destroy(&pag->pag_ici_reclaim_lock);
> > kmem_free(pag);
> > }
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index 169069a01f3c..dcc45b441dd6 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -391,6 +391,7 @@ typedef struct xfs_perag {
> >
> > /* unlinked inode info; lock AGI to access */
> > unsigned int pagi_unlinked_count;
> > + struct rhashtable pagi_unlinked_hash;
> > } xfs_perag_t;
> >
> > static inline struct xfs_ag_resv *
> >
next prev parent reply other threads:[~2019-02-05 17:53 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-04 17:58 [PATCH v2 00/10] xfs: incore unlinked list Darrick J. Wong
2019-02-04 17:59 ` [PATCH 01/10] xfs: clean up iunlink functions Darrick J. Wong
2019-02-04 19:24 ` Brian Foster
2019-02-04 20:51 ` Christoph Hellwig
2019-02-04 17:59 ` [PATCH 02/10] xfs: track unlinked inode counts in per-ag data Darrick J. Wong
2019-02-04 19:24 ` Brian Foster
2019-02-04 20:53 ` Christoph Hellwig
2019-02-05 0:26 ` Darrick J. Wong
2019-02-05 6:55 ` Christoph Hellwig
2019-02-05 7:07 ` Darrick J. Wong
2019-02-05 0:42 ` Darrick J. Wong
2019-02-04 17:59 ` [PATCH 03/10] xfs: add xfs_verify_agino_or_null helper Darrick J. Wong
2019-02-04 19:24 ` Brian Foster
2019-02-04 20:53 ` Christoph Hellwig
2019-02-04 17:59 ` [PATCH 04/10] xfs: refactor AGI unlinked bucket updates Darrick J. Wong
2019-02-04 19:25 ` Brian Foster
2019-02-04 20:54 ` Christoph Hellwig
2019-02-04 17:59 ` [PATCH 05/10] xfs: strengthen AGI unlinked inode bucket pointer checks Darrick J. Wong
2019-02-04 17:59 ` [PATCH 06/10] xfs: refactor inode unlinked pointer update functions Darrick J. Wong
2019-02-04 20:56 ` Christoph Hellwig
2019-02-04 21:49 ` Darrick J. Wong
2019-02-05 14:23 ` Brian Foster
2019-02-04 17:59 ` [PATCH 07/10] xfs: refactor unlinked list search and mapping to a separate function Darrick J. Wong
2019-02-04 20:58 ` Christoph Hellwig
2019-02-04 22:23 ` Darrick J. Wong
2019-02-04 17:59 ` [PATCH 08/10] xfs: refactor inode update in iunlink_remove Darrick J. Wong
2019-02-04 20:58 ` Christoph Hellwig
2019-02-05 14:23 ` Brian Foster
2019-02-04 17:59 ` [PATCH 09/10] xfs: add tracepoints for high level iunlink operations Darrick J. Wong
2019-02-04 20:59 ` Christoph Hellwig
2019-02-05 14:24 ` Brian Foster
2019-02-04 18:00 ` [PATCH 10/10] xfs: cache unlinked pointers in an rhashtable Darrick J. Wong
2019-02-04 21:08 ` Christoph Hellwig
2019-02-05 1:02 ` Darrick J. Wong
2019-02-05 14:24 ` Brian Foster
2019-02-05 17:53 ` Darrick J. Wong [this message]
2019-02-05 17:57 ` Christoph Hellwig
2019-02-05 18:17 ` Darrick J. Wong
2019-02-05 19:06 ` Brian Foster
2019-02-05 19:20 ` Darrick J. Wong
2019-02-05 20:59 ` Dave Chinner
2019-02-06 18:25 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190205175309.GJ7991@magnolia \
--to=darrick.wong@oracle.com \
--cc=bfoster@redhat.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).