public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 01/10] xfs: create simplified inode walk function
Date: Mon, 10 Jun 2019 09:59:09 -0700	[thread overview]
Message-ID: <20190610165909.GI1871505@magnolia> (raw)
In-Reply-To: <20190610135816.GA6473@bfoster>

On Mon, Jun 10, 2019 at 09:58:19AM -0400, Brian Foster wrote:
> On Tue, Jun 04, 2019 at 02:49:34PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Create a new iterator function to simplify walking inodes in an XFS
> > filesystem.  This new iterator will replace the existing open-coded
> > walking that goes on in various places.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/Makefile                  |    1 
> >  fs/xfs/libxfs/xfs_ialloc_btree.c |   31 +++
> >  fs/xfs/libxfs/xfs_ialloc_btree.h |    3 
> >  fs/xfs/xfs_itable.c              |    5 
> >  fs/xfs/xfs_itable.h              |    8 +
> >  fs/xfs/xfs_iwalk.c               |  400 ++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_iwalk.h               |   18 ++
> >  fs/xfs/xfs_trace.h               |   40 ++++
> >  8 files changed, 502 insertions(+), 4 deletions(-)
> >  create mode 100644 fs/xfs/xfs_iwalk.c
> >  create mode 100644 fs/xfs/xfs_iwalk.h
> > 
> > 
> ...
> > diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > index ac4b65da4c2b..cb7eac2f51c0 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > @@ -564,6 +564,34 @@ xfs_inobt_max_size(
> >  					XFS_INODES_PER_CHUNK);
> >  }
> >  
> > +/* Read AGI and create inobt cursor. */
> > +int
> > +xfs_inobt_cur(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_trans	*tp,
> > +	xfs_agnumber_t		agno,
> > +	struct xfs_btree_cur	**curpp,
> > +	struct xfs_buf		**agi_bpp)
> > +{
> > +	struct xfs_btree_cur	*cur;
> > +	int			error;
> > +
> > +	ASSERT(*agi_bpp == NULL);
> > +
> 
> FYI, the xfs_inobt_count_blocks() caller doesn't initialize the pointer
> according to the assert.

AgAGkgjwepth, there's a gcc plugin that makes all uninitialized stack
variables zero now, and so I never see these things anymore. :(

Thanks for catching this.

> 
> > +	error = xfs_ialloc_read_agi(mp, tp, agno, agi_bpp);
> > +	if (error)
> > +		return error;
> > +
> > +	cur = xfs_inobt_init_cursor(mp, tp, *agi_bpp, agno, XFS_BTNUM_INO);
> > +	if (!cur) {
> > +		xfs_trans_brelse(tp, *agi_bpp);
> > +		*agi_bpp = NULL;
> > +		return -ENOMEM;
> > +	}
> > +	*curpp = cur;
> > +	return 0;
> > +}
> > +
> >  static int
> >  xfs_inobt_count_blocks(
> >  	struct xfs_mount	*mp,
> ...
> > diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
> > new file mode 100644
> > index 000000000000..3e6c06e69c75
> > --- /dev/null
> > +++ b/fs/xfs/xfs_iwalk.c
> > @@ -0,0 +1,400 @@
> ...
> > +/* Walk all inodes in a single AG, from @iwag->startino to the end of the AG. */
> > +STATIC int
> > +xfs_iwalk_ag(
> > +	struct xfs_iwalk_ag		*iwag)
> > +{
> > +	struct xfs_mount		*mp = iwag->mp;
> > +	struct xfs_trans		*tp = iwag->tp;
> > +	struct xfs_buf			*agi_bp = NULL;
> > +	struct xfs_btree_cur		*cur = NULL;
> > +	xfs_agnumber_t			agno;
> > +	xfs_agino_t			agino;
> > +	int				has_more;
> > +	int				error = 0;
> > +
> > +	/* Set up our cursor at the right place in the inode btree. */
> > +	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
> > +	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
> > +	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more);
> > +
> > +	while (!error && has_more) {
> > +		struct xfs_inobt_rec_incore	*irec;
> > +
> > +		cond_resched();
> > +
> > +		/* Fetch the inobt record. */
> > +		irec = &iwag->recs[iwag->nr_recs];
> > +		error = xfs_inobt_get_rec(cur, irec, &has_more);
> > +		if (error || !has_more)
> > +			break;
> > +
> > +		/* No allocated inodes in this chunk; skip it. */
> > +		if (irec->ir_freecount == irec->ir_count) {
> > +			error = xfs_btree_increment(cur, 0, &has_more);
> > +			if (error)
> > +				break;
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * Start readahead for this inode chunk in anticipation of
> > +		 * walking the inodes.
> > +		 */
> > +		xfs_bulkstat_ichunk_ra(mp, agno, irec);
> > +
> > +		/*
> > +		 * If there's space in the buffer for more records, increment
> > +		 * the btree cursor and grab more.
> > +		 */
> > +		if (++iwag->nr_recs < iwag->sz_recs) {
> > +			error = xfs_btree_increment(cur, 0, &has_more);
> > +			if (error || !has_more)
> > +				break;
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * Otherwise, we need to save cursor state and run the callback
> > +		 * function on the cached records.  The run_callbacks function
> > +		 * is supposed to return a cursor pointing to the record where
> > +		 * we would be if we had been able to increment like above.
> > +		 */
> > +		error = xfs_iwalk_run_callbacks(iwag, xfs_iwalk_ag_recs, agno,
> > +				&cur, &agi_bp, &has_more);
> > +	}
> > +
> > +	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
> > +
> > +	/* Walk any records left behind in the cache. */
> > +	if (iwag->nr_recs == 0 || error)
> > +		return error;
> > +
> > +	return xfs_iwalk_ag_recs(iwag);
> 
> Hmm, I find the above pattern to process the leftover records a bit
> confusing because of how it is open coded. Could we find a way to reuse
> xfs_iwalk_run_callbacks() in both cases so it looks more obvious? For
> example, pass a flag to indicate whether the callback helper should
> recreate the cursor for continued processing. FWIW, it looks like
> has_more already reflects that state in the current logic above.

Yeah, I think this can be done without making the function too
incohesive.

> > +}
> > +
> > +/*
> > + * Given the number of inodes to prefetch, set the number of inobt records that
> > + * we cache in memory, which controls the number of inodes we try to read
> > + * ahead.
> > + *
> > + * If no max prefetch was given, default to 4096 bytes' worth of inobt records;
> > + * this should be plenty of inodes to read ahead.  This number (256 inobt
> > + * records) was chosen so that the cache is never more than a single memory
> > + * page.
> > + */
> > +static inline void
> > +xfs_iwalk_set_prefetch(
> > +	struct xfs_iwalk_ag	*iwag,
> > +	unsigned int		max_prefetch)
> > +{
> > +	if (max_prefetch)
> > +		iwag->sz_recs = round_up(max_prefetch, XFS_INODES_PER_CHUNK) /
> > +					XFS_INODES_PER_CHUNK;
> > +	else
> > +		iwag->sz_recs = 4096 / sizeof(struct xfs_inobt_rec_incore);
> > +
> 
> Perhaps this should use PAGE_SIZE or a related macro?

It did in the previous revision, but Dave pointed out that sz_recs then
becomes quite large on a system with 64k pages...

65536 bytes / 16 bytes per inobt record = 4096 records
4096 records * 64 inodes per record = 262144 inodes
262144 inodes * 512 bytes per inode = 128MB of inode readahead

I could extend the comment to explain why we don't use PAGE_SIZE...

/*
 * Note: We hardcode 4096 here (instead of, say, PAGE_SIZE) because we want to
 * constrain the amount of inode readahead to 16k inodes regardless of CPU:
 *
 * 4096 bytes / 16 bytes per inobt record = 256 inobt records
 * 256 inobt records * 64 inodes per record = 16384 inodes
 * 16384 inodes * 512 bytes per inode(?) = 8MB of inode readahead
 */

--D

> Brian
> 
> > +	/*
> > +	 * Allocate enough space to prefetch at least two records so that we
> > +	 * can cache both the inobt record where the iwalk started and the next
> > +	 * record.  This simplifies the AG inode walk loop setup code.
> > +	 */
> > +	iwag->sz_recs = max_t(unsigned int, iwag->sz_recs, 2);
> > +}
> > +
> > +/*
> > + * Walk all inodes in the filesystem starting from @startino.  The @iwalk_fn
> > + * will be called for each allocated inode, being passed the inode's number and
> > + * @data.  @max_prefetch controls how many inobt records' worth of inodes we
> > + * try to readahead.
> > + */
> > +int
> > +xfs_iwalk(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_trans	*tp,
> > +	xfs_ino_t		startino,
> > +	xfs_iwalk_fn		iwalk_fn,
> > +	unsigned int		max_prefetch,
> > +	void			*data)
> > +{
> > +	struct xfs_iwalk_ag	iwag = {
> > +		.mp		= mp,
> > +		.tp		= tp,
> > +		.iwalk_fn	= iwalk_fn,
> > +		.data		= data,
> > +		.startino	= startino,
> > +	};
> > +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
> > +	int			error;
> > +
> > +	ASSERT(agno < mp->m_sb.sb_agcount);
> > +
> > +	xfs_iwalk_set_prefetch(&iwag, max_prefetch);
> > +	error = xfs_iwalk_alloc(&iwag);
> > +	if (error)
> > +		return error;
> > +
> > +	for (; agno < mp->m_sb.sb_agcount; agno++) {
> > +		error = xfs_iwalk_ag(&iwag);
> > +		if (error)
> > +			break;
> > +		iwag.startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
> > +	}
> > +
> > +	xfs_iwalk_free(&iwag);
> > +	return error;
> > +}
> > diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
> > new file mode 100644
> > index 000000000000..45b1baabcd2d
> > --- /dev/null
> > +++ b/fs/xfs/xfs_iwalk.h
> > @@ -0,0 +1,18 @@
> > +// SPDX-License-Identifier: GPL-2.0+
> > +/*
> > + * Copyright (C) 2019 Oracle.  All Rights Reserved.
> > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > + */
> > +#ifndef __XFS_IWALK_H__
> > +#define __XFS_IWALK_H__
> > +
> > +/* Walk all inodes in the filesystem starting from @startino. */
> > +typedef int (*xfs_iwalk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
> > +			    xfs_ino_t ino, void *data);
> > +/* Return value (for xfs_iwalk_fn) that aborts the walk immediately. */
> > +#define XFS_IWALK_ABORT	(1)
> > +
> > +int xfs_iwalk(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t startino,
> > +		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
> > +
> > +#endif /* __XFS_IWALK_H__ */
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 2464ea351f83..f9bb1d50bc0e 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -3516,6 +3516,46 @@ DEFINE_EVENT(xfs_inode_corrupt_class, name,	\
> >  DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_sick);
> >  DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_healthy);
> >  
> > +TRACE_EVENT(xfs_iwalk_ag,
> > +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> > +		 xfs_agino_t startino),
> > +	TP_ARGS(mp, agno, startino),
> > +	TP_STRUCT__entry(
> > +		__field(dev_t, dev)
> > +		__field(xfs_agnumber_t, agno)
> > +		__field(xfs_agino_t, startino)
> > +	),
> > +	TP_fast_assign(
> > +		__entry->dev = mp->m_super->s_dev;
> > +		__entry->agno = agno;
> > +		__entry->startino = startino;
> > +	),
> > +	TP_printk("dev %d:%d agno %d startino %u",
> > +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno,
> > +		  __entry->startino)
> > +)
> > +
> > +TRACE_EVENT(xfs_iwalk_ag_rec,
> > +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> > +		 struct xfs_inobt_rec_incore *irec),
> > +	TP_ARGS(mp, agno, irec),
> > +	TP_STRUCT__entry(
> > +		__field(dev_t, dev)
> > +		__field(xfs_agnumber_t, agno)
> > +		__field(xfs_agino_t, startino)
> > +		__field(uint64_t, freemask)
> > +	),
> > +	TP_fast_assign(
> > +		__entry->dev = mp->m_super->s_dev;
> > +		__entry->agno = agno;
> > +		__entry->startino = irec->ir_startino;
> > +		__entry->freemask = irec->ir_free;
> > +	),
> > +	TP_printk("dev %d:%d agno %d startino %u freemask 0x%llx",
> > +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno,
> > +		  __entry->startino, __entry->freemask)
> > +)
> > +
> >  #endif /* _TRACE_XFS_H */
> >  
> >  #undef TRACE_INCLUDE_PATH
> > 

  reply	other threads:[~2019-06-10 16:59 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-04 21:49 [PATCH v2 00/10] xfs: refactor and improve inode iteration Darrick J. Wong
2019-06-04 21:49 ` [PATCH 01/10] xfs: create simplified inode walk function Darrick J. Wong
2019-06-10 13:58   ` Brian Foster
2019-06-10 16:59     ` Darrick J. Wong [this message]
2019-06-10 17:55       ` Brian Foster
2019-06-10 23:11         ` Darrick J. Wong
2019-06-11 22:33           ` Dave Chinner
2019-06-11 23:05             ` Darrick J. Wong
2019-06-12 12:13               ` Brian Foster
2019-06-12 16:53                 ` Darrick J. Wong
2019-06-12 17:54               ` Darrick J. Wong
2019-06-04 21:49 ` [PATCH 02/10] xfs: convert quotacheck to use the new iwalk functions Darrick J. Wong
2019-06-10 13:58   ` Brian Foster
2019-06-10 17:10     ` Darrick J. Wong
2019-06-11 23:23     ` Dave Chinner
2019-06-12  0:32       ` Darrick J. Wong
2019-06-12 12:55         ` Brian Foster
2019-06-12 23:33           ` Dave Chinner
2019-06-13 18:34             ` Brian Foster
2019-06-04 21:49 ` [PATCH 03/10] xfs: bulkstat should copy lastip whenever userspace supplies one Darrick J. Wong
2019-06-10 13:59   ` Brian Foster
2019-06-04 21:49 ` [PATCH 04/10] xfs: convert bulkstat to new iwalk infrastructure Darrick J. Wong
2019-06-10 14:02   ` Brian Foster
2019-06-10 17:38     ` Darrick J. Wong
2019-06-10 18:29       ` Brian Foster
2019-06-10 23:42         ` Darrick J. Wong
2019-06-04 21:49 ` [PATCH 05/10] xfs: move bulkstat ichunk helpers to iwalk code Darrick J. Wong
2019-06-10 14:02   ` Brian Foster
2019-06-04 21:50 ` [PATCH 06/10] xfs: change xfs_iwalk_grab_ichunk to use startino, not lastino Darrick J. Wong
2019-06-10 19:32   ` Brian Foster
2019-06-04 21:50 ` [PATCH 07/10] xfs: clean up long conditionals in xfs_iwalk_ichunk_ra Darrick J. Wong
2019-06-10 19:32   ` Brian Foster
2019-06-04 21:50 ` [PATCH 08/10] xfs: multithreaded iwalk implementation Darrick J. Wong
2019-06-10 19:40   ` Brian Foster
2019-06-11  1:10     ` Darrick J. Wong
2019-06-11 13:13       ` Brian Foster
2019-06-11 15:29         ` Darrick J. Wong
2019-06-11 17:00           ` Brian Foster
2019-06-04 21:50 ` [PATCH 09/10] xfs: poll waiting for quotacheck Darrick J. Wong
2019-06-11 15:07   ` Brian Foster
2019-06-11 16:06     ` Darrick J. Wong
2019-06-11 17:01       ` Brian Foster
2019-06-04 21:50 ` [PATCH 10/10] xfs: refactor INUMBERS to use iwalk functions Darrick J. Wong
2019-06-11 15:08   ` Brian Foster
2019-06-11 16:21     ` Darrick J. Wong
2019-06-11 17:01       ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190610165909.GI1871505@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=bfoster@redhat.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox