Re: [PATCH 1/5] xfs: replace xfs_btree_has_record with a general keyspace scanner

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 1/5] xfs: replace xfs_btree_has_record with a general keyspace scanner
Date: Wed, 2 Nov 2022 18:56:12 -0700	[thread overview]
Message-ID: <Y2MfvER9CbKgbbyd@magnolia> (raw)
In-Reply-To: <20221102014745.GT3600936@dread.disaster.area>

On Wed, Nov 02, 2022 at 12:47:45PM +1100, Dave Chinner wrote:
> On Sun, Oct 02, 2022 at 11:20:16AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > The current implementation of xfs_btree_has_record returns true if it
> > finds /any/ record within the given range.  Unfortunately, that's not
> > sufficient for scrub.  We want to be able to tell if a range of keyspace
> > for a btree is devoid of records, is totally mapped to records, or is
> > somewhere in between.  By forcing this to be a boolean, we were
> > generally missing the "in between" case and returning incorrect results.
> > Fix the API so that we can tell the caller which of those three is the
> > current state.
> 
> My first reaction is that this "keyfill" API name is .... awful.

/me smacks head, realizes that "fill" could be interpreted as an action
verb, instead of a noun.  "fullness" might have been better.

This function scans part of a btree's keyspace to determine the fullness
factor (empty, totally full, sparse).

xfs_rmapbt_keyspace_fullness ?
                   _sparseness?
		   _contiguity?

That's the best thing I can think of, though my brain is a little tired
right now.  I could even leave it as _has_record just to avoid the
rename costs, though "has records" is a little vague.

OTOH "keyspace" is one of those jargon terms that comes from database
theory land.

> From an API perspective, all we are doing is changing the
> "has_record()" API that returns a boolean to return a tri-state - we
> add a "partial" return to the "all" and "none" states we currently
> return. The whole API doesn't need renaming - it's impossible to
> work out what "scan_keyfill" iis actually intended to do, whereas
> "has_record"  is very much self documenting....

Ok.

> Hence I think that the implementation of xfs_btree_has_record()
> needs to change, I don't think the entire API needs to be renamed.
> All that needs to happen to the higher level API is this conversion:
> 
> > -	bool			*exists)
> > +	enum xfs_btree_keyfill	*exists)
> 
> Even then, the enum could be named for what it means -
> xfs_btree_rec_overlap - with values for none, full and partial. At
> this point, nothing above xfs_btree_has record needs to know
> anything about whatever a "key fill" operation might actually be.

/me wishes he'd thought of "keyspace contiguity" earlier.

Though that's a lot of long words.  I'll take your suggestion to leave
the name as _has_records.  However, we're not actually measuring the
amount of *overlap* between records -- what we're really looking for is
the btree record equivalent of file holes.

enum xfs_btree_rec_contiguity?

> >  static const struct xfs_btree_ops xfs_cntbt_ops = {
> > diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
> > index cfa052d40105..d1225b957649 100644
> > --- a/fs/xfs/libxfs/xfs_bmap_btree.c
> > +++ b/fs/xfs/libxfs/xfs_bmap_btree.c
> > @@ -518,6 +518,18 @@ xfs_bmbt_recs_inorder(
> >  		xfs_bmbt_disk_get_startoff(&r2->bmbt);
> >  }
> >  
> > +STATIC bool
> > +xfs_bmbt_has_key_gap(
> > +	struct xfs_btree_cur		*cur,
> > +	const union xfs_btree_key	*key1,
> > +	const union xfs_btree_key	*key2)
> > +{
> > +	xfs_fileoff_t			next;
> > +
> > +	next = be64_to_cpu(key1->bmbt.br_startoff) + 1;
> > +	return next != be64_to_cpu(key2->bmbt.br_startoff);
> > +}
> 
> IDGI - this is an extent tree - there is no gap if the extent at
> key2 starts at the end of key1. IOWs, this only returns "no gap"
> if the extent at key 1 is a single block in length. I'll come back
> to this...
> 
> Oh, does this assume that the caller has already created a key to a
> nonexistent record in the BMBT that points to the end of the first
> extent?

Yes.

> i.e. that this method is being called with key1 being a high
> key for the bmbt record (i.e. an end pointer) and key2 being a low
> key for the bmbt record (i.e. a start pointer)?

Generically, the _has_key_gap functions take two record keys A and C and
decide if is possible for there to be a third record key B satisfying
this relationship: A < B < C.  For the operation to make any sense, it's
very strongly implied that A is the high key of a record R and B is the
low key of a record S.  Technically, however, there's no reason why you
can't pass any two keys.

> If so, this API needs some documentation on how it is expected to be
> used - at least name the two keys something more descriptive like
> "high key" and "next low key"....

Now that I know that scrub is the only user of the key gap functions,
I'm confident that the function signatures can s/key1/high_key/ and
s/key2/low_key/.

Clearly I also need to improve the documentation for this function.

"Given two btree keys @high_key and @low_key, decide if it is possible
for there to be a third btree key K satisfying the relationship
@high_key < K < @low_key.  To determine the sparseness of the keyspace
for a pair of btree records, @high_key should be the high key of a
record and @low_key should be the low key of the next record in the
record set."

Not sure if that's any better though.

> It occurs to me that what I'm actually doing here is reverse
> engineering the design of this mechanism from the code because
> there's no documentation in the patch or the commit description of
> the algorithm being used to find overlapping records....

That's the (severe!) downside of talking to a database guy -- these
kinds of things are obvious to me, but that's not everyone's background.

> >  static const struct xfs_btree_ops xfs_bmbt_ops = {
> >  	.rec_len		= sizeof(xfs_bmbt_rec_t),
> >  	.key_len		= sizeof(xfs_bmbt_key_t),
> > @@ -538,6 +550,7 @@ static const struct xfs_btree_ops xfs_bmbt_ops = {
> >  	.buf_ops		= &xfs_bmbt_buf_ops,
> >  	.keys_inorder		= xfs_bmbt_keys_inorder,
> >  	.recs_inorder		= xfs_bmbt_recs_inorder,
> > +	.has_key_gap		= xfs_bmbt_has_key_gap,
> >  };
> >  
> >  /*
> > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > index 4c16c8c31fcb..5710d3ee582a 100644
> > --- a/fs/xfs/libxfs/xfs_btree.c
> > +++ b/fs/xfs/libxfs/xfs_btree.c
> > @@ -5008,34 +5008,100 @@ xfs_btree_diff_two_ptrs(
> >  	return (int64_t)be32_to_cpu(a->s) - be32_to_cpu(b->s);
> >  }
> >  
> > -/* If there's an extent, we're done. */
> > +struct xfs_btree_scan_keyfill {
> > +	/* Keys for the start and end of the range we want to know about. */
> > +	union xfs_btree_key	start_key;
> > +	union xfs_btree_key	end_key;
> > +
> > +	/* Highest record key we've seen so far. */
> > +	union xfs_btree_key	high_key;
> > +
> > +	enum xfs_btree_keyfill	outcome;
> > +};
> 
> This "keyfill" scan operation is completely private to
> xfs_btree_has_record(), which further indicates the higher level API
> should not be renamed "keyfill"....

struct xfbt_has_records?

> > +
> >  STATIC int
> > -xfs_btree_has_record_helper(
> > +xfs_btree_scan_keyfill_helper(
> >  	struct xfs_btree_cur		*cur,
> >  	const union xfs_btree_rec	*rec,
> >  	void				*priv)
> >  {
> > -	return -ECANCELED;
> > +	union xfs_btree_key		rec_key;
> > +	union xfs_btree_key		rec_high_key;
> > +	struct xfs_btree_scan_keyfill	*info = priv;
> > +	int64_t				res;
> > +
> > +	cur->bc_ops->init_key_from_rec(&rec_key, rec);
> > +
> > +	if (info->outcome == XFS_BTREE_KEYFILL_EMPTY) {
> > +		info->outcome = XFS_BTREE_KEYFILL_SPARSE;
> > +
> > +		/* Bail if the first record starts after the start key. */
> > +		res = cur->bc_ops->diff_two_keys(cur, &info->start_key,
> > +				&rec_key);
> > +		if (res < 0)
> > +			return -ECANCELED;
> 
> Better comment needed.
> 
> 		/*
> 		 * If the first record we find does not overlap the
> 		 * start key, then there is a hole at the start of
> 		 * the search range before the overlap was found.
> 		 * Hence we can classify this as a sparse overlap
> 		 * and we don't need to search any further.
> 		 */

Added.

> > +	} else {
> > +		/* Bail if there's a gap with the previous record. */
> > +		if (cur->bc_ops->has_key_gap(cur, &info->high_key, &rec_key))
> > +			return -ECANCELED;
> 
> Ah, yeah, there's the high key -> low key implementation assumption.

Yes.

> > +	}
> > +
> > +	/* If the current record is higher than what we've seen, remember it. */
> > +	cur->bc_ops->init_high_key_from_rec(&rec_high_key, rec);
> > +	res = cur->bc_ops->diff_two_keys(cur, &rec_high_key, &info->high_key);
> > +	if (res > 0)
> > +		info->high_key = rec_high_key; /* struct copy */
> > +
> > +	return 0;
> >  }
> >  
> > -/* Is there a record covering a given range of keys? */
> > +/*
> > + * Scan part of the keyspace of a btree and tell us if the area has no records,
> > + * is fully mapped by records, or is partially filled.
> > + */
> >  int
> > -xfs_btree_has_record(
> > +xfs_btree_scan_keyfill(
> >  	struct xfs_btree_cur		*cur,
> >  	const union xfs_btree_irec	*low,
> >  	const union xfs_btree_irec	*high,
> > -	bool				*exists)
> > +	enum xfs_btree_keyfill		*outcome)
> >  {
> > +	struct xfs_btree_scan_keyfill	info = {
> > +		.outcome		= XFS_BTREE_KEYFILL_EMPTY,
> > +	};
> > +	union xfs_btree_rec		rec;
> > +	int64_t				res;
> >  	int				error;
> >  
> > +	if (!cur->bc_ops->has_key_gap)
> > +		return -EOPNOTSUPP;
> > +
> > +	cur->bc_rec = *low;
> > +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> > +	cur->bc_ops->init_key_from_rec(&info.start_key, &rec);
> > +
> > +	cur->bc_rec = *high;
> > +	cur->bc_ops->init_rec_from_cur(cur, &rec);
> > +	cur->bc_ops->init_key_from_rec(&info.end_key, &rec);
> 
> Didn't a previous patch I just create helpers for these?
> 
> Oh.... patches in the series were threaded in the wrong order...

Yeah.  I'll rearrange these.

> 
> > +
> >  	error = xfs_btree_query_range(cur, low, high,
> > -			&xfs_btree_has_record_helper, NULL);
> > -	if (error == -ECANCELED) {
> > -		*exists = true;
> > -		return 0;
> > -	}
> > -	*exists = false;
> > -	return error;
> > +			xfs_btree_scan_keyfill_helper, &info);
> > +	if (error == -ECANCELED)
> > +		goto out;
> > +	if (error)
> > +		return error;
> > +
> > +	if (info.outcome == XFS_BTREE_KEYFILL_EMPTY)
> > +		goto out;
> > +
> > +	/* Did the record set go at least as far as the end? */
> > +	res = cur->bc_ops->diff_two_keys(cur, &info.high_key, &info.end_key);
> > +	if (res >= 0)
> > +		info.outcome = XFS_BTREE_KEYFILL_FULL;
> 
> Hmmm. I'm wondering if we should have helpers for these sorts of
> key comparisons.
> 
> static bool
> xfs_btree_keycmp_lt(
> 	struct xfs_btree_cur	*cur,
> 	struct xfs_btree_key	*key1,
> 	struct xfs_btree_key	*key2)
> {
> 	return cur->bc_ops->diff_two_keys(cur, key1, key2) < 0;
> }
> 
> static bool
> xfs_btree_keycmp_gt(
> 	struct xfs_btree_cur	*cur,
> 	struct xfs_btree_key	*key1,
> 	struct xfs_btree_key	*key2)
> {
> 	return cur->bc_ops->diff_two_keys(cur, key1, key2) > 0;
> }
> 
> static bool
> xfs_btree_keycmp_ge(
> 	struct xfs_btree_cur	*cur,
> 	struct xfs_btree_key	*key1,
> 	struct xfs_btree_key	*key2)
> {
> 	return cur->bc_ops->diff_two_keys(cur, key1, key2) >= 0;
> }
> 
> Which then makes the code read a whole lot nicer:
> 
> 	/* Did the record set go at least as far as the end? */
> 	if (xfs_btree_keycmp_ge(cur, &info.high_key, &info.end_key))
> 		info.outcome = XFS_BTREE_KEYFILL_FULL;
> ...
> 
> Not necessary for this patch, but I note there are a few places
> where these sorts of key range/ordering checks are open coded...

Yeah.  Every time I squint at "< 0" "> 0" and have to remember what all
that means.  I'll clean that one up too.

> > +
> > +out:
> > +	*outcome = info.outcome;
> > +	return 0;
> >  }
> >  
> >  /* Are there more records in this btree? */
> > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > index eef27858a013..58a05f0d1f1b 100644
> > --- a/fs/xfs/libxfs/xfs_btree.h
> > +++ b/fs/xfs/libxfs/xfs_btree.h
> > @@ -157,6 +157,11 @@ struct xfs_btree_ops {
> >  	int	(*recs_inorder)(struct xfs_btree_cur *cur,
> >  				const union xfs_btree_rec *r1,
> >  				const union xfs_btree_rec *r2);
> > +
> > +	/* decide if there's a gap in the keyspace between two keys */
> > +	bool	(*has_key_gap)(struct xfs_btree_cur *cur,
> > +			       const union xfs_btree_key *key1,
> > +			       const union xfs_btree_key *key2);
> 
> Having read through it this far, this looks like it is checking for
> two discrete keys form a contiguous range? Perhaps that's a better
> name than "gap", because "contiguous" means different things for
> different keys. e.g. extents vs inode records.

bc_ops->keys_contiguous()?  Sounds good to me.

> 
> 
> > diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > index 8c83e265770c..fd48b95b4f4e 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > @@ -380,6 +380,18 @@ xfs_inobt_recs_inorder(
> >  		be32_to_cpu(r2->inobt.ir_startino);
> >  }
> >  
> > +STATIC bool
> > +xfs_inobt_has_key_gap(
> > +	struct xfs_btree_cur		*cur,
> > +	const union xfs_btree_key	*key1,
> > +	const union xfs_btree_key	*key2)
> > +{
> > +	xfs_agino_t			next;
> > +
> > +	next = be32_to_cpu(key1->inobt.ir_startino) + XFS_INODES_PER_CHUNK;
> > +	return next != be32_to_cpu(key2->inobt.ir_startino);
> > +}
> 
> Huh. Is that correct? The high key for an inode chunk is:
> 
> STATIC void                                                                      
> xfs_inobt_init_high_key_from_rec(                                                
>         union xfs_btree_key             *key,                                    
>         const union xfs_btree_rec       *rec)                                    
> {                                                                                
>         __u32                           x;                                       
>                                                                                  
>         x = be32_to_cpu(rec->inobt.ir_startino);                                 
>         x += XFS_INODES_PER_CHUNK - 1;                                           
>         key->inobt.ir_startino = cpu_to_be32(x);                                 
> }                                                                                
> 
> Which means highkey->ir_startino (end range pointer) points to
> low_key->ir_startino + 63 (start range pointer + inodes in chunk)
> 
> Hence if this "gap" API is supposed to be passed {high_key,
> low_key}, then xfs_inobt_has_key_gap() is checking
> (low_key->ir_startino + 127) against next_low_key->ir_startino...

Oops, I committed the correct code into the wrong patch.  Some times I
really hate stgit.  This has gotten better recently now that I figured
out how to dump the branch and patch name into PS1.

> > +STATIC bool
> > +xfs_refcountbt_has_key_gap(
> > +	struct xfs_btree_cur		*cur,
> > +	const union xfs_btree_key	*key1,
> > +	const union xfs_btree_key	*key2)
> > +{
> > +	xfs_agblock_t			next;
> > +
> > +	next = be32_to_cpu(key1->refc.rc_startblock) + 1;
> > +	return next != be32_to_cpu(key2->refc.rc_startblock);
> > +}
> 
> ... and this matches the BMBT code (as does the rmapbt code), which seems to
> assume a high key (end extent pointer) is being passed as key1, and key2 is
> a low key (start extent pointer).
> 
> Am I completely misunderstanding what the key gap API uses for
> low_key and high_key? I am completely confused now...

You've understood btree keyspace sparseness scanning correctly.
My apologies for making it harder than it had to be, and thanks for
wading through all this.

--D

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

     prev parent reply	other threads:[~2022-11-03  1:56 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-02 18:20 [PATCHSET v23.1 0/5] xfs: detect incorrect gaps in refcount btree Darrick J. Wong
2022-10-02 18:20 ` [PATCH 5/5] xfs: ensure that all metadata and data blocks are not cow staging extents Darrick J. Wong
2022-11-02  0:29   ` Dave Chinner
2022-10-02 18:20 ` [PATCH 2/5] xfs: refactor converting btree irec to btree key Darrick J. Wong
2022-11-02  0:31   ` Dave Chinner
2022-10-02 18:20 ` [PATCH 3/5] xfs: mask key comparisons for keyspace fill scans Darrick J. Wong
2022-11-02  2:16   ` Dave Chinner
2022-11-03  1:08     ` Darrick J. Wong
2022-11-03 21:12       ` Darrick J. Wong
2022-10-02 18:20 ` [PATCH 4/5] xfs: check the reference counts of gaps in the refcount btree Darrick J. Wong
2022-11-02  2:20   ` Dave Chinner
2022-10-02 18:20 ` [PATCH 1/5] xfs: replace xfs_btree_has_record with a general keyspace scanner Darrick J. Wong
2022-11-02  1:47   ` Dave Chinner
2022-11-03  1:56     ` Darrick J. Wong [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y2MfvER9CbKgbbyd@magnolia \
    --to=djwong@kernel.org \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.