linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] xfs: speed up directory bestfree block scanning
@ 2018-01-03  6:27 Dave Chinner
  2018-01-03 11:59 ` Dave Chinner
  2018-01-03 13:28 ` Brian Foster
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Chinner @ 2018-01-03  6:27 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When running a "create millions inodes in a directory" test
recently, I noticed we were spending a huge amount of time
converting freespace block headers from disk format to in-memory
format:

 31.47%  [kernel]  [k] xfs_dir2_node_addname
 17.86%  [kernel]  [k] xfs_dir3_free_hdr_from_disk
  3.55%  [kernel]  [k] xfs_dir3_free_bests_p

We shouldn't be hitting the best free block scanning code so hard
when doing sequential directory creates, and it turns out there's
a highly suboptimal loop searching the the best free array in
the freespace block - it decodes the block header before checking
each entry inside a loop, instead of decoding the header once before
running the entry search loop.

This makes a massive difference to create rates. Profile now looks
like this:

  13.15%  [kernel]  [k] xfs_dir2_node_addname
   3.52%  [kernel]  [k] xfs_dir3_leaf_check_int
   3.11%  [kernel]  [k] xfs_log_commit_cil

And the wall time/average file create rate differences are
just as stark:

		create time(sec) / rate (files/s)
File count	     vanilla		    patched
  10k		   0.54 / 18.5k		   0.53 / 18.9k
  20k		   1.10	/ 18.1k		   1.05 / 19.0k
 100k		   4.21	/ 23.8k		   3.91 / 25.6k
 200k		   9.66	/ 20,7k		   7.37 / 27.1k
   1M		  86.61	/ 11.5k		  48.26 / 20.7k
   2M		 206.13	/  9.7k		 129.71 / 15.4k

The larger the directory, the bigger the performance improvement.

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_dir2_node.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c
index 682e2bf370c7..bcf0d43cd6a8 100644
--- a/fs/xfs/libxfs/xfs_dir2_node.c
+++ b/fs/xfs/libxfs/xfs_dir2_node.c
@@ -1829,24 +1829,24 @@ xfs_dir2_node_addname_int(
 		 */
 		bests = dp->d_ops->free_bests_p(free);
 		dp->d_ops->free_hdr_from_disk(&freehdr, free);
-		if (be16_to_cpu(bests[findex]) != NULLDATAOFF &&
-		    be16_to_cpu(bests[findex]) >= length)
-			dbno = freehdr.firstdb + findex;
-		else {
-			/*
-			 * Are we done with the freeblock?
-			 */
-			if (++findex == freehdr.nvalid) {
-				/*
-				 * Drop the block.
-				 */
-				xfs_trans_brelse(tp, fbp);
-				fbp = NULL;
-				if (fblk && fblk->bp)
-					fblk->bp = NULL;
+		do {
+
+			if (be16_to_cpu(bests[findex]) != NULLDATAOFF &&
+			    be16_to_cpu(bests[findex]) >= length) {
+				dbno = freehdr.firstdb + findex;
+				break;
 			}
+		} while (++findex < freehdr.nvalid);
+
+		/* Drop the block if we done with the freeblock */
+		if (findex == freehdr.nvalid) {
+			xfs_trans_brelse(tp, fbp);
+			fbp = NULL;
+			if (fblk)
+				fblk->bp = NULL;
 		}
 	}
+
 	/*
 	 * If we don't have a data block, we need to allocate one and make
 	 * the freespace entries refer to it.
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: speed up directory bestfree block scanning
  2018-01-03  6:27 [PATCH] xfs: speed up directory bestfree block scanning Dave Chinner
@ 2018-01-03 11:59 ` Dave Chinner
  2018-01-03 13:41   ` Brian Foster
  2018-01-03 13:28 ` Brian Foster
  1 sibling, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2018-01-03 11:59 UTC (permalink / raw)
  To: linux-xfs

On Wed, Jan 03, 2018 at 05:27:48PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When running a "create millions inodes in a directory" test
> recently, I noticed we were spending a huge amount of time
> converting freespace block headers from disk format to in-memory
> format:
> 
>  31.47%  [kernel]  [k] xfs_dir2_node_addname
>  17.86%  [kernel]  [k] xfs_dir3_free_hdr_from_disk
>   3.55%  [kernel]  [k] xfs_dir3_free_bests_p
> 
> We shouldn't be hitting the best free block scanning code so hard
> when doing sequential directory creates, and it turns out there's
> a highly suboptimal loop searching the the best free array in
> the freespace block - it decodes the block header before checking
> each entry inside a loop, instead of decoding the header once before
> running the entry search loop.
> 
> This makes a massive difference to create rates. Profile now looks
> like this:
> 
>   13.15%  [kernel]  [k] xfs_dir2_node_addname
>    3.52%  [kernel]  [k] xfs_dir3_leaf_check_int
>    3.11%  [kernel]  [k] xfs_log_commit_cil
> 
> And the wall time/average file create rate differences are
> just as stark:
> 
> 		create time(sec) / rate (files/s)
> File count	     vanilla		    patched
>   10k		   0.54 / 18.5k		   0.53 / 18.9k
>   20k		   1.10	/ 18.1k		   1.05 / 19.0k
>  100k		   4.21	/ 23.8k		   3.91 / 25.6k
>  200k		   9.66	/ 20,7k		   7.37 / 27.1k
>    1M		  86.61	/ 11.5k		  48.26 / 20.7k
>    2M		 206.13	/  9.7k		 129.71 / 15.4k
> 
> The larger the directory, the bigger the performance improvement.

FWIW, while this improves the free block scanning code, it still
doesn't really fix the overall problem with the free space index
blocks. The problem there is that the free space index is
essentially a linear array, with one index per allocated data
block in the directory. This array is scanned linearly on every
insert to find the first block with enough free space to hold the
new entry. That, unfortunately, is problematic when new blocks are
always added at the tail of the list. hence we have to scan the
the entire free space index before we find the free space the last
insert added.

That's nasty - the result is that by the time we have 5 million
entries in the directory we have something like 60,000 4k blocks in
the filesystem and only the last one has free space. Hence we're
searching 60,000 entries just to find where to insert the next
entry. The result is this:

		    create time(sec)
File count	 vanilla        patched
 200k		   9.66		   7.37
   1M		  86.61		  48.26
   2M		 206.13		 129.71
  10M		2843.57		1817.39

it goes highly non-linear at above 1M entries in the directory.

In writing this, I think I can see a quick and simple change that
will fix this case and improve most other directory grow workloads
without affecting normal random directory insert/remove performance.
That is, do a reverse order search starting at the last block rather
than increasing order search starting at the first block.....

Ok, now were are talking - performance and scalability improvements!

		create time(sec) / rate (files/s)
 File count     vanilla		    loop-fix		+reverse
   10k	      0.54 / 18.5k	   0.53 / 18.9k	       0.52 / 19.3k
   20k	      1.10 / 18.1k	   1.05 / 19.0k	       1.00 / 20.0k
  100k	      4.21 / 23.8k	   3.91 / 25.6k	       3.58 / 27.9k
  200k	      9.66 / 20,7k	   7.37 / 27.1k	       7.08 / 28.3k
    1M	     86.61 / 11.5k	  48.26 / 20.7k	      38.33 / 26.1k
    2M	    206.13 /  9.7k	 129.71 / 15.4k	      82.20 / 24.3k
   10M	   2843.57 /  3.5k	1817.39 /  5.5k      591.78 / 16.9k

Theres still some non-linearity as we approach the 10M number, but
it's still 5x faster to 10M inodes than the existing code....

Reverse search patch - working but not finished, smoke tested only -
is attached below. Discuss!

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: reverse search directory freespace indexes

From: Dave Chinner <dchinner@redhat.com>

When a directory is growing rapidly, new blocks tend to get added at
the end of teh directory. These end up at teh end of teh freespace
index, and when teh directory gets large finding these new
freespaces gets expensive. The code does a linear search across the
frespace index from the first block in teh directory to teh last,
hence meaning the newly added space is the last index searched.

Instead, do a reverse order index search, starting from the last
block and index in the freespace index. This makes most lookups for
free space on rapidly growing directories O(1) instead of O(N).
The result is a major improvement in large directory grow rates:

                create time(sec) / rate (files/s)
 File count     vanilla             patched
   10k        0.54 / 18.5k         0.52 / 19.3k
   20k        1.10 / 18.1k         1.00 / 20.0k
  100k        4.21 / 23.8k         3.58 / 27.9k
  200k        9.66 / 20,7k         7.08 / 28.3k
    1M       86.61 / 11.5k        38.33 / 26.1k
    2M      206.13 /  9.7k        82.20 / 24.3k
   10M     2843.57 /  3.5k       591.78 / 16.9k

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_dir2_node.c | 91 ++++++++++++++++++-------------------------
 1 file changed, 38 insertions(+), 53 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c
index bcf0d43cd6a8..d95229a9d700 100644
--- a/fs/xfs/libxfs/xfs_dir2_node.c
+++ b/fs/xfs/libxfs/xfs_dir2_node.c
@@ -1701,10 +1701,11 @@ xfs_dir2_node_addname_int(
 	int			error;		/* error return value */
 	xfs_dir2_db_t		fbno;		/* freespace block number */
 	struct xfs_buf		*fbp;		/* freespace buffer */
-	int			findex;		/* freespace entry index */
-	xfs_dir2_free_t		*free=NULL;	/* freespace block structure */
+	int			findex = -1;	/* freespace entry index */
+	xfs_dir2_free_t		*free = NULL;	/* freespace block structure */
 	xfs_dir2_db_t		ifbno;		/* initial freespace block no */
-	xfs_dir2_db_t		lastfbno=0;	/* highest freespace block no */
+	xfs_dir2_db_t		lastfbno = 0;	/* highest freespace block no */
+	xfs_dir2_db_t		firstfbno = 0;	/* lowest freespace block no */
 	int			length;		/* length of the new entry */
 	int			logfree;	/* need to log free entry */
 	xfs_mount_t		*mp;		/* filesystem mount point */
@@ -1715,11 +1716,17 @@ xfs_dir2_node_addname_int(
 	__be16			*bests;
 	struct xfs_dir3_icfree_hdr freehdr;
 	struct xfs_dir2_data_free *bf;
+	xfs_fileoff_t		fo;		/* freespace block number */
 
 	dp = args->dp;
 	mp = dp->i_mount;
 	tp = args->trans;
 	length = dp->d_ops->data_entsize(args->namelen);
+
+	ifbno = -1;
+	dbno = -1;
+	fbp = NULL;
+
 	/*
 	 * If we came in with a freespace block that means that lookup
 	 * found an entry with our hash value.  This is the freespace
@@ -1746,62 +1753,38 @@ xfs_dir2_node_addname_int(
 			ASSERT(be16_to_cpu(bests[findex]) != NULLDATAOFF);
 			ASSERT(be16_to_cpu(bests[findex]) >= length);
 			dbno = freehdr.firstdb + findex;
-		} else {
-			/*
-			 * The data block looked at didn't have enough room.
-			 * We'll start at the beginning of the freespace entries.
-			 */
-			dbno = -1;
-			findex = 0;
+			goto block_found;
 		}
-	} else {
-		/*
-		 * Didn't come in with a freespace block, so no data block.
-		 */
-		ifbno = dbno = -1;
-		fbp = NULL;
-		findex = 0;
 	}
 
 	/*
-	 * If we don't have a data block yet, we're going to scan the
+	 * If we don't have a usable data block yet, we're going to scan the
 	 * freespace blocks looking for one.  Figure out what the
-	 * highest freespace block number is.
+	 * highest freespace block number is as we'll start scanning there.
 	 */
-	if (dbno == -1) {
-		xfs_fileoff_t	fo;		/* freespace block number */
+	error = xfs_bmap_last_offset(dp, &fo, XFS_DATA_FORK);
+	if (error)
+		return error;
+	lastfbno = xfs_dir2_da_to_db(args->geo, (xfs_dablk_t)fo);
+	firstfbno = xfs_dir2_byte_to_db(args->geo, XFS_DIR2_FREE_OFFSET);
 
-		if ((error = xfs_bmap_last_offset(dp, &fo, XFS_DATA_FORK)))
-			return error;
-		lastfbno = xfs_dir2_da_to_db(args->geo, (xfs_dablk_t)fo);
-		fbno = ifbno;
-	}
 	/*
 	 * While we haven't identified a data block, search the freeblock
-	 * data for a good data block.  If we find a null freeblock entry,
-	 * indicating a hole in the data blocks, remember that.
+	 * data for a good data block. Do a reverse order search, as growing
+	 * directories will put new blocks with free space at the end of the
+	 * free space index.
 	 */
-	while (dbno == -1) {
-		/*
-		 * If we don't have a freeblock in hand, get the next one.
-		 */
-		if (fbp == NULL) {
-			/*
-			 * Happens the first time through unless lookup gave
-			 * us a freespace block to start with.
-			 */
-			if (++fbno == 0)
-				fbno = xfs_dir2_byte_to_db(args->geo,
-							XFS_DIR2_FREE_OFFSET);
-			/*
-			 * If it's ifbno we already looked at it.
-			 */
+	for (fbno = lastfbno; fbno >= firstfbno; fbno--) {
+
+		/* If we don't have a freeblock in hand, get the previous one. */
+		if (!fbp) {
+			/* If it's ifbno we already looked at it. */
 			if (fbno == ifbno)
-				fbno++;
+				fbno--;
 			/*
 			 * If it's off the end we're done.
 			 */
-			if (fbno >= lastfbno)
+			if (fbno < firstfbno)
 				break;
 			/*
 			 * Read the block.  There can be holes in the
@@ -1817,8 +1800,8 @@ xfs_dir2_node_addname_int(
 			if (!fbp)
 				continue;
 			free = fbp->b_addr;
-			findex = 0;
 		}
+
 		/*
 		 * Look at the current free entry.  Is it good enough?
 		 *
@@ -1829,17 +1812,16 @@ xfs_dir2_node_addname_int(
 		 */
 		bests = dp->d_ops->free_bests_p(free);
 		dp->d_ops->free_hdr_from_disk(&freehdr, free);
-		do {
-
+		for (findex = freehdr.nvalid - 1; findex >= 0; findex--){
 			if (be16_to_cpu(bests[findex]) != NULLDATAOFF &&
 			    be16_to_cpu(bests[findex]) >= length) {
 				dbno = freehdr.firstdb + findex;
-				break;
+				goto block_found;
 			}
-		} while (++findex < freehdr.nvalid);
+		}
 
 		/* Drop the block if we done with the freeblock */
-		if (findex == freehdr.nvalid) {
+		if (findex < 0) {
 			xfs_trans_brelse(tp, fbp);
 			fbp = NULL;
 			if (fblk)
@@ -1847,11 +1829,13 @@ xfs_dir2_node_addname_int(
 		}
 	}
 
+	ASSERT(dbno == -1);
+
 	/*
-	 * If we don't have a data block, we need to allocate one and make
+	 * We don't have a data block so we need to allocate one and make
 	 * the freespace entries refer to it.
 	 */
-	if (unlikely(dbno == -1)) {
+	if (dbno == -1) {
 		/*
 		 * Not allowed to allocate, return failure.
 		 */
@@ -1981,6 +1965,7 @@ xfs_dir2_node_addname_int(
 	 * We had a data block so we don't have to make a new one.
 	 */
 	else {
+block_found:
 		/*
 		 * If just checking, we succeeded.
 		 */

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: speed up directory bestfree block scanning
  2018-01-03  6:27 [PATCH] xfs: speed up directory bestfree block scanning Dave Chinner
  2018-01-03 11:59 ` Dave Chinner
@ 2018-01-03 13:28 ` Brian Foster
  2018-01-03 20:38   ` Dave Chinner
  1 sibling, 1 reply; 8+ messages in thread
From: Brian Foster @ 2018-01-03 13:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jan 03, 2018 at 05:27:48PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When running a "create millions inodes in a directory" test
> recently, I noticed we were spending a huge amount of time
> converting freespace block headers from disk format to in-memory
> format:
> 
>  31.47%  [kernel]  [k] xfs_dir2_node_addname
>  17.86%  [kernel]  [k] xfs_dir3_free_hdr_from_disk
>   3.55%  [kernel]  [k] xfs_dir3_free_bests_p
> 
> We shouldn't be hitting the best free block scanning code so hard
> when doing sequential directory creates, and it turns out there's
> a highly suboptimal loop searching the the best free array in
> the freespace block - it decodes the block header before checking
> each entry inside a loop, instead of decoding the header once before
> running the entry search loop.
> 
> This makes a massive difference to create rates. Profile now looks
> like this:
> 
>   13.15%  [kernel]  [k] xfs_dir2_node_addname
>    3.52%  [kernel]  [k] xfs_dir3_leaf_check_int
>    3.11%  [kernel]  [k] xfs_log_commit_cil
> 
> And the wall time/average file create rate differences are
> just as stark:
> 
> 		create time(sec) / rate (files/s)
> File count	     vanilla		    patched
>   10k		   0.54 / 18.5k		   0.53 / 18.9k
>   20k		   1.10	/ 18.1k		   1.05 / 19.0k
>  100k		   4.21	/ 23.8k		   3.91 / 25.6k
>  200k		   9.66	/ 20,7k		   7.37 / 27.1k
>    1M		  86.61	/ 11.5k		  48.26 / 20.7k
>    2M		 206.13	/  9.7k		 129.71 / 15.4k
> 
> The larger the directory, the bigger the performance improvement.
> 

Interesting..

> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_dir2_node.c | 30 +++++++++++++++---------------
>  1 file changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c
> index 682e2bf370c7..bcf0d43cd6a8 100644
> --- a/fs/xfs/libxfs/xfs_dir2_node.c
> +++ b/fs/xfs/libxfs/xfs_dir2_node.c
> @@ -1829,24 +1829,24 @@ xfs_dir2_node_addname_int(
>  		 */
>  		bests = dp->d_ops->free_bests_p(free);
>  		dp->d_ops->free_hdr_from_disk(&freehdr, free);
> -		if (be16_to_cpu(bests[findex]) != NULLDATAOFF &&
> -		    be16_to_cpu(bests[findex]) >= length)
> -			dbno = freehdr.firstdb + findex;
> -		else {
> -			/*
> -			 * Are we done with the freeblock?
> -			 */
> -			if (++findex == freehdr.nvalid) {
> -				/*
> -				 * Drop the block.
> -				 */
> -				xfs_trans_brelse(tp, fbp);
> -				fbp = NULL;
> -				if (fblk && fblk->bp)
> -					fblk->bp = NULL;

Ok, so we're adding a dir entry to a node dir and walking the free space
blocks to see if we have space somewhere to insert the entry without
growing the dir. The current code reads the free block, converts the
header, checks bests[findex], then bumps findex or invalidates the free
block if we're done with it.

The updated code reads the free block, converts the header, iterates the
free index range then invalidates the block when complete (assuming we
don't find suitable free space). The end result is that we don't convert
the block header over and over for each index in the individual block.
Seems reasonable to me, just a couple nits...

> +		do {
> +

Extra space above.

> +			if (be16_to_cpu(bests[findex]) != NULLDATAOFF &&
> +			    be16_to_cpu(bests[findex]) >= length) {
> +				dbno = freehdr.firstdb + findex;
> +				break;
>  			}
> +		} while (++findex < freehdr.nvalid);
> +
> +		/* Drop the block if we done with the freeblock */

"... if we're done ..."

Also FWIW, according to the comment it looks like the only reason the
freehdr conversion is elevated to this scope is to accommodate gcc
foolishness. If so, I'm wondering if a simple NULL init of bests at the
top of the function would avoid that problem and allow us to move the
code to where it was apparently intended to be in the first place. Hm?

Brian

> +		if (findex == freehdr.nvalid) {
> +			xfs_trans_brelse(tp, fbp);
> +			fbp = NULL;
> +			if (fblk)
> +				fblk->bp = NULL;
>  		}
>  	}
> +
>  	/*
>  	 * If we don't have a data block, we need to allocate one and make
>  	 * the freespace entries refer to it.
> -- 
> 2.15.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: speed up directory bestfree block scanning
  2018-01-03 11:59 ` Dave Chinner
@ 2018-01-03 13:41   ` Brian Foster
  2018-01-03 21:00     ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Brian Foster @ 2018-01-03 13:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jan 03, 2018 at 10:59:10PM +1100, Dave Chinner wrote:
> On Wed, Jan 03, 2018 at 05:27:48PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When running a "create millions inodes in a directory" test
> > recently, I noticed we were spending a huge amount of time
> > converting freespace block headers from disk format to in-memory
> > format:
> > 
> >  31.47%  [kernel]  [k] xfs_dir2_node_addname
> >  17.86%  [kernel]  [k] xfs_dir3_free_hdr_from_disk
> >   3.55%  [kernel]  [k] xfs_dir3_free_bests_p
> > 
> > We shouldn't be hitting the best free block scanning code so hard
> > when doing sequential directory creates, and it turns out there's
> > a highly suboptimal loop searching the the best free array in
> > the freespace block - it decodes the block header before checking
> > each entry inside a loop, instead of decoding the header once before
> > running the entry search loop.
> > 
> > This makes a massive difference to create rates. Profile now looks
> > like this:
> > 
> >   13.15%  [kernel]  [k] xfs_dir2_node_addname
> >    3.52%  [kernel]  [k] xfs_dir3_leaf_check_int
> >    3.11%  [kernel]  [k] xfs_log_commit_cil
> > 
> > And the wall time/average file create rate differences are
> > just as stark:
> > 
> > 		create time(sec) / rate (files/s)
> > File count	     vanilla		    patched
> >   10k		   0.54 / 18.5k		   0.53 / 18.9k
> >   20k		   1.10	/ 18.1k		   1.05 / 19.0k
> >  100k		   4.21	/ 23.8k		   3.91 / 25.6k
> >  200k		   9.66	/ 20,7k		   7.37 / 27.1k
> >    1M		  86.61	/ 11.5k		  48.26 / 20.7k
> >    2M		 206.13	/  9.7k		 129.71 / 15.4k
> > 
> > The larger the directory, the bigger the performance improvement.
> 
> FWIW, while this improves the free block scanning code, it still
> doesn't really fix the overall problem with the free space index
> blocks. The problem there is that the free space index is
> essentially a linear array, with one index per allocated data
> block in the directory. This array is scanned linearly on every
> insert to find the first block with enough free space to hold the
> new entry. That, unfortunately, is problematic when new blocks are
> always added at the tail of the list. hence we have to scan the
> the entire free space index before we find the free space the last
> insert added.
> 
> That's nasty - the result is that by the time we have 5 million
> entries in the directory we have something like 60,000 4k blocks in
> the filesystem and only the last one has free space. Hence we're
> searching 60,000 entries just to find where to insert the next
> entry. The result is this:
> 
> 		    create time(sec)
> File count	 vanilla        patched
>  200k		   9.66		   7.37
>    1M		  86.61		  48.26
>    2M		 206.13		 129.71
>   10M		2843.57		1817.39
> 
> it goes highly non-linear at above 1M entries in the directory.
> 
> In writing this, I think I can see a quick and simple change that
> will fix this case and improve most other directory grow workloads
> without affecting normal random directory insert/remove performance.
> That is, do a reverse order search starting at the last block rather
> than increasing order search starting at the first block.....
> 
> Ok, now were are talking - performance and scalability improvements!
> 
> 		create time(sec) / rate (files/s)
>  File count     vanilla		    loop-fix		+reverse
>    10k	      0.54 / 18.5k	   0.53 / 18.9k	       0.52 / 19.3k
>    20k	      1.10 / 18.1k	   1.05 / 19.0k	       1.00 / 20.0k
>   100k	      4.21 / 23.8k	   3.91 / 25.6k	       3.58 / 27.9k
>   200k	      9.66 / 20,7k	   7.37 / 27.1k	       7.08 / 28.3k
>     1M	     86.61 / 11.5k	  48.26 / 20.7k	      38.33 / 26.1k
>     2M	    206.13 /  9.7k	 129.71 / 15.4k	      82.20 / 24.3k
>    10M	   2843.57 /  3.5k	1817.39 /  5.5k      591.78 / 16.9k
> 
> Theres still some non-linearity as we approach the 10M number, but
> it's still 5x faster to 10M inodes than the existing code....
> 

Nice improvement.. I still need to look at the code, but a quick first
thought is that I wonder if there's somewhere we could stash a 'most
recent freeblock' once we have to grow the directory, even if just as an
in-core hint. Then we could jump straight to the latest block regardless
of the workload.

Hmm, thinking a little more about it, that may not be worth the
complication since part of the concept of "search failure" in this case
is tied to the size of the entry we want to add. Then again, I suppose
such is the case when searching forward/backward as well (i.e., one
large insert fails, grows inode, subsequent small insert may very well
have succeeded with the first freeblock, though now we'd always start at
the recently allocated block at the end).

Anyways, just thinking out loud (and recovering from several weeks
vacation). :P

Brian

> Reverse search patch - working but not finished, smoke tested only -
> is attached below. Discuss!
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> xfs: reverse search directory freespace indexes
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> When a directory is growing rapidly, new blocks tend to get added at
> the end of teh directory. These end up at teh end of teh freespace
> index, and when teh directory gets large finding these new
> freespaces gets expensive. The code does a linear search across the
> frespace index from the first block in teh directory to teh last,
> hence meaning the newly added space is the last index searched.
> 
> Instead, do a reverse order index search, starting from the last
> block and index in the freespace index. This makes most lookups for
> free space on rapidly growing directories O(1) instead of O(N).
> The result is a major improvement in large directory grow rates:
> 
>                 create time(sec) / rate (files/s)
>  File count     vanilla             patched
>    10k        0.54 / 18.5k         0.52 / 19.3k
>    20k        1.10 / 18.1k         1.00 / 20.0k
>   100k        4.21 / 23.8k         3.58 / 27.9k
>   200k        9.66 / 20,7k         7.08 / 28.3k
>     1M       86.61 / 11.5k        38.33 / 26.1k
>     2M      206.13 /  9.7k        82.20 / 24.3k
>    10M     2843.57 /  3.5k       591.78 / 16.9k
> 
> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_dir2_node.c | 91 ++++++++++++++++++-------------------------
>  1 file changed, 38 insertions(+), 53 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c
> index bcf0d43cd6a8..d95229a9d700 100644
> --- a/fs/xfs/libxfs/xfs_dir2_node.c
> +++ b/fs/xfs/libxfs/xfs_dir2_node.c
> @@ -1701,10 +1701,11 @@ xfs_dir2_node_addname_int(
>  	int			error;		/* error return value */
>  	xfs_dir2_db_t		fbno;		/* freespace block number */
>  	struct xfs_buf		*fbp;		/* freespace buffer */
> -	int			findex;		/* freespace entry index */
> -	xfs_dir2_free_t		*free=NULL;	/* freespace block structure */
> +	int			findex = -1;	/* freespace entry index */
> +	xfs_dir2_free_t		*free = NULL;	/* freespace block structure */
>  	xfs_dir2_db_t		ifbno;		/* initial freespace block no */
> -	xfs_dir2_db_t		lastfbno=0;	/* highest freespace block no */
> +	xfs_dir2_db_t		lastfbno = 0;	/* highest freespace block no */
> +	xfs_dir2_db_t		firstfbno = 0;	/* lowest freespace block no */
>  	int			length;		/* length of the new entry */
>  	int			logfree;	/* need to log free entry */
>  	xfs_mount_t		*mp;		/* filesystem mount point */
> @@ -1715,11 +1716,17 @@ xfs_dir2_node_addname_int(
>  	__be16			*bests;
>  	struct xfs_dir3_icfree_hdr freehdr;
>  	struct xfs_dir2_data_free *bf;
> +	xfs_fileoff_t		fo;		/* freespace block number */
>  
>  	dp = args->dp;
>  	mp = dp->i_mount;
>  	tp = args->trans;
>  	length = dp->d_ops->data_entsize(args->namelen);
> +
> +	ifbno = -1;
> +	dbno = -1;
> +	fbp = NULL;
> +
>  	/*
>  	 * If we came in with a freespace block that means that lookup
>  	 * found an entry with our hash value.  This is the freespace
> @@ -1746,62 +1753,38 @@ xfs_dir2_node_addname_int(
>  			ASSERT(be16_to_cpu(bests[findex]) != NULLDATAOFF);
>  			ASSERT(be16_to_cpu(bests[findex]) >= length);
>  			dbno = freehdr.firstdb + findex;
> -		} else {
> -			/*
> -			 * The data block looked at didn't have enough room.
> -			 * We'll start at the beginning of the freespace entries.
> -			 */
> -			dbno = -1;
> -			findex = 0;
> +			goto block_found;
>  		}
> -	} else {
> -		/*
> -		 * Didn't come in with a freespace block, so no data block.
> -		 */
> -		ifbno = dbno = -1;
> -		fbp = NULL;
> -		findex = 0;
>  	}
>  
>  	/*
> -	 * If we don't have a data block yet, we're going to scan the
> +	 * If we don't have a usable data block yet, we're going to scan the
>  	 * freespace blocks looking for one.  Figure out what the
> -	 * highest freespace block number is.
> +	 * highest freespace block number is as we'll start scanning there.
>  	 */
> -	if (dbno == -1) {
> -		xfs_fileoff_t	fo;		/* freespace block number */
> +	error = xfs_bmap_last_offset(dp, &fo, XFS_DATA_FORK);
> +	if (error)
> +		return error;
> +	lastfbno = xfs_dir2_da_to_db(args->geo, (xfs_dablk_t)fo);
> +	firstfbno = xfs_dir2_byte_to_db(args->geo, XFS_DIR2_FREE_OFFSET);
>  
> -		if ((error = xfs_bmap_last_offset(dp, &fo, XFS_DATA_FORK)))
> -			return error;
> -		lastfbno = xfs_dir2_da_to_db(args->geo, (xfs_dablk_t)fo);
> -		fbno = ifbno;
> -	}
>  	/*
>  	 * While we haven't identified a data block, search the freeblock
> -	 * data for a good data block.  If we find a null freeblock entry,
> -	 * indicating a hole in the data blocks, remember that.
> +	 * data for a good data block. Do a reverse order search, as growing
> +	 * directories will put new blocks with free space at the end of the
> +	 * free space index.
>  	 */
> -	while (dbno == -1) {
> -		/*
> -		 * If we don't have a freeblock in hand, get the next one.
> -		 */
> -		if (fbp == NULL) {
> -			/*
> -			 * Happens the first time through unless lookup gave
> -			 * us a freespace block to start with.
> -			 */
> -			if (++fbno == 0)
> -				fbno = xfs_dir2_byte_to_db(args->geo,
> -							XFS_DIR2_FREE_OFFSET);
> -			/*
> -			 * If it's ifbno we already looked at it.
> -			 */
> +	for (fbno = lastfbno; fbno >= firstfbno; fbno--) {
> +
> +		/* If we don't have a freeblock in hand, get the previous one. */
> +		if (!fbp) {
> +			/* If it's ifbno we already looked at it. */
>  			if (fbno == ifbno)
> -				fbno++;
> +				fbno--;
>  			/*
>  			 * If it's off the end we're done.
>  			 */
> -			if (fbno >= lastfbno)
> +			if (fbno < firstfbno)
>  				break;
>  			/*
>  			 * Read the block.  There can be holes in the
> @@ -1817,8 +1800,8 @@ xfs_dir2_node_addname_int(
>  			if (!fbp)
>  				continue;
>  			free = fbp->b_addr;
> -			findex = 0;
>  		}
> +
>  		/*
>  		 * Look at the current free entry.  Is it good enough?
>  		 *
> @@ -1829,17 +1812,16 @@ xfs_dir2_node_addname_int(
>  		 */
>  		bests = dp->d_ops->free_bests_p(free);
>  		dp->d_ops->free_hdr_from_disk(&freehdr, free);
> -		do {
> -
> +		for (findex = freehdr.nvalid - 1; findex >= 0; findex--){
>  			if (be16_to_cpu(bests[findex]) != NULLDATAOFF &&
>  			    be16_to_cpu(bests[findex]) >= length) {
>  				dbno = freehdr.firstdb + findex;
> -				break;
> +				goto block_found;
>  			}
> -		} while (++findex < freehdr.nvalid);
> +		}
>  
>  		/* Drop the block if we done with the freeblock */
> -		if (findex == freehdr.nvalid) {
> +		if (findex < 0) {
>  			xfs_trans_brelse(tp, fbp);
>  			fbp = NULL;
>  			if (fblk)
> @@ -1847,11 +1829,13 @@ xfs_dir2_node_addname_int(
>  		}
>  	}
>  
> +	ASSERT(dbno == -1);
> +
>  	/*
> -	 * If we don't have a data block, we need to allocate one and make
> +	 * We don't have a data block so we need to allocate one and make
>  	 * the freespace entries refer to it.
>  	 */
> -	if (unlikely(dbno == -1)) {
> +	if (dbno == -1) {
>  		/*
>  		 * Not allowed to allocate, return failure.
>  		 */
> @@ -1981,6 +1965,7 @@ xfs_dir2_node_addname_int(
>  	 * We had a data block so we don't have to make a new one.
>  	 */
>  	else {
> +block_found:
>  		/*
>  		 * If just checking, we succeeded.
>  		 */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: speed up directory bestfree block scanning
  2018-01-03 13:28 ` Brian Foster
@ 2018-01-03 20:38   ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2018-01-03 20:38 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Jan 03, 2018 at 08:28:03AM -0500, Brian Foster wrote:
> On Wed, Jan 03, 2018 at 05:27:48PM +1100, Dave Chinner wrote:
> > +			if (be16_to_cpu(bests[findex]) != NULLDATAOFF &&
> > +			    be16_to_cpu(bests[findex]) >= length) {
> > +				dbno = freehdr.firstdb + findex;
> > +				break;
> >  			}
> > +		} while (++findex < freehdr.nvalid);
> > +
> > +		/* Drop the block if we done with the freeblock */
> 
> "... if we're done ..."
> 
> Also FWIW, according to the comment it looks like the only reason the
> freehdr conversion is elevated to this scope is to accommodate gcc
> foolishness. If so, I'm wondering if a simple NULL init of bests at the
> top of the function would avoid that problem and allow us to move the
> code to where it was apparently intended to be in the first place. Hm?

Yeah, looking at the follow-on patch, there's a gigantic amount of
cleanup needed in this function. There's a bunch of "gcc is so
stupid" hacks amongst the code because the function is too long
for gcc correctly determine variable usage.

I might sit down and factor it properly because that will make it a
whole lot simpler and easier to understand...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: speed up directory bestfree block scanning
  2018-01-03 13:41   ` Brian Foster
@ 2018-01-03 21:00     ` Dave Chinner
  2018-01-04 14:04       ` Brian Foster
  0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2018-01-03 21:00 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Jan 03, 2018 at 08:41:37AM -0500, Brian Foster wrote:
> On Wed, Jan 03, 2018 at 10:59:10PM +1100, Dave Chinner wrote:
> > In writing this, I think I can see a quick and simple change that
> > will fix this case and improve most other directory grow workloads
> > without affecting normal random directory insert/remove performance.
> > That is, do a reverse order search starting at the last block rather
> > than increasing order search starting at the first block.....
> > 
> > Ok, now were are talking - performance and scalability improvements!
> > 
> > 		create time(sec) / rate (files/s)
> >  File count     vanilla		    loop-fix		+reverse
> >    10k	      0.54 / 18.5k	   0.53 / 18.9k	       0.52 / 19.3k
> >    20k	      1.10 / 18.1k	   1.05 / 19.0k	       1.00 / 20.0k
> >   100k	      4.21 / 23.8k	   3.91 / 25.6k	       3.58 / 27.9k
> >   200k	      9.66 / 20,7k	   7.37 / 27.1k	       7.08 / 28.3k
> >     1M	     86.61 / 11.5k	  48.26 / 20.7k	      38.33 / 26.1k
> >     2M	    206.13 /  9.7k	 129.71 / 15.4k	      82.20 / 24.3k
> >    10M	   2843.57 /  3.5k	1817.39 /  5.5k      591.78 / 16.9k
> > 
> > Theres still some non-linearity as we approach the 10M number, but
> > it's still 5x faster to 10M inodes than the existing code....
> > 
> 
> Nice improvement.. I still need to look at the code, but a quick first
> thought is that I wonder if there's somewhere we could stash a 'most
> recent freeblock' once we have to grow the directory, even if just as an
> in-core hint. Then we could jump straight to the latest block regardless
> of the workload.

I thought about that, and then wondered where to stash it, then
wondered whether it would miss smaller, better fitting blocks, and
then finally realised we didn't need to have cross-operation state
to solve the common case of growing directories.

> Hmm, thinking a little more about it, that may not be worth the
> complication since part of the concept of "search failure" in this case
> is tied to the size of the entry we want to add. Then again, I suppose
> such is the case when searching forward/backward as well (i.e., one
> large insert fails, grows inode, subsequent small insert may very well
> have succeeded with the first freeblock, though now we'd always start at
> the recently allocated block at the end).

Right. random hole filling in the directory shouldn't be greatly
affected by forward or reverse search order - the eventual search
distances are all the same. It does, OTOH, matter greatly for
sequntial inserts...

After sleeping on it, I suspect that there's a simple on-disk mod to
the dir3 header that will improve the search function for all
workloads. The dir3 free block header:

struct xfs_dir3_free_hdr {
        struct xfs_dir3_blk_hdr hdr;
        __be32                  firstdb;        /* db of first entry */
        __be32                  nvalid;         /* count of valid entries */
        __be32                  nused;          /* count of used entries */
        __be32                  pad;            /* 64 bit alignment */
};

has 32 bits of padding in it, and the most entries a free block can
have is just under 2^15. Hence we can turn that into a "bestfree"
entry that tracks the largest freespace indexed by the block.

Then the free block scan can start by checking the required length
against the largest freespace in the bestfree entry and skip the
block search altogether if there isn't an indexed block with enough
free space inside the freespace index block we are searching.

That's a lot more work than just reversing the search order, but I
think it's a mod we should (eventually) make because it is an
improvement for all insert workloads, not just growing.

The other (far more complex) option is to turn the freespace index
into a btree, like we do with the hash indexes. Not sure we need to
spend that much effort on this right now, though.

> Anyways, just thinking out loud (and recovering from several weeks
> vacation). :P

Welcome back :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: speed up directory bestfree block scanning
  2018-01-03 21:00     ` Dave Chinner
@ 2018-01-04 14:04       ` Brian Foster
  2018-01-04 21:52         ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Brian Foster @ 2018-01-04 14:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Jan 04, 2018 at 08:00:15AM +1100, Dave Chinner wrote:
> On Wed, Jan 03, 2018 at 08:41:37AM -0500, Brian Foster wrote:
> > On Wed, Jan 03, 2018 at 10:59:10PM +1100, Dave Chinner wrote:
> > > In writing this, I think I can see a quick and simple change that
> > > will fix this case and improve most other directory grow workloads
> > > without affecting normal random directory insert/remove performance.
> > > That is, do a reverse order search starting at the last block rather
> > > than increasing order search starting at the first block.....
> > > 
> > > Ok, now were are talking - performance and scalability improvements!
> > > 
> > > 		create time(sec) / rate (files/s)
> > >  File count     vanilla		    loop-fix		+reverse
> > >    10k	      0.54 / 18.5k	   0.53 / 18.9k	       0.52 / 19.3k
> > >    20k	      1.10 / 18.1k	   1.05 / 19.0k	       1.00 / 20.0k
> > >   100k	      4.21 / 23.8k	   3.91 / 25.6k	       3.58 / 27.9k
> > >   200k	      9.66 / 20,7k	   7.37 / 27.1k	       7.08 / 28.3k
> > >     1M	     86.61 / 11.5k	  48.26 / 20.7k	      38.33 / 26.1k
> > >     2M	    206.13 /  9.7k	 129.71 / 15.4k	      82.20 / 24.3k
> > >    10M	   2843.57 /  3.5k	1817.39 /  5.5k      591.78 / 16.9k
> > > 
> > > Theres still some non-linearity as we approach the 10M number, but
> > > it's still 5x faster to 10M inodes than the existing code....
> > > 
> > 
> > Nice improvement.. I still need to look at the code, but a quick first
> > thought is that I wonder if there's somewhere we could stash a 'most
> > recent freeblock' once we have to grow the directory, even if just as an
> > in-core hint. Then we could jump straight to the latest block regardless
> > of the workload.
> 
> I thought about that, and then wondered where to stash it, then
> wondered whether it would miss smaller, better fitting blocks, and
> then finally realised we didn't need to have cross-operation state
> to solve the common case of growing directories.
> 

Agreed with regard to the growing directories scenario. The thought was
more around first: avoiding any potential negative effects for other
workloads, and second: perhaps constructing a more generally applicable
optimization (cases beyond purely seqential directory grow).

> > Hmm, thinking a little more about it, that may not be worth the
> > complication since part of the concept of "search failure" in this case
> > is tied to the size of the entry we want to add. Then again, I suppose
> > such is the case when searching forward/backward as well (i.e., one
> > large insert fails, grows inode, subsequent small insert may very well
> > have succeeded with the first freeblock, though now we'd always start at
> > the recently allocated block at the end).
> 
> Right. random hole filling in the directory shouldn't be greatly
> affected by forward or reverse search order - the eventual search
> distances are all the same. It does, OTOH, matter greatly for
> sequntial inserts...
> 

Yeah, I think that's a reasonable argument for simply swapping the
search order (i.e., addresses the first concern above).

> After sleeping on it, I suspect that there's a simple on-disk mod to
> the dir3 header that will improve the search function for all
> workloads. The dir3 free block header:
> 
> struct xfs_dir3_free_hdr {
>         struct xfs_dir3_blk_hdr hdr;
>         __be32                  firstdb;        /* db of first entry */
>         __be32                  nvalid;         /* count of valid entries */
>         __be32                  nused;          /* count of used entries */
>         __be32                  pad;            /* 64 bit alignment */
> };
> 
> has 32 bits of padding in it, and the most entries a free block can
> have is just under 2^15. Hence we can turn that into a "bestfree"
> entry that tracks the largest freespace indexed by the block.
> 
> Then the free block scan can start by checking the required length
> against the largest freespace in the bestfree entry and skip the
> block search altogether if there isn't an indexed block with enough
> free space inside the freespace index block we are searching.
> 

So with this we'd still need to walk the range of free blocks, right?
I.e., we'd just be able to shortcut individual free block processing
(bests[] iteration) for those that obviously don't satisfy the request.

Assuming I follow that correctly, that sounds reasonable from the
perspective that it certainly eliminates work. I suppose how worthwhile
it is probably depends on how much of an effect it has on the higher
level directory workload. IOW, is performance measurably improved by
skipping individual free block processing or is that cost mostly
amortized by having to read/walk all free blocks in the first place?

It may very well be a worthwhile optimization. I think the larger point
is that more isolated testing is probably required to confirm. The
improvement from this patch alone doesn't necessarily translate to an
answer one way or another because this patch creates conditions that
allow us to skip most of the scan altogether (by jumping right to the
most recently added block).

> That's a lot more work than just reversing the search order, but I
> think it's a mod we should (eventually) make because it is an
> improvement for all insert workloads, not just growing.
> 
> The other (far more complex) option is to turn the freespace index
> into a btree, like we do with the hash indexes. Not sure we need to
> spend that much effort on this right now, though.
> 

*nod*

Brian

> > Anyways, just thinking out loud (and recovering from several weeks
> > vacation). :P
> 
> Welcome back :)
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] xfs: speed up directory bestfree block scanning
  2018-01-04 14:04       ` Brian Foster
@ 2018-01-04 21:52         ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2018-01-04 21:52 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Jan 04, 2018 at 09:04:53AM -0500, Brian Foster wrote:
> On Thu, Jan 04, 2018 at 08:00:15AM +1100, Dave Chinner wrote:
> > On Wed, Jan 03, 2018 at 08:41:37AM -0500, Brian Foster wrote:
> > > On Wed, Jan 03, 2018 at 10:59:10PM +1100, Dave Chinner wrote:
> > > > In writing this, I think I can see a quick and simple change that
> > > > will fix this case and improve most other directory grow workloads
> > > > without affecting normal random directory insert/remove performance.
> > > > That is, do a reverse order search starting at the last block rather
> > > > than increasing order search starting at the first block.....
> > > > 
> > > > Ok, now were are talking - performance and scalability improvements!
> > > > 
> > > > 		create time(sec) / rate (files/s)
> > > >  File count     vanilla		    loop-fix		+reverse
> > > >    10k	      0.54 / 18.5k	   0.53 / 18.9k	       0.52 / 19.3k
> > > >    20k	      1.10 / 18.1k	   1.05 / 19.0k	       1.00 / 20.0k
> > > >   100k	      4.21 / 23.8k	   3.91 / 25.6k	       3.58 / 27.9k
> > > >   200k	      9.66 / 20,7k	   7.37 / 27.1k	       7.08 / 28.3k
> > > >     1M	     86.61 / 11.5k	  48.26 / 20.7k	      38.33 / 26.1k
> > > >     2M	    206.13 /  9.7k	 129.71 / 15.4k	      82.20 / 24.3k
> > > >    10M	   2843.57 /  3.5k	1817.39 /  5.5k      591.78 / 16.9k
> > > > 
> > > > Theres still some non-linearity as we approach the 10M number, but
> > > > it's still 5x faster to 10M inodes than the existing code....
> > > > 
> > > 
> > > Nice improvement.. I still need to look at the code, but a quick first
> > > thought is that I wonder if there's somewhere we could stash a 'most
> > > recent freeblock' once we have to grow the directory, even if just as an
> > > in-core hint. Then we could jump straight to the latest block regardless
> > > of the workload.

.....

> > After sleeping on it, I suspect that there's a simple on-disk mod to
> > the dir3 header that will improve the search function for all
> > workloads. The dir3 free block header:
> > 
> > struct xfs_dir3_free_hdr {
> >         struct xfs_dir3_blk_hdr hdr;
> >         __be32                  firstdb;        /* db of first entry */
> >         __be32                  nvalid;         /* count of valid entries */
> >         __be32                  nused;          /* count of used entries */
> >         __be32                  pad;            /* 64 bit alignment */
> > };
> > 
> > has 32 bits of padding in it, and the most entries a free block can
> > have is just under 2^15. Hence we can turn that into a "bestfree"
> > entry that tracks the largest freespace indexed by the block.
> > 
> > Then the free block scan can start by checking the required length
> > against the largest freespace in the bestfree entry and skip the
> > block search altogether if there isn't an indexed block with enough
> > free space inside the freespace index block we are searching.
> 
> So with this we'd still need to walk the range of free blocks, right?
> I.e., we'd just be able to shortcut individual free block processing
> (bests[] iteration) for those that obviously don't satisfy the request.

Right. In the case of the 5M entry directory I was looking at, there
were a total of 33 free index blocks. All the overhead was in
scanning the 2016 entries in each block, not in iterating over the
blocks. So if we can skip the entry scanning in a given block, then
we have a general search improvement.  i.e. it makes the free block
index structure more like a skip list than a linear array.

Also, if we keep the index into the free index block of the largest
free space we have in the header, we can jump straight to it without
needing to scan. Then when we modify the free index during the
insert, we can do the "best free" scan at that point, similar to how
we keep other bestfree indexes up to date.

> Assuming I follow that correctly, that sounds reasonable from the
> perspective that it certainly eliminates work. I suppose how worthwhile
> it is probably depends on how much of an effect it has on the higher
> level directory workload. IOW, is performance measurably improved by
> skipping individual free block processing or is that cost mostly
> amortized by having to read/walk all free blocks in the first place?

According to the profile, the CPU cost of the read/walk of the free
index blocks was within the noise floor. It didn't stand out as a
major contributor, but I'd need to re-measure the overhead on a
random insert/delete workload. However, my gut feel is that if we
combined reverse order searching (for sequential insert) with block
skipping (for random insert) we'll get get substantial improvements
across all insert operations on large directories...

> It may very well be a worthwhile optimization. I think the larger point
> is that more isolated testing is probably required to confirm. The
> improvement from this patch alone doesn't necessarily translate to an
> answer one way or another because this patch creates conditions that
> allow us to skip most of the scan altogether (by jumping right to the
> most recently added block).

Yeah, lots of validation work, but given the overhead for a
non-trivial search distance I was measuring (>40% of CPU time @10M
inodes) I think it's worth persuing.

CHeers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-01-04 21:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-03  6:27 [PATCH] xfs: speed up directory bestfree block scanning Dave Chinner
2018-01-03 11:59 ` Dave Chinner
2018-01-03 13:41   ` Brian Foster
2018-01-03 21:00     ` Dave Chinner
2018-01-04 14:04       ` Brian Foster
2018-01-04 21:52         ` Dave Chinner
2018-01-03 13:28 ` Brian Foster
2018-01-03 20:38   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).