* [PATCH] xfs: speed up directory bestfree block scanning @ 2018-01-03 6:27 Dave Chinner 2018-01-03 11:59 ` Dave Chinner 2018-01-03 13:28 ` Brian Foster 0 siblings, 2 replies; 8+ messages in thread From: Dave Chinner @ 2018-01-03 6:27 UTC (permalink / raw) To: linux-xfs From: Dave Chinner <dchinner@redhat.com> When running a "create millions inodes in a directory" test recently, I noticed we were spending a huge amount of time converting freespace block headers from disk format to in-memory format: 31.47% [kernel] [k] xfs_dir2_node_addname 17.86% [kernel] [k] xfs_dir3_free_hdr_from_disk 3.55% [kernel] [k] xfs_dir3_free_bests_p We shouldn't be hitting the best free block scanning code so hard when doing sequential directory creates, and it turns out there's a highly suboptimal loop searching the the best free array in the freespace block - it decodes the block header before checking each entry inside a loop, instead of decoding the header once before running the entry search loop. This makes a massive difference to create rates. Profile now looks like this: 13.15% [kernel] [k] xfs_dir2_node_addname 3.52% [kernel] [k] xfs_dir3_leaf_check_int 3.11% [kernel] [k] xfs_log_commit_cil And the wall time/average file create rate differences are just as stark: create time(sec) / rate (files/s) File count vanilla patched 10k 0.54 / 18.5k 0.53 / 18.9k 20k 1.10 / 18.1k 1.05 / 19.0k 100k 4.21 / 23.8k 3.91 / 25.6k 200k 9.66 / 20,7k 7.37 / 27.1k 1M 86.61 / 11.5k 48.26 / 20.7k 2M 206.13 / 9.7k 129.71 / 15.4k The larger the directory, the bigger the performance improvement. Signed-Off-By: Dave Chinner <dchinner@redhat.com> --- fs/xfs/libxfs/xfs_dir2_node.c | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c index 682e2bf370c7..bcf0d43cd6a8 100644 --- a/fs/xfs/libxfs/xfs_dir2_node.c +++ b/fs/xfs/libxfs/xfs_dir2_node.c @@ -1829,24 +1829,24 @@ xfs_dir2_node_addname_int( */ bests = dp->d_ops->free_bests_p(free); dp->d_ops->free_hdr_from_disk(&freehdr, free); - if (be16_to_cpu(bests[findex]) != NULLDATAOFF && - be16_to_cpu(bests[findex]) >= length) - dbno = freehdr.firstdb + findex; - else { - /* - * Are we done with the freeblock? - */ - if (++findex == freehdr.nvalid) { - /* - * Drop the block. - */ - xfs_trans_brelse(tp, fbp); - fbp = NULL; - if (fblk && fblk->bp) - fblk->bp = NULL; + do { + + if (be16_to_cpu(bests[findex]) != NULLDATAOFF && + be16_to_cpu(bests[findex]) >= length) { + dbno = freehdr.firstdb + findex; + break; } + } while (++findex < freehdr.nvalid); + + /* Drop the block if we done with the freeblock */ + if (findex == freehdr.nvalid) { + xfs_trans_brelse(tp, fbp); + fbp = NULL; + if (fblk) + fblk->bp = NULL; } } + /* * If we don't have a data block, we need to allocate one and make * the freespace entries refer to it. -- 2.15.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: speed up directory bestfree block scanning 2018-01-03 6:27 [PATCH] xfs: speed up directory bestfree block scanning Dave Chinner @ 2018-01-03 11:59 ` Dave Chinner 2018-01-03 13:41 ` Brian Foster 2018-01-03 13:28 ` Brian Foster 1 sibling, 1 reply; 8+ messages in thread From: Dave Chinner @ 2018-01-03 11:59 UTC (permalink / raw) To: linux-xfs On Wed, Jan 03, 2018 at 05:27:48PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > When running a "create millions inodes in a directory" test > recently, I noticed we were spending a huge amount of time > converting freespace block headers from disk format to in-memory > format: > > 31.47% [kernel] [k] xfs_dir2_node_addname > 17.86% [kernel] [k] xfs_dir3_free_hdr_from_disk > 3.55% [kernel] [k] xfs_dir3_free_bests_p > > We shouldn't be hitting the best free block scanning code so hard > when doing sequential directory creates, and it turns out there's > a highly suboptimal loop searching the the best free array in > the freespace block - it decodes the block header before checking > each entry inside a loop, instead of decoding the header once before > running the entry search loop. > > This makes a massive difference to create rates. Profile now looks > like this: > > 13.15% [kernel] [k] xfs_dir2_node_addname > 3.52% [kernel] [k] xfs_dir3_leaf_check_int > 3.11% [kernel] [k] xfs_log_commit_cil > > And the wall time/average file create rate differences are > just as stark: > > create time(sec) / rate (files/s) > File count vanilla patched > 10k 0.54 / 18.5k 0.53 / 18.9k > 20k 1.10 / 18.1k 1.05 / 19.0k > 100k 4.21 / 23.8k 3.91 / 25.6k > 200k 9.66 / 20,7k 7.37 / 27.1k > 1M 86.61 / 11.5k 48.26 / 20.7k > 2M 206.13 / 9.7k 129.71 / 15.4k > > The larger the directory, the bigger the performance improvement. FWIW, while this improves the free block scanning code, it still doesn't really fix the overall problem with the free space index blocks. The problem there is that the free space index is essentially a linear array, with one index per allocated data block in the directory. This array is scanned linearly on every insert to find the first block with enough free space to hold the new entry. That, unfortunately, is problematic when new blocks are always added at the tail of the list. hence we have to scan the the entire free space index before we find the free space the last insert added. That's nasty - the result is that by the time we have 5 million entries in the directory we have something like 60,000 4k blocks in the filesystem and only the last one has free space. Hence we're searching 60,000 entries just to find where to insert the next entry. The result is this: create time(sec) File count vanilla patched 200k 9.66 7.37 1M 86.61 48.26 2M 206.13 129.71 10M 2843.57 1817.39 it goes highly non-linear at above 1M entries in the directory. In writing this, I think I can see a quick and simple change that will fix this case and improve most other directory grow workloads without affecting normal random directory insert/remove performance. That is, do a reverse order search starting at the last block rather than increasing order search starting at the first block..... Ok, now were are talking - performance and scalability improvements! create time(sec) / rate (files/s) File count vanilla loop-fix +reverse 10k 0.54 / 18.5k 0.53 / 18.9k 0.52 / 19.3k 20k 1.10 / 18.1k 1.05 / 19.0k 1.00 / 20.0k 100k 4.21 / 23.8k 3.91 / 25.6k 3.58 / 27.9k 200k 9.66 / 20,7k 7.37 / 27.1k 7.08 / 28.3k 1M 86.61 / 11.5k 48.26 / 20.7k 38.33 / 26.1k 2M 206.13 / 9.7k 129.71 / 15.4k 82.20 / 24.3k 10M 2843.57 / 3.5k 1817.39 / 5.5k 591.78 / 16.9k Theres still some non-linearity as we approach the 10M number, but it's still 5x faster to 10M inodes than the existing code.... Reverse search patch - working but not finished, smoke tested only - is attached below. Discuss! Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: reverse search directory freespace indexes From: Dave Chinner <dchinner@redhat.com> When a directory is growing rapidly, new blocks tend to get added at the end of teh directory. These end up at teh end of teh freespace index, and when teh directory gets large finding these new freespaces gets expensive. The code does a linear search across the frespace index from the first block in teh directory to teh last, hence meaning the newly added space is the last index searched. Instead, do a reverse order index search, starting from the last block and index in the freespace index. This makes most lookups for free space on rapidly growing directories O(1) instead of O(N). The result is a major improvement in large directory grow rates: create time(sec) / rate (files/s) File count vanilla patched 10k 0.54 / 18.5k 0.52 / 19.3k 20k 1.10 / 18.1k 1.00 / 20.0k 100k 4.21 / 23.8k 3.58 / 27.9k 200k 9.66 / 20,7k 7.08 / 28.3k 1M 86.61 / 11.5k 38.33 / 26.1k 2M 206.13 / 9.7k 82.20 / 24.3k 10M 2843.57 / 3.5k 591.78 / 16.9k Signed-Off-By: Dave Chinner <dchinner@redhat.com> --- fs/xfs/libxfs/xfs_dir2_node.c | 91 ++++++++++++++++++------------------------- 1 file changed, 38 insertions(+), 53 deletions(-) diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c index bcf0d43cd6a8..d95229a9d700 100644 --- a/fs/xfs/libxfs/xfs_dir2_node.c +++ b/fs/xfs/libxfs/xfs_dir2_node.c @@ -1701,10 +1701,11 @@ xfs_dir2_node_addname_int( int error; /* error return value */ xfs_dir2_db_t fbno; /* freespace block number */ struct xfs_buf *fbp; /* freespace buffer */ - int findex; /* freespace entry index */ - xfs_dir2_free_t *free=NULL; /* freespace block structure */ + int findex = -1; /* freespace entry index */ + xfs_dir2_free_t *free = NULL; /* freespace block structure */ xfs_dir2_db_t ifbno; /* initial freespace block no */ - xfs_dir2_db_t lastfbno=0; /* highest freespace block no */ + xfs_dir2_db_t lastfbno = 0; /* highest freespace block no */ + xfs_dir2_db_t firstfbno = 0; /* lowest freespace block no */ int length; /* length of the new entry */ int logfree; /* need to log free entry */ xfs_mount_t *mp; /* filesystem mount point */ @@ -1715,11 +1716,17 @@ xfs_dir2_node_addname_int( __be16 *bests; struct xfs_dir3_icfree_hdr freehdr; struct xfs_dir2_data_free *bf; + xfs_fileoff_t fo; /* freespace block number */ dp = args->dp; mp = dp->i_mount; tp = args->trans; length = dp->d_ops->data_entsize(args->namelen); + + ifbno = -1; + dbno = -1; + fbp = NULL; + /* * If we came in with a freespace block that means that lookup * found an entry with our hash value. This is the freespace @@ -1746,62 +1753,38 @@ xfs_dir2_node_addname_int( ASSERT(be16_to_cpu(bests[findex]) != NULLDATAOFF); ASSERT(be16_to_cpu(bests[findex]) >= length); dbno = freehdr.firstdb + findex; - } else { - /* - * The data block looked at didn't have enough room. - * We'll start at the beginning of the freespace entries. - */ - dbno = -1; - findex = 0; + goto block_found; } - } else { - /* - * Didn't come in with a freespace block, so no data block. - */ - ifbno = dbno = -1; - fbp = NULL; - findex = 0; } /* - * If we don't have a data block yet, we're going to scan the + * If we don't have a usable data block yet, we're going to scan the * freespace blocks looking for one. Figure out what the - * highest freespace block number is. + * highest freespace block number is as we'll start scanning there. */ - if (dbno == -1) { - xfs_fileoff_t fo; /* freespace block number */ + error = xfs_bmap_last_offset(dp, &fo, XFS_DATA_FORK); + if (error) + return error; + lastfbno = xfs_dir2_da_to_db(args->geo, (xfs_dablk_t)fo); + firstfbno = xfs_dir2_byte_to_db(args->geo, XFS_DIR2_FREE_OFFSET); - if ((error = xfs_bmap_last_offset(dp, &fo, XFS_DATA_FORK))) - return error; - lastfbno = xfs_dir2_da_to_db(args->geo, (xfs_dablk_t)fo); - fbno = ifbno; - } /* * While we haven't identified a data block, search the freeblock - * data for a good data block. If we find a null freeblock entry, - * indicating a hole in the data blocks, remember that. + * data for a good data block. Do a reverse order search, as growing + * directories will put new blocks with free space at the end of the + * free space index. */ - while (dbno == -1) { - /* - * If we don't have a freeblock in hand, get the next one. - */ - if (fbp == NULL) { - /* - * Happens the first time through unless lookup gave - * us a freespace block to start with. - */ - if (++fbno == 0) - fbno = xfs_dir2_byte_to_db(args->geo, - XFS_DIR2_FREE_OFFSET); - /* - * If it's ifbno we already looked at it. - */ + for (fbno = lastfbno; fbno >= firstfbno; fbno--) { + + /* If we don't have a freeblock in hand, get the previous one. */ + if (!fbp) { + /* If it's ifbno we already looked at it. */ if (fbno == ifbno) - fbno++; + fbno--; /* * If it's off the end we're done. */ - if (fbno >= lastfbno) + if (fbno < firstfbno) break; /* * Read the block. There can be holes in the @@ -1817,8 +1800,8 @@ xfs_dir2_node_addname_int( if (!fbp) continue; free = fbp->b_addr; - findex = 0; } + /* * Look at the current free entry. Is it good enough? * @@ -1829,17 +1812,16 @@ xfs_dir2_node_addname_int( */ bests = dp->d_ops->free_bests_p(free); dp->d_ops->free_hdr_from_disk(&freehdr, free); - do { - + for (findex = freehdr.nvalid - 1; findex >= 0; findex--){ if (be16_to_cpu(bests[findex]) != NULLDATAOFF && be16_to_cpu(bests[findex]) >= length) { dbno = freehdr.firstdb + findex; - break; + goto block_found; } - } while (++findex < freehdr.nvalid); + } /* Drop the block if we done with the freeblock */ - if (findex == freehdr.nvalid) { + if (findex < 0) { xfs_trans_brelse(tp, fbp); fbp = NULL; if (fblk) @@ -1847,11 +1829,13 @@ xfs_dir2_node_addname_int( } } + ASSERT(dbno == -1); + /* - * If we don't have a data block, we need to allocate one and make + * We don't have a data block so we need to allocate one and make * the freespace entries refer to it. */ - if (unlikely(dbno == -1)) { + if (dbno == -1) { /* * Not allowed to allocate, return failure. */ @@ -1981,6 +1965,7 @@ xfs_dir2_node_addname_int( * We had a data block so we don't have to make a new one. */ else { +block_found: /* * If just checking, we succeeded. */ ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: speed up directory bestfree block scanning 2018-01-03 11:59 ` Dave Chinner @ 2018-01-03 13:41 ` Brian Foster 2018-01-03 21:00 ` Dave Chinner 0 siblings, 1 reply; 8+ messages in thread From: Brian Foster @ 2018-01-03 13:41 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Wed, Jan 03, 2018 at 10:59:10PM +1100, Dave Chinner wrote: > On Wed, Jan 03, 2018 at 05:27:48PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > When running a "create millions inodes in a directory" test > > recently, I noticed we were spending a huge amount of time > > converting freespace block headers from disk format to in-memory > > format: > > > > 31.47% [kernel] [k] xfs_dir2_node_addname > > 17.86% [kernel] [k] xfs_dir3_free_hdr_from_disk > > 3.55% [kernel] [k] xfs_dir3_free_bests_p > > > > We shouldn't be hitting the best free block scanning code so hard > > when doing sequential directory creates, and it turns out there's > > a highly suboptimal loop searching the the best free array in > > the freespace block - it decodes the block header before checking > > each entry inside a loop, instead of decoding the header once before > > running the entry search loop. > > > > This makes a massive difference to create rates. Profile now looks > > like this: > > > > 13.15% [kernel] [k] xfs_dir2_node_addname > > 3.52% [kernel] [k] xfs_dir3_leaf_check_int > > 3.11% [kernel] [k] xfs_log_commit_cil > > > > And the wall time/average file create rate differences are > > just as stark: > > > > create time(sec) / rate (files/s) > > File count vanilla patched > > 10k 0.54 / 18.5k 0.53 / 18.9k > > 20k 1.10 / 18.1k 1.05 / 19.0k > > 100k 4.21 / 23.8k 3.91 / 25.6k > > 200k 9.66 / 20,7k 7.37 / 27.1k > > 1M 86.61 / 11.5k 48.26 / 20.7k > > 2M 206.13 / 9.7k 129.71 / 15.4k > > > > The larger the directory, the bigger the performance improvement. > > FWIW, while this improves the free block scanning code, it still > doesn't really fix the overall problem with the free space index > blocks. The problem there is that the free space index is > essentially a linear array, with one index per allocated data > block in the directory. This array is scanned linearly on every > insert to find the first block with enough free space to hold the > new entry. That, unfortunately, is problematic when new blocks are > always added at the tail of the list. hence we have to scan the > the entire free space index before we find the free space the last > insert added. > > That's nasty - the result is that by the time we have 5 million > entries in the directory we have something like 60,000 4k blocks in > the filesystem and only the last one has free space. Hence we're > searching 60,000 entries just to find where to insert the next > entry. The result is this: > > create time(sec) > File count vanilla patched > 200k 9.66 7.37 > 1M 86.61 48.26 > 2M 206.13 129.71 > 10M 2843.57 1817.39 > > it goes highly non-linear at above 1M entries in the directory. > > In writing this, I think I can see a quick and simple change that > will fix this case and improve most other directory grow workloads > without affecting normal random directory insert/remove performance. > That is, do a reverse order search starting at the last block rather > than increasing order search starting at the first block..... > > Ok, now were are talking - performance and scalability improvements! > > create time(sec) / rate (files/s) > File count vanilla loop-fix +reverse > 10k 0.54 / 18.5k 0.53 / 18.9k 0.52 / 19.3k > 20k 1.10 / 18.1k 1.05 / 19.0k 1.00 / 20.0k > 100k 4.21 / 23.8k 3.91 / 25.6k 3.58 / 27.9k > 200k 9.66 / 20,7k 7.37 / 27.1k 7.08 / 28.3k > 1M 86.61 / 11.5k 48.26 / 20.7k 38.33 / 26.1k > 2M 206.13 / 9.7k 129.71 / 15.4k 82.20 / 24.3k > 10M 2843.57 / 3.5k 1817.39 / 5.5k 591.78 / 16.9k > > Theres still some non-linearity as we approach the 10M number, but > it's still 5x faster to 10M inodes than the existing code.... > Nice improvement.. I still need to look at the code, but a quick first thought is that I wonder if there's somewhere we could stash a 'most recent freeblock' once we have to grow the directory, even if just as an in-core hint. Then we could jump straight to the latest block regardless of the workload. Hmm, thinking a little more about it, that may not be worth the complication since part of the concept of "search failure" in this case is tied to the size of the entry we want to add. Then again, I suppose such is the case when searching forward/backward as well (i.e., one large insert fails, grows inode, subsequent small insert may very well have succeeded with the first freeblock, though now we'd always start at the recently allocated block at the end). Anyways, just thinking out loud (and recovering from several weeks vacation). :P Brian > Reverse search patch - working but not finished, smoke tested only - > is attached below. Discuss! > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > xfs: reverse search directory freespace indexes > > From: Dave Chinner <dchinner@redhat.com> > > When a directory is growing rapidly, new blocks tend to get added at > the end of teh directory. These end up at teh end of teh freespace > index, and when teh directory gets large finding these new > freespaces gets expensive. The code does a linear search across the > frespace index from the first block in teh directory to teh last, > hence meaning the newly added space is the last index searched. > > Instead, do a reverse order index search, starting from the last > block and index in the freespace index. This makes most lookups for > free space on rapidly growing directories O(1) instead of O(N). > The result is a major improvement in large directory grow rates: > > create time(sec) / rate (files/s) > File count vanilla patched > 10k 0.54 / 18.5k 0.52 / 19.3k > 20k 1.10 / 18.1k 1.00 / 20.0k > 100k 4.21 / 23.8k 3.58 / 27.9k > 200k 9.66 / 20,7k 7.08 / 28.3k > 1M 86.61 / 11.5k 38.33 / 26.1k > 2M 206.13 / 9.7k 82.20 / 24.3k > 10M 2843.57 / 3.5k 591.78 / 16.9k > > Signed-Off-By: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/libxfs/xfs_dir2_node.c | 91 ++++++++++++++++++------------------------- > 1 file changed, 38 insertions(+), 53 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c > index bcf0d43cd6a8..d95229a9d700 100644 > --- a/fs/xfs/libxfs/xfs_dir2_node.c > +++ b/fs/xfs/libxfs/xfs_dir2_node.c > @@ -1701,10 +1701,11 @@ xfs_dir2_node_addname_int( > int error; /* error return value */ > xfs_dir2_db_t fbno; /* freespace block number */ > struct xfs_buf *fbp; /* freespace buffer */ > - int findex; /* freespace entry index */ > - xfs_dir2_free_t *free=NULL; /* freespace block structure */ > + int findex = -1; /* freespace entry index */ > + xfs_dir2_free_t *free = NULL; /* freespace block structure */ > xfs_dir2_db_t ifbno; /* initial freespace block no */ > - xfs_dir2_db_t lastfbno=0; /* highest freespace block no */ > + xfs_dir2_db_t lastfbno = 0; /* highest freespace block no */ > + xfs_dir2_db_t firstfbno = 0; /* lowest freespace block no */ > int length; /* length of the new entry */ > int logfree; /* need to log free entry */ > xfs_mount_t *mp; /* filesystem mount point */ > @@ -1715,11 +1716,17 @@ xfs_dir2_node_addname_int( > __be16 *bests; > struct xfs_dir3_icfree_hdr freehdr; > struct xfs_dir2_data_free *bf; > + xfs_fileoff_t fo; /* freespace block number */ > > dp = args->dp; > mp = dp->i_mount; > tp = args->trans; > length = dp->d_ops->data_entsize(args->namelen); > + > + ifbno = -1; > + dbno = -1; > + fbp = NULL; > + > /* > * If we came in with a freespace block that means that lookup > * found an entry with our hash value. This is the freespace > @@ -1746,62 +1753,38 @@ xfs_dir2_node_addname_int( > ASSERT(be16_to_cpu(bests[findex]) != NULLDATAOFF); > ASSERT(be16_to_cpu(bests[findex]) >= length); > dbno = freehdr.firstdb + findex; > - } else { > - /* > - * The data block looked at didn't have enough room. > - * We'll start at the beginning of the freespace entries. > - */ > - dbno = -1; > - findex = 0; > + goto block_found; > } > - } else { > - /* > - * Didn't come in with a freespace block, so no data block. > - */ > - ifbno = dbno = -1; > - fbp = NULL; > - findex = 0; > } > > /* > - * If we don't have a data block yet, we're going to scan the > + * If we don't have a usable data block yet, we're going to scan the > * freespace blocks looking for one. Figure out what the > - * highest freespace block number is. > + * highest freespace block number is as we'll start scanning there. > */ > - if (dbno == -1) { > - xfs_fileoff_t fo; /* freespace block number */ > + error = xfs_bmap_last_offset(dp, &fo, XFS_DATA_FORK); > + if (error) > + return error; > + lastfbno = xfs_dir2_da_to_db(args->geo, (xfs_dablk_t)fo); > + firstfbno = xfs_dir2_byte_to_db(args->geo, XFS_DIR2_FREE_OFFSET); > > - if ((error = xfs_bmap_last_offset(dp, &fo, XFS_DATA_FORK))) > - return error; > - lastfbno = xfs_dir2_da_to_db(args->geo, (xfs_dablk_t)fo); > - fbno = ifbno; > - } > /* > * While we haven't identified a data block, search the freeblock > - * data for a good data block. If we find a null freeblock entry, > - * indicating a hole in the data blocks, remember that. > + * data for a good data block. Do a reverse order search, as growing > + * directories will put new blocks with free space at the end of the > + * free space index. > */ > - while (dbno == -1) { > - /* > - * If we don't have a freeblock in hand, get the next one. > - */ > - if (fbp == NULL) { > - /* > - * Happens the first time through unless lookup gave > - * us a freespace block to start with. > - */ > - if (++fbno == 0) > - fbno = xfs_dir2_byte_to_db(args->geo, > - XFS_DIR2_FREE_OFFSET); > - /* > - * If it's ifbno we already looked at it. > - */ > + for (fbno = lastfbno; fbno >= firstfbno; fbno--) { > + > + /* If we don't have a freeblock in hand, get the previous one. */ > + if (!fbp) { > + /* If it's ifbno we already looked at it. */ > if (fbno == ifbno) > - fbno++; > + fbno--; > /* > * If it's off the end we're done. > */ > - if (fbno >= lastfbno) > + if (fbno < firstfbno) > break; > /* > * Read the block. There can be holes in the > @@ -1817,8 +1800,8 @@ xfs_dir2_node_addname_int( > if (!fbp) > continue; > free = fbp->b_addr; > - findex = 0; > } > + > /* > * Look at the current free entry. Is it good enough? > * > @@ -1829,17 +1812,16 @@ xfs_dir2_node_addname_int( > */ > bests = dp->d_ops->free_bests_p(free); > dp->d_ops->free_hdr_from_disk(&freehdr, free); > - do { > - > + for (findex = freehdr.nvalid - 1; findex >= 0; findex--){ > if (be16_to_cpu(bests[findex]) != NULLDATAOFF && > be16_to_cpu(bests[findex]) >= length) { > dbno = freehdr.firstdb + findex; > - break; > + goto block_found; > } > - } while (++findex < freehdr.nvalid); > + } > > /* Drop the block if we done with the freeblock */ > - if (findex == freehdr.nvalid) { > + if (findex < 0) { > xfs_trans_brelse(tp, fbp); > fbp = NULL; > if (fblk) > @@ -1847,11 +1829,13 @@ xfs_dir2_node_addname_int( > } > } > > + ASSERT(dbno == -1); > + > /* > - * If we don't have a data block, we need to allocate one and make > + * We don't have a data block so we need to allocate one and make > * the freespace entries refer to it. > */ > - if (unlikely(dbno == -1)) { > + if (dbno == -1) { > /* > * Not allowed to allocate, return failure. > */ > @@ -1981,6 +1965,7 @@ xfs_dir2_node_addname_int( > * We had a data block so we don't have to make a new one. > */ > else { > +block_found: > /* > * If just checking, we succeeded. > */ > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: speed up directory bestfree block scanning 2018-01-03 13:41 ` Brian Foster @ 2018-01-03 21:00 ` Dave Chinner 2018-01-04 14:04 ` Brian Foster 0 siblings, 1 reply; 8+ messages in thread From: Dave Chinner @ 2018-01-03 21:00 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Wed, Jan 03, 2018 at 08:41:37AM -0500, Brian Foster wrote: > On Wed, Jan 03, 2018 at 10:59:10PM +1100, Dave Chinner wrote: > > In writing this, I think I can see a quick and simple change that > > will fix this case and improve most other directory grow workloads > > without affecting normal random directory insert/remove performance. > > That is, do a reverse order search starting at the last block rather > > than increasing order search starting at the first block..... > > > > Ok, now were are talking - performance and scalability improvements! > > > > create time(sec) / rate (files/s) > > File count vanilla loop-fix +reverse > > 10k 0.54 / 18.5k 0.53 / 18.9k 0.52 / 19.3k > > 20k 1.10 / 18.1k 1.05 / 19.0k 1.00 / 20.0k > > 100k 4.21 / 23.8k 3.91 / 25.6k 3.58 / 27.9k > > 200k 9.66 / 20,7k 7.37 / 27.1k 7.08 / 28.3k > > 1M 86.61 / 11.5k 48.26 / 20.7k 38.33 / 26.1k > > 2M 206.13 / 9.7k 129.71 / 15.4k 82.20 / 24.3k > > 10M 2843.57 / 3.5k 1817.39 / 5.5k 591.78 / 16.9k > > > > Theres still some non-linearity as we approach the 10M number, but > > it's still 5x faster to 10M inodes than the existing code.... > > > > Nice improvement.. I still need to look at the code, but a quick first > thought is that I wonder if there's somewhere we could stash a 'most > recent freeblock' once we have to grow the directory, even if just as an > in-core hint. Then we could jump straight to the latest block regardless > of the workload. I thought about that, and then wondered where to stash it, then wondered whether it would miss smaller, better fitting blocks, and then finally realised we didn't need to have cross-operation state to solve the common case of growing directories. > Hmm, thinking a little more about it, that may not be worth the > complication since part of the concept of "search failure" in this case > is tied to the size of the entry we want to add. Then again, I suppose > such is the case when searching forward/backward as well (i.e., one > large insert fails, grows inode, subsequent small insert may very well > have succeeded with the first freeblock, though now we'd always start at > the recently allocated block at the end). Right. random hole filling in the directory shouldn't be greatly affected by forward or reverse search order - the eventual search distances are all the same. It does, OTOH, matter greatly for sequntial inserts... After sleeping on it, I suspect that there's a simple on-disk mod to the dir3 header that will improve the search function for all workloads. The dir3 free block header: struct xfs_dir3_free_hdr { struct xfs_dir3_blk_hdr hdr; __be32 firstdb; /* db of first entry */ __be32 nvalid; /* count of valid entries */ __be32 nused; /* count of used entries */ __be32 pad; /* 64 bit alignment */ }; has 32 bits of padding in it, and the most entries a free block can have is just under 2^15. Hence we can turn that into a "bestfree" entry that tracks the largest freespace indexed by the block. Then the free block scan can start by checking the required length against the largest freespace in the bestfree entry and skip the block search altogether if there isn't an indexed block with enough free space inside the freespace index block we are searching. That's a lot more work than just reversing the search order, but I think it's a mod we should (eventually) make because it is an improvement for all insert workloads, not just growing. The other (far more complex) option is to turn the freespace index into a btree, like we do with the hash indexes. Not sure we need to spend that much effort on this right now, though. > Anyways, just thinking out loud (and recovering from several weeks > vacation). :P Welcome back :) Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: speed up directory bestfree block scanning 2018-01-03 21:00 ` Dave Chinner @ 2018-01-04 14:04 ` Brian Foster 2018-01-04 21:52 ` Dave Chinner 0 siblings, 1 reply; 8+ messages in thread From: Brian Foster @ 2018-01-04 14:04 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Thu, Jan 04, 2018 at 08:00:15AM +1100, Dave Chinner wrote: > On Wed, Jan 03, 2018 at 08:41:37AM -0500, Brian Foster wrote: > > On Wed, Jan 03, 2018 at 10:59:10PM +1100, Dave Chinner wrote: > > > In writing this, I think I can see a quick and simple change that > > > will fix this case and improve most other directory grow workloads > > > without affecting normal random directory insert/remove performance. > > > That is, do a reverse order search starting at the last block rather > > > than increasing order search starting at the first block..... > > > > > > Ok, now were are talking - performance and scalability improvements! > > > > > > create time(sec) / rate (files/s) > > > File count vanilla loop-fix +reverse > > > 10k 0.54 / 18.5k 0.53 / 18.9k 0.52 / 19.3k > > > 20k 1.10 / 18.1k 1.05 / 19.0k 1.00 / 20.0k > > > 100k 4.21 / 23.8k 3.91 / 25.6k 3.58 / 27.9k > > > 200k 9.66 / 20,7k 7.37 / 27.1k 7.08 / 28.3k > > > 1M 86.61 / 11.5k 48.26 / 20.7k 38.33 / 26.1k > > > 2M 206.13 / 9.7k 129.71 / 15.4k 82.20 / 24.3k > > > 10M 2843.57 / 3.5k 1817.39 / 5.5k 591.78 / 16.9k > > > > > > Theres still some non-linearity as we approach the 10M number, but > > > it's still 5x faster to 10M inodes than the existing code.... > > > > > > > Nice improvement.. I still need to look at the code, but a quick first > > thought is that I wonder if there's somewhere we could stash a 'most > > recent freeblock' once we have to grow the directory, even if just as an > > in-core hint. Then we could jump straight to the latest block regardless > > of the workload. > > I thought about that, and then wondered where to stash it, then > wondered whether it would miss smaller, better fitting blocks, and > then finally realised we didn't need to have cross-operation state > to solve the common case of growing directories. > Agreed with regard to the growing directories scenario. The thought was more around first: avoiding any potential negative effects for other workloads, and second: perhaps constructing a more generally applicable optimization (cases beyond purely seqential directory grow). > > Hmm, thinking a little more about it, that may not be worth the > > complication since part of the concept of "search failure" in this case > > is tied to the size of the entry we want to add. Then again, I suppose > > such is the case when searching forward/backward as well (i.e., one > > large insert fails, grows inode, subsequent small insert may very well > > have succeeded with the first freeblock, though now we'd always start at > > the recently allocated block at the end). > > Right. random hole filling in the directory shouldn't be greatly > affected by forward or reverse search order - the eventual search > distances are all the same. It does, OTOH, matter greatly for > sequntial inserts... > Yeah, I think that's a reasonable argument for simply swapping the search order (i.e., addresses the first concern above). > After sleeping on it, I suspect that there's a simple on-disk mod to > the dir3 header that will improve the search function for all > workloads. The dir3 free block header: > > struct xfs_dir3_free_hdr { > struct xfs_dir3_blk_hdr hdr; > __be32 firstdb; /* db of first entry */ > __be32 nvalid; /* count of valid entries */ > __be32 nused; /* count of used entries */ > __be32 pad; /* 64 bit alignment */ > }; > > has 32 bits of padding in it, and the most entries a free block can > have is just under 2^15. Hence we can turn that into a "bestfree" > entry that tracks the largest freespace indexed by the block. > > Then the free block scan can start by checking the required length > against the largest freespace in the bestfree entry and skip the > block search altogether if there isn't an indexed block with enough > free space inside the freespace index block we are searching. > So with this we'd still need to walk the range of free blocks, right? I.e., we'd just be able to shortcut individual free block processing (bests[] iteration) for those that obviously don't satisfy the request. Assuming I follow that correctly, that sounds reasonable from the perspective that it certainly eliminates work. I suppose how worthwhile it is probably depends on how much of an effect it has on the higher level directory workload. IOW, is performance measurably improved by skipping individual free block processing or is that cost mostly amortized by having to read/walk all free blocks in the first place? It may very well be a worthwhile optimization. I think the larger point is that more isolated testing is probably required to confirm. The improvement from this patch alone doesn't necessarily translate to an answer one way or another because this patch creates conditions that allow us to skip most of the scan altogether (by jumping right to the most recently added block). > That's a lot more work than just reversing the search order, but I > think it's a mod we should (eventually) make because it is an > improvement for all insert workloads, not just growing. > > The other (far more complex) option is to turn the freespace index > into a btree, like we do with the hash indexes. Not sure we need to > spend that much effort on this right now, though. > *nod* Brian > > Anyways, just thinking out loud (and recovering from several weeks > > vacation). :P > > Welcome back :) > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: speed up directory bestfree block scanning 2018-01-04 14:04 ` Brian Foster @ 2018-01-04 21:52 ` Dave Chinner 0 siblings, 0 replies; 8+ messages in thread From: Dave Chinner @ 2018-01-04 21:52 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Thu, Jan 04, 2018 at 09:04:53AM -0500, Brian Foster wrote: > On Thu, Jan 04, 2018 at 08:00:15AM +1100, Dave Chinner wrote: > > On Wed, Jan 03, 2018 at 08:41:37AM -0500, Brian Foster wrote: > > > On Wed, Jan 03, 2018 at 10:59:10PM +1100, Dave Chinner wrote: > > > > In writing this, I think I can see a quick and simple change that > > > > will fix this case and improve most other directory grow workloads > > > > without affecting normal random directory insert/remove performance. > > > > That is, do a reverse order search starting at the last block rather > > > > than increasing order search starting at the first block..... > > > > > > > > Ok, now were are talking - performance and scalability improvements! > > > > > > > > create time(sec) / rate (files/s) > > > > File count vanilla loop-fix +reverse > > > > 10k 0.54 / 18.5k 0.53 / 18.9k 0.52 / 19.3k > > > > 20k 1.10 / 18.1k 1.05 / 19.0k 1.00 / 20.0k > > > > 100k 4.21 / 23.8k 3.91 / 25.6k 3.58 / 27.9k > > > > 200k 9.66 / 20,7k 7.37 / 27.1k 7.08 / 28.3k > > > > 1M 86.61 / 11.5k 48.26 / 20.7k 38.33 / 26.1k > > > > 2M 206.13 / 9.7k 129.71 / 15.4k 82.20 / 24.3k > > > > 10M 2843.57 / 3.5k 1817.39 / 5.5k 591.78 / 16.9k > > > > > > > > Theres still some non-linearity as we approach the 10M number, but > > > > it's still 5x faster to 10M inodes than the existing code.... > > > > > > > > > > Nice improvement.. I still need to look at the code, but a quick first > > > thought is that I wonder if there's somewhere we could stash a 'most > > > recent freeblock' once we have to grow the directory, even if just as an > > > in-core hint. Then we could jump straight to the latest block regardless > > > of the workload. ..... > > After sleeping on it, I suspect that there's a simple on-disk mod to > > the dir3 header that will improve the search function for all > > workloads. The dir3 free block header: > > > > struct xfs_dir3_free_hdr { > > struct xfs_dir3_blk_hdr hdr; > > __be32 firstdb; /* db of first entry */ > > __be32 nvalid; /* count of valid entries */ > > __be32 nused; /* count of used entries */ > > __be32 pad; /* 64 bit alignment */ > > }; > > > > has 32 bits of padding in it, and the most entries a free block can > > have is just under 2^15. Hence we can turn that into a "bestfree" > > entry that tracks the largest freespace indexed by the block. > > > > Then the free block scan can start by checking the required length > > against the largest freespace in the bestfree entry and skip the > > block search altogether if there isn't an indexed block with enough > > free space inside the freespace index block we are searching. > > So with this we'd still need to walk the range of free blocks, right? > I.e., we'd just be able to shortcut individual free block processing > (bests[] iteration) for those that obviously don't satisfy the request. Right. In the case of the 5M entry directory I was looking at, there were a total of 33 free index blocks. All the overhead was in scanning the 2016 entries in each block, not in iterating over the blocks. So if we can skip the entry scanning in a given block, then we have a general search improvement. i.e. it makes the free block index structure more like a skip list than a linear array. Also, if we keep the index into the free index block of the largest free space we have in the header, we can jump straight to it without needing to scan. Then when we modify the free index during the insert, we can do the "best free" scan at that point, similar to how we keep other bestfree indexes up to date. > Assuming I follow that correctly, that sounds reasonable from the > perspective that it certainly eliminates work. I suppose how worthwhile > it is probably depends on how much of an effect it has on the higher > level directory workload. IOW, is performance measurably improved by > skipping individual free block processing or is that cost mostly > amortized by having to read/walk all free blocks in the first place? According to the profile, the CPU cost of the read/walk of the free index blocks was within the noise floor. It didn't stand out as a major contributor, but I'd need to re-measure the overhead on a random insert/delete workload. However, my gut feel is that if we combined reverse order searching (for sequential insert) with block skipping (for random insert) we'll get get substantial improvements across all insert operations on large directories... > It may very well be a worthwhile optimization. I think the larger point > is that more isolated testing is probably required to confirm. The > improvement from this patch alone doesn't necessarily translate to an > answer one way or another because this patch creates conditions that > allow us to skip most of the scan altogether (by jumping right to the > most recently added block). Yeah, lots of validation work, but given the overhead for a non-trivial search distance I was measuring (>40% of CPU time @10M inodes) I think it's worth persuing. CHeers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: speed up directory bestfree block scanning 2018-01-03 6:27 [PATCH] xfs: speed up directory bestfree block scanning Dave Chinner 2018-01-03 11:59 ` Dave Chinner @ 2018-01-03 13:28 ` Brian Foster 2018-01-03 20:38 ` Dave Chinner 1 sibling, 1 reply; 8+ messages in thread From: Brian Foster @ 2018-01-03 13:28 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs On Wed, Jan 03, 2018 at 05:27:48PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > When running a "create millions inodes in a directory" test > recently, I noticed we were spending a huge amount of time > converting freespace block headers from disk format to in-memory > format: > > 31.47% [kernel] [k] xfs_dir2_node_addname > 17.86% [kernel] [k] xfs_dir3_free_hdr_from_disk > 3.55% [kernel] [k] xfs_dir3_free_bests_p > > We shouldn't be hitting the best free block scanning code so hard > when doing sequential directory creates, and it turns out there's > a highly suboptimal loop searching the the best free array in > the freespace block - it decodes the block header before checking > each entry inside a loop, instead of decoding the header once before > running the entry search loop. > > This makes a massive difference to create rates. Profile now looks > like this: > > 13.15% [kernel] [k] xfs_dir2_node_addname > 3.52% [kernel] [k] xfs_dir3_leaf_check_int > 3.11% [kernel] [k] xfs_log_commit_cil > > And the wall time/average file create rate differences are > just as stark: > > create time(sec) / rate (files/s) > File count vanilla patched > 10k 0.54 / 18.5k 0.53 / 18.9k > 20k 1.10 / 18.1k 1.05 / 19.0k > 100k 4.21 / 23.8k 3.91 / 25.6k > 200k 9.66 / 20,7k 7.37 / 27.1k > 1M 86.61 / 11.5k 48.26 / 20.7k > 2M 206.13 / 9.7k 129.71 / 15.4k > > The larger the directory, the bigger the performance improvement. > Interesting.. > Signed-Off-By: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/libxfs/xfs_dir2_node.c | 30 +++++++++++++++--------------- > 1 file changed, 15 insertions(+), 15 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c > index 682e2bf370c7..bcf0d43cd6a8 100644 > --- a/fs/xfs/libxfs/xfs_dir2_node.c > +++ b/fs/xfs/libxfs/xfs_dir2_node.c > @@ -1829,24 +1829,24 @@ xfs_dir2_node_addname_int( > */ > bests = dp->d_ops->free_bests_p(free); > dp->d_ops->free_hdr_from_disk(&freehdr, free); > - if (be16_to_cpu(bests[findex]) != NULLDATAOFF && > - be16_to_cpu(bests[findex]) >= length) > - dbno = freehdr.firstdb + findex; > - else { > - /* > - * Are we done with the freeblock? > - */ > - if (++findex == freehdr.nvalid) { > - /* > - * Drop the block. > - */ > - xfs_trans_brelse(tp, fbp); > - fbp = NULL; > - if (fblk && fblk->bp) > - fblk->bp = NULL; Ok, so we're adding a dir entry to a node dir and walking the free space blocks to see if we have space somewhere to insert the entry without growing the dir. The current code reads the free block, converts the header, checks bests[findex], then bumps findex or invalidates the free block if we're done with it. The updated code reads the free block, converts the header, iterates the free index range then invalidates the block when complete (assuming we don't find suitable free space). The end result is that we don't convert the block header over and over for each index in the individual block. Seems reasonable to me, just a couple nits... > + do { > + Extra space above. > + if (be16_to_cpu(bests[findex]) != NULLDATAOFF && > + be16_to_cpu(bests[findex]) >= length) { > + dbno = freehdr.firstdb + findex; > + break; > } > + } while (++findex < freehdr.nvalid); > + > + /* Drop the block if we done with the freeblock */ "... if we're done ..." Also FWIW, according to the comment it looks like the only reason the freehdr conversion is elevated to this scope is to accommodate gcc foolishness. If so, I'm wondering if a simple NULL init of bests at the top of the function would avoid that problem and allow us to move the code to where it was apparently intended to be in the first place. Hm? Brian > + if (findex == freehdr.nvalid) { > + xfs_trans_brelse(tp, fbp); > + fbp = NULL; > + if (fblk) > + fblk->bp = NULL; > } > } > + > /* > * If we don't have a data block, we need to allocate one and make > * the freespace entries refer to it. > -- > 2.15.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] xfs: speed up directory bestfree block scanning 2018-01-03 13:28 ` Brian Foster @ 2018-01-03 20:38 ` Dave Chinner 0 siblings, 0 replies; 8+ messages in thread From: Dave Chinner @ 2018-01-03 20:38 UTC (permalink / raw) To: Brian Foster; +Cc: linux-xfs On Wed, Jan 03, 2018 at 08:28:03AM -0500, Brian Foster wrote: > On Wed, Jan 03, 2018 at 05:27:48PM +1100, Dave Chinner wrote: > > + if (be16_to_cpu(bests[findex]) != NULLDATAOFF && > > + be16_to_cpu(bests[findex]) >= length) { > > + dbno = freehdr.firstdb + findex; > > + break; > > } > > + } while (++findex < freehdr.nvalid); > > + > > + /* Drop the block if we done with the freeblock */ > > "... if we're done ..." > > Also FWIW, according to the comment it looks like the only reason the > freehdr conversion is elevated to this scope is to accommodate gcc > foolishness. If so, I'm wondering if a simple NULL init of bests at the > top of the function would avoid that problem and allow us to move the > code to where it was apparently intended to be in the first place. Hm? Yeah, looking at the follow-on patch, there's a gigantic amount of cleanup needed in this function. There's a bunch of "gcc is so stupid" hacks amongst the code because the function is too long for gcc correctly determine variable usage. I might sit down and factor it properly because that will make it a whole lot simpler and easier to understand... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2018-01-04 21:53 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-01-03 6:27 [PATCH] xfs: speed up directory bestfree block scanning Dave Chinner 2018-01-03 11:59 ` Dave Chinner 2018-01-03 13:41 ` Brian Foster 2018-01-03 21:00 ` Dave Chinner 2018-01-04 14:04 ` Brian Foster 2018-01-04 21:52 ` Dave Chinner 2018-01-03 13:28 ` Brian Foster 2018-01-03 20:38 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).