* [PATCH 0/2] repair: fixes for 3.2.0-rc2
@ 2014-05-09 1:17 Dave Chinner
2014-05-09 1:17 ` [PATCH 1/2] repair: don't unlock prefetch tree to read discontig buffers Dave Chinner
2014-05-09 1:17 ` [PATCH 2/2] repair: don't grind CPUs with large extent lists Dave Chinner
0 siblings, 2 replies; 8+ messages in thread
From: Dave Chinner @ 2014-05-09 1:17 UTC (permalink / raw)
To: xfs
Hi folks,
A couple of fixes identified for xfs_repair follow. The first is a
regression introduced by the discontiguous buffer support in the
prefetch code, which Eric Sandeen found. I've written a slightly
different fix for it than Eric did, but the concept for the fix is
all Eric's work.
The second patch is not actually a regression, but the filesystem
that triggered the above regression can't actually be repaired in a
sane amount of time because of an array insertion algorithm with
exponential cost. It's trivial to fix, and it means xfs_repair can
actually run to completion quickly rather than burn CPU for hours
getting nothing done.
These fixes mean I'll probably release a 3.2.0-rc3 once they've been
reviewed and go into the tree....
Cheers,
Dave.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/2] repair: don't unlock prefetch tree to read discontig buffers
2014-05-09 1:17 [PATCH 0/2] repair: fixes for 3.2.0-rc2 Dave Chinner
@ 2014-05-09 1:17 ` Dave Chinner
2014-05-09 2:14 ` Eric Sandeen
2014-05-09 1:17 ` [PATCH 2/2] repair: don't grind CPUs with large extent lists Dave Chinner
1 sibling, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2014-05-09 1:17 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
The way discontiguous buffers are currently handled in prefetch is
by unlocking the prefetch tree and reading them one at a time in
pf_read_discontig(), inside the normal loop of searching for buffers
to read in a more optimized fashion.
But by unlocking the tree, we allow other threads to come in and
find buffers which we've already stashed locally on our bplist[].
If 2 threads think they own the same set of buffers, they may both
try to delete them from the prefetch btree, and the second one to
arrive will not find it, resulting in:
fatal error -- prefetch corruption
To fix this, simply abort the buffer gathering loop when we come
across a discontiguous buffer, process the gathered list as per
normal, and then after running the large optimised read, check to
see if the last buffer on the list is a discontiguous buffer.
If is is discontiguous, then issue the discontiguous buffer read
while the locks are not held. We only ever have one discontiguous
buffer per read loop, so it is safe just to check the last buffer in
the list.
The fix is loosely based on a a patch provided by Eric Sandeen, who
did all the hard work of finding the bug and demonstrating how to
fix it.
Reported-by:Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
repair/prefetch.c | 53 +++++++++++++++++++++++++++--------------------------
1 file changed, 27 insertions(+), 26 deletions(-)
diff --git a/repair/prefetch.c b/repair/prefetch.c
index 65fedf5..e269f1f 100644
--- a/repair/prefetch.c
+++ b/repair/prefetch.c
@@ -444,27 +444,6 @@ pf_read_inode_dirs(
}
/*
- * Discontiguous buffers require multiple IOs to fill, so we can't use any
- * linearising, hole filling algorithms on them to avoid seeks. Just remove them
- * for the prefetch queue and read them straight into the cache and release
- * them.
- */
-static void
-pf_read_discontig(
- struct prefetch_args *args,
- struct xfs_buf *bp)
-{
- if (!btree_delete(args->io_queue, XFS_DADDR_TO_FSB(mp, bp->b_bn)))
- do_error(_("prefetch corruption\n"));
-
- pthread_mutex_unlock(&args->lock);
- libxfs_readbufr_map(mp->m_ddev_targp, bp, 0);
- bp->b_flags |= LIBXFS_B_UNCHECKED;
- libxfs_putbuf(bp);
- pthread_mutex_lock(&args->lock);
-}
-
-/*
* pf_batch_read must be called with the lock locked.
*/
static void
@@ -496,13 +475,19 @@ pf_batch_read(
}
while (bplist[num] && num < MAX_BUFS && fsbno < max_fsbno) {
/*
- * Handle discontiguous buffers outside the seek
- * optimised IO loop below.
+ * Discontiguous buffers need special handling, so stop
+ * gathering new buffers and process the list and this
+ * discontigous buffer immediately. This avoids the
+ * complexity of keeping a separate discontigous buffer
+ * list and seeking back over ranges we've already done
+ * optimised reads for.
*/
if ((bplist[num]->b_flags & LIBXFS_B_DISCONTIG)) {
- pf_read_discontig(args, bplist[num]);
- bplist[num] = NULL;
- } else if (which != PF_META_ONLY ||
+ num++;
+ break;
+ }
+
+ if (which != PF_META_ONLY ||
!B_IS_INODE(XFS_BUF_PRIORITY(bplist[num])))
num++;
if (num == MAX_BUFS)
@@ -570,6 +555,22 @@ pf_batch_read(
* now read the data and put into the xfs_but_t's
*/
len = pread64(mp_fd, buf, (int)(last_off - first_off), first_off);
+
+ /*
+ * Check the last buffer on the list to see if we need to
+ * process a discontiguous buffer. The gather loop guarantees
+ * we only ever have a single discontig buffer on the list,
+ * and that it is the last buffer, so it is safe to do this
+ * check and read here like this while we aren't holding any
+ * locks.
+ */
+ if ((bplist[num - 1]->b_flags & LIBXFS_B_DISCONTIG)) {
+ libxfs_readbufr_map(mp->m_ddev_targp, bplist[num - 1], 0);
+ bplist[num - 1]->b_flags |= LIBXFS_B_UNCHECKED;
+ libxfs_putbuf(bplist[num - 1]);
+ num--;
+ }
+
if (len > 0) {
/*
* go through the xfs_buf_t list copying from the
--
1.9.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/2] repair: don't grind CPUs with large extent lists
2014-05-09 1:17 [PATCH 0/2] repair: fixes for 3.2.0-rc2 Dave Chinner
2014-05-09 1:17 ` [PATCH 1/2] repair: don't unlock prefetch tree to read discontig buffers Dave Chinner
@ 2014-05-09 1:17 ` Dave Chinner
2014-05-09 3:01 ` Eric Sandeen
1 sibling, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2014-05-09 1:17 UTC (permalink / raw)
To: xfs
From: Dave Chinner <dchinner@redhat.com>
When repairing a large filesystem with fragemented files, xfs_repair
can grind to a halt burning multiple CPUs in blkmap_set_ext().
blkmap_set_ext() inserts extents into the blockmap for the inode
fork and it keeps them in order, even if the inserts are not done in
order.
The ordered insert is highly inefficient - it starts at the first
extent, and simple walks the array to find the insertion point. i.e.
it is an O(n) operation. When we have a fragemented file with a
large number of extents, the cost of the entire mapping operation
is rather costly.
The thing is, we are doing the insertion from an *ordered btree
scan* which is inserting the extents in ascending offset order.
IOWs, we are always inserting the extent at the end of the array
after searching the entire array. i.e. the mapping operation cost is
O(N^2).
Fix this simply by reversing the order of the insert slot search.
Start at the end of the blockmap array when we do almost all
insertions, which brings the overhead of each insertion down to O(1)
complexity. This, in turn, results in the overall map building
operation being reduced to an O(N) operation, and so performance
degrades linearly with increasing extent counts rather than
exponentially.
The result is that the test filesystem (27TB, 30M inodes, at ENOSPC)
takes 5m10s to *fully repair* on my test system, rather that getting
15 (of 60) AGs into phase three and sitting there burning 3-4 CPUs
making no progress for over half an hour.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
repair/bmap.c | 36 ++++++++++++++++++++++++++++--------
1 file changed, 28 insertions(+), 8 deletions(-)
diff --git a/repair/bmap.c b/repair/bmap.c
index 14161cb..1e913c5 100644
--- a/repair/bmap.c
+++ b/repair/bmap.c
@@ -260,7 +260,15 @@ blkmap_grow(
{
pthread_key_t key = dblkmap_key;
blkmap_t *new_blkmap;
- int new_naexts = blkmap->naexts + 4;
+ int new_naexts;
+
+ /* reduce the number of reallocations for large files */
+ if (blkmap->naexts < 1000)
+ new_naexts = blkmap->naexts + 4;
+ else if (blkmap->naexts < 10000)
+ new_naexts = blkmap->naexts + 100;
+ else
+ new_naexts = blkmap->naexts + 1000;
if (pthread_getspecific(key) != blkmap) {
key = ablkmap_key;
@@ -308,7 +316,7 @@ blkmap_set_ext(
xfs_dfilblks_t c)
{
blkmap_t *blkmap = *blkmapp;
- xfs_extnum_t i;
+ xfs_extnum_t i = 0;
if (blkmap->nexts == blkmap->naexts) {
blkmap = blkmap_grow(blkmap);
@@ -318,15 +326,27 @@ blkmap_set_ext(
}
ASSERT(blkmap->nexts < blkmap->naexts);
- for (i = 0; i < blkmap->nexts; i++) {
- if (blkmap->exts[i].startoff > o) {
- memmove(blkmap->exts + i + 1,
- blkmap->exts + i,
- sizeof(bmap_ext_t) * (blkmap->nexts - i));
+
+ if (blkmap->nexts == 0)
+ goto insert;
+
+ /*
+ * Do a reverse order search as the most common insertion during
+ * a bmapbt scan is at the end of the map. Use "plus 1" indexing
+ * for the loop counter so when we break out we are at the correct index
+ * for insertion.
+ */
+ for (i = blkmap->nexts; i > 0; i--) {
+ if (blkmap->exts[i - 1].startoff < o)
break;
- }
}
+ /* make space for the new extent */
+ memmove(blkmap->exts + i + 1,
+ blkmap->exts + i,
+ sizeof(bmap_ext_t) * (blkmap->nexts - i));
+
+insert:
blkmap->exts[i].startoff = o;
blkmap->exts[i].startblock = b;
blkmap->exts[i].blockcount = c;
--
1.9.0
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] repair: don't unlock prefetch tree to read discontig buffers
2014-05-09 1:17 ` [PATCH 1/2] repair: don't unlock prefetch tree to read discontig buffers Dave Chinner
@ 2014-05-09 2:14 ` Eric Sandeen
2014-05-09 3:53 ` Dave Chinner
0 siblings, 1 reply; 8+ messages in thread
From: Eric Sandeen @ 2014-05-09 2:14 UTC (permalink / raw)
To: Dave Chinner, xfs
On 5/8/14, 8:17 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> The way discontiguous buffers are currently handled in prefetch is
> by unlocking the prefetch tree and reading them one at a time in
> pf_read_discontig(), inside the normal loop of searching for buffers
> to read in a more optimized fashion.
>
> But by unlocking the tree, we allow other threads to come in and
> find buffers which we've already stashed locally on our bplist[].
> If 2 threads think they own the same set of buffers, they may both
> try to delete them from the prefetch btree, and the second one to
> arrive will not find it, resulting in:
>
> fatal error -- prefetch corruption
>
> To fix this, simply abort the buffer gathering loop when we come
> across a discontiguous buffer, process the gathered list as per
> normal, and then after running the large optimised read, check to
> see if the last buffer on the list is a discontiguous buffer.
> If is is discontiguous, then issue the discontiguous buffer read
> while the locks are not held. We only ever have one discontiguous
> buffer per read loop, so it is safe just to check the last buffer in
> the list.
>
> The fix is loosely based on a a patch provided by Eric Sandeen, who
> did all the hard work of finding the bug and demonstrating how to
> fix it.
Ok, this makes sense to me. The comment above the discontig read
seems a bit confusing; you say it's safe to read while unlocked,
but I wouldn't have expected it not to be - the lock is just for
btree manipulation, and that's not being done. So I think the
comment adds a little confusion rather than clarification.
But the code looks fine.
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
> Reported-by:Eric Sandeen <sandeen@redhat.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> repair/prefetch.c | 53 +++++++++++++++++++++++++++--------------------------
> 1 file changed, 27 insertions(+), 26 deletions(-)
>
> diff --git a/repair/prefetch.c b/repair/prefetch.c
> index 65fedf5..e269f1f 100644
> --- a/repair/prefetch.c
> +++ b/repair/prefetch.c
> @@ -444,27 +444,6 @@ pf_read_inode_dirs(
> }
>
> /*
> - * Discontiguous buffers require multiple IOs to fill, so we can't use any
> - * linearising, hole filling algorithms on them to avoid seeks. Just remove them
> - * for the prefetch queue and read them straight into the cache and release
> - * them.
> - */
> -static void
> -pf_read_discontig(
> - struct prefetch_args *args,
> - struct xfs_buf *bp)
> -{
> - if (!btree_delete(args->io_queue, XFS_DADDR_TO_FSB(mp, bp->b_bn)))
> - do_error(_("prefetch corruption\n"));
> -
> - pthread_mutex_unlock(&args->lock);
> - libxfs_readbufr_map(mp->m_ddev_targp, bp, 0);
> - bp->b_flags |= LIBXFS_B_UNCHECKED;
> - libxfs_putbuf(bp);
> - pthread_mutex_lock(&args->lock);
> -}
> -
> -/*
> * pf_batch_read must be called with the lock locked.
> */
> static void
> @@ -496,13 +475,19 @@ pf_batch_read(
> }
> while (bplist[num] && num < MAX_BUFS && fsbno < max_fsbno) {
> /*
> - * Handle discontiguous buffers outside the seek
> - * optimised IO loop below.
> + * Discontiguous buffers need special handling, so stop
> + * gathering new buffers and process the list and this
> + * discontigous buffer immediately. This avoids the
> + * complexity of keeping a separate discontigous buffer
> + * list and seeking back over ranges we've already done
> + * optimised reads for.
> */
> if ((bplist[num]->b_flags & LIBXFS_B_DISCONTIG)) {
> - pf_read_discontig(args, bplist[num]);
> - bplist[num] = NULL;
> - } else if (which != PF_META_ONLY ||
> + num++;
> + break;
> + }
> +
> + if (which != PF_META_ONLY ||
> !B_IS_INODE(XFS_BUF_PRIORITY(bplist[num])))
> num++;
> if (num == MAX_BUFS)
> @@ -570,6 +555,22 @@ pf_batch_read(
> * now read the data and put into the xfs_but_t's
> */
> len = pread64(mp_fd, buf, (int)(last_off - first_off), first_off);
> +
> + /*
> + * Check the last buffer on the list to see if we need to
> + * process a discontiguous buffer. The gather loop guarantees
> + * we only ever have a single discontig buffer on the list,
> + * and that it is the last buffer, so it is safe to do this
> + * check and read here like this while we aren't holding any
> + * locks.
> + */
> + if ((bplist[num - 1]->b_flags & LIBXFS_B_DISCONTIG)) {
> + libxfs_readbufr_map(mp->m_ddev_targp, bplist[num - 1], 0);
> + bplist[num - 1]->b_flags |= LIBXFS_B_UNCHECKED;
> + libxfs_putbuf(bplist[num - 1]);
> + num--;
> + }
> +
> if (len > 0) {
> /*
> * go through the xfs_buf_t list copying from the
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] repair: don't grind CPUs with large extent lists
2014-05-09 1:17 ` [PATCH 2/2] repair: don't grind CPUs with large extent lists Dave Chinner
@ 2014-05-09 3:01 ` Eric Sandeen
2014-05-09 3:56 ` Dave Chinner
0 siblings, 1 reply; 8+ messages in thread
From: Eric Sandeen @ 2014-05-09 3:01 UTC (permalink / raw)
To: Dave Chinner, xfs
On 5/8/14, 8:17 PM, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> When repairing a large filesystem with fragemented files, xfs_repair
> can grind to a halt burning multiple CPUs in blkmap_set_ext().
> blkmap_set_ext() inserts extents into the blockmap for the inode
> fork and it keeps them in order, even if the inserts are not done in
> order.
>
> The ordered insert is highly inefficient - it starts at the first
> extent, and simple walks the array to find the insertion point. i.e.
> it is an O(n) operation. When we have a fragemented file with a
> large number of extents, the cost of the entire mapping operation
> is rather costly.
>
> The thing is, we are doing the insertion from an *ordered btree
> scan* which is inserting the extents in ascending offset order.
> IOWs, we are always inserting the extent at the end of the array
> after searching the entire array. i.e. the mapping operation cost is
> O(N^2).
>
> Fix this simply by reversing the order of the insert slot search.
> Start at the end of the blockmap array when we do almost all
> insertions, which brings the overhead of each insertion down to O(1)
> complexity. This, in turn, results in the overall map building
> operation being reduced to an O(N) operation, and so performance
> degrades linearly with increasing extent counts rather than
> exponentially.
>
> The result is that the test filesystem (27TB, 30M inodes, at ENOSPC)
> takes 5m10s to *fully repair* on my test system, rather that getting
> 15 (of 60) AGs into phase three and sitting there burning 3-4 CPUs
> making no progress for over half an hour.
Did the blkmap_grow() changes sneak in here accidentally?
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> repair/bmap.c | 36 ++++++++++++++++++++++++++++--------
> 1 file changed, 28 insertions(+), 8 deletions(-)
>
> diff --git a/repair/bmap.c b/repair/bmap.c
> index 14161cb..1e913c5 100644
> --- a/repair/bmap.c
> +++ b/repair/bmap.c
> @@ -260,7 +260,15 @@ blkmap_grow(
> {
> pthread_key_t key = dblkmap_key;
> blkmap_t *new_blkmap;
> - int new_naexts = blkmap->naexts + 4;
> + int new_naexts;
> +
> + /* reduce the number of reallocations for large files */
> + if (blkmap->naexts < 1000)
> + new_naexts = blkmap->naexts + 4;
> + else if (blkmap->naexts < 10000)
> + new_naexts = blkmap->naexts + 100;
> + else
> + new_naexts = blkmap->naexts + 1000;
>
> if (pthread_getspecific(key) != blkmap) {
> key = ablkmap_key;
> @@ -308,7 +316,7 @@ blkmap_set_ext(
> xfs_dfilblks_t c)
> {
> blkmap_t *blkmap = *blkmapp;
> - xfs_extnum_t i;
> + xfs_extnum_t i = 0;
I wonder if rather than initializing i here...
>
> if (blkmap->nexts == blkmap->naexts) {
> blkmap = blkmap_grow(blkmap);
> @@ -318,15 +326,27 @@ blkmap_set_ext(
> }
>
> ASSERT(blkmap->nexts < blkmap->naexts);
> - for (i = 0; i < blkmap->nexts; i++) {
> - if (blkmap->exts[i].startoff > o) {
> - memmove(blkmap->exts + i + 1,
> - blkmap->exts + i,
> - sizeof(bmap_ext_t) * (blkmap->nexts - i));
> +
> + if (blkmap->nexts == 0)
... setting i = 0 here would be a bit more explicit & clear?
> + goto insert;
> +
> + /*
> + * Do a reverse order search as the most common insertion during
> + * a bmapbt scan is at the end of the map. Use "plus 1" indexing
> + * for the loop counter so when we break out we are at the correct index
> + * for insertion.
> + */
perhaps say why it's most common at the end?
> + for (i = blkmap->nexts; i > 0; i--) {
> + if (blkmap->exts[i - 1].startoff < o)
> break;
> - }
> }
>
> + /* make space for the new extent */
> + memmove(blkmap->exts + i + 1,
> + blkmap->exts + i,
> + sizeof(bmap_ext_t) * (blkmap->nexts - i));
> +
> +insert:
> blkmap->exts[i].startoff = o;
> blkmap->exts[i].startblock = b;
> blkmap->exts[i].blockcount = c;
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 1/2] repair: don't unlock prefetch tree to read discontig buffers
2014-05-09 2:14 ` Eric Sandeen
@ 2014-05-09 3:53 ` Dave Chinner
0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2014-05-09 3:53 UTC (permalink / raw)
To: Eric Sandeen; +Cc: xfs
On Thu, May 08, 2014 at 09:14:31PM -0500, Eric Sandeen wrote:
> On 5/8/14, 8:17 PM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > The way discontiguous buffers are currently handled in prefetch is
> > by unlocking the prefetch tree and reading them one at a time in
> > pf_read_discontig(), inside the normal loop of searching for buffers
> > to read in a more optimized fashion.
> >
> > But by unlocking the tree, we allow other threads to come in and
> > find buffers which we've already stashed locally on our bplist[].
> > If 2 threads think they own the same set of buffers, they may both
> > try to delete them from the prefetch btree, and the second one to
> > arrive will not find it, resulting in:
> >
> > fatal error -- prefetch corruption
> >
> > To fix this, simply abort the buffer gathering loop when we come
> > across a discontiguous buffer, process the gathered list as per
> > normal, and then after running the large optimised read, check to
> > see if the last buffer on the list is a discontiguous buffer.
> > If is is discontiguous, then issue the discontiguous buffer read
> > while the locks are not held. We only ever have one discontiguous
> > buffer per read loop, so it is safe just to check the last buffer in
> > the list.
> >
> > The fix is loosely based on a a patch provided by Eric Sandeen, who
> > did all the hard work of finding the bug and demonstrating how to
> > fix it.
>
> Ok, this makes sense to me. The comment above the discontig read
> seems a bit confusing; you say it's safe to read while unlocked,
> but I wouldn't have expected it not to be - the lock is just for
> btree manipulation, and that's not being done. So I think the
> comment adds a little confusion rather than clarification.
Ok, I'll just drop the bit about it being safe to read - the bit
about being the last buffer on the list is the important bit...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] repair: don't grind CPUs with large extent lists
2014-05-09 3:01 ` Eric Sandeen
@ 2014-05-09 3:56 ` Dave Chinner
2014-05-09 4:18 ` Eric Sandeen
0 siblings, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2014-05-09 3:56 UTC (permalink / raw)
To: Eric Sandeen; +Cc: xfs
On Thu, May 08, 2014 at 10:01:37PM -0500, Eric Sandeen wrote:
> On 5/8/14, 8:17 PM, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > When repairing a large filesystem with fragemented files, xfs_repair
> > can grind to a halt burning multiple CPUs in blkmap_set_ext().
> > blkmap_set_ext() inserts extents into the blockmap for the inode
> > fork and it keeps them in order, even if the inserts are not done in
> > order.
> >
> > The ordered insert is highly inefficient - it starts at the first
> > extent, and simple walks the array to find the insertion point. i.e.
> > it is an O(n) operation. When we have a fragemented file with a
> > large number of extents, the cost of the entire mapping operation
> > is rather costly.
> >
> > The thing is, we are doing the insertion from an *ordered btree
> > scan* which is inserting the extents in ascending offset order.
> > IOWs, we are always inserting the extent at the end of the array
> > after searching the entire array. i.e. the mapping operation cost is
> > O(N^2).
> >
> > Fix this simply by reversing the order of the insert slot search.
> > Start at the end of the blockmap array when we do almost all
> > insertions, which brings the overhead of each insertion down to O(1)
> > complexity. This, in turn, results in the overall map building
> > operation being reduced to an O(N) operation, and so performance
> > degrades linearly with increasing extent counts rather than
> > exponentially.
> >
> > The result is that the test filesystem (27TB, 30M inodes, at ENOSPC)
> > takes 5m10s to *fully repair* on my test system, rather that getting
> > 15 (of 60) AGs into phase three and sitting there burning 3-4 CPUs
> > making no progress for over half an hour.
>
> Did the blkmap_grow() changes sneak in here accidentally?
Ah, forgot to mention them. Do a realloc every 4 entries when we
have hundreds of thousands of extents is just silly ;)
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> > repair/bmap.c | 36 ++++++++++++++++++++++++++++--------
> > 1 file changed, 28 insertions(+), 8 deletions(-)
> >
> > diff --git a/repair/bmap.c b/repair/bmap.c
> > index 14161cb..1e913c5 100644
> > --- a/repair/bmap.c
> > +++ b/repair/bmap.c
> > @@ -260,7 +260,15 @@ blkmap_grow(
> > {
> > pthread_key_t key = dblkmap_key;
> > blkmap_t *new_blkmap;
> > - int new_naexts = blkmap->naexts + 4;
> > + int new_naexts;
> > +
> > + /* reduce the number of reallocations for large files */
> > + if (blkmap->naexts < 1000)
> > + new_naexts = blkmap->naexts + 4;
> > + else if (blkmap->naexts < 10000)
> > + new_naexts = blkmap->naexts + 100;
> > + else
> > + new_naexts = blkmap->naexts + 1000;
> >
> > if (pthread_getspecific(key) != blkmap) {
> > key = ablkmap_key;
> > @@ -308,7 +316,7 @@ blkmap_set_ext(
> > xfs_dfilblks_t c)
> > {
> > blkmap_t *blkmap = *blkmapp;
> > - xfs_extnum_t i;
> > + xfs_extnum_t i = 0;
>
> I wonder if rather than initializing i here...
>
> >
> > if (blkmap->nexts == blkmap->naexts) {
> > blkmap = blkmap_grow(blkmap);
> > @@ -318,15 +326,27 @@ blkmap_set_ext(
> > }
> >
> > ASSERT(blkmap->nexts < blkmap->naexts);
> > - for (i = 0; i < blkmap->nexts; i++) {
> > - if (blkmap->exts[i].startoff > o) {
> > - memmove(blkmap->exts + i + 1,
> > - blkmap->exts + i,
> > - sizeof(bmap_ext_t) * (blkmap->nexts - i));
> > +
> > + if (blkmap->nexts == 0)
>
> ... setting i = 0 here would be a bit more explicit & clear?
*nod*. Will do.
> > + goto insert;
> > +
> > + /*
> > + * Do a reverse order search as the most common insertion during
> > + * a bmapbt scan is at the end of the map. Use "plus 1" indexing
> > + * for the loop counter so when we break out we are at the correct index
> > + * for insertion.
> > + */
>
> perhaps say why it's most common at the end?
OK, I'll add "because we are doing an ascending offset order scan of
the bmapbt" or words to that effect.
Thanks!
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 2/2] repair: don't grind CPUs with large extent lists
2014-05-09 3:56 ` Dave Chinner
@ 2014-05-09 4:18 ` Eric Sandeen
0 siblings, 0 replies; 8+ messages in thread
From: Eric Sandeen @ 2014-05-09 4:18 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On 5/8/14, 10:56 PM, Dave Chinner wrote:
> On Thu, May 08, 2014 at 10:01:37PM -0500, Eric Sandeen wrote:
>> On 5/8/14, 8:17 PM, Dave Chinner wrote:
>>> From: Dave Chinner <dchinner@redhat.com>
>>>
>>> When repairing a large filesystem with fragemented files, xfs_repair
>>> can grind to a halt burning multiple CPUs in blkmap_set_ext().
>>> blkmap_set_ext() inserts extents into the blockmap for the inode
>>> fork and it keeps them in order, even if the inserts are not done in
>>> order.
>>>
>>> The ordered insert is highly inefficient - it starts at the first
>>> extent, and simple walks the array to find the insertion point. i.e.
>>> it is an O(n) operation. When we have a fragemented file with a
>>> large number of extents, the cost of the entire mapping operation
>>> is rather costly.
>>>
>>> The thing is, we are doing the insertion from an *ordered btree
>>> scan* which is inserting the extents in ascending offset order.
>>> IOWs, we are always inserting the extent at the end of the array
>>> after searching the entire array. i.e. the mapping operation cost is
>>> O(N^2).
>>>
>>> Fix this simply by reversing the order of the insert slot search.
>>> Start at the end of the blockmap array when we do almost all
>>> insertions, which brings the overhead of each insertion down to O(1)
>>> complexity. This, in turn, results in the overall map building
>>> operation being reduced to an O(N) operation, and so performance
>>> degrades linearly with increasing extent counts rather than
>>> exponentially.
>>>
>>> The result is that the test filesystem (27TB, 30M inodes, at ENOSPC)
>>> takes 5m10s to *fully repair* on my test system, rather that getting
>>> 15 (of 60) AGs into phase three and sitting there burning 3-4 CPUs
>>> making no progress for over half an hour.
>>
>> Did the blkmap_grow() changes sneak in here accidentally?
>
> Ah, forgot to mention them. Do a realloc every 4 entries when we
> have hundreds of thousands of extents is just silly ;)
<snip>
> OK, I'll add "because we are doing an ascending offset order scan of
> the bmapbt" or words to that effect.
Ok, with those agreed-upon changes feel free to add a
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
- thanks for getting these done.
-Eric
> Thanks!
>
> Cheers,
>
> Dave.
>
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2014-05-09 4:18 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-09 1:17 [PATCH 0/2] repair: fixes for 3.2.0-rc2 Dave Chinner
2014-05-09 1:17 ` [PATCH 1/2] repair: don't unlock prefetch tree to read discontig buffers Dave Chinner
2014-05-09 2:14 ` Eric Sandeen
2014-05-09 3:53 ` Dave Chinner
2014-05-09 1:17 ` [PATCH 2/2] repair: don't grind CPUs with large extent lists Dave Chinner
2014-05-09 3:01 ` Eric Sandeen
2014-05-09 3:56 ` Dave Chinner
2014-05-09 4:18 ` Eric Sandeen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox