public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xfs: fix bmbt block allocation failures
@ 2011-04-11  2:34 Dave Chinner
  2011-04-11 23:41 ` Christoph Hellwig
  2011-04-12  0:55 ` Lachlan McIlroy
  0 siblings, 2 replies; 7+ messages in thread
From: Dave Chinner @ 2011-04-11  2:34 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When a multilevel bmbt split occurs, we can be asked to allocate
blocks from an AG that has no space left available. In the case of
an extent just being allocated, the first bmbt block allocation sees
the firstblock parameter is set and does not set a minleft parameter
for the allocation. The allocation also does not set the total
number of blocks required by the allocation, either.

If the extent allocation used all the free space in the AG, the lack
of a total allocation size results in a block being reserved for the
AG freespace btree manipulations being used incorrectly.

Secondly, the allocation is only allowed to be allocated in the
current AG (NEAR_BNO) because the low space algorithm has not been
activated. As a result of this, the second block allocation in the
split fails to find sufficient space in the current AG and fails,
resulting in a dirty transaction cancellation and a filesytem
shutdown.

There are two problems here - both pointed out by Lachlan McIlroy
when we were first discussing a proposed fix for the problem (See
http://oss.sgi.com/archives/xfs/2010-09/msg00438.html).  Firstly,
the missing args->total was causing the initial block to be
allocated from the freespace reserve. Secondly that the NEAR_BNO
allocation should probably be a START_BNO allocation to allow blocks
to be taken from a higher numbered AG.

With these changes, the allocation failure goes away, but we trigger
asserts in xfs_bmapi() that try to ensure that bmbt block
allocations only occur in the same AG as the extent allocated,
except if the low space algorithm is active. In this case, we don't
need to activate the low space algorithm - if a START_BNO allocation
fails with minleft = 0, then there is no space we can allocate at
all, and hence the low space algorithm is no help. Therefore the
asserts need modification reflect the new state of affairs.

This commit fixes the problem demonstrated by the test case in
xfstests 250.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap.c       |    8 ++------
 fs/xfs/xfs_bmap_btree.c |    7 ++-----
 2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
index fa00788..50eed97 100644
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -4917,13 +4917,9 @@ error0:
 	if (cur) {
 		if (!error) {
 			ASSERT(*firstblock == NULLFSBLOCK ||
-			       XFS_FSB_TO_AGNO(mp, *firstblock) ==
+			       XFS_FSB_TO_AGNO(mp, *firstblock) <=
 			       XFS_FSB_TO_AGNO(mp,
-				       cur->bc_private.b.firstblock) ||
-			       (flist->xbf_low &&
-				XFS_FSB_TO_AGNO(mp, *firstblock) <
-				XFS_FSB_TO_AGNO(mp,
-					cur->bc_private.b.firstblock)));
+				       cur->bc_private.b.firstblock));
 			*firstblock = cur->bc_private.b.firstblock;
 		}
 		xfs_btree_del_cursor(cur,
diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c
index 87d3c10..b2cdbfe 100644
--- a/fs/xfs/xfs_bmap_btree.c
+++ b/fs/xfs/xfs_bmap_btree.c
@@ -518,10 +518,11 @@ xfs_bmbt_alloc_block(
 	args.mp = cur->bc_mp;
 	args.fsbno = cur->bc_private.b.firstblock;
 	args.firstblock = args.fsbno;
+	args.total = 1;
+	args.type = XFS_ALLOCTYPE_START_BNO;
 
 	if (args.fsbno == NULLFSBLOCK) {
 		args.fsbno = be64_to_cpu(start->l);
-		args.type = XFS_ALLOCTYPE_START_BNO;
 		/*
 		 * Make sure there is sufficient room left in the AG to
 		 * complete a full tree split for an extent insert.  If
@@ -534,10 +535,6 @@ xfs_bmbt_alloc_block(
 		 * block allocation here and corrupt the filesystem.
 		 */
 		args.minleft = xfs_trans_get_block_res(args.tp);
-	} else if (cur->bc_private.b.flist->xbf_low) {
-		args.type = XFS_ALLOCTYPE_START_BNO;
-	} else {
-		args.type = XFS_ALLOCTYPE_NEAR_BNO;
 	}
 
 	args.minlen = args.maxlen = args.prod = 1;
-- 
1.7.2.3

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] xfs: fix bmbt block allocation failures
  2011-04-11  2:34 [PATCH] xfs: fix bmbt block allocation failures Dave Chinner
@ 2011-04-11 23:41 ` Christoph Hellwig
  2011-04-11 23:56   ` Dave Chinner
  2011-04-12  0:55 ` Lachlan McIlroy
  1 sibling, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2011-04-11 23:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Mon, Apr 11, 2011 at 12:34:11PM +1000, Dave Chinner wrote:
> http://oss.sgi.com/archives/xfs/2010-09/msg00438.html).  Firstly,
> the missing args->total was causing the initial block to be
> allocated from the freespace reserve. Secondly that the NEAR_BNO
> allocation should probably be a START_BNO allocation to allow blocks
> to be taken from a higher numbered AG.

Looking over the use of ->total, isn't the ialloc btree code also
incorrect by setting ->total to 0?

The actual change looks good to me,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xfs: fix bmbt block allocation failures
  2011-04-11 23:41 ` Christoph Hellwig
@ 2011-04-11 23:56   ` Dave Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2011-04-11 23:56 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Mon, Apr 11, 2011 at 07:41:58PM -0400, Christoph Hellwig wrote:
> On Mon, Apr 11, 2011 at 12:34:11PM +1000, Dave Chinner wrote:
> > http://oss.sgi.com/archives/xfs/2010-09/msg00438.html).  Firstly,
> > the missing args->total was causing the initial block to be
> > allocated from the freespace reserve. Secondly that the NEAR_BNO
> > allocation should probably be a START_BNO allocation to allow blocks
> > to be taken from a higher numbered AG.
> 
> Looking over the use of ->total, isn't the ialloc btree code also
> incorrect by setting ->total to 0?

It probably is - I hadn't got that far yet. I wanted to get this
tested and reviewed so test 250 would no longer assert fail
everyone's debug build....

> The actual change looks good to me,
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xfs: fix bmbt block allocation failures
  2011-04-11  2:34 [PATCH] xfs: fix bmbt block allocation failures Dave Chinner
  2011-04-11 23:41 ` Christoph Hellwig
@ 2011-04-12  0:55 ` Lachlan McIlroy
  2011-04-12  6:53   ` Dave Chinner
  1 sibling, 1 reply; 7+ messages in thread
From: Lachlan McIlroy @ 2011-04-12  0:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

----- Original Message -----
> From: Dave Chinner <dchinner@redhat.com>
> 
> When a multilevel bmbt split occurs, we can be asked to allocate
> blocks from an AG that has no space left available. In the case of
> an extent just being allocated, the first bmbt block allocation sees
> the firstblock parameter is set and does not set a minleft parameter
> for the allocation. The allocation also does not set the total
> number of blocks required by the allocation, either.
> 
> If the extent allocation used all the free space in the AG, the lack
> of a total allocation size results in a block being reserved for the
> AG freespace btree manipulations being used incorrectly.
> 
> Secondly, the allocation is only allowed to be allocated in the
> current AG (NEAR_BNO) because the low space algorithm has not been
> activated. As a result of this, the second block allocation in the
> split fails to find sufficient space in the current AG and fails,
> resulting in a dirty transaction cancellation and a filesytem
> shutdown.
> 
> There are two problems here - both pointed out by Lachlan McIlroy
> when we were first discussing a proposed fix for the problem (See
> http://oss.sgi.com/archives/xfs/2010-09/msg00438.html). Firstly,
> the missing args->total was causing the initial block to be
> allocated from the freespace reserve. Secondly that the NEAR_BNO
> allocation should probably be a START_BNO allocation to allow blocks
> to be taken from a higher numbered AG.
> 
> With these changes, the allocation failure goes away, but we trigger
> asserts in xfs_bmapi() that try to ensure that bmbt block
> allocations only occur in the same AG as the extent allocated,
> except if the low space algorithm is active. In this case, we don't
> need to activate the low space algorithm - if a START_BNO allocation
> fails with minleft = 0, then there is no space we can allocate at
> all, and hence the low space algorithm is no help. Therefore the
> asserts need modification reflect the new state of affairs.
> 
> This commit fixes the problem demonstrated by the test case in
> xfstests 250.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_bmap.c | 8 ++------
> fs/xfs/xfs_bmap_btree.c | 7 ++-----
> 2 files changed, 4 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
> index fa00788..50eed97 100644
> --- a/fs/xfs/xfs_bmap.c
> +++ b/fs/xfs/xfs_bmap.c
> @@ -4917,13 +4917,9 @@ error0:
> if (cur) {
> if (!error) {
> ASSERT(*firstblock == NULLFSBLOCK ||
> - XFS_FSB_TO_AGNO(mp, *firstblock) ==
> + XFS_FSB_TO_AGNO(mp, *firstblock) <=
> XFS_FSB_TO_AGNO(mp,
> - cur->bc_private.b.firstblock) ||
> - (flist->xbf_low &&
> - XFS_FSB_TO_AGNO(mp, *firstblock) <
> - XFS_FSB_TO_AGNO(mp,
> - cur->bc_private.b.firstblock)));
> + cur->bc_private.b.firstblock));
> *firstblock = cur->bc_private.b.firstblock;
> }
> xfs_btree_del_cursor(cur,

There's another ASSERT like this earlier in xfs_bmapi() after the call
to xfs_bmap_alloc() which may need the same attention.

> diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c
> index 87d3c10..b2cdbfe 100644
> --- a/fs/xfs/xfs_bmap_btree.c
> +++ b/fs/xfs/xfs_bmap_btree.c
> @@ -518,10 +518,11 @@ xfs_bmbt_alloc_block(
> args.mp = cur->bc_mp;
> args.fsbno = cur->bc_private.b.firstblock;
> args.firstblock = args.fsbno;
> + args.total = 1;
> + args.type = XFS_ALLOCTYPE_START_BNO;
> 
> if (args.fsbno == NULLFSBLOCK) {
> args.fsbno = be64_to_cpu(start->l);
> - args.type = XFS_ALLOCTYPE_START_BNO;
> /*
> * Make sure there is sufficient room left in the AG to
> * complete a full tree split for an extent insert. If
> @@ -534,10 +535,6 @@ xfs_bmbt_alloc_block(
> * block allocation here and corrupt the filesystem.
> */
> args.minleft = xfs_trans_get_block_res(args.tp);
> - } else if (cur->bc_private.b.flist->xbf_low) {
> - args.type = XFS_ALLOCTYPE_START_BNO;
> - } else {
> - args.type = XFS_ALLOCTYPE_NEAR_BNO;
> }
> 
> args.minlen = args.maxlen = args.prod = 1;

I think that's exactly what I had in mind when we looked at this last
time but now I'm not so sure.  I think I understand now why there is a
requirement that all subsequent allocations must (read 'can') come from
the same AG.

The first time we try to allocate blocks we look for an AG with all the
space we need for the reservation to succeed.  If we don't find a single
AG that will satisfy the full request then we use the low space algorithm
and start from AG 0.  But if we do find such an AG with all the space we
need it should be fine to use NEAR_BNO on subsequent allocations because
the space is there, right?

If this logic fails to hold true then there are other problems we need to
deal with including;

If we've already allocated a block from AG N then all subsequent
allocations must come from AGs >= N to prevent locking issues.  If the
blocks we need are in AGs < N we're in trouble.  We need to fall back to
the low space algorithm and start from AG 0 but we can't do this if we 
have already allocated from AG N.

Also if we pass START_BNO to xfs_alloc_vextent() then it will fall
through to the ANY/START/FIRST_AG case and set flags to TRYLOCK.  It
will then scan AGs from N to the last AG and skip any that are already
locked.  It may skip AGs that have the space we need and we may get to
the last AG without finding all required blocks.

Then xfs_alloc_vextent() will loop through the AGs a second time if we
didn't get all we needed the first time but this time without the TRYLOCK.
This could result in locking AGs out of order.

So why is the extent allocation using all of the remaining blocks in the
AG and not leaving any for the split?  Are we not reserving the split
space with the conversion transaction?  Is it because the space needed
for splits is not accounted for when delayed allocations are reserved so
we oversubscribe ourselves?  Or are we not activating the low space
algorithm early enough?

I need to look at the code a bit more to understand all this.  The same
allocation logic appears to be used in xfs_bmap_extents_to_btree(),
xfs_bmap_local_to_extents() and xfs_bmap_btalloc() so maybe there's some
logic in the logic.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xfs: fix bmbt block allocation failures
  2011-04-12  0:55 ` Lachlan McIlroy
@ 2011-04-12  6:53   ` Dave Chinner
  2011-04-12  8:53     ` Dave Chinner
  2011-04-15  7:29     ` Lachlan McIlroy
  0 siblings, 2 replies; 7+ messages in thread
From: Dave Chinner @ 2011-04-12  6:53 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs

On Mon, Apr 11, 2011 at 08:55:33PM -0400, Lachlan McIlroy wrote:
> ----- Original Message -----
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When a multilevel bmbt split occurs, we can be asked to allocate
> > blocks from an AG that has no space left available. In the case of
> > an extent just being allocated, the first bmbt block allocation sees
> > the firstblock parameter is set and does not set a minleft parameter
> > for the allocation. The allocation also does not set the total
> > number of blocks required by the allocation, either.
> > 
> > If the extent allocation used all the free space in the AG, the lack
> > of a total allocation size results in a block being reserved for the
> > AG freespace btree manipulations being used incorrectly.
> > 
> > Secondly, the allocation is only allowed to be allocated in the
> > current AG (NEAR_BNO) because the low space algorithm has not been
> > activated. As a result of this, the second block allocation in the
> > split fails to find sufficient space in the current AG and fails,
> > resulting in a dirty transaction cancellation and a filesytem
> > shutdown.
> > 
> > There are two problems here - both pointed out by Lachlan McIlroy
> > when we were first discussing a proposed fix for the problem (See
> > http://oss.sgi.com/archives/xfs/2010-09/msg00438.html). Firstly,
> > the missing args->total was causing the initial block to be
> > allocated from the freespace reserve. Secondly that the NEAR_BNO
> > allocation should probably be a START_BNO allocation to allow blocks
> > to be taken from a higher numbered AG.
> > 
> > With these changes, the allocation failure goes away, but we trigger
> > asserts in xfs_bmapi() that try to ensure that bmbt block
> > allocations only occur in the same AG as the extent allocated,
> > except if the low space algorithm is active. In this case, we don't
> > need to activate the low space algorithm - if a START_BNO allocation
> > fails with minleft = 0, then there is no space we can allocate at
> > all, and hence the low space algorithm is no help. Therefore the
> > asserts need modification reflect the new state of affairs.
> > 
> > This commit fixes the problem demonstrated by the test case in
> > xfstests 250.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> > fs/xfs/xfs_bmap.c | 8 ++------
> > fs/xfs/xfs_bmap_btree.c | 7 ++-----
> > 2 files changed, 4 insertions(+), 11 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
> > index fa00788..50eed97 100644
> > --- a/fs/xfs/xfs_bmap.c
> > +++ b/fs/xfs/xfs_bmap.c
> > @@ -4917,13 +4917,9 @@ error0:
> > if (cur) {
> > if (!error) {
> > ASSERT(*firstblock == NULLFSBLOCK ||
> > - XFS_FSB_TO_AGNO(mp, *firstblock) ==
> > + XFS_FSB_TO_AGNO(mp, *firstblock) <=
> > XFS_FSB_TO_AGNO(mp,
> > - cur->bc_private.b.firstblock) ||
> > - (flist->xbf_low &&
> > - XFS_FSB_TO_AGNO(mp, *firstblock) <
> > - XFS_FSB_TO_AGNO(mp,
> > - cur->bc_private.b.firstblock)));
> > + cur->bc_private.b.firstblock));
> > *firstblock = cur->bc_private.b.firstblock;
> > }
> > xfs_btree_del_cursor(cur,
> 
> There's another ASSERT like this earlier in xfs_bmapi() after the call
> to xfs_bmap_alloc() which may need the same attention.

No, it checks the allocated extent against any existing firstblock
and has nothing to do with the the bmbt allocations during extent
insertion.
> 
> > diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c
> > index 87d3c10..b2cdbfe 100644
> > --- a/fs/xfs/xfs_bmap_btree.c
> > +++ b/fs/xfs/xfs_bmap_btree.c
> > @@ -518,10 +518,11 @@ xfs_bmbt_alloc_block(
> > args.mp = cur->bc_mp;
> > args.fsbno = cur->bc_private.b.firstblock;
> > args.firstblock = args.fsbno;
> > + args.total = 1;
> > + args.type = XFS_ALLOCTYPE_START_BNO;
> > 
> > if (args.fsbno == NULLFSBLOCK) {
> > args.fsbno = be64_to_cpu(start->l);
> > - args.type = XFS_ALLOCTYPE_START_BNO;
> > /*
> > * Make sure there is sufficient room left in the AG to
> > * complete a full tree split for an extent insert. If
> > @@ -534,10 +535,6 @@ xfs_bmbt_alloc_block(
> > * block allocation here and corrupt the filesystem.
> > */
> > args.minleft = xfs_trans_get_block_res(args.tp);
> > - } else if (cur->bc_private.b.flist->xbf_low) {
> > - args.type = XFS_ALLOCTYPE_START_BNO;
> > - } else {
> > - args.type = XFS_ALLOCTYPE_NEAR_BNO;
> > }
> > 
> > args.minlen = args.maxlen = args.prod = 1;
> 
> I think that's exactly what I had in mind when we looked at this last
> time but now I'm not so sure.  I think I understand now why there is a
> requirement that all subsequent allocations must (read 'can') come from
> the same AG.
> 
> The first time we try to allocate blocks we look for an AG with all the
> space we need for the reservation to succeed.

Of course, that explains the near bno allocation them.  It's
basically saying "we only ever have a reservation for a freespace
btree split in a single AG". Ok, so it seems we need to retain that,
and the problem of not being able to allocate in the current AG is
elsewhere.

Thanks for pointing that out to me, Lachlan - I should have realised
that is what it was for in the first place.

> If we don't find a single
> AG that will satisfy the full request then we use the low space algorithm
> and start from AG 0.  But if we do find such an AG with all the space we
> need it should be fine to use NEAR_BNO on subsequent allocations because
> the space is there, right?

But for the case we are dealing with here - bmbt insertion after an
allocation - we already have firstblock set and hence never traverse
the args.fsbno == NULLFSBLOCK case as the cursor firstblock is
always set to the block that has already been allocated. Hence we
never hit the low space algorithm unless it was triggered by the
extent allocation.

> If this logic fails to hold true then there are other problems we need to
> deal with including;
> 
> If we've already allocated a block from AG N then all subsequent
> allocations must come from AGs >= N to prevent locking issues.  If the
> blocks we need are in AGs < N we're in trouble.

Agreed - this is what minleft is trying to avoid from happening
ahead of the first allocations.

> We need to fall back to
> the low space algorithm and start from AG 0 but we can't do this if we 
> have already allocated from AG N.

Right, even the low space algorithm cannot avoid the need to avoid
lock inversions in the AGs.

> Also if we pass START_BNO to xfs_alloc_vextent() then it will fall
> through to the ANY/START/FIRST_AG case and set flags to TRYLOCK.  It

That's not a problem.

> will then scan AGs from N to the last AG and skip any that are already
> locked.  It may skip AGs that have the space we need and we may get to
> the last AG without finding all required blocks.

Sure, but....

> Then xfs_alloc_vextent() will loop through the AGs a second time if we
> didn't get all we needed the first time but this time without the TRYLOCK.
> This could result in locking AGs out of order.

It loops the second time from the start AG, not AG zero so we should
not be locking AGs out of order.

> So why is the extent allocation using all of the remaining blocks in the
> AG and not leaving any for the split?

That's the big question, isn't it? I'll come back to this.

> Are we not reserving the split
> space with the conversion transaction?

I don't think this is relevant - the problem is allocation, not
extent conversion.

> Is it because the space needed
> for splits is not accounted for when delayed allocations are reserved so
> we oversubscribe ourselves?

This is preallocation into a hole, so delalloc is not involved
here.

> Or are we not activating the low space
> algorithm early enough?

There's plenty of free space, just not in the AG _after_ the extent
allocation, and so there's no reason to trigger the low space
algorithm.

> I need to look at the code a bit more to understand all this.  The same
> allocation logic appears to be used in xfs_bmap_extents_to_btree(),
> xfs_bmap_local_to_extents() and xfs_bmap_btalloc() so maybe there's some
> logic in the logic.

I think they are all fine, now that I think I understand the reason
for the logic - that we only reserve enough blocks in the
transaction for freespace tree splits in a single AG. New blocks in
the bmbt can be triggered when:

	1. extent conversion splits a record into two (or three)
	records.
	2. extent removal split a record into two (punch a hole in
	the middle of an extent)
	3. extent allocation inserts a new record
	4. converting from extent to btree format

Note that the local to extent change is actually a data extent
allocation (same as xfs_bmap_btalloc), not a bmbt allocation.

So, in the cases of #1 and #2, it is unlikely that there has been a
previous allocation in the transaction, and so we take the first
case of args.fsbno == NULLFSBLOCK to chose an AG with enough space
for an entire bmbt split. Otherwise, it's the same case as #3 and
#4.

In the case of #3 and #4, the reason for the bmbt allocation is that
we have allocated a new extent, and as a result firstblock _will_ be
set. Hence we should have either space in the AG available for the
bmbt split _or_ the low space algorithm is active.

This problem case is that for #3/4 we can enter xfs_bmbt_alloc_block
with the initial condition of firstblock set, not enough space in
the current AG and the low space algorithm not active. For the test
case (250), we still have lots of free space in the filesystem so we
should not be activating the low space algorithm. That leaves the
fact we are not leaving enough space in the AG for the bmbt split
after the extent is allocated. AFAICT, this code in xfs_bmapi()
handles it:

4452         if (wr && *firstblock == NULLFSBLOCK) {
4453                 if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
4454                         minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
4455                 else
4456                         minleft = 1;
4457         } else
4458                 minleft = 0;

So it seems to me that in all cases the setting of minleft is
correct, as are the fallbacks that clear minleft then set xbf_low in
xfs_bmap_btalloc(). So I don't think the bug actually resides at the
xfs-bmapi/bmbt level. The question I'm now trying to answer is how
we can allocate the entire AG if minleft is actually set.

The answer, it appears, is in xfs_alloc_fix_minleft().  If we've
allocated the largest extent possible, we'll have a situation where
the only free space remaining is on the freelist.  To use numbers
relevant to the test case:

	agsize = 4096 blocks (16MB)
	max ext len = 4096 - 8 blocks
		    = 4088 blocks
	AGF freeblks = 4088
	AGF flcount = 4
	minleft = 2;

	extent allocation length = 4088

	diff = 4088 + 4 - 4088 - 2
	     = 2

And because diff >= 0, xfs_alloc_fix_minleft() says that the extent
length is OK and does not trim it. Then when we go to allocate bmbt
blocks, args->total is not set so the first block is allocated from
the free list, and the second then fails at the
xfs_alloc_fix_freelist checks because we don't have space available
in the AG....

IOWs, we've set minleft appropriately, but xfs_alloc_fix_minleft()
assumes that the AGFL blocks are part of the space available for
the allocation. This, it seems, is wrong. If minleft is set, then we
are not allowing the allocation to dip into any reserves, so we
should not be considering the freelist blocks as available for
minleft reservations.

So, the patch below fixes the test 250 assert failure as well, and
to me seems much more likely as the root cause of the bug.

What do you think, Lachlan?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


---
 fs/xfs/xfs_alloc.c      |    1 -
 fs/xfs/xfs_bmap_btree.c |    1 +
 2 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c
index 27d64d7..8946464 100644
--- a/fs/xfs/xfs_alloc.c
+++ b/fs/xfs/xfs_alloc.c
@@ -280,7 +280,6 @@ xfs_alloc_fix_minleft(
 		return 1;
 	agf = XFS_BUF_TO_AGF(args->agbp);
 	diff = be32_to_cpu(agf->agf_freeblks)
-		+ be32_to_cpu(agf->agf_flcount)
 		- args->len - args->minleft;
 	if (diff >= 0)
 		return 1;
diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c
index 87d3c10..470047e 100644
--- a/fs/xfs/xfs_bmap_btree.c
+++ b/fs/xfs/xfs_bmap_btree.c
@@ -518,6 +518,7 @@ xfs_bmbt_alloc_block(
 	args.mp = cur->bc_mp;
 	args.fsbno = cur->bc_private.b.firstblock;
 	args.firstblock = args.fsbno;
+	args.total = 1;
 
 	if (args.fsbno == NULLFSBLOCK) {
 		args.fsbno = be64_to_cpu(start->l);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] xfs: fix bmbt block allocation failures
  2011-04-12  6:53   ` Dave Chinner
@ 2011-04-12  8:53     ` Dave Chinner
  2011-04-15  7:29     ` Lachlan McIlroy
  1 sibling, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2011-04-12  8:53 UTC (permalink / raw)
  To: Lachlan McIlroy; +Cc: xfs

On Tue, Apr 12, 2011 at 04:53:28PM +1000, Dave Chinner wrote:
> On Mon, Apr 11, 2011 at 08:55:33PM -0400, Lachlan McIlroy wrote:
> > ----- Original Message -----
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > When a multilevel bmbt split occurs, we can be asked to allocate
> > > blocks from an AG that has no space left available. In the case of
> > > an extent just being allocated, the first bmbt block allocation sees
> > > the firstblock parameter is set and does not set a minleft parameter
> > > for the allocation. The allocation also does not set the total
> > > number of blocks required by the allocation, either.
...
> So, the patch below fixes the test 250 assert failure as well, and
> to me seems much more likely as the root cause of the bug.

FWIW, test 250 is showing up another three bugs, all unrelated to
the bug it written to exercise. Two are mkfs bugs - the first being
that mkfs is terminated due to freeing an invalid pointer. The
second being that it is leaving behind a corrupted freespace btree:

_check_xfs_filesystem: filesystem on /mnt/test/250.fs is inconsistent
*** xfs_repair -n output ***
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - scan filesystem freespace and inode maps...
invalid start block 4096 in record 0 of 4635043 btree block 1600/1
invalid start block 4096 in record 0 of 4635039 btree block 1600/2
        - found root inode chunk
Phase 3 - for each AG...
....

xfs_db> convert agno 1600 agbno 1 fsb
0x640001 (6553601)
xfs_db> fsb 0x640001
xfs_db> type bnobt
xfs_db> p
magic = 0x41425442
level = 0
numrecs = 1
leftsib = null
rightsib = null
recs[1] = [startblock,blockcount] 1:[4096,0]
xfs_db> fsb 0x640002
xfs_db> type cntbt
xfs_db> p
magic = 0x41425443
level = 0
numrecs = 1
leftsib = null
rightsib = null
recs[1] = [startblock,blockcount] 1:[4096,0]
xfs_db> 

So in AG 1600, there is a freespace record at block 4096 for length
zero. Both are incorrect - the AG size is only 4096 blocks. Worth
noting:

xfs_db> agf 1600
xfs_db> p
magicnum = 0x58414746
versionnum = 1
seqno = 1600
length = 4096
bnoroot = 1
cntroot = 2
bnolevel = 1
cntlevel = 1
flfirst = 0
fllast = 127
flcount = 0
freeblks = 0
longest = 0
btreeblks = 0

All the freelist blocks have been consumed, which should not happen
- there should be 4 freelist blocks when the AG is empty. Looking a
bit deeper, AG 1600 doesn't contain any data blocks - it contains
the log. The log is allocated by mkfs, and is 4092 blocks in length.
Ї just confirmed that the repair failure occurs on a freshly made
FS, so this is definitely a mkfs bug that hasn't been noticed
because the test hasn't been running to completion and checking the
fs....

And the third bug is in repair:

invalid start block 4096 in record 0 of 4635043 btree block 1600/1
invalid start block 4096 in record 0 of 4635039 btree block 1600/2
                                        ^^^^^^^

These are supposed to say "bno" or "cnt", but a %d instead of a %s
is incorrectly used in format string so it gives a wacky result.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] xfs: fix bmbt block allocation failures
  2011-04-12  6:53   ` Dave Chinner
  2011-04-12  8:53     ` Dave Chinner
@ 2011-04-15  7:29     ` Lachlan McIlroy
  1 sibling, 0 replies; 7+ messages in thread
From: Lachlan McIlroy @ 2011-04-15  7:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

----- Original Message -----
> On Mon, Apr 11, 2011 at 08:55:33PM -0400, Lachlan McIlroy wrote:
> > ----- Original Message -----
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > When a multilevel bmbt split occurs, we can be asked to allocate
> > > blocks from an AG that has no space left available. In the case of
> > > an extent just being allocated, the first bmbt block allocation
> > > sees
> > > the firstblock parameter is set and does not set a minleft
> > > parameter
> > > for the allocation. The allocation also does not set the total
> > > number of blocks required by the allocation, either.
> > >
> > > If the extent allocation used all the free space in the AG, the
> > > lack
> > > of a total allocation size results in a block being reserved for
> > > the
> > > AG freespace btree manipulations being used incorrectly.
> > >
> > > Secondly, the allocation is only allowed to be allocated in the
> > > current AG (NEAR_BNO) because the low space algorithm has not been
> > > activated. As a result of this, the second block allocation in the
> > > split fails to find sufficient space in the current AG and fails,
> > > resulting in a dirty transaction cancellation and a filesytem
> > > shutdown.
> > >
> > > There are two problems here - both pointed out by Lachlan McIlroy
> > > when we were first discussing a proposed fix for the problem (See
> > > http://oss.sgi.com/archives/xfs/2010-09/msg00438.html). Firstly,
> > > the missing args->total was causing the initial block to be
> > > allocated from the freespace reserve. Secondly that the NEAR_BNO
> > > allocation should probably be a START_BNO allocation to allow
> > > blocks
> > > to be taken from a higher numbered AG.
> > >
> > > With these changes, the allocation failure goes away, but we
> > > trigger
> > > asserts in xfs_bmapi() that try to ensure that bmbt block
> > > allocations only occur in the same AG as the extent allocated,
> > > except if the low space algorithm is active. In this case, we
> > > don't
> > > need to activate the low space algorithm - if a START_BNO
> > > allocation
> > > fails with minleft = 0, then there is no space we can allocate at
> > > all, and hence the low space algorithm is no help. Therefore the
> > > asserts need modification reflect the new state of affairs.
> > >
> > > This commit fixes the problem demonstrated by the test case in
> > > xfstests 250.
> > >
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > > fs/xfs/xfs_bmap.c | 8 ++------
> > > fs/xfs/xfs_bmap_btree.c | 7 ++-----
> > > 2 files changed, 4 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
> > > index fa00788..50eed97 100644
> > > --- a/fs/xfs/xfs_bmap.c
> > > +++ b/fs/xfs/xfs_bmap.c
> > > @@ -4917,13 +4917,9 @@ error0:
> > > if (cur) {
> > > if (!error) {
> > > ASSERT(*firstblock == NULLFSBLOCK ||
> > > - XFS_FSB_TO_AGNO(mp, *firstblock) ==
> > > + XFS_FSB_TO_AGNO(mp, *firstblock) <=
> > > XFS_FSB_TO_AGNO(mp,
> > > - cur->bc_private.b.firstblock) ||
> > > - (flist->xbf_low &&
> > > - XFS_FSB_TO_AGNO(mp, *firstblock) <
> > > - XFS_FSB_TO_AGNO(mp,
> > > - cur->bc_private.b.firstblock)));
> > > + cur->bc_private.b.firstblock));
> > > *firstblock = cur->bc_private.b.firstblock;
> > > }
> > > xfs_btree_del_cursor(cur,
> >
> > There's another ASSERT like this earlier in xfs_bmapi() after the
> > call
> > to xfs_bmap_alloc() which may need the same attention.
> 
> No, it checks the allocated extent against any existing firstblock
> and has nothing to do with the the bmbt allocations during extent
> insertion.
> >
> > > diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c
> > > index 87d3c10..b2cdbfe 100644
> > > --- a/fs/xfs/xfs_bmap_btree.c
> > > +++ b/fs/xfs/xfs_bmap_btree.c
> > > @@ -518,10 +518,11 @@ xfs_bmbt_alloc_block(
> > > args.mp = cur->bc_mp;
> > > args.fsbno = cur->bc_private.b.firstblock;
> > > args.firstblock = args.fsbno;
> > > + args.total = 1;
> > > + args.type = XFS_ALLOCTYPE_START_BNO;
> > >
> > > if (args.fsbno == NULLFSBLOCK) {
> > > args.fsbno = be64_to_cpu(start->l);
> > > - args.type = XFS_ALLOCTYPE_START_BNO;
> > > /*
> > > * Make sure there is sufficient room left in the AG to
> > > * complete a full tree split for an extent insert. If
> > > @@ -534,10 +535,6 @@ xfs_bmbt_alloc_block(
> > > * block allocation here and corrupt the filesystem.
> > > */
> > > args.minleft = xfs_trans_get_block_res(args.tp);
> > > - } else if (cur->bc_private.b.flist->xbf_low) {
> > > - args.type = XFS_ALLOCTYPE_START_BNO;
> > > - } else {
> > > - args.type = XFS_ALLOCTYPE_NEAR_BNO;
> > > }
> > >
> > > args.minlen = args.maxlen = args.prod = 1;
> >
> > I think that's exactly what I had in mind when we looked at this
> > last
> > time but now I'm not so sure. I think I understand now why there is
> > a
> > requirement that all subsequent allocations must (read 'can') come
> > from
> > the same AG.
> >
> > The first time we try to allocate blocks we look for an AG with all
> > the
> > space we need for the reservation to succeed.
> 
> Of course, that explains the near bno allocation them. It's
> basically saying "we only ever have a reservation for a freespace
> btree split in a single AG". Ok, so it seems we need to retain that,
> and the problem of not being able to allocate in the current AG is
> elsewhere.
> 
> Thanks for pointing that out to me, Lachlan - I should have realised
> that is what it was for in the first place.
> 
> > If we don't find a single
> > AG that will satisfy the full request then we use the low space
> > algorithm
> > and start from AG 0. But if we do find such an AG with all the space
> > we
> > need it should be fine to use NEAR_BNO on subsequent allocations
> > because
> > the space is there, right?
> 
> But for the case we are dealing with here - bmbt insertion after an
> allocation - we already have firstblock set and hence never traverse
> the args.fsbno == NULLFSBLOCK case as the cursor firstblock is
> always set to the block that has already been allocated. Hence we
> never hit the low space algorithm unless it was triggered by the
> extent allocation.
> 
> > If this logic fails to hold true then there are other problems we
> > need to
> > deal with including;
> >
> > If we've already allocated a block from AG N then all subsequent
> > allocations must come from AGs >= N to prevent locking issues. If
> > the
> > blocks we need are in AGs < N we're in trouble.
> 
> Agreed - this is what minleft is trying to avoid from happening
> ahead of the first allocations.
> 
> > We need to fall back to
> > the low space algorithm and start from AG 0 but we can't do this if
> > we
> > have already allocated from AG N.
> 
> Right, even the low space algorithm cannot avoid the need to avoid
> lock inversions in the AGs.
> 
> > Also if we pass START_BNO to xfs_alloc_vextent() then it will fall
> > through to the ANY/START/FIRST_AG case and set flags to TRYLOCK. It
> 
> That's not a problem.
> 
> > will then scan AGs from N to the last AG and skip any that are
> > already
> > locked. It may skip AGs that have the space we need and we may get
> > to
> > the last AG without finding all required blocks.
> 
> Sure, but....
> 
> > Then xfs_alloc_vextent() will loop through the AGs a second time if
> > we
> > didn't get all we needed the first time but this time without the
> > TRYLOCK.
> > This could result in locking AGs out of order.
> 
> It loops the second time from the start AG, not AG zero so we should
> not be locking AGs out of order.
> 
> > So why is the extent allocation using all of the remaining blocks in
> > the
> > AG and not leaving any for the split?
> 
> That's the big question, isn't it? I'll come back to this.
> 
> > Are we not reserving the split
> > space with the conversion transaction?
> 
> I don't think this is relevant - the problem is allocation, not
> extent conversion.
> 
> > Is it because the space needed
> > for splits is not accounted for when delayed allocations are
> > reserved so
> > we oversubscribe ourselves?
> 
> This is preallocation into a hole, so delalloc is not involved
> here.
> 
> > Or are we not activating the low space
> > algorithm early enough?
> 
> There's plenty of free space, just not in the AG _after_ the extent
> allocation, and so there's no reason to trigger the low space
> algorithm.
> 
> > I need to look at the code a bit more to understand all this. The
> > same
> > allocation logic appears to be used in xfs_bmap_extents_to_btree(),
> > xfs_bmap_local_to_extents() and xfs_bmap_btalloc() so maybe there's
> > some
> > logic in the logic.
> 
> I think they are all fine, now that I think I understand the reason
> for the logic - that we only reserve enough blocks in the
> transaction for freespace tree splits in a single AG. New blocks in
> the bmbt can be triggered when:
> 
> 1. extent conversion splits a record into two (or three)
> records.
> 2. extent removal split a record into two (punch a hole in
> the middle of an extent)
> 3. extent allocation inserts a new record
> 4. converting from extent to btree format
> 
> Note that the local to extent change is actually a data extent
> allocation (same as xfs_bmap_btalloc), not a bmbt allocation.
> 
> So, in the cases of #1 and #2, it is unlikely that there has been a
> previous allocation in the transaction, and so we take the first
> case of args.fsbno == NULLFSBLOCK to chose an AG with enough space
> for an entire bmbt split. Otherwise, it's the same case as #3 and
> #4.
> 
> In the case of #3 and #4, the reason for the bmbt allocation is that
> we have allocated a new extent, and as a result firstblock _will_ be
> set. Hence we should have either space in the AG available for the
> bmbt split _or_ the low space algorithm is active.
> 
> This problem case is that for #3/4 we can enter xfs_bmbt_alloc_block
> with the initial condition of firstblock set, not enough space in
> the current AG and the low space algorithm not active. For the test
> case (250), we still have lots of free space in the filesystem so we
> should not be activating the low space algorithm. That leaves the
> fact we are not leaving enough space in the AG for the bmbt split
> after the extent is allocated. AFAICT, this code in xfs_bmapi()
> handles it:
> 
> 4452 if (wr && *firstblock == NULLFSBLOCK) {
> 4453 if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE)
> 4454 minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1;
> 4455 else
> 4456 minleft = 1;
> 4457 } else
> 4458 minleft = 0;
> 
> So it seems to me that in all cases the setting of minleft is
> correct, as are the fallbacks that clear minleft then set xbf_low in
> xfs_bmap_btalloc(). So I don't think the bug actually resides at the
> xfs-bmapi/bmbt level. The question I'm now trying to answer is how
> we can allocate the entire AG if minleft is actually set.
> 
> The answer, it appears, is in xfs_alloc_fix_minleft(). If we've
> allocated the largest extent possible, we'll have a situation where
> the only free space remaining is on the freelist. To use numbers
> relevant to the test case:
> 
> agsize = 4096 blocks (16MB)
> max ext len = 4096 - 8 blocks
> = 4088 blocks
> AGF freeblks = 4088
> AGF flcount = 4
> minleft = 2;
> 
> extent allocation length = 4088
> 
> diff = 4088 + 4 - 4088 - 2
> = 2
> 
> And because diff >= 0, xfs_alloc_fix_minleft() says that the extent
> length is OK and does not trim it. Then when we go to allocate bmbt
> blocks, args->total is not set so the first block is allocated from
> the free list, and the second then fails at the
> xfs_alloc_fix_freelist checks because we don't have space available
> in the AG....
> 
> IOWs, we've set minleft appropriately, but xfs_alloc_fix_minleft()
> assumes that the AGFL blocks are part of the space available for
> the allocation. This, it seems, is wrong. If minleft is set, then we
> are not allowing the allocation to dip into any reserves, so we
> should not be considering the freelist blocks as available for
> minleft reservations.
> 
> So, the patch below fixes the test 250 assert failure as well, and
> to me seems much more likely as the root cause of the bug.
> 
> What do you think, Lachlan?

I think you're onto something Dave and it makes a lot of sense.  It
would appear that the checks in xfs_alloc_fix_freelist() really only
apply when allocating extents and maybe should be skipped when
allocating the minleft blocks.  Oh, unless the low space algorithm is
being used.  I think args.total represents the full extent allocation
request and doesn't apply to minleft allocations so I don't think we
need that change to xfs_bmbt_alloc_block().  Is just the fix to
xfs_alloc_fix_minleft() enough here?

> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> 
> 
> ---
> fs/xfs/xfs_alloc.c | 1 -
> fs/xfs/xfs_bmap_btree.c | 1 +
> 2 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c
> index 27d64d7..8946464 100644
> --- a/fs/xfs/xfs_alloc.c
> +++ b/fs/xfs/xfs_alloc.c
> @@ -280,7 +280,6 @@ xfs_alloc_fix_minleft(
> return 1;
> agf = XFS_BUF_TO_AGF(args->agbp);
> diff = be32_to_cpu(agf->agf_freeblks)
> - + be32_to_cpu(agf->agf_flcount)
> - args->len - args->minleft;
> if (diff >= 0)
> return 1;
> diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c
> index 87d3c10..470047e 100644
> --- a/fs/xfs/xfs_bmap_btree.c
> +++ b/fs/xfs/xfs_bmap_btree.c
> @@ -518,6 +518,7 @@ xfs_bmbt_alloc_block(
> args.mp = cur->bc_mp;
> args.fsbno = cur->bc_private.b.firstblock;
> args.firstblock = args.fsbno;
> + args.total = 1;
> 
> if (args.fsbno == NULLFSBLOCK) {
> args.fsbno = be64_to_cpu(start->l);
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-04-15  7:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-11  2:34 [PATCH] xfs: fix bmbt block allocation failures Dave Chinner
2011-04-11 23:41 ` Christoph Hellwig
2011-04-11 23:56   ` Dave Chinner
2011-04-12  0:55 ` Lachlan McIlroy
2011-04-12  6:53   ` Dave Chinner
2011-04-12  8:53     ` Dave Chinner
2011-04-15  7:29     ` Lachlan McIlroy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox