* [PATCH] xfs: fix bmbt block allocation failures @ 2011-04-11 2:34 Dave Chinner 2011-04-11 23:41 ` Christoph Hellwig 2011-04-12 0:55 ` Lachlan McIlroy 0 siblings, 2 replies; 7+ messages in thread From: Dave Chinner @ 2011-04-11 2:34 UTC (permalink / raw) To: xfs From: Dave Chinner <dchinner@redhat.com> When a multilevel bmbt split occurs, we can be asked to allocate blocks from an AG that has no space left available. In the case of an extent just being allocated, the first bmbt block allocation sees the firstblock parameter is set and does not set a minleft parameter for the allocation. The allocation also does not set the total number of blocks required by the allocation, either. If the extent allocation used all the free space in the AG, the lack of a total allocation size results in a block being reserved for the AG freespace btree manipulations being used incorrectly. Secondly, the allocation is only allowed to be allocated in the current AG (NEAR_BNO) because the low space algorithm has not been activated. As a result of this, the second block allocation in the split fails to find sufficient space in the current AG and fails, resulting in a dirty transaction cancellation and a filesytem shutdown. There are two problems here - both pointed out by Lachlan McIlroy when we were first discussing a proposed fix for the problem (See http://oss.sgi.com/archives/xfs/2010-09/msg00438.html). Firstly, the missing args->total was causing the initial block to be allocated from the freespace reserve. Secondly that the NEAR_BNO allocation should probably be a START_BNO allocation to allow blocks to be taken from a higher numbered AG. With these changes, the allocation failure goes away, but we trigger asserts in xfs_bmapi() that try to ensure that bmbt block allocations only occur in the same AG as the extent allocated, except if the low space algorithm is active. In this case, we don't need to activate the low space algorithm - if a START_BNO allocation fails with minleft = 0, then there is no space we can allocate at all, and hence the low space algorithm is no help. Therefore the asserts need modification reflect the new state of affairs. This commit fixes the problem demonstrated by the test case in xfstests 250. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_bmap.c | 8 ++------ fs/xfs/xfs_bmap_btree.c | 7 ++----- 2 files changed, 4 insertions(+), 11 deletions(-) diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c index fa00788..50eed97 100644 --- a/fs/xfs/xfs_bmap.c +++ b/fs/xfs/xfs_bmap.c @@ -4917,13 +4917,9 @@ error0: if (cur) { if (!error) { ASSERT(*firstblock == NULLFSBLOCK || - XFS_FSB_TO_AGNO(mp, *firstblock) == + XFS_FSB_TO_AGNO(mp, *firstblock) <= XFS_FSB_TO_AGNO(mp, - cur->bc_private.b.firstblock) || - (flist->xbf_low && - XFS_FSB_TO_AGNO(mp, *firstblock) < - XFS_FSB_TO_AGNO(mp, - cur->bc_private.b.firstblock))); + cur->bc_private.b.firstblock)); *firstblock = cur->bc_private.b.firstblock; } xfs_btree_del_cursor(cur, diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c index 87d3c10..b2cdbfe 100644 --- a/fs/xfs/xfs_bmap_btree.c +++ b/fs/xfs/xfs_bmap_btree.c @@ -518,10 +518,11 @@ xfs_bmbt_alloc_block( args.mp = cur->bc_mp; args.fsbno = cur->bc_private.b.firstblock; args.firstblock = args.fsbno; + args.total = 1; + args.type = XFS_ALLOCTYPE_START_BNO; if (args.fsbno == NULLFSBLOCK) { args.fsbno = be64_to_cpu(start->l); - args.type = XFS_ALLOCTYPE_START_BNO; /* * Make sure there is sufficient room left in the AG to * complete a full tree split for an extent insert. If @@ -534,10 +535,6 @@ xfs_bmbt_alloc_block( * block allocation here and corrupt the filesystem. */ args.minleft = xfs_trans_get_block_res(args.tp); - } else if (cur->bc_private.b.flist->xbf_low) { - args.type = XFS_ALLOCTYPE_START_BNO; - } else { - args.type = XFS_ALLOCTYPE_NEAR_BNO; } args.minlen = args.maxlen = args.prod = 1; -- 1.7.2.3 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] xfs: fix bmbt block allocation failures 2011-04-11 2:34 [PATCH] xfs: fix bmbt block allocation failures Dave Chinner @ 2011-04-11 23:41 ` Christoph Hellwig 2011-04-11 23:56 ` Dave Chinner 2011-04-12 0:55 ` Lachlan McIlroy 1 sibling, 1 reply; 7+ messages in thread From: Christoph Hellwig @ 2011-04-11 23:41 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Mon, Apr 11, 2011 at 12:34:11PM +1000, Dave Chinner wrote: > http://oss.sgi.com/archives/xfs/2010-09/msg00438.html). Firstly, > the missing args->total was causing the initial block to be > allocated from the freespace reserve. Secondly that the NEAR_BNO > allocation should probably be a START_BNO allocation to allow blocks > to be taken from a higher numbered AG. Looking over the use of ->total, isn't the ialloc btree code also incorrect by setting ->total to 0? The actual change looks good to me, Reviewed-by: Christoph Hellwig <hch@lst.de> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] xfs: fix bmbt block allocation failures 2011-04-11 23:41 ` Christoph Hellwig @ 2011-04-11 23:56 ` Dave Chinner 0 siblings, 0 replies; 7+ messages in thread From: Dave Chinner @ 2011-04-11 23:56 UTC (permalink / raw) To: Christoph Hellwig; +Cc: xfs On Mon, Apr 11, 2011 at 07:41:58PM -0400, Christoph Hellwig wrote: > On Mon, Apr 11, 2011 at 12:34:11PM +1000, Dave Chinner wrote: > > http://oss.sgi.com/archives/xfs/2010-09/msg00438.html). Firstly, > > the missing args->total was causing the initial block to be > > allocated from the freespace reserve. Secondly that the NEAR_BNO > > allocation should probably be a START_BNO allocation to allow blocks > > to be taken from a higher numbered AG. > > Looking over the use of ->total, isn't the ialloc btree code also > incorrect by setting ->total to 0? It probably is - I hadn't got that far yet. I wanted to get this tested and reviewed so test 250 would no longer assert fail everyone's debug build.... > The actual change looks good to me, > > Reviewed-by: Christoph Hellwig <hch@lst.de> Thanks. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] xfs: fix bmbt block allocation failures 2011-04-11 2:34 [PATCH] xfs: fix bmbt block allocation failures Dave Chinner 2011-04-11 23:41 ` Christoph Hellwig @ 2011-04-12 0:55 ` Lachlan McIlroy 2011-04-12 6:53 ` Dave Chinner 1 sibling, 1 reply; 7+ messages in thread From: Lachlan McIlroy @ 2011-04-12 0:55 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs ----- Original Message ----- > From: Dave Chinner <dchinner@redhat.com> > > When a multilevel bmbt split occurs, we can be asked to allocate > blocks from an AG that has no space left available. In the case of > an extent just being allocated, the first bmbt block allocation sees > the firstblock parameter is set and does not set a minleft parameter > for the allocation. The allocation also does not set the total > number of blocks required by the allocation, either. > > If the extent allocation used all the free space in the AG, the lack > of a total allocation size results in a block being reserved for the > AG freespace btree manipulations being used incorrectly. > > Secondly, the allocation is only allowed to be allocated in the > current AG (NEAR_BNO) because the low space algorithm has not been > activated. As a result of this, the second block allocation in the > split fails to find sufficient space in the current AG and fails, > resulting in a dirty transaction cancellation and a filesytem > shutdown. > > There are two problems here - both pointed out by Lachlan McIlroy > when we were first discussing a proposed fix for the problem (See > http://oss.sgi.com/archives/xfs/2010-09/msg00438.html). Firstly, > the missing args->total was causing the initial block to be > allocated from the freespace reserve. Secondly that the NEAR_BNO > allocation should probably be a START_BNO allocation to allow blocks > to be taken from a higher numbered AG. > > With these changes, the allocation failure goes away, but we trigger > asserts in xfs_bmapi() that try to ensure that bmbt block > allocations only occur in the same AG as the extent allocated, > except if the low space algorithm is active. In this case, we don't > need to activate the low space algorithm - if a START_BNO allocation > fails with minleft = 0, then there is no space we can allocate at > all, and hence the low space algorithm is no help. Therefore the > asserts need modification reflect the new state of affairs. > > This commit fixes the problem demonstrated by the test case in > xfstests 250. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_bmap.c | 8 ++------ > fs/xfs/xfs_bmap_btree.c | 7 ++----- > 2 files changed, 4 insertions(+), 11 deletions(-) > > diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c > index fa00788..50eed97 100644 > --- a/fs/xfs/xfs_bmap.c > +++ b/fs/xfs/xfs_bmap.c > @@ -4917,13 +4917,9 @@ error0: > if (cur) { > if (!error) { > ASSERT(*firstblock == NULLFSBLOCK || > - XFS_FSB_TO_AGNO(mp, *firstblock) == > + XFS_FSB_TO_AGNO(mp, *firstblock) <= > XFS_FSB_TO_AGNO(mp, > - cur->bc_private.b.firstblock) || > - (flist->xbf_low && > - XFS_FSB_TO_AGNO(mp, *firstblock) < > - XFS_FSB_TO_AGNO(mp, > - cur->bc_private.b.firstblock))); > + cur->bc_private.b.firstblock)); > *firstblock = cur->bc_private.b.firstblock; > } > xfs_btree_del_cursor(cur, There's another ASSERT like this earlier in xfs_bmapi() after the call to xfs_bmap_alloc() which may need the same attention. > diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c > index 87d3c10..b2cdbfe 100644 > --- a/fs/xfs/xfs_bmap_btree.c > +++ b/fs/xfs/xfs_bmap_btree.c > @@ -518,10 +518,11 @@ xfs_bmbt_alloc_block( > args.mp = cur->bc_mp; > args.fsbno = cur->bc_private.b.firstblock; > args.firstblock = args.fsbno; > + args.total = 1; > + args.type = XFS_ALLOCTYPE_START_BNO; > > if (args.fsbno == NULLFSBLOCK) { > args.fsbno = be64_to_cpu(start->l); > - args.type = XFS_ALLOCTYPE_START_BNO; > /* > * Make sure there is sufficient room left in the AG to > * complete a full tree split for an extent insert. If > @@ -534,10 +535,6 @@ xfs_bmbt_alloc_block( > * block allocation here and corrupt the filesystem. > */ > args.minleft = xfs_trans_get_block_res(args.tp); > - } else if (cur->bc_private.b.flist->xbf_low) { > - args.type = XFS_ALLOCTYPE_START_BNO; > - } else { > - args.type = XFS_ALLOCTYPE_NEAR_BNO; > } > > args.minlen = args.maxlen = args.prod = 1; I think that's exactly what I had in mind when we looked at this last time but now I'm not so sure. I think I understand now why there is a requirement that all subsequent allocations must (read 'can') come from the same AG. The first time we try to allocate blocks we look for an AG with all the space we need for the reservation to succeed. If we don't find a single AG that will satisfy the full request then we use the low space algorithm and start from AG 0. But if we do find such an AG with all the space we need it should be fine to use NEAR_BNO on subsequent allocations because the space is there, right? If this logic fails to hold true then there are other problems we need to deal with including; If we've already allocated a block from AG N then all subsequent allocations must come from AGs >= N to prevent locking issues. If the blocks we need are in AGs < N we're in trouble. We need to fall back to the low space algorithm and start from AG 0 but we can't do this if we have already allocated from AG N. Also if we pass START_BNO to xfs_alloc_vextent() then it will fall through to the ANY/START/FIRST_AG case and set flags to TRYLOCK. It will then scan AGs from N to the last AG and skip any that are already locked. It may skip AGs that have the space we need and we may get to the last AG without finding all required blocks. Then xfs_alloc_vextent() will loop through the AGs a second time if we didn't get all we needed the first time but this time without the TRYLOCK. This could result in locking AGs out of order. So why is the extent allocation using all of the remaining blocks in the AG and not leaving any for the split? Are we not reserving the split space with the conversion transaction? Is it because the space needed for splits is not accounted for when delayed allocations are reserved so we oversubscribe ourselves? Or are we not activating the low space algorithm early enough? I need to look at the code a bit more to understand all this. The same allocation logic appears to be used in xfs_bmap_extents_to_btree(), xfs_bmap_local_to_extents() and xfs_bmap_btalloc() so maybe there's some logic in the logic. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] xfs: fix bmbt block allocation failures 2011-04-12 0:55 ` Lachlan McIlroy @ 2011-04-12 6:53 ` Dave Chinner 2011-04-12 8:53 ` Dave Chinner 2011-04-15 7:29 ` Lachlan McIlroy 0 siblings, 2 replies; 7+ messages in thread From: Dave Chinner @ 2011-04-12 6:53 UTC (permalink / raw) To: Lachlan McIlroy; +Cc: xfs On Mon, Apr 11, 2011 at 08:55:33PM -0400, Lachlan McIlroy wrote: > ----- Original Message ----- > > From: Dave Chinner <dchinner@redhat.com> > > > > When a multilevel bmbt split occurs, we can be asked to allocate > > blocks from an AG that has no space left available. In the case of > > an extent just being allocated, the first bmbt block allocation sees > > the firstblock parameter is set and does not set a minleft parameter > > for the allocation. The allocation also does not set the total > > number of blocks required by the allocation, either. > > > > If the extent allocation used all the free space in the AG, the lack > > of a total allocation size results in a block being reserved for the > > AG freespace btree manipulations being used incorrectly. > > > > Secondly, the allocation is only allowed to be allocated in the > > current AG (NEAR_BNO) because the low space algorithm has not been > > activated. As a result of this, the second block allocation in the > > split fails to find sufficient space in the current AG and fails, > > resulting in a dirty transaction cancellation and a filesytem > > shutdown. > > > > There are two problems here - both pointed out by Lachlan McIlroy > > when we were first discussing a proposed fix for the problem (See > > http://oss.sgi.com/archives/xfs/2010-09/msg00438.html). Firstly, > > the missing args->total was causing the initial block to be > > allocated from the freespace reserve. Secondly that the NEAR_BNO > > allocation should probably be a START_BNO allocation to allow blocks > > to be taken from a higher numbered AG. > > > > With these changes, the allocation failure goes away, but we trigger > > asserts in xfs_bmapi() that try to ensure that bmbt block > > allocations only occur in the same AG as the extent allocated, > > except if the low space algorithm is active. In this case, we don't > > need to activate the low space algorithm - if a START_BNO allocation > > fails with minleft = 0, then there is no space we can allocate at > > all, and hence the low space algorithm is no help. Therefore the > > asserts need modification reflect the new state of affairs. > > > > This commit fixes the problem demonstrated by the test case in > > xfstests 250. > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > fs/xfs/xfs_bmap.c | 8 ++------ > > fs/xfs/xfs_bmap_btree.c | 7 ++----- > > 2 files changed, 4 insertions(+), 11 deletions(-) > > > > diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c > > index fa00788..50eed97 100644 > > --- a/fs/xfs/xfs_bmap.c > > +++ b/fs/xfs/xfs_bmap.c > > @@ -4917,13 +4917,9 @@ error0: > > if (cur) { > > if (!error) { > > ASSERT(*firstblock == NULLFSBLOCK || > > - XFS_FSB_TO_AGNO(mp, *firstblock) == > > + XFS_FSB_TO_AGNO(mp, *firstblock) <= > > XFS_FSB_TO_AGNO(mp, > > - cur->bc_private.b.firstblock) || > > - (flist->xbf_low && > > - XFS_FSB_TO_AGNO(mp, *firstblock) < > > - XFS_FSB_TO_AGNO(mp, > > - cur->bc_private.b.firstblock))); > > + cur->bc_private.b.firstblock)); > > *firstblock = cur->bc_private.b.firstblock; > > } > > xfs_btree_del_cursor(cur, > > There's another ASSERT like this earlier in xfs_bmapi() after the call > to xfs_bmap_alloc() which may need the same attention. No, it checks the allocated extent against any existing firstblock and has nothing to do with the the bmbt allocations during extent insertion. > > > diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c > > index 87d3c10..b2cdbfe 100644 > > --- a/fs/xfs/xfs_bmap_btree.c > > +++ b/fs/xfs/xfs_bmap_btree.c > > @@ -518,10 +518,11 @@ xfs_bmbt_alloc_block( > > args.mp = cur->bc_mp; > > args.fsbno = cur->bc_private.b.firstblock; > > args.firstblock = args.fsbno; > > + args.total = 1; > > + args.type = XFS_ALLOCTYPE_START_BNO; > > > > if (args.fsbno == NULLFSBLOCK) { > > args.fsbno = be64_to_cpu(start->l); > > - args.type = XFS_ALLOCTYPE_START_BNO; > > /* > > * Make sure there is sufficient room left in the AG to > > * complete a full tree split for an extent insert. If > > @@ -534,10 +535,6 @@ xfs_bmbt_alloc_block( > > * block allocation here and corrupt the filesystem. > > */ > > args.minleft = xfs_trans_get_block_res(args.tp); > > - } else if (cur->bc_private.b.flist->xbf_low) { > > - args.type = XFS_ALLOCTYPE_START_BNO; > > - } else { > > - args.type = XFS_ALLOCTYPE_NEAR_BNO; > > } > > > > args.minlen = args.maxlen = args.prod = 1; > > I think that's exactly what I had in mind when we looked at this last > time but now I'm not so sure. I think I understand now why there is a > requirement that all subsequent allocations must (read 'can') come from > the same AG. > > The first time we try to allocate blocks we look for an AG with all the > space we need for the reservation to succeed. Of course, that explains the near bno allocation them. It's basically saying "we only ever have a reservation for a freespace btree split in a single AG". Ok, so it seems we need to retain that, and the problem of not being able to allocate in the current AG is elsewhere. Thanks for pointing that out to me, Lachlan - I should have realised that is what it was for in the first place. > If we don't find a single > AG that will satisfy the full request then we use the low space algorithm > and start from AG 0. But if we do find such an AG with all the space we > need it should be fine to use NEAR_BNO on subsequent allocations because > the space is there, right? But for the case we are dealing with here - bmbt insertion after an allocation - we already have firstblock set and hence never traverse the args.fsbno == NULLFSBLOCK case as the cursor firstblock is always set to the block that has already been allocated. Hence we never hit the low space algorithm unless it was triggered by the extent allocation. > If this logic fails to hold true then there are other problems we need to > deal with including; > > If we've already allocated a block from AG N then all subsequent > allocations must come from AGs >= N to prevent locking issues. If the > blocks we need are in AGs < N we're in trouble. Agreed - this is what minleft is trying to avoid from happening ahead of the first allocations. > We need to fall back to > the low space algorithm and start from AG 0 but we can't do this if we > have already allocated from AG N. Right, even the low space algorithm cannot avoid the need to avoid lock inversions in the AGs. > Also if we pass START_BNO to xfs_alloc_vextent() then it will fall > through to the ANY/START/FIRST_AG case and set flags to TRYLOCK. It That's not a problem. > will then scan AGs from N to the last AG and skip any that are already > locked. It may skip AGs that have the space we need and we may get to > the last AG without finding all required blocks. Sure, but.... > Then xfs_alloc_vextent() will loop through the AGs a second time if we > didn't get all we needed the first time but this time without the TRYLOCK. > This could result in locking AGs out of order. It loops the second time from the start AG, not AG zero so we should not be locking AGs out of order. > So why is the extent allocation using all of the remaining blocks in the > AG and not leaving any for the split? That's the big question, isn't it? I'll come back to this. > Are we not reserving the split > space with the conversion transaction? I don't think this is relevant - the problem is allocation, not extent conversion. > Is it because the space needed > for splits is not accounted for when delayed allocations are reserved so > we oversubscribe ourselves? This is preallocation into a hole, so delalloc is not involved here. > Or are we not activating the low space > algorithm early enough? There's plenty of free space, just not in the AG _after_ the extent allocation, and so there's no reason to trigger the low space algorithm. > I need to look at the code a bit more to understand all this. The same > allocation logic appears to be used in xfs_bmap_extents_to_btree(), > xfs_bmap_local_to_extents() and xfs_bmap_btalloc() so maybe there's some > logic in the logic. I think they are all fine, now that I think I understand the reason for the logic - that we only reserve enough blocks in the transaction for freespace tree splits in a single AG. New blocks in the bmbt can be triggered when: 1. extent conversion splits a record into two (or three) records. 2. extent removal split a record into two (punch a hole in the middle of an extent) 3. extent allocation inserts a new record 4. converting from extent to btree format Note that the local to extent change is actually a data extent allocation (same as xfs_bmap_btalloc), not a bmbt allocation. So, in the cases of #1 and #2, it is unlikely that there has been a previous allocation in the transaction, and so we take the first case of args.fsbno == NULLFSBLOCK to chose an AG with enough space for an entire bmbt split. Otherwise, it's the same case as #3 and #4. In the case of #3 and #4, the reason for the bmbt allocation is that we have allocated a new extent, and as a result firstblock _will_ be set. Hence we should have either space in the AG available for the bmbt split _or_ the low space algorithm is active. This problem case is that for #3/4 we can enter xfs_bmbt_alloc_block with the initial condition of firstblock set, not enough space in the current AG and the low space algorithm not active. For the test case (250), we still have lots of free space in the filesystem so we should not be activating the low space algorithm. That leaves the fact we are not leaving enough space in the AG for the bmbt split after the extent is allocated. AFAICT, this code in xfs_bmapi() handles it: 4452 if (wr && *firstblock == NULLFSBLOCK) { 4453 if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE) 4454 minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1; 4455 else 4456 minleft = 1; 4457 } else 4458 minleft = 0; So it seems to me that in all cases the setting of minleft is correct, as are the fallbacks that clear minleft then set xbf_low in xfs_bmap_btalloc(). So I don't think the bug actually resides at the xfs-bmapi/bmbt level. The question I'm now trying to answer is how we can allocate the entire AG if minleft is actually set. The answer, it appears, is in xfs_alloc_fix_minleft(). If we've allocated the largest extent possible, we'll have a situation where the only free space remaining is on the freelist. To use numbers relevant to the test case: agsize = 4096 blocks (16MB) max ext len = 4096 - 8 blocks = 4088 blocks AGF freeblks = 4088 AGF flcount = 4 minleft = 2; extent allocation length = 4088 diff = 4088 + 4 - 4088 - 2 = 2 And because diff >= 0, xfs_alloc_fix_minleft() says that the extent length is OK and does not trim it. Then when we go to allocate bmbt blocks, args->total is not set so the first block is allocated from the free list, and the second then fails at the xfs_alloc_fix_freelist checks because we don't have space available in the AG.... IOWs, we've set minleft appropriately, but xfs_alloc_fix_minleft() assumes that the AGFL blocks are part of the space available for the allocation. This, it seems, is wrong. If minleft is set, then we are not allowing the allocation to dip into any reserves, so we should not be considering the freelist blocks as available for minleft reservations. So, the patch below fixes the test 250 assert failure as well, and to me seems much more likely as the root cause of the bug. What do you think, Lachlan? Cheers, Dave. -- Dave Chinner david@fromorbit.com --- fs/xfs/xfs_alloc.c | 1 - fs/xfs/xfs_bmap_btree.c | 1 + 2 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c index 27d64d7..8946464 100644 --- a/fs/xfs/xfs_alloc.c +++ b/fs/xfs/xfs_alloc.c @@ -280,7 +280,6 @@ xfs_alloc_fix_minleft( return 1; agf = XFS_BUF_TO_AGF(args->agbp); diff = be32_to_cpu(agf->agf_freeblks) - + be32_to_cpu(agf->agf_flcount) - args->len - args->minleft; if (diff >= 0) return 1; diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c index 87d3c10..470047e 100644 --- a/fs/xfs/xfs_bmap_btree.c +++ b/fs/xfs/xfs_bmap_btree.c @@ -518,6 +518,7 @@ xfs_bmbt_alloc_block( args.mp = cur->bc_mp; args.fsbno = cur->bc_private.b.firstblock; args.firstblock = args.fsbno; + args.total = 1; if (args.fsbno == NULLFSBLOCK) { args.fsbno = be64_to_cpu(start->l); _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] xfs: fix bmbt block allocation failures 2011-04-12 6:53 ` Dave Chinner @ 2011-04-12 8:53 ` Dave Chinner 2011-04-15 7:29 ` Lachlan McIlroy 1 sibling, 0 replies; 7+ messages in thread From: Dave Chinner @ 2011-04-12 8:53 UTC (permalink / raw) To: Lachlan McIlroy; +Cc: xfs On Tue, Apr 12, 2011 at 04:53:28PM +1000, Dave Chinner wrote: > On Mon, Apr 11, 2011 at 08:55:33PM -0400, Lachlan McIlroy wrote: > > ----- Original Message ----- > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > When a multilevel bmbt split occurs, we can be asked to allocate > > > blocks from an AG that has no space left available. In the case of > > > an extent just being allocated, the first bmbt block allocation sees > > > the firstblock parameter is set and does not set a minleft parameter > > > for the allocation. The allocation also does not set the total > > > number of blocks required by the allocation, either. ... > So, the patch below fixes the test 250 assert failure as well, and > to me seems much more likely as the root cause of the bug. FWIW, test 250 is showing up another three bugs, all unrelated to the bug it written to exercise. Two are mkfs bugs - the first being that mkfs is terminated due to freeing an invalid pointer. The second being that it is leaving behind a corrupted freespace btree: _check_xfs_filesystem: filesystem on /mnt/test/250.fs is inconsistent *** xfs_repair -n output *** Phase 1 - find and verify superblock... Phase 2 - using internal log - scan filesystem freespace and inode maps... invalid start block 4096 in record 0 of 4635043 btree block 1600/1 invalid start block 4096 in record 0 of 4635039 btree block 1600/2 - found root inode chunk Phase 3 - for each AG... .... xfs_db> convert agno 1600 agbno 1 fsb 0x640001 (6553601) xfs_db> fsb 0x640001 xfs_db> type bnobt xfs_db> p magic = 0x41425442 level = 0 numrecs = 1 leftsib = null rightsib = null recs[1] = [startblock,blockcount] 1:[4096,0] xfs_db> fsb 0x640002 xfs_db> type cntbt xfs_db> p magic = 0x41425443 level = 0 numrecs = 1 leftsib = null rightsib = null recs[1] = [startblock,blockcount] 1:[4096,0] xfs_db> So in AG 1600, there is a freespace record at block 4096 for length zero. Both are incorrect - the AG size is only 4096 blocks. Worth noting: xfs_db> agf 1600 xfs_db> p magicnum = 0x58414746 versionnum = 1 seqno = 1600 length = 4096 bnoroot = 1 cntroot = 2 bnolevel = 1 cntlevel = 1 flfirst = 0 fllast = 127 flcount = 0 freeblks = 0 longest = 0 btreeblks = 0 All the freelist blocks have been consumed, which should not happen - there should be 4 freelist blocks when the AG is empty. Looking a bit deeper, AG 1600 doesn't contain any data blocks - it contains the log. The log is allocated by mkfs, and is 4092 blocks in length. Ї just confirmed that the repair failure occurs on a freshly made FS, so this is definitely a mkfs bug that hasn't been noticed because the test hasn't been running to completion and checking the fs.... And the third bug is in repair: invalid start block 4096 in record 0 of 4635043 btree block 1600/1 invalid start block 4096 in record 0 of 4635039 btree block 1600/2 ^^^^^^^ These are supposed to say "bno" or "cnt", but a %d instead of a %s is incorrectly used in format string so it gives a wacky result. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] xfs: fix bmbt block allocation failures 2011-04-12 6:53 ` Dave Chinner 2011-04-12 8:53 ` Dave Chinner @ 2011-04-15 7:29 ` Lachlan McIlroy 1 sibling, 0 replies; 7+ messages in thread From: Lachlan McIlroy @ 2011-04-15 7:29 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs ----- Original Message ----- > On Mon, Apr 11, 2011 at 08:55:33PM -0400, Lachlan McIlroy wrote: > > ----- Original Message ----- > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > When a multilevel bmbt split occurs, we can be asked to allocate > > > blocks from an AG that has no space left available. In the case of > > > an extent just being allocated, the first bmbt block allocation > > > sees > > > the firstblock parameter is set and does not set a minleft > > > parameter > > > for the allocation. The allocation also does not set the total > > > number of blocks required by the allocation, either. > > > > > > If the extent allocation used all the free space in the AG, the > > > lack > > > of a total allocation size results in a block being reserved for > > > the > > > AG freespace btree manipulations being used incorrectly. > > > > > > Secondly, the allocation is only allowed to be allocated in the > > > current AG (NEAR_BNO) because the low space algorithm has not been > > > activated. As a result of this, the second block allocation in the > > > split fails to find sufficient space in the current AG and fails, > > > resulting in a dirty transaction cancellation and a filesytem > > > shutdown. > > > > > > There are two problems here - both pointed out by Lachlan McIlroy > > > when we were first discussing a proposed fix for the problem (See > > > http://oss.sgi.com/archives/xfs/2010-09/msg00438.html). Firstly, > > > the missing args->total was causing the initial block to be > > > allocated from the freespace reserve. Secondly that the NEAR_BNO > > > allocation should probably be a START_BNO allocation to allow > > > blocks > > > to be taken from a higher numbered AG. > > > > > > With these changes, the allocation failure goes away, but we > > > trigger > > > asserts in xfs_bmapi() that try to ensure that bmbt block > > > allocations only occur in the same AG as the extent allocated, > > > except if the low space algorithm is active. In this case, we > > > don't > > > need to activate the low space algorithm - if a START_BNO > > > allocation > > > fails with minleft = 0, then there is no space we can allocate at > > > all, and hence the low space algorithm is no help. Therefore the > > > asserts need modification reflect the new state of affairs. > > > > > > This commit fixes the problem demonstrated by the test case in > > > xfstests 250. > > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > --- > > > fs/xfs/xfs_bmap.c | 8 ++------ > > > fs/xfs/xfs_bmap_btree.c | 7 ++----- > > > 2 files changed, 4 insertions(+), 11 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c > > > index fa00788..50eed97 100644 > > > --- a/fs/xfs/xfs_bmap.c > > > +++ b/fs/xfs/xfs_bmap.c > > > @@ -4917,13 +4917,9 @@ error0: > > > if (cur) { > > > if (!error) { > > > ASSERT(*firstblock == NULLFSBLOCK || > > > - XFS_FSB_TO_AGNO(mp, *firstblock) == > > > + XFS_FSB_TO_AGNO(mp, *firstblock) <= > > > XFS_FSB_TO_AGNO(mp, > > > - cur->bc_private.b.firstblock) || > > > - (flist->xbf_low && > > > - XFS_FSB_TO_AGNO(mp, *firstblock) < > > > - XFS_FSB_TO_AGNO(mp, > > > - cur->bc_private.b.firstblock))); > > > + cur->bc_private.b.firstblock)); > > > *firstblock = cur->bc_private.b.firstblock; > > > } > > > xfs_btree_del_cursor(cur, > > > > There's another ASSERT like this earlier in xfs_bmapi() after the > > call > > to xfs_bmap_alloc() which may need the same attention. > > No, it checks the allocated extent against any existing firstblock > and has nothing to do with the the bmbt allocations during extent > insertion. > > > > > diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c > > > index 87d3c10..b2cdbfe 100644 > > > --- a/fs/xfs/xfs_bmap_btree.c > > > +++ b/fs/xfs/xfs_bmap_btree.c > > > @@ -518,10 +518,11 @@ xfs_bmbt_alloc_block( > > > args.mp = cur->bc_mp; > > > args.fsbno = cur->bc_private.b.firstblock; > > > args.firstblock = args.fsbno; > > > + args.total = 1; > > > + args.type = XFS_ALLOCTYPE_START_BNO; > > > > > > if (args.fsbno == NULLFSBLOCK) { > > > args.fsbno = be64_to_cpu(start->l); > > > - args.type = XFS_ALLOCTYPE_START_BNO; > > > /* > > > * Make sure there is sufficient room left in the AG to > > > * complete a full tree split for an extent insert. If > > > @@ -534,10 +535,6 @@ xfs_bmbt_alloc_block( > > > * block allocation here and corrupt the filesystem. > > > */ > > > args.minleft = xfs_trans_get_block_res(args.tp); > > > - } else if (cur->bc_private.b.flist->xbf_low) { > > > - args.type = XFS_ALLOCTYPE_START_BNO; > > > - } else { > > > - args.type = XFS_ALLOCTYPE_NEAR_BNO; > > > } > > > > > > args.minlen = args.maxlen = args.prod = 1; > > > > I think that's exactly what I had in mind when we looked at this > > last > > time but now I'm not so sure. I think I understand now why there is > > a > > requirement that all subsequent allocations must (read 'can') come > > from > > the same AG. > > > > The first time we try to allocate blocks we look for an AG with all > > the > > space we need for the reservation to succeed. > > Of course, that explains the near bno allocation them. It's > basically saying "we only ever have a reservation for a freespace > btree split in a single AG". Ok, so it seems we need to retain that, > and the problem of not being able to allocate in the current AG is > elsewhere. > > Thanks for pointing that out to me, Lachlan - I should have realised > that is what it was for in the first place. > > > If we don't find a single > > AG that will satisfy the full request then we use the low space > > algorithm > > and start from AG 0. But if we do find such an AG with all the space > > we > > need it should be fine to use NEAR_BNO on subsequent allocations > > because > > the space is there, right? > > But for the case we are dealing with here - bmbt insertion after an > allocation - we already have firstblock set and hence never traverse > the args.fsbno == NULLFSBLOCK case as the cursor firstblock is > always set to the block that has already been allocated. Hence we > never hit the low space algorithm unless it was triggered by the > extent allocation. > > > If this logic fails to hold true then there are other problems we > > need to > > deal with including; > > > > If we've already allocated a block from AG N then all subsequent > > allocations must come from AGs >= N to prevent locking issues. If > > the > > blocks we need are in AGs < N we're in trouble. > > Agreed - this is what minleft is trying to avoid from happening > ahead of the first allocations. > > > We need to fall back to > > the low space algorithm and start from AG 0 but we can't do this if > > we > > have already allocated from AG N. > > Right, even the low space algorithm cannot avoid the need to avoid > lock inversions in the AGs. > > > Also if we pass START_BNO to xfs_alloc_vextent() then it will fall > > through to the ANY/START/FIRST_AG case and set flags to TRYLOCK. It > > That's not a problem. > > > will then scan AGs from N to the last AG and skip any that are > > already > > locked. It may skip AGs that have the space we need and we may get > > to > > the last AG without finding all required blocks. > > Sure, but.... > > > Then xfs_alloc_vextent() will loop through the AGs a second time if > > we > > didn't get all we needed the first time but this time without the > > TRYLOCK. > > This could result in locking AGs out of order. > > It loops the second time from the start AG, not AG zero so we should > not be locking AGs out of order. > > > So why is the extent allocation using all of the remaining blocks in > > the > > AG and not leaving any for the split? > > That's the big question, isn't it? I'll come back to this. > > > Are we not reserving the split > > space with the conversion transaction? > > I don't think this is relevant - the problem is allocation, not > extent conversion. > > > Is it because the space needed > > for splits is not accounted for when delayed allocations are > > reserved so > > we oversubscribe ourselves? > > This is preallocation into a hole, so delalloc is not involved > here. > > > Or are we not activating the low space > > algorithm early enough? > > There's plenty of free space, just not in the AG _after_ the extent > allocation, and so there's no reason to trigger the low space > algorithm. > > > I need to look at the code a bit more to understand all this. The > > same > > allocation logic appears to be used in xfs_bmap_extents_to_btree(), > > xfs_bmap_local_to_extents() and xfs_bmap_btalloc() so maybe there's > > some > > logic in the logic. > > I think they are all fine, now that I think I understand the reason > for the logic - that we only reserve enough blocks in the > transaction for freespace tree splits in a single AG. New blocks in > the bmbt can be triggered when: > > 1. extent conversion splits a record into two (or three) > records. > 2. extent removal split a record into two (punch a hole in > the middle of an extent) > 3. extent allocation inserts a new record > 4. converting from extent to btree format > > Note that the local to extent change is actually a data extent > allocation (same as xfs_bmap_btalloc), not a bmbt allocation. > > So, in the cases of #1 and #2, it is unlikely that there has been a > previous allocation in the transaction, and so we take the first > case of args.fsbno == NULLFSBLOCK to chose an AG with enough space > for an entire bmbt split. Otherwise, it's the same case as #3 and > #4. > > In the case of #3 and #4, the reason for the bmbt allocation is that > we have allocated a new extent, and as a result firstblock _will_ be > set. Hence we should have either space in the AG available for the > bmbt split _or_ the low space algorithm is active. > > This problem case is that for #3/4 we can enter xfs_bmbt_alloc_block > with the initial condition of firstblock set, not enough space in > the current AG and the low space algorithm not active. For the test > case (250), we still have lots of free space in the filesystem so we > should not be activating the low space algorithm. That leaves the > fact we are not leaving enough space in the AG for the bmbt split > after the extent is allocated. AFAICT, this code in xfs_bmapi() > handles it: > > 4452 if (wr && *firstblock == NULLFSBLOCK) { > 4453 if (XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE) > 4454 minleft = be16_to_cpu(ifp->if_broot->bb_level) + 1; > 4455 else > 4456 minleft = 1; > 4457 } else > 4458 minleft = 0; > > So it seems to me that in all cases the setting of minleft is > correct, as are the fallbacks that clear minleft then set xbf_low in > xfs_bmap_btalloc(). So I don't think the bug actually resides at the > xfs-bmapi/bmbt level. The question I'm now trying to answer is how > we can allocate the entire AG if minleft is actually set. > > The answer, it appears, is in xfs_alloc_fix_minleft(). If we've > allocated the largest extent possible, we'll have a situation where > the only free space remaining is on the freelist. To use numbers > relevant to the test case: > > agsize = 4096 blocks (16MB) > max ext len = 4096 - 8 blocks > = 4088 blocks > AGF freeblks = 4088 > AGF flcount = 4 > minleft = 2; > > extent allocation length = 4088 > > diff = 4088 + 4 - 4088 - 2 > = 2 > > And because diff >= 0, xfs_alloc_fix_minleft() says that the extent > length is OK and does not trim it. Then when we go to allocate bmbt > blocks, args->total is not set so the first block is allocated from > the free list, and the second then fails at the > xfs_alloc_fix_freelist checks because we don't have space available > in the AG.... > > IOWs, we've set minleft appropriately, but xfs_alloc_fix_minleft() > assumes that the AGFL blocks are part of the space available for > the allocation. This, it seems, is wrong. If minleft is set, then we > are not allowing the allocation to dip into any reserves, so we > should not be considering the freelist blocks as available for > minleft reservations. > > So, the patch below fixes the test 250 assert failure as well, and > to me seems much more likely as the root cause of the bug. > > What do you think, Lachlan? I think you're onto something Dave and it makes a lot of sense. It would appear that the checks in xfs_alloc_fix_freelist() really only apply when allocating extents and maybe should be skipped when allocating the minleft blocks. Oh, unless the low space algorithm is being used. I think args.total represents the full extent allocation request and doesn't apply to minleft allocations so I don't think we need that change to xfs_bmbt_alloc_block(). Is just the fix to xfs_alloc_fix_minleft() enough here? > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > > --- > fs/xfs/xfs_alloc.c | 1 - > fs/xfs/xfs_bmap_btree.c | 1 + > 2 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/fs/xfs/xfs_alloc.c b/fs/xfs/xfs_alloc.c > index 27d64d7..8946464 100644 > --- a/fs/xfs/xfs_alloc.c > +++ b/fs/xfs/xfs_alloc.c > @@ -280,7 +280,6 @@ xfs_alloc_fix_minleft( > return 1; > agf = XFS_BUF_TO_AGF(args->agbp); > diff = be32_to_cpu(agf->agf_freeblks) > - + be32_to_cpu(agf->agf_flcount) > - args->len - args->minleft; > if (diff >= 0) > return 1; > diff --git a/fs/xfs/xfs_bmap_btree.c b/fs/xfs/xfs_bmap_btree.c > index 87d3c10..470047e 100644 > --- a/fs/xfs/xfs_bmap_btree.c > +++ b/fs/xfs/xfs_bmap_btree.c > @@ -518,6 +518,7 @@ xfs_bmbt_alloc_block( > args.mp = cur->bc_mp; > args.fsbno = cur->bc_private.b.firstblock; > args.firstblock = args.fsbno; > + args.total = 1; > > if (args.fsbno == NULLFSBLOCK) { > args.fsbno = be64_to_cpu(start->l); > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2011-04-15 7:25 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-04-11 2:34 [PATCH] xfs: fix bmbt block allocation failures Dave Chinner 2011-04-11 23:41 ` Christoph Hellwig 2011-04-11 23:56 ` Dave Chinner 2011-04-12 0:55 ` Lachlan McIlroy 2011-04-12 6:53 ` Dave Chinner 2011-04-12 8:53 ` Dave Chinner 2011-04-15 7:29 ` Lachlan McIlroy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox