* [PATCH 0/2] repair: fix AGFL rebuild problem
@ 2014-10-29 3:09 Dave Chinner
2014-10-29 3:09 ` [PATCH 1/2] repair: fix XR_BLD_FREE_TRACE compilation errors Dave Chinner
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Dave Chinner @ 2014-10-29 3:09 UTC (permalink / raw)
To: xfs; +Cc: bvowk
Hi folks,
The following two patches fix a repair rebuild issue that was
reported to me. xfs_repair would appear to succeed, but subsequent
runs would report a freespace mismatch and then on mount trigger
a corruption error on the first allocation from AG 0.
The problem as that the AGFL was not being correctly rebuilt due to
a very particular freespace pattern in AG 0 which required free
blocks in the AGFL to populate an empty AGFL. reapir silently
ignored this error, so it triggered allocation corruption errors
when the same AGFL fixup attempt failed on a mounted filesystem.
The fix is relatively straight forward, as is the fixes to the
tracing code that I had to make before i could track down the issue.
-Dave.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 9+ messages in thread* [PATCH 1/2] repair: fix XR_BLD_FREE_TRACE compilation errors 2014-10-29 3:09 [PATCH 0/2] repair: fix AGFL rebuild problem Dave Chinner @ 2014-10-29 3:09 ` Dave Chinner 2014-10-29 4:03 ` Eric Sandeen 2014-10-29 19:24 ` Brian Foster 2014-10-29 3:09 ` [PATCH 2/2] repair: AGFL rebuild fails if btree split required Dave Chinner 2014-10-29 18:23 ` [PATCH 0/2] repair: fix AGFL rebuild problem Barkley Vowk 2 siblings, 2 replies; 9+ messages in thread From: Dave Chinner @ 2014-10-29 3:09 UTC (permalink / raw) To: xfs; +Cc: bvowk From: Dave Chinner <dchinner@redhat.com> Obviously hasn't been used for quite some time, so fix the build problems and make it useful again. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- repair/phase5.c | 17 ++++++----------- 1 file changed, 6 insertions(+), 11 deletions(-) diff --git a/repair/phase5.c b/repair/phase5.c index e82eeb7..d6d3c6d 100644 --- a/repair/phase5.c +++ b/repair/phase5.c @@ -362,10 +362,6 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, extent_tree_node_t *ext_ptr; int level; #ifdef XR_BLD_FREE_TRACE - int old_state; - int state = XR_E_BAD_STATE; -#endif -#ifdef XR_BLD_FREE_TRACE fprintf(stderr, "in init_freespace_cursor, agno = %d\n", agno); #endif @@ -1292,8 +1288,8 @@ build_agf_agfl(xfs_mount_t *mp, memset(agf, 0, mp->m_sb.sb_sectsize); #ifdef XR_BLD_FREE_TRACE - fprintf(stderr, "agf = 0x%x, agf_buf->b_un.b_addr = 0x%x\n", - (__psint_t) agf, (__psint_t) agf_buf->b_un.b_addr); + fprintf(stderr, "agf = 0x%p, agf_buf->b_addr = 0x%p\n", + agf, agf_buf->b_addr); #endif /* @@ -1452,7 +1448,7 @@ build_agf_agfl(xfs_mount_t *mp, } #ifdef XR_BLD_FREE_TRACE - fprintf(stderr, "wrote agf for ag %u, error = %d\n", agno, error); + fprintf(stderr, "wrote agf for ag %u\n", agno); #endif } @@ -1707,11 +1703,10 @@ phase5(xfs_mount_t *mp) #ifdef XR_BLD_FREE_TRACE fprintf(stderr, "inobt level 1, maxrec = %d, minrec = %d\n", xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, 0), - xfs_inobt_maxrecs(mp->m_sb.sb_blocksize, 0) / 2 - ); + xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, 0) / 2); fprintf(stderr, "inobt level 0 (leaf), maxrec = %d, minrec = %d\n", - xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, xfs_inobt, 1), - xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, xfs_inobt, 1) / 2); + xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, 1), + xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, 1) / 2); fprintf(stderr, "xr inobt level 0 (leaf), maxrec = %d\n", XR_INOBT_BLOCK_MAXRECS(mp, 0)); fprintf(stderr, "xr inobt level 1 (int), maxrec = %d\n", -- 2.0.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] repair: fix XR_BLD_FREE_TRACE compilation errors 2014-10-29 3:09 ` [PATCH 1/2] repair: fix XR_BLD_FREE_TRACE compilation errors Dave Chinner @ 2014-10-29 4:03 ` Eric Sandeen 2014-10-29 19:24 ` Brian Foster 1 sibling, 0 replies; 9+ messages in thread From: Eric Sandeen @ 2014-10-29 4:03 UTC (permalink / raw) To: Dave Chinner, xfs; +Cc: bvowk On 10/28/14 10:09 PM, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Obviously hasn't been used for quite some time, so fix the > build problems and make it useful again. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> > --- > repair/phase5.c | 17 ++++++----------- > 1 file changed, 6 insertions(+), 11 deletions(-) > > diff --git a/repair/phase5.c b/repair/phase5.c > index e82eeb7..d6d3c6d 100644 > --- a/repair/phase5.c > +++ b/repair/phase5.c > @@ -362,10 +362,6 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, > extent_tree_node_t *ext_ptr; > int level; > #ifdef XR_BLD_FREE_TRACE > - int old_state; > - int state = XR_E_BAD_STATE; > -#endif > -#ifdef XR_BLD_FREE_TRACE > fprintf(stderr, > "in init_freespace_cursor, agno = %d\n", agno); > #endif > @@ -1292,8 +1288,8 @@ build_agf_agfl(xfs_mount_t *mp, > memset(agf, 0, mp->m_sb.sb_sectsize); > > #ifdef XR_BLD_FREE_TRACE > - fprintf(stderr, "agf = 0x%x, agf_buf->b_un.b_addr = 0x%x\n", > - (__psint_t) agf, (__psint_t) agf_buf->b_un.b_addr); > + fprintf(stderr, "agf = 0x%p, agf_buf->b_addr = 0x%p\n", > + agf, agf_buf->b_addr); > #endif > > /* > @@ -1452,7 +1448,7 @@ build_agf_agfl(xfs_mount_t *mp, > } > > #ifdef XR_BLD_FREE_TRACE > - fprintf(stderr, "wrote agf for ag %u, error = %d\n", agno, error); > + fprintf(stderr, "wrote agf for ag %u\n", agno); > #endif > } > > @@ -1707,11 +1703,10 @@ phase5(xfs_mount_t *mp) > #ifdef XR_BLD_FREE_TRACE > fprintf(stderr, "inobt level 1, maxrec = %d, minrec = %d\n", > xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, 0), > - xfs_inobt_maxrecs(mp->m_sb.sb_blocksize, 0) / 2 > - ); > + xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, 0) / 2); > fprintf(stderr, "inobt level 0 (leaf), maxrec = %d, minrec = %d\n", > - xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, xfs_inobt, 1), > - xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, xfs_inobt, 1) / 2); > + xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, 1), > + xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, 1) / 2); > fprintf(stderr, "xr inobt level 0 (leaf), maxrec = %d\n", > XR_INOBT_BLOCK_MAXRECS(mp, 0)); > fprintf(stderr, "xr inobt level 1 (int), maxrec = %d\n", > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 1/2] repair: fix XR_BLD_FREE_TRACE compilation errors 2014-10-29 3:09 ` [PATCH 1/2] repair: fix XR_BLD_FREE_TRACE compilation errors Dave Chinner 2014-10-29 4:03 ` Eric Sandeen @ 2014-10-29 19:24 ` Brian Foster 1 sibling, 0 replies; 9+ messages in thread From: Brian Foster @ 2014-10-29 19:24 UTC (permalink / raw) To: Dave Chinner; +Cc: bvowk, xfs On Wed, Oct 29, 2014 at 02:09:03PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Obviously hasn't been used for quite some time, so fix the > build problems and make it useful again. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- Reviewed-by: Brian Foster <bfoster@redhat.com> > repair/phase5.c | 17 ++++++----------- > 1 file changed, 6 insertions(+), 11 deletions(-) > > diff --git a/repair/phase5.c b/repair/phase5.c > index e82eeb7..d6d3c6d 100644 > --- a/repair/phase5.c > +++ b/repair/phase5.c > @@ -362,10 +362,6 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, > extent_tree_node_t *ext_ptr; > int level; > #ifdef XR_BLD_FREE_TRACE > - int old_state; > - int state = XR_E_BAD_STATE; > -#endif > -#ifdef XR_BLD_FREE_TRACE > fprintf(stderr, > "in init_freespace_cursor, agno = %d\n", agno); > #endif > @@ -1292,8 +1288,8 @@ build_agf_agfl(xfs_mount_t *mp, > memset(agf, 0, mp->m_sb.sb_sectsize); > > #ifdef XR_BLD_FREE_TRACE > - fprintf(stderr, "agf = 0x%x, agf_buf->b_un.b_addr = 0x%x\n", > - (__psint_t) agf, (__psint_t) agf_buf->b_un.b_addr); > + fprintf(stderr, "agf = 0x%p, agf_buf->b_addr = 0x%p\n", > + agf, agf_buf->b_addr); > #endif > > /* > @@ -1452,7 +1448,7 @@ build_agf_agfl(xfs_mount_t *mp, > } > > #ifdef XR_BLD_FREE_TRACE > - fprintf(stderr, "wrote agf for ag %u, error = %d\n", agno, error); > + fprintf(stderr, "wrote agf for ag %u\n", agno); > #endif > } > > @@ -1707,11 +1703,10 @@ phase5(xfs_mount_t *mp) > #ifdef XR_BLD_FREE_TRACE > fprintf(stderr, "inobt level 1, maxrec = %d, minrec = %d\n", > xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, 0), > - xfs_inobt_maxrecs(mp->m_sb.sb_blocksize, 0) / 2 > - ); > + xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, 0) / 2); > fprintf(stderr, "inobt level 0 (leaf), maxrec = %d, minrec = %d\n", > - xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, xfs_inobt, 1), > - xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, xfs_inobt, 1) / 2); > + xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, 1), > + xfs_inobt_maxrecs(mp, mp->m_sb.sb_blocksize, 1) / 2); > fprintf(stderr, "xr inobt level 0 (leaf), maxrec = %d\n", > XR_INOBT_BLOCK_MAXRECS(mp, 0)); > fprintf(stderr, "xr inobt level 1 (int), maxrec = %d\n", > -- > 2.0.0 > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 2/2] repair: AGFL rebuild fails if btree split required 2014-10-29 3:09 [PATCH 0/2] repair: fix AGFL rebuild problem Dave Chinner 2014-10-29 3:09 ` [PATCH 1/2] repair: fix XR_BLD_FREE_TRACE compilation errors Dave Chinner @ 2014-10-29 3:09 ` Dave Chinner 2014-10-29 19:26 ` Brian Foster 2014-10-29 18:23 ` [PATCH 0/2] repair: fix AGFL rebuild problem Barkley Vowk 2 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2014-10-29 3:09 UTC (permalink / raw) To: xfs; +Cc: bvowk From: Dave Chinner <dchinner@redhat.com> In phase 5 we rebuild the freelist btrees, the AGF and the AGFL from all the free space information we have gathered and resolved during pahses 3 and 4. If the freespace information is laid out just right, we end up having to allocate free space for the AGFL blocks. If the size of the free space we allocate from is larger than the space we need, then we have to insert the remainder back into the freespace btree. For the by-size tree, this means we are likely to be removing a record from one leaf, and then inserting the remainder - a smaller size - into another leaf. The issue is that the leaf blocks to the left of the original leaf block we removed the extent record from are full and hence require a split to insert the new record. That, of course, requires a free block in the AGFL to allocate from, and now we have a chicken and egg situation: there are no free blocks in the AGFL because we are setting it up. As a result, setting up the free list silently fails, leaving the freespace btrees in an inconsistent state and the AGFL in question empty. When the filesystem is next mounted, the first allocation from that AGF results in attempting to fix the AGFL, and it then does exactly the same thing as the repair code, fails to allocate a block during the split and fails. This results in an immediate shutdown because the transaction doing the allocation is dirty by this stage. The fix for the problem is to make repair handle rebulding the btree differently. If we leave ispace for a couple of records in each btree leaf and node, there is never a need for a split to occur when initially setting up the AGFL. This results in repair doing the right thing, and hence the runtime problems after mount don't occur. Further, add error checking the the AGFL setup code and abort repair if we have a failure to correctly set up the AGFL so we catch this problem at repair time, not mount time... Reported-by: Barkley Vowk <bvowk@box.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> --- repair/phase5.c | 44 +++++++++++++++++++++++++++++++++++--------- 1 file changed, 35 insertions(+), 9 deletions(-) diff --git a/repair/phase5.c b/repair/phase5.c index d6d3c6d..3d58936 100644 --- a/repair/phase5.c +++ b/repair/phase5.c @@ -335,11 +335,22 @@ finish_cursor(bt_status_t *curs) } /* + * We need to leave some free records in the tree for the corner case of + * setting up the AGFL. This may require allocation of blocks, and as + * such can require insertion of new records into the tree (e.g. moving + * a record in the by-count tree when a long extent is shortened). If we + * pack the records into the leaves with no slack space, this requires a + * leaf split to occur and a block to be allocated from the free list. + * If we don't have any blocks on the free list (because we are setting + * it up!), then we fail, and the filesystem will fail with the same + * failure at runtime. Hence leave a couple of records slack space in + * each block to allow immediate modification of the tree without + * requiring splits to be done. + * * XXX(hch): any reason we don't just look at mp->m_alloc_mxr? */ #define XR_ALLOC_BLOCK_MAXRECS(mp, level) \ - xfs_allocbt_maxrecs((mp), (mp)->m_sb.sb_blocksize, \ - (level) == 0) + (xfs_allocbt_maxrecs((mp), (mp)->m_sb.sb_blocksize, (level) == 0) - 2) /* * this calculates a freespace cursor for an ag. @@ -361,10 +372,6 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, bt_stat_level_t *p_lptr; extent_tree_node_t *ext_ptr; int level; -#ifdef XR_BLD_FREE_TRACE - fprintf(stderr, - "in init_freespace_cursor, agno = %d\n", agno); -#endif num_extents = *extents; extents_used = 0; @@ -385,6 +392,13 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, lptr->num_recs_tot = num_extents; level = 1; +#ifdef XR_BLD_FREE_TRACE + fprintf(stderr, "%s 0 %d %d %d %d\n", __func__, + lptr->num_blocks, + lptr->num_recs_pb, + lptr->modulo, + lptr->num_recs_tot); +#endif /* * if we need more levels, set them up. # of records * per level is the # of blocks in the level below it @@ -402,6 +416,14 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, lptr->num_recs_pb = p_lptr->num_blocks / lptr->num_blocks; lptr->num_recs_tot = p_lptr->num_blocks; +#ifdef XR_BLD_FREE_TRACE + fprintf(stderr, "%s %d %d %d %d %d\n", __func__, + level, + lptr->num_blocks, + lptr->num_recs_pb, + lptr->modulo, + lptr->num_recs_tot); +#endif } } @@ -546,8 +568,7 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, lptr = &btree_curs->level[level]; p_lptr = &btree_curs->level[level-1]; lptr->num_blocks = howmany(p_lptr->num_blocks, - XR_ALLOC_BLOCK_MAXRECS(mp, - level)); + XR_ALLOC_BLOCK_MAXRECS(mp, level)); lptr->modulo = p_lptr->num_blocks % lptr->num_blocks; lptr->num_recs_pb = p_lptr->num_blocks @@ -1434,6 +1455,7 @@ build_agf_agfl(xfs_mount_t *mp, xfs_alloc_arg_t args; xfs_trans_t *tp; struct xfs_trans_res tres = {0}; + int error; memset(&args, 0, sizeof(args)); args.tp = tp = libxfs_trans_alloc(mp, 0); @@ -1442,8 +1464,12 @@ build_agf_agfl(xfs_mount_t *mp, args.alignment = 1; args.pag = xfs_perag_get(mp,agno); libxfs_trans_reserve(tp, &tres, XFS_MIN_FREELIST(agf, mp), 0); - libxfs_alloc_fix_freelist(&args, 0); + error = libxfs_alloc_fix_freelist(&args, 0); xfs_perag_put(args.pag); + if (error) { + do_error(_("failed to fix AGFL on AG %d, error %d\n"), + agno, error); + } libxfs_trans_commit(tp, 0); } -- 2.0.0 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] repair: AGFL rebuild fails if btree split required 2014-10-29 3:09 ` [PATCH 2/2] repair: AGFL rebuild fails if btree split required Dave Chinner @ 2014-10-29 19:26 ` Brian Foster 2014-10-29 19:59 ` Dave Chinner 0 siblings, 1 reply; 9+ messages in thread From: Brian Foster @ 2014-10-29 19:26 UTC (permalink / raw) To: Dave Chinner; +Cc: bvowk, xfs On Wed, Oct 29, 2014 at 02:09:04PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > In phase 5 we rebuild the freelist btrees, the AGF and the AGFL from > all the free space information we have gathered and resolved during > pahses 3 and 4. If the freespace information is laid out just right, > we end up having to allocate free space for the AGFL blocks. > > If the size of the free space we allocate from is larger than the > space we need, then we have to insert the remainder back into the > freespace btree. For the by-size tree, this means we are likely to > be removing a record from one leaf, and then inserting the remainder > - a smaller size - into another leaf. > > The issue is that the leaf blocks to the left of the original leaf > block we removed the extent record from are full and hence require a > split to insert the new record. That, of course, requires a free > block in the AGFL to allocate from, and now we have a chicken and > egg situation: there are no free blocks in the AGFL because we are > setting it up. > > As a result, setting up the free list silently fails, leaving the > freespace btrees in an inconsistent state and the AGFL in question > empty. When the filesystem is next mounted, the first allocation > from that AGF results in attempting to fix the AGFL, and it then > does exactly the same thing as the repair code, fails to allocate a > block during the split and fails. This results in an immediate > shutdown because the transaction doing the allocation is dirty by > this stage. > > The fix for the problem is to make repair handle rebulding the btree > differently. If we leave ispace for a couple of records in each > btree leaf and node, there is never a need for a split to occur when > initially setting up the AGFL. This results in repair doing the > right thing, and hence the runtime problems after mount don't occur. > Further, add error checking the the AGFL setup code and abort repair > if we have a failure to correctly set up the AGFL so we catch this > problem at repair time, not mount time... > Interesting problem, thanks for the breakdown. I suppose the only interesting side effect is that the alloc btrees might require a bit more space after repair than before. I wonder if we need to take that into consideration here for certain cases. E.g., does this have ramifications for running repair on a clean, but completely full fs, or should that be generally handled by the existence of reserved blocks? A couple minor nits below... > Reported-by: Barkley Vowk <bvowk@box.com> > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > repair/phase5.c | 44 +++++++++++++++++++++++++++++++++++--------- > 1 file changed, 35 insertions(+), 9 deletions(-) > > diff --git a/repair/phase5.c b/repair/phase5.c > index d6d3c6d..3d58936 100644 > --- a/repair/phase5.c > +++ b/repair/phase5.c > @@ -335,11 +335,22 @@ finish_cursor(bt_status_t *curs) > } > > /* > + * We need to leave some free records in the tree for the corner case of > + * setting up the AGFL. This may require allocation of blocks, and as > + * such can require insertion of new records into the tree (e.g. moving > + * a record in the by-count tree when a long extent is shortened). If we > + * pack the records into the leaves with no slack space, this requires a > + * leaf split to occur and a block to be allocated from the free list. > + * If we don't have any blocks on the free list (because we are setting > + * it up!), then we fail, and the filesystem will fail with the same > + * failure at runtime. Hence leave a couple of records slack space in > + * each block to allow immediate modification of the tree without > + * requiring splits to be done. > + * > * XXX(hch): any reason we don't just look at mp->m_alloc_mxr? I guess we could kill this comment now. ;) > */ > #define XR_ALLOC_BLOCK_MAXRECS(mp, level) \ > - xfs_allocbt_maxrecs((mp), (mp)->m_sb.sb_blocksize, \ > - (level) == 0) > + (xfs_allocbt_maxrecs((mp), (mp)->m_sb.sb_blocksize, (level) == 0) - 2) > > /* > * this calculates a freespace cursor for an ag. > @@ -361,10 +372,6 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, > bt_stat_level_t *p_lptr; > extent_tree_node_t *ext_ptr; > int level; > -#ifdef XR_BLD_FREE_TRACE > - fprintf(stderr, > - "in init_freespace_cursor, agno = %d\n", agno); > -#endif > > num_extents = *extents; > extents_used = 0; > @@ -385,6 +392,13 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, > lptr->num_recs_tot = num_extents; > level = 1; > > +#ifdef XR_BLD_FREE_TRACE > + fprintf(stderr, "%s 0 %d %d %d %d\n", __func__, What's the 0 for? Brian > + lptr->num_blocks, > + lptr->num_recs_pb, > + lptr->modulo, > + lptr->num_recs_tot); > +#endif > /* > * if we need more levels, set them up. # of records > * per level is the # of blocks in the level below it > @@ -402,6 +416,14 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, > lptr->num_recs_pb = p_lptr->num_blocks > / lptr->num_blocks; > lptr->num_recs_tot = p_lptr->num_blocks; > +#ifdef XR_BLD_FREE_TRACE > + fprintf(stderr, "%s %d %d %d %d %d\n", __func__, > + level, > + lptr->num_blocks, > + lptr->num_recs_pb, > + lptr->modulo, > + lptr->num_recs_tot); > +#endif > } > } > > @@ -546,8 +568,7 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, > lptr = &btree_curs->level[level]; > p_lptr = &btree_curs->level[level-1]; > lptr->num_blocks = howmany(p_lptr->num_blocks, > - XR_ALLOC_BLOCK_MAXRECS(mp, > - level)); > + XR_ALLOC_BLOCK_MAXRECS(mp, level)); > lptr->modulo = p_lptr->num_blocks > % lptr->num_blocks; > lptr->num_recs_pb = p_lptr->num_blocks > @@ -1434,6 +1455,7 @@ build_agf_agfl(xfs_mount_t *mp, > xfs_alloc_arg_t args; > xfs_trans_t *tp; > struct xfs_trans_res tres = {0}; > + int error; > > memset(&args, 0, sizeof(args)); > args.tp = tp = libxfs_trans_alloc(mp, 0); > @@ -1442,8 +1464,12 @@ build_agf_agfl(xfs_mount_t *mp, > args.alignment = 1; > args.pag = xfs_perag_get(mp,agno); > libxfs_trans_reserve(tp, &tres, XFS_MIN_FREELIST(agf, mp), 0); > - libxfs_alloc_fix_freelist(&args, 0); > + error = libxfs_alloc_fix_freelist(&args, 0); > xfs_perag_put(args.pag); > + if (error) { > + do_error(_("failed to fix AGFL on AG %d, error %d\n"), > + agno, error); > + } > libxfs_trans_commit(tp, 0); > } > > -- > 2.0.0 > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] repair: AGFL rebuild fails if btree split required 2014-10-29 19:26 ` Brian Foster @ 2014-10-29 19:59 ` Dave Chinner 2014-10-30 13:42 ` Brian Foster 0 siblings, 1 reply; 9+ messages in thread From: Dave Chinner @ 2014-10-29 19:59 UTC (permalink / raw) To: Brian Foster; +Cc: bvowk, xfs On Wed, Oct 29, 2014 at 03:26:10PM -0400, Brian Foster wrote: > On Wed, Oct 29, 2014 at 02:09:04PM +1100, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > In phase 5 we rebuild the freelist btrees, the AGF and the AGFL from > > all the free space information we have gathered and resolved during > > pahses 3 and 4. If the freespace information is laid out just right, > > we end up having to allocate free space for the AGFL blocks. > > > > If the size of the free space we allocate from is larger than the > > space we need, then we have to insert the remainder back into the > > freespace btree. For the by-size tree, this means we are likely to > > be removing a record from one leaf, and then inserting the remainder > > - a smaller size - into another leaf. > > > > The issue is that the leaf blocks to the left of the original leaf > > block we removed the extent record from are full and hence require a > > split to insert the new record. That, of course, requires a free > > block in the AGFL to allocate from, and now we have a chicken and > > egg situation: there are no free blocks in the AGFL because we are > > setting it up. > > > > As a result, setting up the free list silently fails, leaving the > > freespace btrees in an inconsistent state and the AGFL in question > > empty. When the filesystem is next mounted, the first allocation > > from that AGF results in attempting to fix the AGFL, and it then > > does exactly the same thing as the repair code, fails to allocate a > > block during the split and fails. This results in an immediate > > shutdown because the transaction doing the allocation is dirty by > > this stage. > > > > The fix for the problem is to make repair handle rebulding the btree > > differently. If we leave ispace for a couple of records in each > > btree leaf and node, there is never a need for a split to occur when > > initially setting up the AGFL. This results in repair doing the > > right thing, and hence the runtime problems after mount don't occur. > > Further, add error checking the the AGFL setup code and abort repair > > if we have a failure to correctly set up the AGFL so we catch this > > problem at repair time, not mount time... > > > > Interesting problem, thanks for the breakdown. I suppose the only > interesting side effect is that the alloc btrees might require a bit > more space after repair than before. I wonder if we need to take that > into consideration here for certain cases. E.g., does this have > ramifications for running repair on a clean, but completely full fs, or > should that be generally handled by the existence of reserved blocks? In general, it shouldn't be an issue. It is extremely rare for a free space btree to be entirely compact - it currently only happens when repair runs, and the first insert at each leaf triggers splits and so after a short runtime will no longer be compact. Hence it is almost always going to be the case that the btree is smaller after repair than it was before.... I can see that it might cause an issue at maximally sized filesystems with worst case free space fragmentation, but given the runtime of repair of such a filesystem is probably measured in months and in hundreds of petabytes of RAM I'm not really concerned about that right now. > > A couple minor nits below... > > > Reported-by: Barkley Vowk <bvowk@box.com> > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > --- > > repair/phase5.c | 44 +++++++++++++++++++++++++++++++++++--------- > > 1 file changed, 35 insertions(+), 9 deletions(-) > > > > diff --git a/repair/phase5.c b/repair/phase5.c > > index d6d3c6d..3d58936 100644 > > --- a/repair/phase5.c > > +++ b/repair/phase5.c > > @@ -335,11 +335,22 @@ finish_cursor(bt_status_t *curs) > > } > > > > /* > > + * We need to leave some free records in the tree for the corner case of > > + * setting up the AGFL. This may require allocation of blocks, and as > > + * such can require insertion of new records into the tree (e.g. moving > > + * a record in the by-count tree when a long extent is shortened). If we > > + * pack the records into the leaves with no slack space, this requires a > > + * leaf split to occur and a block to be allocated from the free list. > > + * If we don't have any blocks on the free list (because we are setting > > + * it up!), then we fail, and the filesystem will fail with the same > > + * failure at runtime. Hence leave a couple of records slack space in > > + * each block to allow immediate modification of the tree without > > + * requiring splits to be done. > > + * > > * XXX(hch): any reason we don't just look at mp->m_alloc_mxr? > > I guess we could kill this comment now. ;) Comment is still valid. I only fixed the bug - I didn't look into why we aren't using mp->m_alloc_mxr instead of calculating the max records value from first principles... > > @@ -385,6 +392,13 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, > > lptr->num_recs_tot = num_extents; > > level = 1; > > > > +#ifdef XR_BLD_FREE_TRACE > > + fprintf(stderr, "%s 0 %d %d %d %d\n", __func__, > > What's the 0 for? The btree level. 0 == leaf.... .... > > @@ -402,6 +416,14 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, > > lptr->num_recs_pb = p_lptr->num_blocks > > / lptr->num_blocks; > > lptr->num_recs_tot = p_lptr->num_blocks; > > +#ifdef XR_BLD_FREE_TRACE > > + fprintf(stderr, "%s %d %d %d %d %d\n", __func__, > > + level, > > + lptr->num_blocks, And as you can see in this hunk as we set up the nodes the level is printed so we can see the config of each level as it is set up. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/2] repair: AGFL rebuild fails if btree split required 2014-10-29 19:59 ` Dave Chinner @ 2014-10-30 13:42 ` Brian Foster 0 siblings, 0 replies; 9+ messages in thread From: Brian Foster @ 2014-10-30 13:42 UTC (permalink / raw) To: Dave Chinner; +Cc: bvowk, xfs On Thu, Oct 30, 2014 at 06:59:39AM +1100, Dave Chinner wrote: > On Wed, Oct 29, 2014 at 03:26:10PM -0400, Brian Foster wrote: > > On Wed, Oct 29, 2014 at 02:09:04PM +1100, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > In phase 5 we rebuild the freelist btrees, the AGF and the AGFL from > > > all the free space information we have gathered and resolved during > > > pahses 3 and 4. If the freespace information is laid out just right, > > > we end up having to allocate free space for the AGFL blocks. > > > > > > If the size of the free space we allocate from is larger than the > > > space we need, then we have to insert the remainder back into the > > > freespace btree. For the by-size tree, this means we are likely to > > > be removing a record from one leaf, and then inserting the remainder > > > - a smaller size - into another leaf. > > > > > > The issue is that the leaf blocks to the left of the original leaf > > > block we removed the extent record from are full and hence require a > > > split to insert the new record. That, of course, requires a free > > > block in the AGFL to allocate from, and now we have a chicken and > > > egg situation: there are no free blocks in the AGFL because we are > > > setting it up. > > > > > > As a result, setting up the free list silently fails, leaving the > > > freespace btrees in an inconsistent state and the AGFL in question > > > empty. When the filesystem is next mounted, the first allocation > > > from that AGF results in attempting to fix the AGFL, and it then > > > does exactly the same thing as the repair code, fails to allocate a > > > block during the split and fails. This results in an immediate > > > shutdown because the transaction doing the allocation is dirty by > > > this stage. > > > > > > The fix for the problem is to make repair handle rebulding the btree > > > differently. If we leave ispace for a couple of records in each > > > btree leaf and node, there is never a need for a split to occur when > > > initially setting up the AGFL. This results in repair doing the > > > right thing, and hence the runtime problems after mount don't occur. > > > Further, add error checking the the AGFL setup code and abort repair > > > if we have a failure to correctly set up the AGFL so we catch this > > > problem at repair time, not mount time... > > > > > > > Interesting problem, thanks for the breakdown. I suppose the only > > interesting side effect is that the alloc btrees might require a bit > > more space after repair than before. I wonder if we need to take that > > into consideration here for certain cases. E.g., does this have > > ramifications for running repair on a clean, but completely full fs, or > > should that be generally handled by the existence of reserved blocks? > > In general, it shouldn't be an issue. It is extremely rare for a > free space btree to be entirely compact - it currently only happens > when repair runs, and the first insert at each leaf triggers splits > and so after a short runtime will no longer be compact. Hence it is > almost always going to be the case that the btree is smaller after > repair than it was before.... > > I can see that it might cause an issue at maximally sized > filesystems with worst case free space fragmentation, but given the > runtime of repair of such a filesystem is probably measured in > months and in hundreds of petabytes of RAM I'm not really concerned > about that right now. > I was more thinking about an fs close to ENOSPC than the overall size of the fs. Now that I think about it again, the free space is going to be related to the size of the trees so it's probably not an issue (e.g., more records in the tree, the more free space in the fs). More specifically, I see that we build up a reserve of blocks for each tree in advance in calculate_freespace_cursor() based on the number of records. That pretty much handles the general issue of the size of the metadata changing due to repair as we'll fail if there isn't enough space. All that said (and as discussed on irc), this isn't quite how I would expect to address this problem. The general algorithm for allocbt reconstruction is to calculate the blocks required for the tree based on the number of records, reserve that many blocks out of the in-core block allocation metadata, reconstruct the tree in its entirety into the buffers associated with those blocks and finally write out the buffers to disk. After the tree has been constructed, remaining extra blocks are spilled over into the agfl. Extra blocks (over reservation) seems to only occur when it is expected that the blocks allocated for the btree change the geometry of the tree from what is expected without considering said btree blocks. I would think that if we're going to potentially do an allocation to fixup the agfl, it probably makes sense to seed the agfl similar to how the btree in general is constructed from in-core metadata. The current risk here is that we don't seem to handle overestimation well. Whatever blocks are left are fit into the agfl and remaining blocks beyond that are lost. The trees have been written by the time build_agf_agfl() is called and we do go right ahead and do allocations for the agfl, so it seems like we should be able to free those extra blocks just the same (this would need to be confirmed, as I could be missing something) and eliminate that (independent) problem entirely. So with that, there is certainly more to it than giving slack to the btree leaf blocks and thus at best this is work for another day. This solution seems fine to me: Reviewed-by: Brian Foster <bfoster@redhat.com> Brian > > > > A couple minor nits below... > > > > > Reported-by: Barkley Vowk <bvowk@box.com> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > --- > > > repair/phase5.c | 44 +++++++++++++++++++++++++++++++++++--------- > > > 1 file changed, 35 insertions(+), 9 deletions(-) > > > > > > diff --git a/repair/phase5.c b/repair/phase5.c > > > index d6d3c6d..3d58936 100644 > > > --- a/repair/phase5.c > > > +++ b/repair/phase5.c > > > @@ -335,11 +335,22 @@ finish_cursor(bt_status_t *curs) > > > } > > > > > > /* > > > + * We need to leave some free records in the tree for the corner case of > > > + * setting up the AGFL. This may require allocation of blocks, and as > > > + * such can require insertion of new records into the tree (e.g. moving > > > + * a record in the by-count tree when a long extent is shortened). If we > > > + * pack the records into the leaves with no slack space, this requires a > > > + * leaf split to occur and a block to be allocated from the free list. > > > + * If we don't have any blocks on the free list (because we are setting > > > + * it up!), then we fail, and the filesystem will fail with the same > > > + * failure at runtime. Hence leave a couple of records slack space in > > > + * each block to allow immediate modification of the tree without > > > + * requiring splits to be done. > > > + * > > > * XXX(hch): any reason we don't just look at mp->m_alloc_mxr? > > > > I guess we could kill this comment now. ;) > > Comment is still valid. I only fixed the bug - I didn't look into > why we aren't using mp->m_alloc_mxr instead of calculating the max > records value from first principles... > > > > @@ -385,6 +392,13 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, > > > lptr->num_recs_tot = num_extents; > > > level = 1; > > > > > > +#ifdef XR_BLD_FREE_TRACE > > > + fprintf(stderr, "%s 0 %d %d %d %d\n", __func__, > > > > What's the 0 for? > > The btree level. 0 == leaf.... > > .... > > > @@ -402,6 +416,14 @@ calculate_freespace_cursor(xfs_mount_t *mp, xfs_agnumber_t agno, > > > lptr->num_recs_pb = p_lptr->num_blocks > > > / lptr->num_blocks; > > > lptr->num_recs_tot = p_lptr->num_blocks; > > > +#ifdef XR_BLD_FREE_TRACE > > > + fprintf(stderr, "%s %d %d %d %d %d\n", __func__, > > > + level, > > > + lptr->num_blocks, > > And as you can see in this hunk as we set up the nodes the level > is printed so we can see the config of each level as it is set up. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 0/2] repair: fix AGFL rebuild problem 2014-10-29 3:09 [PATCH 0/2] repair: fix AGFL rebuild problem Dave Chinner 2014-10-29 3:09 ` [PATCH 1/2] repair: fix XR_BLD_FREE_TRACE compilation errors Dave Chinner 2014-10-29 3:09 ` [PATCH 2/2] repair: AGFL rebuild fails if btree split required Dave Chinner @ 2014-10-29 18:23 ` Barkley Vowk 2 siblings, 0 replies; 9+ messages in thread From: Barkley Vowk @ 2014-10-29 18:23 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs [-- Attachment #1.1: Type: text/plain, Size: 875 bytes --] These patches fixed my issue. Thanks. On Tue, Oct 28, 2014 at 8:09 PM, Dave Chinner <david@fromorbit.com> wrote: > Hi folks, > > The following two patches fix a repair rebuild issue that was > reported to me. xfs_repair would appear to succeed, but subsequent > runs would report a freespace mismatch and then on mount trigger > a corruption error on the first allocation from AG 0. > > The problem as that the AGFL was not being correctly rebuilt due to > a very particular freespace pattern in AG 0 which required free > blocks in the AGFL to populate an empty AGFL. reapir silently > ignored this error, so it triggered allocation corruption errors > when the same AGFL fixup attempt failed on a mounted filesystem. > > The fix is relatively straight forward, as is the fixes to the > tracing code that I had to make before i could track down the issue. > > -Dave. > > [-- Attachment #1.2: Type: text/html, Size: 1311 bytes --] [-- Attachment #2: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-10-30 13:43 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-10-29 3:09 [PATCH 0/2] repair: fix AGFL rebuild problem Dave Chinner 2014-10-29 3:09 ` [PATCH 1/2] repair: fix XR_BLD_FREE_TRACE compilation errors Dave Chinner 2014-10-29 4:03 ` Eric Sandeen 2014-10-29 19:24 ` Brian Foster 2014-10-29 3:09 ` [PATCH 2/2] repair: AGFL rebuild fails if btree split required Dave Chinner 2014-10-29 19:26 ` Brian Foster 2014-10-29 19:59 ` Dave Chinner 2014-10-30 13:42 ` Brian Foster 2014-10-29 18:23 ` [PATCH 0/2] repair: fix AGFL rebuild problem Barkley Vowk
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox