* [PATCH 0/3] xfs: fixes for 3.10-rc6
@ 2013-06-12 2:19 Dave Chinner
2013-06-12 2:19 ` [PATCH 1/3] xfs: don't shutdown log recovery on validation errors Dave Chinner
` (2 more replies)
0 siblings, 3 replies; 31+ messages in thread
From: Dave Chinner @ 2013-06-12 2:19 UTC (permalink / raw)
To: xfs; +Cc: bpm
Hi Ben,
Here are the fixes I have for 3.10-rc6. The first fixes the log
recovery regression that Dave jones reported, the second fixes a bug
in the CRC on-disk format on 32 bit systems, and the third fixes a
BMBT block corruption when CRCs are enabled. All are simple, short
fixes, so please consider them for 3.10-rc6.
Thanks,
Dave.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 31+ messages in thread* [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-12 2:19 [PATCH 0/3] xfs: fixes for 3.10-rc6 Dave Chinner @ 2013-06-12 2:19 ` Dave Chinner 2013-06-13 1:04 ` Ben Myers 2013-06-12 2:19 ` [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats Dave Chinner 2013-06-12 2:19 ` [PATCH 3/3] xfs: ensure btree root split sets blkno correctly Dave Chinner 2 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2013-06-12 2:19 UTC (permalink / raw) To: xfs; +Cc: bpm From: Dave Chinner <dchinner@redhat.com> Unfortunately, we cannot guarantee that items logged multiple times and replayed by log recovery do not take objects back in time. When theya re taken back in time, the go into an intermediate state which is corrupt, and hence verification that occurs on this intermediate state causes log recovery to abort with a corruption shutdown. Instead of causing a shutdown and unmountable filesystem, don't verify post-recovery items before they are written to disk. This is less than optimal, but there is no way to detect this issue for non-CRC filesystems If log recovery successfully completes, this will be undone and the object will be consistent by subsequent transactions that are replayed, so in most cases we don't need to take drastic action. For CRC enabled filesystems, leave the verifiers in place - we need to call them to recalculate the CRCs on the objects anyway. This recovery problem canbe solved for such filesystems - we have a LSN stamped in all metadata at writeback time that we can to determine whether the item should be replayed or not. This is a separate piece of work, so is not addressed by this patch. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_log_recover.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 45a85ff..7cf5e4e 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -1845,7 +1845,13 @@ xlog_recover_do_inode_buffer( xfs_agino_t *buffer_nextp; trace_xfs_log_recover_buf_inode_buf(mp->m_log, buf_f); - bp->b_ops = &xfs_inode_buf_ops; + + /* + * Post recovery validation only works properly on CRC enabled + * filesystems. + */ + if (xfs_sb_version_hascrc(&mp->m_sb)) + bp->b_ops = &xfs_inode_buf_ops; inodes_per_buf = BBTOB(bp->b_io_length) >> mp->m_sb.sb_inodelog; for (i = 0; i < inodes_per_buf; i++) { @@ -2205,7 +2211,16 @@ xlog_recover_do_reg_buffer( /* Shouldn't be any more regions */ ASSERT(i == item->ri_total); - xlog_recovery_validate_buf_type(mp, bp, buf_f); + /* + * We can only do post recovery validation on items on CRC enabled + * fielsystems as we need to know when the buffer was written to be able + * to determine if we should have replayed the item. If we replay old + * metadata over a newer buffer, then it will enter a temporarily + * inconsistent state resulting in verification failures. Hence for now + * just avoid the verification stage for non-crc filesystems + */ + if (xfs_sb_version_hascrc(&mp->m_sb)) + xlog_recovery_validate_buf_type(mp, bp, buf_f); } /* -- 1.7.10.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-12 2:19 ` [PATCH 1/3] xfs: don't shutdown log recovery on validation errors Dave Chinner @ 2013-06-13 1:04 ` Ben Myers 2013-06-13 2:08 ` Dave Chinner 0 siblings, 1 reply; 31+ messages in thread From: Ben Myers @ 2013-06-13 1:04 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Hey Dave, On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Unfortunately, we cannot guarantee that items logged multiple times > and replayed by log recovery do not take objects back in time. When > theya re taken back in time, the go into an intermediate state which > is corrupt, and hence verification that occurs on this intermediate > state causes log recovery to abort with a corruption shutdown. > > Instead of causing a shutdown and unmountable filesystem, don't > verify post-recovery items before they are written to disk. This is > less than optimal, but there is no way to detect this issue for > non-CRC filesystems If log recovery successfully completes, this > will be undone and the object will be consistent by subsequent > transactions that are replayed, so in most cases we don't need to > take drastic action. > > For CRC enabled filesystems, leave the verifiers in place - we need > to call them to recalculate the CRCs on the objects anyway. This > recovery problem canbe solved for such filesystems - we have a LSN > stamped in all metadata at writeback time that we can to determine > whether the item should be replayed or not. This is a separate piece > of work, so is not addressed by this patch. Is there a test case for this one? How are you reproducing this? Thanks, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-13 1:04 ` Ben Myers @ 2013-06-13 2:08 ` Dave Chinner 2013-06-13 22:09 ` Ben Myers 0 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2013-06-13 2:08 UTC (permalink / raw) To: Ben Myers; +Cc: xfs On Wed, Jun 12, 2013 at 08:04:41PM -0500, Ben Myers wrote: > Hey Dave, > > On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > Unfortunately, we cannot guarantee that items logged multiple times > > and replayed by log recovery do not take objects back in time. When > > theya re taken back in time, the go into an intermediate state which > > is corrupt, and hence verification that occurs on this intermediate > > state causes log recovery to abort with a corruption shutdown. > > > > Instead of causing a shutdown and unmountable filesystem, don't > > verify post-recovery items before they are written to disk. This is > > less than optimal, but there is no way to detect this issue for > > non-CRC filesystems If log recovery successfully completes, this > > will be undone and the object will be consistent by subsequent > > transactions that are replayed, so in most cases we don't need to > > take drastic action. > > > > For CRC enabled filesystems, leave the verifiers in place - we need > > to call them to recalculate the CRCs on the objects anyway. This > > recovery problem canbe solved for such filesystems - we have a LSN > > stamped in all metadata at writeback time that we can to determine > > whether the item should be replayed or not. This is a separate piece > > of work, so is not addressed by this patch. > > Is there a test case for this one? How are you reproducing this? The test case was Dave Jones running sysrq-b on a hung test machine. The machine would occasionally end up with a corrupt home directory. http://oss.sgi.com/pipermail/xfs/2013-May/026759.html Analysis from a metdadump provided by Dave: http://oss.sgi.com/pipermail/xfs/2013-June/026965.html And Cai also appeared to be hitting this after a crash on 3.10-rc4, as it's giving exactly the same "verifier failed during log recovery" stack trace: http://oss.sgi.com/pipermail/xfs/2013-June/026889.html Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-13 2:08 ` Dave Chinner @ 2013-06-13 22:09 ` Ben Myers 2013-06-14 0:13 ` Dave Chinner 0 siblings, 1 reply; 31+ messages in thread From: Ben Myers @ 2013-06-13 22:09 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs Hi Dave, On Thu, Jun 13, 2013 at 12:08:27PM +1000, Dave Chinner wrote: > On Wed, Jun 12, 2013 at 08:04:41PM -0500, Ben Myers wrote: > > On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > Unfortunately, we cannot guarantee that items logged multiple times > > > and replayed by log recovery do not take objects back in time. When > > > theya re taken back in time, the go into an intermediate state which > > > is corrupt, and hence verification that occurs on this intermediate > > > state causes log recovery to abort with a corruption shutdown. > > > > > > Instead of causing a shutdown and unmountable filesystem, don't > > > verify post-recovery items before they are written to disk. This is > > > less than optimal, but there is no way to detect this issue for > > > non-CRC filesystems If log recovery successfully completes, this > > > will be undone and the object will be consistent by subsequent > > > transactions that are replayed, so in most cases we don't need to > > > take drastic action. > > > > > > For CRC enabled filesystems, leave the verifiers in place - we need > > > to call them to recalculate the CRCs on the objects anyway. This > > > recovery problem canbe solved for such filesystems - we have a LSN > > > stamped in all metadata at writeback time that we can to determine > > > whether the item should be replayed or not. This is a separate piece > > > of work, so is not addressed by this patch. > > > > Is there a test case for this one? How are you reproducing this? > > The test case was Dave Jones running sysrq-b on a hung test machine. > The machine would occasionally end up with a corrupt home directory. > > http://oss.sgi.com/pipermail/xfs/2013-May/026759.html > > Analysis from a metdadump provided by Dave: > > http://oss.sgi.com/pipermail/xfs/2013-June/026965.html > > And Cai also appeared to be hitting this after a crash on 3.10-rc4, > as it's giving exactly the same "verifier failed during log recovery" > stack trace: > > http://oss.sgi.com/pipermail/xfs/2013-June/026889.html Thanks. It appears that the verifiers have found corruption due to a flaw in log recovery, and the fix you are proposing is to stop using them. If we do that, we'll have no way of detecting the corruption and will end up hanging users of older kernels out to dry. I think your suggestion that non-debug systems could warn instead of fail is a good one, but removing the verifier altogether is inappropriate. Can you make the metadump available? I need to understand this better before I can sign off. Also: Any idea how far back this one goes? Regards, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-13 22:09 ` Ben Myers @ 2013-06-14 0:13 ` Dave Chinner 2013-06-14 12:55 ` Mark Tinguely 2013-06-14 16:09 ` Ben Myers 0 siblings, 2 replies; 31+ messages in thread From: Dave Chinner @ 2013-06-14 0:13 UTC (permalink / raw) To: Ben Myers; +Cc: xfs On Thu, Jun 13, 2013 at 05:09:03PM -0500, Ben Myers wrote: > Hi Dave, > > On Thu, Jun 13, 2013 at 12:08:27PM +1000, Dave Chinner wrote: > > On Wed, Jun 12, 2013 at 08:04:41PM -0500, Ben Myers wrote: > > > On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote: > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > Unfortunately, we cannot guarantee that items logged multiple times > > > > and replayed by log recovery do not take objects back in time. When > > > > theya re taken back in time, the go into an intermediate state which > > > > is corrupt, and hence verification that occurs on this intermediate > > > > state causes log recovery to abort with a corruption shutdown. > > > > > > > > Instead of causing a shutdown and unmountable filesystem, don't > > > > verify post-recovery items before they are written to disk. This is > > > > less than optimal, but there is no way to detect this issue for > > > > non-CRC filesystems If log recovery successfully completes, this > > > > will be undone and the object will be consistent by subsequent > > > > transactions that are replayed, so in most cases we don't need to > > > > take drastic action. > > > > > > > > For CRC enabled filesystems, leave the verifiers in place - we need > > > > to call them to recalculate the CRCs on the objects anyway. This > > > > recovery problem canbe solved for such filesystems - we have a LSN > > > > stamped in all metadata at writeback time that we can to determine > > > > whether the item should be replayed or not. This is a separate piece > > > > of work, so is not addressed by this patch. > > > > > > Is there a test case for this one? How are you reproducing this? > > > > The test case was Dave Jones running sysrq-b on a hung test machine. > > The machine would occasionally end up with a corrupt home directory. > > > > http://oss.sgi.com/pipermail/xfs/2013-May/026759.html > > > > Analysis from a metdadump provided by Dave: > > > > http://oss.sgi.com/pipermail/xfs/2013-June/026965.html > > > > And Cai also appeared to be hitting this after a crash on 3.10-rc4, > > as it's giving exactly the same "verifier failed during log recovery" > > stack trace: > > > > http://oss.sgi.com/pipermail/xfs/2013-June/026889.html > > Thanks. It appears that the verifiers have found corruption due to a > flaw in log recovery, and the fix you are proposing is to stop using > them. If we do that, we'll have no way of detecting the corruption and > will end up hanging users of older kernels out to dry. We've never detected it before, and it's causing regressions for multiple people. We *can't fix it* because we can't detect the situation sanely, and we are not leaving people with old kernels hanging out to dry. The opposite is true: we are fucking over current users by preventing log recovery on filesystems that will recovery perfectly OK and have almost always recovered just fine in the past. > I think your suggestion that non-debug systems could warn instead of > fail is a good one, but removing the verifier altogether is > inappropriate. Changing every single verifier in a non-trivial way is not something I'm about to do for a -rc6 kernel. Removing the verifiers from log recovery just reverts to the pre-3.8 situation, so is perfectly acceptable short term solution while we do the more invasive verify changes. > Can you make the metadump available? I need to understand this better > before I can sign off. Also: Any idea how far back this one goes? No, I can't make the metadump available to you - it was provided privately and not obfuscated and so you'd have to ask Dave for it. As to how long this problem has existed? It's a zero-day bug. Like I said, I've suspected for years that this can happen, and only now do we have proof of it... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-14 0:13 ` Dave Chinner @ 2013-06-14 12:55 ` Mark Tinguely 2013-06-14 16:09 ` Ben Myers 1 sibling, 0 replies; 31+ messages in thread From: Mark Tinguely @ 2013-06-14 12:55 UTC (permalink / raw) To: Dave Chinner; +Cc: Ben Myers, xfs On 06/13/13 19:13, Dave Chinner wrote: > On Thu, Jun 13, 2013 at 05:09:03PM -0500, Ben Myers wrote: >> Hi Dave, >> >> On Thu, Jun 13, 2013 at 12:08:27PM +1000, Dave Chinner wrote: >>> On Wed, Jun 12, 2013 at 08:04:41PM -0500, Ben Myers wrote: >>>> On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote: >>>>> From: Dave Chinner<dchinner@redhat.com> >>>>> >>>>> Unfortunately, we cannot guarantee that items logged multiple times >>>>> and replayed by log recovery do not take objects back in time. When >>>>> theya re taken back in time, the go into an intermediate state which >>>>> is corrupt, and hence verification that occurs on this intermediate >>>>> state causes log recovery to abort with a corruption shutdown. >>>>> >>>>> Instead of causing a shutdown and unmountable filesystem, don't >>>>> verify post-recovery items before they are written to disk. This is >>>>> less than optimal, but there is no way to detect this issue for >>>>> non-CRC filesystems If log recovery successfully completes, this >>>>> will be undone and the object will be consistent by subsequent >>>>> transactions that are replayed, so in most cases we don't need to >>>>> take drastic action. >>>>> >>>>> For CRC enabled filesystems, leave the verifiers in place - we need >>>>> to call them to recalculate the CRCs on the objects anyway. This >>>>> recovery problem canbe solved for such filesystems - we have a LSN >>>>> stamped in all metadata at writeback time that we can to determine >>>>> whether the item should be replayed or not. This is a separate piece >>>>> of work, so is not addressed by this patch. >>>> >>>> Is there a test case for this one? How are you reproducing this? >>> >>> The test case was Dave Jones running sysrq-b on a hung test machine. >>> The machine would occasionally end up with a corrupt home directory. >>> >>> http://oss.sgi.com/pipermail/xfs/2013-May/026759.html >>> >>> Analysis from a metdadump provided by Dave: >>> >>> http://oss.sgi.com/pipermail/xfs/2013-June/026965.html >>> >>> And Cai also appeared to be hitting this after a crash on 3.10-rc4, >>> as it's giving exactly the same "verifier failed during log recovery" >>> stack trace: >>> >>> http://oss.sgi.com/pipermail/xfs/2013-June/026889.html >> >> Thanks. It appears that the verifiers have found corruption due to a >> flaw in log recovery, and the fix you are proposing is to stop using >> them. If we do that, we'll have no way of detecting the corruption and >> will end up hanging users of older kernels out to dry. > > We've never detected it before, and it's causing regressions for > multiple people. We *can't fix it* because we can't detect the > situation sanely, and we are not leaving people with old kernels > hanging out to dry. The opposite is true: we are fucking over > current users by preventing log recovery on filesystems that will > recovery perfectly OK and have almost always recovered just fine in > the past. > >> I think your suggestion that non-debug systems could warn instead of >> fail is a good one, but removing the verifier altogether is >> inappropriate. > > Changing every single verifier in a non-trivial way is not something > I'm about to do for a -rc6 kernel. Removing the verifiers from log > recovery just reverts to the pre-3.8 situation, so is perfectly > acceptable short term solution while we do the more invasive verify > changes. > >> Can you make the metadump available? I need to understand this better >> before I can sign off. Also: Any idea how far back this one goes? > > No, I can't make the metadump available to you - it was provided > privately and not obfuscated and so you'd have to ask Dave for it. > > As to how long this problem has existed? It's a zero-day bug. Like I > said, I've suspected for years that this can happen, and only now do > we have proof of it... > > Cheers, > > Dave. My gut feeling for the patch was the same as Ben's, but thinking this over, I have to take back all the eloquent curse works. IMO, we have to bring the patch in because the goal for Linux 3.10 is to have a stable environment for the non-CRC case and the verifier is breaking XFS log recovery for the common non-CRC case. --- In the common case, the verifier is tripping over the fundemental difference between how the AIL works (consolidate buffer writes into one and it can write anything that made it to log up to l_last_sync_lsn) and the log recovery, which works on each modification. If there is another unknown kind of future write, then it would be nice to know and having a warning message would help. Unfortunately, the warning may make recovery noisy and falsely concern the users, but we are in uncharted waters. I will blame and put a reviewed-by on the patch. --Mark. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-14 0:13 ` Dave Chinner 2013-06-14 12:55 ` Mark Tinguely @ 2013-06-14 16:09 ` Ben Myers 2013-06-14 16:15 ` Eric Sandeen 2013-06-14 16:17 ` Dave Jones 1 sibling, 2 replies; 31+ messages in thread From: Ben Myers @ 2013-06-14 16:09 UTC (permalink / raw) To: Dave Jones; +Cc: xfs Hey, On Fri, Jun 14, 2013 at 10:13:06AM +1000, Dave Chinner wrote: > On Thu, Jun 13, 2013 at 05:09:03PM -0500, Ben Myers wrote: > > Hi Dave, > > > > On Thu, Jun 13, 2013 at 12:08:27PM +1000, Dave Chinner wrote: > > > On Wed, Jun 12, 2013 at 08:04:41PM -0500, Ben Myers wrote: > > > > On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote: > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > > > Unfortunately, we cannot guarantee that items logged multiple times > > > > > and replayed by log recovery do not take objects back in time. When > > > > > theya re taken back in time, the go into an intermediate state which > > > > > is corrupt, and hence verification that occurs on this intermediate > > > > > state causes log recovery to abort with a corruption shutdown. > > > > > > > > > > Instead of causing a shutdown and unmountable filesystem, don't > > > > > verify post-recovery items before they are written to disk. This is > > > > > less than optimal, but there is no way to detect this issue for > > > > > non-CRC filesystems If log recovery successfully completes, this > > > > > will be undone and the object will be consistent by subsequent > > > > > transactions that are replayed, so in most cases we don't need to > > > > > take drastic action. > > > > > > > > > > For CRC enabled filesystems, leave the verifiers in place - we need > > > > > to call them to recalculate the CRCs on the objects anyway. This > > > > > recovery problem canbe solved for such filesystems - we have a LSN > > > > > stamped in all metadata at writeback time that we can to determine > > > > > whether the item should be replayed or not. This is a separate piece > > > > > of work, so is not addressed by this patch. > > > > > > > > Is there a test case for this one? How are you reproducing this? > > > > > > The test case was Dave Jones running sysrq-b on a hung test machine. > > > The machine would occasionally end up with a corrupt home directory. > > > > > > http://oss.sgi.com/pipermail/xfs/2013-May/026759.html > > > > > > Analysis from a metdadump provided by Dave: > > > > > > http://oss.sgi.com/pipermail/xfs/2013-June/026965.html > > > > > > And Cai also appeared to be hitting this after a crash on 3.10-rc4, > > > as it's giving exactly the same "verifier failed during log recovery" > > > stack trace: > > > > > > http://oss.sgi.com/pipermail/xfs/2013-June/026889.html > > > > Thanks. It appears that the verifiers have found corruption due to a > > flaw in log recovery, and the fix you are proposing is to stop using > > them. If we do that, we'll have no way of detecting the corruption and > > will end up hanging users of older kernels out to dry. > > We've never detected it before, and it's causing regressions for > multiple people. We *can't fix it* because we can't detect the > situation sanely, and we are not leaving people with old kernels > hanging out to dry. The opposite is true: we are fucking over > current users by preventing log recovery on filesystems that will > recovery perfectly OK and have almost always recovered just fine in > the past. > > > I think your suggestion that non-debug systems could warn instead of > > fail is a good one, but removing the verifier altogether is > > inappropriate. > > Changing every single verifier in a non-trivial way is not something > I'm about to do for a -rc6 kernel. Removing the verifiers from log > recovery just reverts to the pre-3.8 situation, so is perfectly > acceptable short term solution while we do the more invasive verify > changes. > > > Can you make the metadump available? I need to understand this better > > before I can sign off. Also: Any idea how far back this one goes? > > No, I can't make the metadump available to you - it was provided > privately and not obfuscated and so you'd have to ask Dave for it. Dave (Jones), could you make the metadump available to me? I'd like to understand this a little bit better. I'm a bit uncomfortable with the proposition that we should corrupt silently in this case... Thanks, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-14 16:09 ` Ben Myers @ 2013-06-14 16:15 ` Eric Sandeen 2013-06-14 19:08 ` Ben Myers 2013-06-14 16:17 ` Dave Jones 1 sibling, 1 reply; 31+ messages in thread From: Eric Sandeen @ 2013-06-14 16:15 UTC (permalink / raw) To: Ben Myers; +Cc: Dave Jones, xfs On 6/14/13 11:09 AM, Ben Myers wrote: > Hey, > > On Fri, Jun 14, 2013 at 10:13:06AM +1000, Dave Chinner wrote: >> On Thu, Jun 13, 2013 at 05:09:03PM -0500, Ben Myers wrote: >>> Hi Dave, >>> >>> On Thu, Jun 13, 2013 at 12:08:27PM +1000, Dave Chinner wrote: >>>> On Wed, Jun 12, 2013 at 08:04:41PM -0500, Ben Myers wrote: >>>>> On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote: >>>>>> From: Dave Chinner <dchinner@redhat.com> >>>>>> >>>>>> Unfortunately, we cannot guarantee that items logged multiple times >>>>>> and replayed by log recovery do not take objects back in time. When >>>>>> theya re taken back in time, the go into an intermediate state which >>>>>> is corrupt, and hence verification that occurs on this intermediate >>>>>> state causes log recovery to abort with a corruption shutdown. >>>>>> >>>>>> Instead of causing a shutdown and unmountable filesystem, don't >>>>>> verify post-recovery items before they are written to disk. This is >>>>>> less than optimal, but there is no way to detect this issue for >>>>>> non-CRC filesystems If log recovery successfully completes, this >>>>>> will be undone and the object will be consistent by subsequent >>>>>> transactions that are replayed, so in most cases we don't need to >>>>>> take drastic action. >>>>>> >>>>>> For CRC enabled filesystems, leave the verifiers in place - we need >>>>>> to call them to recalculate the CRCs on the objects anyway. This >>>>>> recovery problem canbe solved for such filesystems - we have a LSN >>>>>> stamped in all metadata at writeback time that we can to determine >>>>>> whether the item should be replayed or not. This is a separate piece >>>>>> of work, so is not addressed by this patch. >>>>> >>>>> Is there a test case for this one? How are you reproducing this? >>>> >>>> The test case was Dave Jones running sysrq-b on a hung test machine. >>>> The machine would occasionally end up with a corrupt home directory. >>>> >>>> http://oss.sgi.com/pipermail/xfs/2013-May/026759.html >>>> >>>> Analysis from a metdadump provided by Dave: >>>> >>>> http://oss.sgi.com/pipermail/xfs/2013-June/026965.html >>>> >>>> And Cai also appeared to be hitting this after a crash on 3.10-rc4, >>>> as it's giving exactly the same "verifier failed during log recovery" >>>> stack trace: >>>> >>>> http://oss.sgi.com/pipermail/xfs/2013-June/026889.html >>> >>> Thanks. It appears that the verifiers have found corruption due to a >>> flaw in log recovery, and the fix you are proposing is to stop using >>> them. If we do that, we'll have no way of detecting the corruption and >>> will end up hanging users of older kernels out to dry. >> >> We've never detected it before, and it's causing regressions for >> multiple people. We *can't fix it* because we can't detect the >> situation sanely, and we are not leaving people with old kernels >> hanging out to dry. The opposite is true: we are fucking over >> current users by preventing log recovery on filesystems that will >> recovery perfectly OK and have almost always recovered just fine in >> the past. >> >>> I think your suggestion that non-debug systems could warn instead of >>> fail is a good one, but removing the verifier altogether is >>> inappropriate. >> >> Changing every single verifier in a non-trivial way is not something >> I'm about to do for a -rc6 kernel. Removing the verifiers from log >> recovery just reverts to the pre-3.8 situation, so is perfectly >> acceptable short term solution while we do the more invasive verify >> changes. >> >>> Can you make the metadump available? I need to understand this better >>> before I can sign off. Also: Any idea how far back this one goes? >> >> No, I can't make the metadump available to you - it was provided >> privately and not obfuscated and so you'd have to ask Dave for it. > > Dave (Jones), could you make the metadump available to me? I'd like to > understand this a little bit better. I'm a bit uncomfortable with the > proposition that we should corrupt silently in this case... Ben, isn't it the case that the corruption would only happen if log replay failed for some reason (as has always been the case, verifier or not), but with the verifier in place, it kills replay even w/o other problems due to a logical problem with the (recently added) verifiers? IOW - this seems like an actual functional regression due to the addition of the verifier, and dchinner's patch gets us back to the almost-always-fine state we were in prior to the change. As we're at -rc6, it seems quite reasonable to me as a quick fix to just short-circuit it for now. If you have time to analyze dave's metadump that's cool, but this seems like something that really needs to be addressed before 3.10 gets out the door. Whenever you're ready, you can also add: Reviewed-by: Eric Sandeen <sandeen@redhat.com> (And dchinner, should this cc: stable for 3.9?) Thanks, -Eric > Thanks, > Ben > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-14 16:15 ` Eric Sandeen @ 2013-06-14 19:08 ` Ben Myers 2013-06-14 19:18 ` Eric Sandeen 0 siblings, 1 reply; 31+ messages in thread From: Ben Myers @ 2013-06-14 19:08 UTC (permalink / raw) To: Eric Sandeen; +Cc: Dave Jones, xfs Hey Eric, On Fri, Jun 14, 2013 at 11:15:41AM -0500, Eric Sandeen wrote: > On 6/14/13 11:09 AM, Ben Myers wrote: > > On Fri, Jun 14, 2013 at 10:13:06AM +1000, Dave Chinner wrote: > >> On Thu, Jun 13, 2013 at 05:09:03PM -0500, Ben Myers wrote: > >>> On Thu, Jun 13, 2013 at 12:08:27PM +1000, Dave Chinner wrote: > >>>> On Wed, Jun 12, 2013 at 08:04:41PM -0500, Ben Myers wrote: > >>>>> On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote: > >>>>>> From: Dave Chinner <dchinner@redhat.com> > >>>>>> > >>>>>> Unfortunately, we cannot guarantee that items logged multiple times > >>>>>> and replayed by log recovery do not take objects back in time. When > >>>>>> theya re taken back in time, the go into an intermediate state which > >>>>>> is corrupt, and hence verification that occurs on this intermediate > >>>>>> state causes log recovery to abort with a corruption shutdown. > >>>>>> > >>>>>> Instead of causing a shutdown and unmountable filesystem, don't > >>>>>> verify post-recovery items before they are written to disk. This is > >>>>>> less than optimal, but there is no way to detect this issue for > >>>>>> non-CRC filesystems If log recovery successfully completes, this > >>>>>> will be undone and the object will be consistent by subsequent > >>>>>> transactions that are replayed, so in most cases we don't need to > >>>>>> take drastic action. > >>>>>> > >>>>>> For CRC enabled filesystems, leave the verifiers in place - we need > >>>>>> to call them to recalculate the CRCs on the objects anyway. This > >>>>>> recovery problem canbe solved for such filesystems - we have a LSN > >>>>>> stamped in all metadata at writeback time that we can to determine > >>>>>> whether the item should be replayed or not. This is a separate piece > >>>>>> of work, so is not addressed by this patch. > >>>>> > >>>>> Is there a test case for this one? How are you reproducing this? > >>>> > >>>> The test case was Dave Jones running sysrq-b on a hung test machine. > >>>> The machine would occasionally end up with a corrupt home directory. > >>>> > >>>> http://oss.sgi.com/pipermail/xfs/2013-May/026759.html > >>>> > >>>> Analysis from a metdadump provided by Dave: > >>>> > >>>> http://oss.sgi.com/pipermail/xfs/2013-June/026965.html > >>>> > >>>> And Cai also appeared to be hitting this after a crash on 3.10-rc4, > >>>> as it's giving exactly the same "verifier failed during log recovery" > >>>> stack trace: > >>>> > >>>> http://oss.sgi.com/pipermail/xfs/2013-June/026889.html > >>> > >>> Thanks. It appears that the verifiers have found corruption due to a > >>> flaw in log recovery, and the fix you are proposing is to stop using > >>> them. If we do that, we'll have no way of detecting the corruption and > >>> will end up hanging users of older kernels out to dry. > >> > >> We've never detected it before, and it's causing regressions for > >> multiple people. We *can't fix it* because we can't detect the > >> situation sanely, and we are not leaving people with old kernels > >> hanging out to dry. The opposite is true: we are fucking over > >> current users by preventing log recovery on filesystems that will > >> recovery perfectly OK and have almost always recovered just fine in > >> the past. > >> > >>> I think your suggestion that non-debug systems could warn instead of > >>> fail is a good one, but removing the verifier altogether is > >>> inappropriate. > >> > >> Changing every single verifier in a non-trivial way is not something > >> I'm about to do for a -rc6 kernel. Removing the verifiers from log > >> recovery just reverts to the pre-3.8 situation, so is perfectly > >> acceptable short term solution while we do the more invasive verify > >> changes. > >> > >>> Can you make the metadump available? I need to understand this better > >>> before I can sign off. Also: Any idea how far back this one goes? > >> > >> No, I can't make the metadump available to you - it was provided > >> privately and not obfuscated and so you'd have to ask Dave for it. > > > > Dave (Jones), could you make the metadump available to me? I'd like to > > understand this a little bit better. I'm a bit uncomfortable with the > > proposition that we should corrupt silently in this case... > > Ben, isn't it the case that the corruption would only happen if > log replay failed for some reason (as has always been the case, > verifier or not), but with the verifier in place, it kills replay > even w/o other problems due to a logical problem with the > (recently added) verifiers? It seems like the verifier prevented corruption from hitting disk during log replay. It is enforcing a partial replay up to the point where the corruption occurred. Now you should be able to zero the log and the filesystem is not corrupted. > IOW - this seems like an actual functional regression due to the > addition of the verifier, and dchinner's patch gets us back > to the almost-always-fine state we were in prior to the change. Oh, the spin doctor is *in*! This isn't a logical problem with the verifier, it's a logical problem with log replay. We need to find a way for recovery to know whether a given transaction should be replayed. Fixing that is nontrivial. > As we're at -rc6, it seems quite reasonable to me as a quick > fix to just short-circuit it for now. If we're talking about a short term fix, that's fine. This should be conditional on CONFIG_XFS_DEBUG and marked as such. Long term, removing the verifiers is the wrong thing to do here. We need to fix the recovery bug and then remove this temporary workaround. > If you have time to analyze dave's metadump that's cool, but > this seems like something that really needs to be addressed > before 3.10 gets out the door. If this really is a day one bug then it's been out the door almost twenty years. And you want to hurry now? ;) > Whenever you're ready, you can also add: > > Reviewed-by: Eric Sandeen <sandeen@redhat.com> Sure. > (And dchinner, should this cc: stable for 3.9?) Looks like verifiers were added in 3.7. We should Cc stable. Thanks, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-14 19:08 ` Ben Myers @ 2013-06-14 19:18 ` Eric Sandeen 2013-06-14 19:44 ` Ben Myers 0 siblings, 1 reply; 31+ messages in thread From: Eric Sandeen @ 2013-06-14 19:18 UTC (permalink / raw) To: Ben Myers; +Cc: Dave Jones, xfs On 6/14/13 2:08 PM, Ben Myers wrote: > Hey Eric, > > On Fri, Jun 14, 2013 at 11:15:41AM -0500, Eric Sandeen wrote: >> On 6/14/13 11:09 AM, Ben Myers wrote: >>> On Fri, Jun 14, 2013 at 10:13:06AM +1000, Dave Chinner wrote: >>>> On Thu, Jun 13, 2013 at 05:09:03PM -0500, Ben Myers wrote: >>>>> On Thu, Jun 13, 2013 at 12:08:27PM +1000, Dave Chinner wrote: >>>>>> On Wed, Jun 12, 2013 at 08:04:41PM -0500, Ben Myers wrote: >>>>>>> On Wed, Jun 12, 2013 at 12:19:06PM +1000, Dave Chinner wrote: >>>>>>>> From: Dave Chinner <dchinner@redhat.com> >>>>>>>> >>>>>>>> Unfortunately, we cannot guarantee that items logged multiple times >>>>>>>> and replayed by log recovery do not take objects back in time. When >>>>>>>> theya re taken back in time, the go into an intermediate state which >>>>>>>> is corrupt, and hence verification that occurs on this intermediate >>>>>>>> state causes log recovery to abort with a corruption shutdown. >>>>>>>> >>>>>>>> Instead of causing a shutdown and unmountable filesystem, don't >>>>>>>> verify post-recovery items before they are written to disk. This is >>>>>>>> less than optimal, but there is no way to detect this issue for >>>>>>>> non-CRC filesystems If log recovery successfully completes, this >>>>>>>> will be undone and the object will be consistent by subsequent >>>>>>>> transactions that are replayed, so in most cases we don't need to >>>>>>>> take drastic action. >>>>>>>> >>>>>>>> For CRC enabled filesystems, leave the verifiers in place - we need >>>>>>>> to call them to recalculate the CRCs on the objects anyway. This >>>>>>>> recovery problem canbe solved for such filesystems - we have a LSN >>>>>>>> stamped in all metadata at writeback time that we can to determine >>>>>>>> whether the item should be replayed or not. This is a separate piece >>>>>>>> of work, so is not addressed by this patch. >>>>>>> >>>>>>> Is there a test case for this one? How are you reproducing this? >>>>>> >>>>>> The test case was Dave Jones running sysrq-b on a hung test machine. >>>>>> The machine would occasionally end up with a corrupt home directory. >>>>>> >>>>>> http://oss.sgi.com/pipermail/xfs/2013-May/026759.html >>>>>> >>>>>> Analysis from a metdadump provided by Dave: >>>>>> >>>>>> http://oss.sgi.com/pipermail/xfs/2013-June/026965.html >>>>>> >>>>>> And Cai also appeared to be hitting this after a crash on 3.10-rc4, >>>>>> as it's giving exactly the same "verifier failed during log recovery" >>>>>> stack trace: >>>>>> >>>>>> http://oss.sgi.com/pipermail/xfs/2013-June/026889.html >>>>> >>>>> Thanks. It appears that the verifiers have found corruption due to a >>>>> flaw in log recovery, and the fix you are proposing is to stop using >>>>> them. If we do that, we'll have no way of detecting the corruption and >>>>> will end up hanging users of older kernels out to dry. >>>> >>>> We've never detected it before, and it's causing regressions for >>>> multiple people. We *can't fix it* because we can't detect the >>>> situation sanely, and we are not leaving people with old kernels >>>> hanging out to dry. The opposite is true: we are fucking over >>>> current users by preventing log recovery on filesystems that will >>>> recovery perfectly OK and have almost always recovered just fine in >>>> the past. >>>> >>>>> I think your suggestion that non-debug systems could warn instead of >>>>> fail is a good one, but removing the verifier altogether is >>>>> inappropriate. >>>> >>>> Changing every single verifier in a non-trivial way is not something >>>> I'm about to do for a -rc6 kernel. Removing the verifiers from log >>>> recovery just reverts to the pre-3.8 situation, so is perfectly >>>> acceptable short term solution while we do the more invasive verify >>>> changes. >>>> >>>>> Can you make the metadump available? I need to understand this better >>>>> before I can sign off. Also: Any idea how far back this one goes? >>>> >>>> No, I can't make the metadump available to you - it was provided >>>> privately and not obfuscated and so you'd have to ask Dave for it. >>> >>> Dave (Jones), could you make the metadump available to me? I'd like to >>> understand this a little bit better. I'm a bit uncomfortable with the >>> proposition that we should corrupt silently in this case... >> >> Ben, isn't it the case that the corruption would only happen if >> log replay failed for some reason (as has always been the case, >> verifier or not), but with the verifier in place, it kills replay >> even w/o other problems due to a logical problem with the >> (recently added) verifiers? > > It seems like the verifier prevented corruption from hitting disk during > log replay. It detected a an inconsistent *interim* state during replay, which is always made correct by log replay completion. But it *stopped* that log replay completion. And caused log replay to fail. And mount to fail. This is *new* behavior, and bad. As I understand it. > It is enforcing a partial replay up to the point where the > corruption occurred. Now you should be able to zero the log and the > filesystem is not corrupted. > >> IOW - this seems like an actual functional regression due to the >> addition of the verifier, and dchinner's patch gets us back >> to the almost-always-fine state we were in prior to the change. > > Oh, the spin doctor is *in*! This is not spin. > This isn't a logical problem with the verifier, it's a logical problem > with log replay. We need to find a way for recovery to know whether a > given transaction should be replayed. Fixing that is nontrivial. Right. And it's been around for years. The verifier now detects that interim state, and makes things *worse* than they would be had log replay been allowed to continue. Fixing the interim state may be nontrivial; allowing log replay to continue to a consistent state as it always has *is* trivial, it's what's done in Dave's small patch. >> As we're at -rc6, it seems quite reasonable to me as a quick >> fix to just short-circuit it for now. > > If we're talking about a short term fix, that's fine. This should be > conditional on CONFIG_XFS_DEBUG and marked as such. > > Long term, removing the verifiers is the wrong thing to do here. We > need to fix the recovery bug and then remove this temporary workaround. > >> If you have time to analyze dave's metadump that's cool, but >> this seems like something that really needs to be addressed >> before 3.10 gets out the door. > > If this really is a day one bug then it's been out the door almost > twenty years. And you want to hurry now? ;) We seem to be talking past each other. The corrupted interim state has been around for years. Up until now, log replay completion left things in perfect state. The verifier now *breaks replay* at that interim point. Were it allowed to continue, everything would be fine. As things stand, it is not fine, and this is a recent change which Dave is trying to correct. Leaving it in place will cause filesystems which were replaying logs just fine until recently to now fail with no good way out. -Eric >> Whenever you're ready, you can also add: >> >> Reviewed-by: Eric Sandeen <sandeen@redhat.com> > > Sure. > >> (And dchinner, should this cc: stable for 3.9?) > > Looks like verifiers were added in 3.7. We should Cc stable. > > Thanks, > Ben > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-14 19:18 ` Eric Sandeen @ 2013-06-14 19:44 ` Ben Myers 2013-06-14 19:54 ` Eric Sandeen 2013-06-15 0:56 ` Dave Chinner 0 siblings, 2 replies; 31+ messages in thread From: Ben Myers @ 2013-06-14 19:44 UTC (permalink / raw) To: Eric Sandeen; +Cc: Dave Jones, xfs Hey Eric, On Fri, Jun 14, 2013 at 02:18:20PM -0500, Eric Sandeen wrote: > On 6/14/13 2:08 PM, Ben Myers wrote: > > On Fri, Jun 14, 2013 at 11:15:41AM -0500, Eric Sandeen wrote: > >> Ben, isn't it the case that the corruption would only happen if > >> log replay failed for some reason (as has always been the case, > >> verifier or not), but with the verifier in place, it kills replay > >> even w/o other problems due to a logical problem with the > >> (recently added) verifiers? > > > > It seems like the verifier prevented corruption from hitting disk during > > log replay. > > It detected a an inconsistent *interim* state during replay, which is > always made correct by log replay completion. But it *stopped* that log > replay completion. And caused log replay to fail. And mount to fail. > This is *new* behavior, and bad. > > As I understand it. > > > It is enforcing a partial replay up to the point where the > > corruption occurred. Now you should be able to zero the log and the > > filesystem is not corrupted. > > > >> IOW - this seems like an actual functional regression due to the > >> addition of the verifier, and dchinner's patch gets us back > >> to the almost-always-fine state we were in prior to the change. > > > > Oh, the spin doctor is *in*! > > This is not spin. > > > This isn't a logical problem with the verifier, it's a logical problem > > with log replay. We need to find a way for recovery to know whether a > > given transaction should be replayed. Fixing that is nontrivial. > > Right. > > And it's been around for years. The verifier now detects that > interim state, and makes things *worse* than they would be had log > replay been allowed to continue. > > Fixing the interim state may be nontrivial; allowing log replay > to continue to a consistent state as it always has *is* trivial, > it's what's done in Dave's small patch. > > >> As we're at -rc6, it seems quite reasonable to me as a quick > >> fix to just short-circuit it for now. > > > > If we're talking about a short term fix, that's fine. This should be > > conditional on CONFIG_XFS_DEBUG and marked as such. > > > > Long term, removing the verifiers is the wrong thing to do here. We > > need to fix the recovery bug and then remove this temporary workaround. > > > >> If you have time to analyze dave's metadump that's cool, but > >> this seems like something that really needs to be addressed > >> before 3.10 gets out the door. > > > > If this really is a day one bug then it's been out the door almost > > twenty years. And you want to hurry now? ;) > > We seem to be talking past each other. > > The corrupted interim state has been around for years. Up until > now, log replay completion left things in perfect state. > > The verifier now *breaks replay* at that interim point. > Were it allowed to continue, everything would be fine. > > As things stand, it is not fine, and this is a recent change > which Dave is trying to correct. > > Leaving it in place will cause filesystems which were replaying > logs just fine until recently to now fail with no good way out. That is consistent with my understanding of the problem... Unfortunately log replay is broken. The verifier has detected this and stopped replay. Ideally the solution would be to fix log replay, but that is going to take some time. So, in the near term we're just going to disable the verifier to allow replay to complete. I'm suggesting that this disabling be done conditionally on CONFIG_XFS_DEBUG so that developers still have a chance at hitting the log replay problem, and a comment should be added explaining that we've disabled the verifier due to a specific bug as a temporary workaround and we'll re-enable the verifier once it's fixed. I'll update the patch and repost. Are you guys arguing that the log replay bug should not be fixed? -Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-14 19:44 ` Ben Myers @ 2013-06-14 19:54 ` Eric Sandeen 2013-06-14 20:22 ` Ben Myers 2013-06-15 0:56 ` Dave Chinner 1 sibling, 1 reply; 31+ messages in thread From: Eric Sandeen @ 2013-06-14 19:54 UTC (permalink / raw) To: Ben Myers; +Cc: Dave Jones, xfs On 6/14/13 2:44 PM, Ben Myers wrote: > Hey Eric, > > On Fri, Jun 14, 2013 at 02:18:20PM -0500, Eric Sandeen wrote: >> On 6/14/13 2:08 PM, Ben Myers wrote: >>> On Fri, Jun 14, 2013 at 11:15:41AM -0500, Eric Sandeen wrote: >>>> Ben, isn't it the case that the corruption would only happen if >>>> log replay failed for some reason (as has always been the case, >>>> verifier or not), but with the verifier in place, it kills replay >>>> even w/o other problems due to a logical problem with the >>>> (recently added) verifiers? >>> >>> It seems like the verifier prevented corruption from hitting disk during >>> log replay. >> >> It detected a an inconsistent *interim* state during replay, which is >> always made correct by log replay completion. But it *stopped* that log >> replay completion. And caused log replay to fail. And mount to fail. >> This is *new* behavior, and bad. >> >> As I understand it. >> >>> It is enforcing a partial replay up to the point where the >>> corruption occurred. Now you should be able to zero the log and the >>> filesystem is not corrupted. >>> >>>> IOW - this seems like an actual functional regression due to the >>>> addition of the verifier, and dchinner's patch gets us back >>>> to the almost-always-fine state we were in prior to the change. >>> >>> Oh, the spin doctor is *in*! >> >> This is not spin. >> >>> This isn't a logical problem with the verifier, it's a logical problem >>> with log replay. We need to find a way for recovery to know whether a >>> given transaction should be replayed. Fixing that is nontrivial. >> >> Right. >> >> And it's been around for years. The verifier now detects that >> interim state, and makes things *worse* than they would be had log >> replay been allowed to continue. >> >> Fixing the interim state may be nontrivial; allowing log replay >> to continue to a consistent state as it always has *is* trivial, >> it's what's done in Dave's small patch. >> >>>> As we're at -rc6, it seems quite reasonable to me as a quick >>>> fix to just short-circuit it for now. >>> >>> If we're talking about a short term fix, that's fine. This should be >>> conditional on CONFIG_XFS_DEBUG and marked as such. >>> >>> Long term, removing the verifiers is the wrong thing to do here. We >>> need to fix the recovery bug and then remove this temporary workaround. >>> >>>> If you have time to analyze dave's metadump that's cool, but >>>> this seems like something that really needs to be addressed >>>> before 3.10 gets out the door. >>> >>> If this really is a day one bug then it's been out the door almost >>> twenty years. And you want to hurry now? ;) >> >> We seem to be talking past each other. >> >> The corrupted interim state has been around for years. Up until >> now, log replay completion left things in perfect state. >> >> The verifier now *breaks replay* at that interim point. >> Were it allowed to continue, everything would be fine. >> >> As things stand, it is not fine, and this is a recent change >> which Dave is trying to correct. >> >> Leaving it in place will cause filesystems which were replaying >> logs just fine until recently to now fail with no good way out. > > That is consistent with my understanding of the problem... > > Unfortunately log replay is broken. The verifier has detected this and stopped > replay. Ideally the solution would be to fix log replay, but that is going to > take some time. So, in the near term we're just going to disable the verifier > to allow replay to complete. Right, that's what we're hoping - for 3.10 right? Maybe the talking-past-each-other was only that part. I thought you didn't want to disable it for now. > I'm suggesting that this disabling be done conditionally on CONFIG_XFS_DEBUG so > that developers still have a chance at hitting the log replay problem, so that real-world users will still feel the pain ...? Or did you say that backwards (really only *disabling* it under debug?) Ok, confirmed on IRC you mean to disable it if *NOT* debug, enable it under debug. > and a > comment should be added explaining that we've disabled the verifier due to a > specific bug as a temporary workaround and we'll re-enable the verifier once > it's fixed. I'll update the patch and repost. Maybe if the verifiers were *on* under debug that'd make sense. I think putting it under the config is overkill, since anyone who wants to fix it is surely capable of re-enabling it in the code. But if that avoids an impasse, I don't much care. > Are you guys arguing that the log replay bug should not be fixed? Speaking for myself, I'm not arguing that, not at all. (not that I know how to fix it, either) -Eric > -Ben > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-14 19:54 ` Eric Sandeen @ 2013-06-14 20:22 ` Ben Myers 2013-06-28 18:54 ` Dave Jones 0 siblings, 1 reply; 31+ messages in thread From: Ben Myers @ 2013-06-14 20:22 UTC (permalink / raw) To: Eric Sandeen; +Cc: Dave Jones, xfs Hey Eric, On Fri, Jun 14, 2013 at 02:54:24PM -0500, Eric Sandeen wrote: > On 6/14/13 2:44 PM, Ben Myers wrote: > > Hey Eric, > > > > On Fri, Jun 14, 2013 at 02:18:20PM -0500, Eric Sandeen wrote: > >> On 6/14/13 2:08 PM, Ben Myers wrote: > >>> On Fri, Jun 14, 2013 at 11:15:41AM -0500, Eric Sandeen wrote: > >>>> Ben, isn't it the case that the corruption would only happen if > >>>> log replay failed for some reason (as has always been the case, > >>>> verifier or not), but with the verifier in place, it kills replay > >>>> even w/o other problems due to a logical problem with the > >>>> (recently added) verifiers? > >>> > >>> It seems like the verifier prevented corruption from hitting disk during > >>> log replay. > >> > >> It detected a an inconsistent *interim* state during replay, which is > >> always made correct by log replay completion. But it *stopped* that log > >> replay completion. And caused log replay to fail. And mount to fail. > >> This is *new* behavior, and bad. > >> > >> As I understand it. > >> > >>> It is enforcing a partial replay up to the point where the > >>> corruption occurred. Now you should be able to zero the log and the > >>> filesystem is not corrupted. > >>> > >>>> IOW - this seems like an actual functional regression due to the > >>>> addition of the verifier, and dchinner's patch gets us back > >>>> to the almost-always-fine state we were in prior to the change. > >>> > >>> Oh, the spin doctor is *in*! > >> > >> This is not spin. > >> > >>> This isn't a logical problem with the verifier, it's a logical problem > >>> with log replay. We need to find a way for recovery to know whether a > >>> given transaction should be replayed. Fixing that is nontrivial. > >> > >> Right. > >> > >> And it's been around for years. The verifier now detects that > >> interim state, and makes things *worse* than they would be had log > >> replay been allowed to continue. > >> > >> Fixing the interim state may be nontrivial; allowing log replay > >> to continue to a consistent state as it always has *is* trivial, > >> it's what's done in Dave's small patch. > >> > >>>> As we're at -rc6, it seems quite reasonable to me as a quick > >>>> fix to just short-circuit it for now. > >>> > >>> If we're talking about a short term fix, that's fine. This should be > >>> conditional on CONFIG_XFS_DEBUG and marked as such. > >>> > >>> Long term, removing the verifiers is the wrong thing to do here. We > >>> need to fix the recovery bug and then remove this temporary workaround. > >>> > >>>> If you have time to analyze dave's metadump that's cool, but > >>>> this seems like something that really needs to be addressed > >>>> before 3.10 gets out the door. > >>> > >>> If this really is a day one bug then it's been out the door almost > >>> twenty years. And you want to hurry now? ;) > >> > >> We seem to be talking past each other. > >> > >> The corrupted interim state has been around for years. Up until > >> now, log replay completion left things in perfect state. > >> > >> The verifier now *breaks replay* at that interim point. > >> Were it allowed to continue, everything would be fine. > >> > >> As things stand, it is not fine, and this is a recent change > >> which Dave is trying to correct. > >> > >> Leaving it in place will cause filesystems which were replaying > >> logs just fine until recently to now fail with no good way out. > > > > That is consistent with my understanding of the problem... > > > > Unfortunately log replay is broken. The verifier has detected this and stopped > > replay. Ideally the solution would be to fix log replay, but that is going to > > take some time. So, in the near term we're just going to disable the verifier > > to allow replay to complete. > > Right, that's what we're hoping - for 3.10 right? Yep. But it should also go back through -stable. > Maybe the talking-past-each-other was only that part. I thought you didn't > want to disable it for now. I didn't want to disable it. That was yesterday. Mark set me straight on how easy this should be to hit. > > I'm suggesting that this disabling be done conditionally on CONFIG_XFS_DEBUG so > > that developers still have a chance at hitting the log replay problem, > > so that real-world users will still feel the pain ...? > > Or did you say that backwards (really only *disabling* it under debug?) > > Ok, confirmed on IRC you mean to disable it if *NOT* debug, enable it > under debug. Right. > > and a > > comment should be added explaining that we've disabled the verifier due to a > > specific bug as a temporary workaround and we'll re-enable the verifier once > > it's fixed. I'll update the patch and repost. > > Maybe if the verifiers were *on* under debug that'd make sense. Sorry for the confusion. > I think putting it under the config is overkill, since anyone who wants > to fix it is surely capable of re-enabling it in the code. But if that > avoids an impasse, I don't much care. Great. Apparently I'm being pedantic today. I'd like it to be clear that we intend for this to be a temporary workaround.... > > Are you guys arguing that the log replay bug should not be fixed? > > Speaking for myself, I'm not arguing that, not at all. > (not that I know how to fix it, either) ...until log replay is fixed. Hell. I'll just pull it in as-is. Reviewed-by: Ben Myers <bpm@sgi.com> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-14 20:22 ` Ben Myers @ 2013-06-28 18:54 ` Dave Jones 2013-06-28 19:24 ` Ben Myers 0 siblings, 1 reply; 31+ messages in thread From: Dave Jones @ 2013-06-28 18:54 UTC (permalink / raw) To: Ben Myers; +Cc: Linus Torvalds, Eric Sandeen, xfs On Fri, Jun 14, 2013 at 03:22:23PM -0500, Ben Myers wrote: > > > Are you guys arguing that the log replay bug should not be fixed? > > > > Speaking for myself, I'm not arguing that, not at all. > > (not that I know how to fix it, either) > > ...until log replay is fixed. Hell. I'll just pull it in as-is. > > Reviewed-by: Ben Myers <bpm@sgi.com> Is this on its way to Linus ? Given it's a regression over 3.9, it probably should be. Dave _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-28 18:54 ` Dave Jones @ 2013-06-28 19:24 ` Ben Myers 2013-06-28 19:28 ` Dave Jones 0 siblings, 1 reply; 31+ messages in thread From: Ben Myers @ 2013-06-28 19:24 UTC (permalink / raw) To: Dave Jones; +Cc: Eric Sandeen, Linus Torvalds, xfs Hi Dave, On Fri, Jun 28, 2013 at 02:54:06PM -0400, Dave Jones wrote: > On Fri, Jun 14, 2013 at 03:22:23PM -0500, Ben Myers wrote: > > > > > Are you guys arguing that the log replay bug should not be fixed? > > > > > > Speaking for myself, I'm not arguing that, not at all. > > > (not that I know how to fix it, either) > > > > ...until log replay is fixed. Hell. I'll just pull it in as-is. > > > > Reviewed-by: Ben Myers <bpm@sgi.com> > > Is this on its way to Linus ? Given it's a regression over 3.9, it probably should be. This one was in my pull request for -rc6. Looks like Linus merged it here: http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d58c6ff0b779c5adae2a8596fde69cb45f2a5d68 We still need to propose this one for -stable, IIRC. Thanks, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-28 19:24 ` Ben Myers @ 2013-06-28 19:28 ` Dave Jones 2013-06-28 19:31 ` Ben Myers 0 siblings, 1 reply; 31+ messages in thread From: Dave Jones @ 2013-06-28 19:28 UTC (permalink / raw) To: Ben Myers; +Cc: Eric Sandeen, Linus Torvalds, xfs On Fri, Jun 28, 2013 at 02:24:20PM -0500, Ben Myers wrote: > Hi Dave, > > On Fri, Jun 28, 2013 at 02:54:06PM -0400, Dave Jones wrote: > > On Fri, Jun 14, 2013 at 03:22:23PM -0500, Ben Myers wrote: > > > > > > > Are you guys arguing that the log replay bug should not be fixed? > > > > > > > > Speaking for myself, I'm not arguing that, not at all. > > > > (not that I know how to fix it, either) > > > > > > ...until log replay is fixed. Hell. I'll just pull it in as-is. > > > > > > Reviewed-by: Ben Myers <bpm@sgi.com> > > > > Is this on its way to Linus ? Given it's a regression over 3.9, it probably should be. > > This one was in my pull request for -rc6. Looks like Linus merged it here: > > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d58c6ff0b779c5adae2a8596fde69cb45f2a5d68 > > We still need to propose this one for -stable, IIRC. Bah, no Reported-by: Dave _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-28 19:28 ` Dave Jones @ 2013-06-28 19:31 ` Ben Myers 0 siblings, 0 replies; 31+ messages in thread From: Ben Myers @ 2013-06-28 19:31 UTC (permalink / raw) To: Dave Jones; +Cc: Eric Sandeen, Linus Torvalds, xfs On Fri, Jun 28, 2013 at 03:28:28PM -0400, Dave Jones wrote: > On Fri, Jun 28, 2013 at 02:24:20PM -0500, Ben Myers wrote: > > Hi Dave, > > > > On Fri, Jun 28, 2013 at 02:54:06PM -0400, Dave Jones wrote: > > > On Fri, Jun 14, 2013 at 03:22:23PM -0500, Ben Myers wrote: > > > > > > > > > Are you guys arguing that the log replay bug should not be fixed? > > > > > > > > > > Speaking for myself, I'm not arguing that, not at all. > > > > > (not that I know how to fix it, either) > > > > > > > > ...until log replay is fixed. Hell. I'll just pull it in as-is. > > > > > > > > Reviewed-by: Ben Myers <bpm@sgi.com> > > > > > > Is this on its way to Linus ? Given it's a regression over 3.9, it probably should be. > > > > This one was in my pull request for -rc6. Looks like Linus merged it here: > > > > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d58c6ff0b779c5adae2a8596fde69cb45f2a5d68 > > > > We still need to propose this one for -stable, IIRC. > > Bah, no Reported-by: Sorry Dave. I'll do better next time. Regards, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-14 19:44 ` Ben Myers 2013-06-14 19:54 ` Eric Sandeen @ 2013-06-15 0:56 ` Dave Chinner 2013-06-17 14:53 ` Ben Myers 1 sibling, 1 reply; 31+ messages in thread From: Dave Chinner @ 2013-06-15 0:56 UTC (permalink / raw) To: Ben Myers; +Cc: Dave Jones, Eric Sandeen, xfs On Fri, Jun 14, 2013 at 02:44:53PM -0500, Ben Myers wrote: > Unfortunately log replay is broken. The verifier has detected this and stopped > replay. Ideally the solution would be to fix log replay, but that is going to > take some time. So, in the near term we're just going to disable the verifier > to allow replay to complete. > > I'm suggesting that this disabling be done conditionally on CONFIG_XFS_DEBUG so > that developers still have a chance at hitting the log replay problem, and a > comment should be added explaining that we've disabled the verifier due to a > specific bug as a temporary workaround and we'll re-enable the verifier once > it's fixed. I'll update the patch and repost. > > Are you guys arguing that the log replay bug should not be fixed? I'm not arguing that it should not be fixed, I'm *stating* that it *can't be fixed* for non-CRC filesystems. That is, the best we can do is work around this deficiency in log recovery with some kind of self-defusing warning.... To fix it properly, you need to know the age of the object being overwritten relative to the age of overwrite data. For non-CRC filesystems we don't have that information in the metadata object being overwritten. We can't even correctly identify the object being overwritten. So, it's simply not fixable in log recovery for non-CRC filesystems, and the LSN stamped in every piece of metadata at writeback time for CRC enabled filesystems is designed precisely to avoid this problem. Indeed, the LSN stamp is a far more effective method than what used to be in the inode core to try to avoid the problem for unlogged inode size updates to try to prevent log recovery from replaying inode core updates over more recently written inodes - the di_flushiter field. Note the comment in xfs_flush_int(): /* * bump the flush iteration count, used to detect flushes which * postdate a log record during recovery. This is redundant as we now * log every change and hence this can't happen. Still, it doesn't hurt. */ ip->i_d.di_flushiter++; And this in xlog_recover_inode_pass2(): /* Skip replay when the on disk inode is newer than the log one */ if (dicp->di_flushiter < be16_to_cpu(dip->di_flushiter)) { This recovery problem has been around forever, and it we cannot fix log recovery with age information in the metadata on disk that recovery is going overwrite. CRC enabled filesystems have that information on disk, existing filesystems don't. Therefore, we can only solve the recovery problem for CRC enabled filesystems... We could probably also avoid the problem by modifying the way we do writeback from the AIL to limit it only to objects at the tail LSN, but that has a horrific performance penalty associated with it for many common workloads because of the way relogging works. And for a problem that I've suspected has occurred maybe 5 times in 10 years, modifying metadata writeback to avoid this problem is a bad tradeoff to make for just about everyone... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-15 0:56 ` Dave Chinner @ 2013-06-17 14:53 ` Ben Myers 2013-06-18 1:22 ` Dave Chinner 0 siblings, 1 reply; 31+ messages in thread From: Ben Myers @ 2013-06-17 14:53 UTC (permalink / raw) To: Dave Chinner; +Cc: Dave Jones, Eric Sandeen, xfs Hey Dave, On Sat, Jun 15, 2013 at 10:56:39AM +1000, Dave Chinner wrote: > On Fri, Jun 14, 2013 at 02:44:53PM -0500, Ben Myers wrote: > > Unfortunately log replay is broken. The verifier has detected this and stopped > > replay. Ideally the solution would be to fix log replay, but that is going to > > take some time. So, in the near term we're just going to disable the verifier > > to allow replay to complete. > > > > I'm suggesting that this disabling be done conditionally on CONFIG_XFS_DEBUG so > > that developers still have a chance at hitting the log replay problem, and a > > comment should be added explaining that we've disabled the verifier due to a > > specific bug as a temporary workaround and we'll re-enable the verifier once > > it's fixed. I'll update the patch and repost. > > > > Are you guys arguing that the log replay bug should not be fixed? > > I'm not arguing that it should not be fixed, Good. > I'm *stating* that it > *can't be fixed* for non-CRC filesystems. That is, the best we can > do is work around this deficiency in log recovery with some kind of > self-defusing warning.... I think you're mistaken here. In order to prevent this from happening on non-crc filesystems we need only ensure that all of the buffer modifications in the log for a given buffer are played back onto the buffer before we write it out to disk. My impression is that the final state of the buffer is always the correct one, so it's a matter of preventing any intermediate states from being written. > To fix it properly, you need to know the age of the object being > overwritten relative to the age of overwrite data. For non-CRC > filesystems we don't have that information in the metadata object > being overwritten. Only if you take the tack that you are not going to write intermediate state into the buffer. I agree this is one way to fix the problem, but I'm not sure that it's the only option. > We can't even correctly identify the object being > overwritten. Can you be more specific about what you mean there? > So, it's simply not fixable in log recovery for non-CRC filesystems, > and the LSN stamped in every piece of metadata at writeback time for > CRC enabled filesystems is designed precisely to avoid this problem. Again, I'm not so sure that there aren't other ways to fix this, which I'm going to continue to look in to. I think you're giving up a bit prematurely. > Indeed, the LSN stamp is a far more effective method than what used > to be in the inode core to try to avoid the problem for unlogged > inode size updates to try to prevent log recovery from replaying > inode core updates over more recently written inodes - the > di_flushiter field. Note the comment in xfs_flush_int(): > > /* > * bump the flush iteration count, used to detect flushes which > * postdate a log record during recovery. This is redundant as we now > * log every change and hence this can't happen. Still, it doesn't hurt. > */ > ip->i_d.di_flushiter++; > > And this in xlog_recover_inode_pass2(): > > /* Skip replay when the on disk inode is newer than the log one */ > if (dicp->di_flushiter < be16_to_cpu(dip->di_flushiter)) { Why is using the lsn more effective than the counter? Seems like the counter would be adequate as is. > This recovery problem has been around forever, and it we cannot fix > log recovery with age information in the metadata on disk that > recovery is going overwrite. CRC enabled filesystems have that > information on disk, existing filesystems don't. Therefore, we can > only solve the recovery problem for CRC enabled filesystems... Nah, I think we can find another way. I am also curious to look back at some older recovery code to see if this truly is a day one bug. > We could probably also avoid the problem by modifying the way we do > writeback from the AIL to limit it only to objects at the tail LSN, > but that has a horrific performance penalty associated with it for > many common workloads because of the way relogging works. And for a > problem that I've suspected has occurred maybe 5 times in 10 years, > modifying metadata writeback to avoid this problem is a bad tradeoff > to make for just about everyone... How would limiting writeback from the AIL to only structures at the tail resolve the issue of playback of intermediate states onto newer buffers during recovery? I don't understand what you're getting at here. Anyway, I agree that it would be better to try and avoid any massive performance impacts during regular operation. A performance impact during recovery is less of a concern but should be avoided if possible. I think there may be some creative ways to resolve this problem for older filesystems which are worth looking into. Regards, Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-17 14:53 ` Ben Myers @ 2013-06-18 1:22 ` Dave Chinner 0 siblings, 0 replies; 31+ messages in thread From: Dave Chinner @ 2013-06-18 1:22 UTC (permalink / raw) To: Ben Myers; +Cc: Dave Jones, Eric Sandeen, xfs On Mon, Jun 17, 2013 at 09:53:57AM -0500, Ben Myers wrote: > Hey Dave, > > On Sat, Jun 15, 2013 at 10:56:39AM +1000, Dave Chinner wrote: > > On Fri, Jun 14, 2013 at 02:44:53PM -0500, Ben Myers wrote: > > > Unfortunately log replay is broken. The verifier has detected this and stopped > > > replay. Ideally the solution would be to fix log replay, but that is going to > > > take some time. So, in the near term we're just going to disable the verifier > > > to allow replay to complete. > > > > > > I'm suggesting that this disabling be done conditionally on CONFIG_XFS_DEBUG so > > > that developers still have a chance at hitting the log replay problem, and a > > > comment should be added explaining that we've disabled the verifier due to a > > > specific bug as a temporary workaround and we'll re-enable the verifier once > > > it's fixed. I'll update the patch and repost. > > > > > > Are you guys arguing that the log replay bug should not be fixed? > > > > I'm not arguing that it should not be fixed, > > Good. > > > I'm *stating* that it > > *can't be fixed* for non-CRC filesystems. That is, the best we can > > do is work around this deficiency in log recovery with some kind of > > self-defusing warning.... > > I think you're mistaken here. In order to prevent this from happening on > non-crc filesystems we need only ensure that all of the buffer modifications in > the log for a given buffer are played back onto the buffer before we write it > out to disk. My impression is that the final state of the buffer is always the > correct one, so it's a matter of preventing any intermediate states from being > written. Sure. That's theoretically possible to do, but in practice it's extremely problematic. First of all, the log can be larger than system RAM, so you can't hold all the changes in memory to replay the completely before allowing writeback. So, we have to have writeback in some form. And if we have to have writeback, that means we need to ensure we are writing checkpoints as a whole unit of change.... If we don't write back checkpoints as a whole, then as a result of omitting certain metadata we end up with an inconsistent state on disk - exactly the problem we are trying to avoid. i.e. not writing back metadata that changed in future checkpoints *guarantees* the filesystem goes through inconsistent intermediate states. And then we have to track every item in the log to determine if it is used in multiple checkpoints. This is basically the same as the cancelled buffer tracking, which now means every object requires another ~32 bytes of RAM to track and a new scalable index mechanism to that can track several hundred thousand objects in a scalable fashion and has low overhead random insert, lookup and delete behaviour. > > To fix it properly, you need to know the age of the object being > > overwritten relative to the age of overwrite data. For non-CRC > > filesystems we don't have that information in the metadata object > > being overwritten. > > Only if you take the tack that you are not going to write intermediate state > into the buffer. I agree this is one way to fix the problem, but I'm not sure > that it's the only option. > > > We can't even correctly identify the object being > > overwritten. > > Can you be more specific about what you mean there? Not all metadata that is logged contains self describing information in non-CRC filesystems. If we can't identify metadata reliably, we can't make any decisions base don the contents.... > > Indeed, the LSN stamp is a far more effective method than what used > > to be in the inode core to try to avoid the problem for unlogged > > inode size updates to try to prevent log recovery from replaying > > inode core updates over more recently written inodes - the > > di_flushiter field. Note the comment in xfs_flush_int(): > > > > /* > > * bump the flush iteration count, used to detect flushes which > > * postdate a log record during recovery. This is redundant as we now > > * log every change and hence this can't happen. Still, it doesn't hurt. > > */ > > ip->i_d.di_flushiter++; > > > > And this in xlog_recover_inode_pass2(): > > > > /* Skip replay when the on disk inode is newer than the log one */ > > if (dicp->di_flushiter < be16_to_cpu(dip->di_flushiter)) { > > Why is using the lsn more effective than the counter? Seems like the counter > would be adequate as is. The counter is only valid within the life of a single inode. It doesn't track the inode across allocation and freeing.... > > We could probably also avoid the problem by modifying the way we do > > writeback from the AIL to limit it only to objects at the tail LSN, > > but that has a horrific performance penalty associated with it for > > many common workloads because of the way relogging works. And for a > > problem that I've suspected has occurred maybe 5 times in 10 years, > > modifying metadata writeback to avoid this problem is a bad tradeoff > > to make for just about everyone... > > How would limiting writeback from the AIL to only structures at the tail > resolve the issue of playback of intermediate states onto newer buffers during > recovery? I don't understand what you're getting at here. The source of the problem is that we write back metadata with too-recent information in it. What determines "too recent"? The LSN of the log item which determines it's position in the AIL. If we want to avoid writing back something that is "too recent", then we've got to stop newer objects on the AIL from being written to disk before the older modifications are cleared from the log. i.e. we can avoid the recovery problem by serialising metadata flushing to a single checkpoint at a time. i.e. we flush the objects from the tail LSN only and wait for them to complete and move the tail of the log forwards before we write the next set of objects that are now at the tail LSN.... > I think there may be some creative ways to resolve this problem for older > filesystems which are worth looking into. Good luck, but I think you are embarking on a fool's errand. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-14 16:09 ` Ben Myers 2013-06-14 16:15 ` Eric Sandeen @ 2013-06-14 16:17 ` Dave Jones 2013-06-14 16:31 ` Ben Myers 1 sibling, 1 reply; 31+ messages in thread From: Dave Jones @ 2013-06-14 16:17 UTC (permalink / raw) To: Ben Myers; +Cc: xfs On Fri, Jun 14, 2013 at 11:09:40AM -0500, Ben Myers wrote: > > > I think your suggestion that non-debug systems could warn instead of > > > fail is a good one, but removing the verifier altogether is > > > inappropriate. > > > > Changing every single verifier in a non-trivial way is not something > > I'm about to do for a -rc6 kernel. Removing the verifiers from log > > recovery just reverts to the pre-3.8 situation, so is perfectly > > acceptable short term solution while we do the more invasive verify > > changes. > > > > > Can you make the metadump available? I need to understand this better > > > before I can sign off. Also: Any idea how far back this one goes? > > > > No, I can't make the metadump available to you - it was provided > > privately and not obfuscated and so you'd have to ask Dave for it. > > Dave (Jones), could you make the metadump available to me? I'd like to > understand this a little bit better. I'm a bit uncomfortable with the > proposition that we should corrupt silently in this case... Sorry, I don't have it any more. I'll see if I can recreate the problem next week and prepare another dump. Dave _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors 2013-06-14 16:17 ` Dave Jones @ 2013-06-14 16:31 ` Ben Myers 0 siblings, 0 replies; 31+ messages in thread From: Ben Myers @ 2013-06-14 16:31 UTC (permalink / raw) To: Dave Jones; +Cc: xfs Hi Dave, On Fri, Jun 14, 2013 at 12:17:24PM -0400, Dave Jones wrote: > On Fri, Jun 14, 2013 at 11:09:40AM -0500, Ben Myers wrote: > > > > > I think your suggestion that non-debug systems could warn instead of > > > > fail is a good one, but removing the verifier altogether is > > > > inappropriate. > > > > > > Changing every single verifier in a non-trivial way is not something > > > I'm about to do for a -rc6 kernel. Removing the verifiers from log > > > recovery just reverts to the pre-3.8 situation, so is perfectly > > > acceptable short term solution while we do the more invasive verify > > > changes. > > > > > > > Can you make the metadump available? I need to understand this better > > > > before I can sign off. Also: Any idea how far back this one goes? > > > > > > No, I can't make the metadump available to you - it was provided > > > privately and not obfuscated and so you'd have to ask Dave for it. > > > > Dave (Jones), could you make the metadump available to me? I'd like to > > understand this a little bit better. I'm a bit uncomfortable with the > > proposition that we should corrupt silently in this case... > > Sorry, I don't have it any more. I'll see if I can recreate the problem > next week and prepare another dump. Much appreciated. Thanks! -Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats 2013-06-12 2:19 [PATCH 0/3] xfs: fixes for 3.10-rc6 Dave Chinner 2013-06-12 2:19 ` [PATCH 1/3] xfs: don't shutdown log recovery on validation errors Dave Chinner @ 2013-06-12 2:19 ` Dave Chinner 2013-06-13 0:58 ` Ben Myers 2013-06-12 2:19 ` [PATCH 3/3] xfs: ensure btree root split sets blkno correctly Dave Chinner 2 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2013-06-12 2:19 UTC (permalink / raw) To: xfs; +Cc: bpm From: Dave Chinner <dchinner@redhat.com> Michael L. Semon has been testing CRC patches ona 32 bit system and been seeing assert failures in the directory code from xfs/080. Thanks to Michael's heroic efforts with printk debugging, we found that the problem was that the last free space being left in the directory structure was too small to fit a unused tag structure and it was being corrupted and attempting to log a region out of bounds. Hence the assert failure looked something like: ..... #5 calling xfs_dir2_data_log_unused() 36 32 #1 4092 4095 4096 #2 8182 8183 4096 XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568 Where #1 showed the first region of the dup being logged (i.e. the last 4 bytes of a directory buffer) and #2 shows the corrupt values being calculated from the length of the dup entry which overflowed the size of the buffer. It turns out that the problem was not in the logging code, nor in the freespace handling code. It is an initial condition bug that only shows up on 32 bit systems. When a new buffer is initialised, where's the freespace that is set up: [ 172.316249] calling xfs_dir2_leaf_addname() from xfs_dir_createname() [ 172.316346] #9 calling xfs_dir2_data_log_unused() [ 172.316351] #1 calling xfs_trans_log_buf() 60 63 4096 [ 172.316353] #2 calling xfs_trans_log_buf() 4094 4095 4096 Note the offset of the first region being logged? It's 60 bytes into the buffer. Once I saw that, I pretty much knew what the bug was going to be caused by this. Essentially, all direct entries are rounded to 8 bytes in length, and all entries start with an 8 byte alignment. This means that we can decode inplace as variables are naturally aligned. With the directory data supposedly starting on a 8 byte boundary, and all entries padded to 8 bytes, the minimum freespace in a directory block is supposed to be 8 bytes, which is large enough to fit a unused data entry structure (6 bytes in size). The fact we only have 4 bytes of free space indicates a directory data block alignment problem. And what do you know - there's an implicit hole in the directory data block header for the CRC format, which means the header is 60 byte on 32 bit intel systems and 64 bytes on 64 bit systems. Needs padding. And while looking at the structures, I found the same problem in the attr leaf header. Fix them both. Note that this only affects 32 bit systems with CRCs enabled. Everything else is just fine. Note that filesystems created before this fix on such systems will not be readable with this fix applied. Reported-by: Michael L. Semon <mlsemon35@gmail.com> Debugged-by: Michael L. Semon <mlsemon35@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_attr_leaf.h | 1 + fs/xfs/xfs_dir2_format.h | 5 +++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/xfs/xfs_attr_leaf.h b/fs/xfs/xfs_attr_leaf.h index f9d7846..444a770 100644 --- a/fs/xfs/xfs_attr_leaf.h +++ b/fs/xfs/xfs_attr_leaf.h @@ -128,6 +128,7 @@ struct xfs_attr3_leaf_hdr { __u8 holes; __u8 pad1; struct xfs_attr_leaf_map freemap[XFS_ATTR_LEAF_MAPSIZE]; + __be32 pad2; /* 64 bit alignment */ }; #define XFS_ATTR3_LEAF_CRC_OFF (offsetof(struct xfs_attr3_leaf_hdr, info.crc)) diff --git a/fs/xfs/xfs_dir2_format.h b/fs/xfs/xfs_dir2_format.h index 995f1f5..7826782 100644 --- a/fs/xfs/xfs_dir2_format.h +++ b/fs/xfs/xfs_dir2_format.h @@ -266,6 +266,7 @@ struct xfs_dir3_blk_hdr { struct xfs_dir3_data_hdr { struct xfs_dir3_blk_hdr hdr; xfs_dir2_data_free_t best_free[XFS_DIR2_DATA_FD_COUNT]; + __be32 pad; /* 64 bit alignment */ }; #define XFS_DIR3_DATA_CRC_OFF offsetof(struct xfs_dir3_data_hdr, hdr.crc) @@ -477,7 +478,7 @@ struct xfs_dir3_leaf_hdr { struct xfs_da3_blkinfo info; /* header for da routines */ __be16 count; /* count of entries */ __be16 stale; /* count of stale entries */ - __be32 pad; + __be32 pad; /* 64 bit alignment */ }; struct xfs_dir3_icleaf_hdr { @@ -715,7 +716,7 @@ struct xfs_dir3_free_hdr { __be32 firstdb; /* db of first entry */ __be32 nvalid; /* count of valid entries */ __be32 nused; /* count of used entries */ - __be32 pad; /* 64 bit alignment. */ + __be32 pad; /* 64 bit alignment */ }; struct xfs_dir3_free { -- 1.7.10.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats 2013-06-12 2:19 ` [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats Dave Chinner @ 2013-06-13 0:58 ` Ben Myers 2013-06-13 1:40 ` Michael L. Semon 2013-06-13 2:27 ` Dave Chinner 0 siblings, 2 replies; 31+ messages in thread From: Ben Myers @ 2013-06-13 0:58 UTC (permalink / raw) To: Dave Chinner, Michael L. Semon; +Cc: xfs On Wed, Jun 12, 2013 at 12:19:07PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > Michael L. Semon has been testing CRC patches ona 32 bit system and on a > been seeing assert failures in the directory code from xfs/080. > Thanks to Michael's heroic efforts with printk debugging, we found > that the problem was that the last free space being left in the > directory structure was too small to fit a unused tag structure and > it was being corrupted and attempting to log a region out of bounds. > Hence the assert failure looked something like: > > ..... > #5 calling xfs_dir2_data_log_unused() 36 32 > #1 4092 4095 4096 > #2 8182 8183 4096 first? last? bp->b_length? > XFS: Assertion failed: first <= last && last < BBTOB(bp->b_length), file: fs/xfs/xfs_trans_buf.c, line: 568 > > Where #1 showed the first region of the dup being logged (i.e. the > last 4 bytes of a directory buffer) and #2 shows the corrupt values > being calculated from the length of the dup entry which overflowed > the size of the buffer. > > It turns out that the problem was not in the logging code, nor in > the freespace handling code. It is an initial condition bug that > only shows up on 32 bit systems. When a new buffer is initialised, > where's the freespace that is set up: > > [ 172.316249] calling xfs_dir2_leaf_addname() from xfs_dir_createname() > [ 172.316346] #9 calling xfs_dir2_data_log_unused() > [ 172.316351] #1 calling xfs_trans_log_buf() 60 63 4096 > [ 172.316353] #2 calling xfs_trans_log_buf() 4094 4095 4096 > > Note the offset of the first region being logged? It's 60 bytes into > the buffer. Once I saw that, I pretty much knew what the bug was > going to be caused by this. > > Essentially, all direct entries are rounded to 8 bytes in length, > and all entries start with an 8 byte alignment. This means that we > can decode inplace as variables are naturally aligned. With the > directory data supposedly starting on a 8 byte boundary, and all > entries padded to 8 bytes, the minimum freespace in a directory > block is supposed to be 8 bytes, which is large enough to fit a > unused data entry structure (6 bytes in size). The fact we only have > 4 bytes of free space indicates a directory data block alignment > problem. > > And what do you know - there's an implicit hole in the directory > data block header for the CRC format, which means the header is 60 > byte on 32 bit intel systems and 64 bytes on 64 bit systems. Needs > padding. And while looking at the structures, I found the same > problem in the attr leaf header. Fix them both. > > Note that this only affects 32 bit systems with CRCs enabled. > Everything else is just fine. Note that filesystems created before CRC enabled filesystems I suggest this be added to head off any confusion. > this fix on such systems will not be readable with this fix applied. > > Reported-by: Michael L. Semon <mlsemon35@gmail.com> > Debugged-by: Michael L. Semon <mlsemon35@gmail.com> > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > fs/xfs/xfs_attr_leaf.h | 1 + > fs/xfs/xfs_dir2_format.h | 5 +++-- > 2 files changed, 4 insertions(+), 2 deletions(-) > > diff --git a/fs/xfs/xfs_attr_leaf.h b/fs/xfs/xfs_attr_leaf.h > index f9d7846..444a770 100644 > --- a/fs/xfs/xfs_attr_leaf.h > +++ b/fs/xfs/xfs_attr_leaf.h > @@ -128,6 +128,7 @@ struct xfs_attr3_leaf_hdr { > __u8 holes; > __u8 pad1; > struct xfs_attr_leaf_map freemap[XFS_ATTR_LEAF_MAPSIZE]; > + __be32 pad2; /* 64 bit alignment */ > }; > > #define XFS_ATTR3_LEAF_CRC_OFF (offsetof(struct xfs_attr3_leaf_hdr, info.crc)) > diff --git a/fs/xfs/xfs_dir2_format.h b/fs/xfs/xfs_dir2_format.h > index 995f1f5..7826782 100644 > --- a/fs/xfs/xfs_dir2_format.h > +++ b/fs/xfs/xfs_dir2_format.h > @@ -266,6 +266,7 @@ struct xfs_dir3_blk_hdr { > struct xfs_dir3_data_hdr { > struct xfs_dir3_blk_hdr hdr; > xfs_dir2_data_free_t best_free[XFS_DIR2_DATA_FD_COUNT]; > + __be32 pad; /* 64 bit alignment */ I counted these up and it looks fine. Nice work gents. Reviewed-by: Ben Myers <bpm@sgi.com> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats 2013-06-13 0:58 ` Ben Myers @ 2013-06-13 1:40 ` Michael L. Semon 2013-06-13 2:27 ` Dave Chinner 1 sibling, 0 replies; 31+ messages in thread From: Michael L. Semon @ 2013-06-13 1:40 UTC (permalink / raw) To: Ben Myers; +Cc: xfs On 06/12/2013 08:58 PM, Ben Myers wrote: > On Wed, Jun 12, 2013 at 12:19:07PM +1000, Dave Chinner wrote: >> From: Dave Chinner <dchinner@redhat.com> >> >> Michael L. Semon has been testing CRC patches ona 32 bit system and > on a > >> been seeing assert failures in the directory code from xfs/080. >> Thanks to Michael's heroic efforts with printk debugging, we found >> that the problem was that the last free space being left in the >> directory structure was too small to fit a unused tag structure and >> it was being corrupted and attempting to log a region out of bounds. >> Hence the assert failure looked something like: >> >> ..... >> #5 calling xfs_dir2_data_log_unused() 36 32 >> #1 4092 4095 4096 >> #2 8182 8183 4096 > first? > last? > bp->b_length? BBTOB(bp->b_length) This is all terrible numbering on my part... >> #1 4092 4095 4096 >> #2 8182 8183 4096 xfs_dir2_data_log_unused() calls xfs_trans_log_buf() twice in the same function. #1 is the first call, #2 is the second call, and there's no running count. The printk() is a copy-and-paste of those two function calls plus a BBTOB(bp->b_length) below it. >> #5 calling xfs_dir2_data_log_unused() 36 32 The #5 was caused by numbering all the calls to xfs_dir2_data_log_unused() to see if one code path was being called every time. #5 is in the xfs_dir2_data_use_free() function, starting with this else-if... else if (matchfront) { newdup = (xfs_dir2_data_unused_t *)((char *)hdr + offset + len); newdup->freetag = cpu_to_be16(XFS_DIR2_DATA_FREE_TAG); newdup->length = cpu_to_be16(oldlen - len); *xfs_dir2_data_unused_tag_p(newdup) = cpu_to_be16((char *)newdup - (char *)hdr); printk( KERN_INFO "#5 calling xfs_dir2_data_log_unused() %d %d\n", oldlen, len ); xfs_dir2_data_log_unused(tp, bp, newdup); Sorry about that! Michael _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats 2013-06-13 0:58 ` Ben Myers 2013-06-13 1:40 ` Michael L. Semon @ 2013-06-13 2:27 ` Dave Chinner 2013-06-13 21:31 ` Ben Myers 1 sibling, 1 reply; 31+ messages in thread From: Dave Chinner @ 2013-06-13 2:27 UTC (permalink / raw) To: Ben Myers; +Cc: Michael L. Semon, xfs On Wed, Jun 12, 2013 at 07:58:19PM -0500, Ben Myers wrote: > On Wed, Jun 12, 2013 at 12:19:07PM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > Michael L. Semon has been testing CRC patches ona 32 bit system and > on a > > > been seeing assert failures in the directory code from xfs/080. > > Thanks to Michael's heroic efforts with printk debugging, we found > > that the problem was that the last free space being left in the > > directory structure was too small to fit a unused tag structure and > > it was being corrupted and attempting to log a region out of bounds. > > Hence the assert failure looked something like: > > > > ..... > > #5 calling xfs_dir2_data_log_unused() 36 32 > > #1 4092 4095 4096 > > #2 8182 8183 4096 > first? > last? > bp->b_length? Yup. > > > > Note that this only affects 32 bit systems with CRCs enabled. > > Everything else is just fine. Note that filesystems created before > CRC enabled filesystems > > I suggest this be added to head off any confusion. Sure. Do I need to resubmit this, or are you going to just modify the commit message yourself before applying it? > > index 995f1f5..7826782 100644 > > --- a/fs/xfs/xfs_dir2_format.h > > +++ b/fs/xfs/xfs_dir2_format.h > > @@ -266,6 +266,7 @@ struct xfs_dir3_blk_hdr { > > struct xfs_dir3_data_hdr { > > struct xfs_dir3_blk_hdr hdr; > > xfs_dir2_data_free_t best_free[XFS_DIR2_DATA_FD_COUNT]; > > + __be32 pad; /* 64 bit alignment */ > > I counted these up and it looks fine. Nice work gents. pahole is a much better way of determining structure size - it tells you exactly what the compiler did, rather than having to assume what the compiler is going to do... $ pahole fs/xfs/xfs_dir2_data.o |less ..... struct xfs_dir3_blk_hdr { __be32 magic; /* 0 4 */ __be32 crc; /* 4 4 */ __be64 blkno; /* 8 8 */ __be64 lsn; /* 16 8 */ uuid_t uuid; /* 24 16 */ __be64 owner; /* 40 8 */ /* size: 48, cachelines: 1, members: 6 */ /* last cacheline: 48 bytes */ }; struct xfs_dir3_data_hdr { struct xfs_dir3_blk_hdr hdr; /* 0 48 */ xfs_dir2_data_free_t best_free[3]; /* 48 12 */ __be32 pad; /* 60 4 */ /* --- cacheline 1 boundary (64 bytes) --- */ /* size: 64, cachelines: 1, members: 3 */ }; .... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats 2013-06-13 2:27 ` Dave Chinner @ 2013-06-13 21:31 ` Ben Myers 0 siblings, 0 replies; 31+ messages in thread From: Ben Myers @ 2013-06-13 21:31 UTC (permalink / raw) To: Dave Chinner; +Cc: Michael L. Semon, xfs Hey Dave, On Thu, Jun 13, 2013 at 12:27:29PM +1000, Dave Chinner wrote: > On Wed, Jun 12, 2013 at 07:58:19PM -0500, Ben Myers wrote: > > On Wed, Jun 12, 2013 at 12:19:07PM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > Michael L. Semon has been testing CRC patches ona 32 bit system and > > on a > > > > > been seeing assert failures in the directory code from xfs/080. > > > Thanks to Michael's heroic efforts with printk debugging, we found > > > that the problem was that the last free space being left in the > > > directory structure was too small to fit a unused tag structure and > > > it was being corrupted and attempting to log a region out of bounds. > > > Hence the assert failure looked something like: > > > > > > ..... > > > #5 calling xfs_dir2_data_log_unused() 36 32 > > > #1 4092 4095 4096 > > > #2 8182 8183 4096 > > first? > > last? > > bp->b_length? > > Yup. > > > > > > > Note that this only affects 32 bit systems with CRCs enabled. > > > Everything else is just fine. Note that filesystems created before > > CRC enabled filesystems > > > > I suggest this be added to head off any confusion. > > Sure. Do I need to resubmit this, or are you going to just modify > the commit message yourself before applying it? Nah, I'll do it. > > > index 995f1f5..7826782 100644 > > > --- a/fs/xfs/xfs_dir2_format.h > > > +++ b/fs/xfs/xfs_dir2_format.h > > > @@ -266,6 +266,7 @@ struct xfs_dir3_blk_hdr { > > > struct xfs_dir3_data_hdr { > > > struct xfs_dir3_blk_hdr hdr; > > > xfs_dir2_data_free_t best_free[XFS_DIR2_DATA_FD_COUNT]; > > > + __be32 pad; /* 64 bit alignment */ > > > > I counted these up and it looks fine. Nice work gents. > > pahole is a much better way of determining structure size - it tells > you exactly what the compiler did, rather than having to assume what > the compiler is going to do... > > $ pahole fs/xfs/xfs_dir2_data.o |less Thanks, I'll check it out. Applied this. -Ben _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH 3/3] xfs: ensure btree root split sets blkno correctly 2013-06-12 2:19 [PATCH 0/3] xfs: fixes for 3.10-rc6 Dave Chinner 2013-06-12 2:19 ` [PATCH 1/3] xfs: don't shutdown log recovery on validation errors Dave Chinner 2013-06-12 2:19 ` [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats Dave Chinner @ 2013-06-12 2:19 ` Dave Chinner 2013-06-13 19:16 ` Ben Myers 2 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2013-06-12 2:19 UTC (permalink / raw) To: xfs; +Cc: bpm From: Dave Chinner <dchinner@redhat.com> For CRC enabled filesystems, the BMBT is rooted in an inode, so it passes through a difference code path on root splits to the freespace and inode btrees. This is much less traversed by xfstests than the other trees. When testing on a 1k block size filesystem, I've been seeing ASSERT failures in generic/234 like: XFS: Assertion failed: cur->bc_btnum != XFS_BTNUM_BMAP || cur->bc_private.b.allocated == 0, file: fs/xfs/xfs_btree.c, line: 317 which are generally preceded by a lblock check failure. I noticed this in the bmbt stats: $ pminfo -f xfs.btree.block_map xfs.btree.block_map.lookup value 39135 xfs.btree.block_map.compare value 268432 xfs.btree.block_map.insrec value 15786 xfs.btree.block_map.delrec value 13884 xfs.btree.block_map.newroot value 2 xfs.btree.block_map.killroot value 0 ..... Very little coverage of root splits and merges. Indeed, on a 4k filesystem, block_map.newroot and block_map.killroot are both zero. i.e the code is not exercised at all, and it's the only generic btree infrastruct operation that is not exercised by a default run of xfstests. Turns out that on a 1k filesystem, generic/234 accounts for one of those two root splits, and that is somewhat of a smoking gun. In fact, it's the same problem we saw in the directory/attr code where headers are memcpy()d from one block to another without updating the self describing metadata. Simple fix - when copying the header out of the root block, make sure the block number is updated correctly. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_btree.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/xfs/xfs_btree.c b/fs/xfs/xfs_btree.c index 8804b8a..0903960 100644 --- a/fs/xfs/xfs_btree.c +++ b/fs/xfs/xfs_btree.c @@ -2544,7 +2544,17 @@ xfs_btree_new_iroot( if (error) goto error0; + /* + * we can't just memcpy() the root in for CRC enabled btree blocks. + * In that case have to also ensure the blkno remains correct + */ memcpy(cblock, block, xfs_btree_block_len(cur)); + if (cur->bc_flags & XFS_BTREE_CRC_BLOCKS) { + if (cur->bc_flags & XFS_BTREE_LONG_PTRS) + cblock->bb_u.l.bb_blkno = cpu_to_be64(cbp->b_bn); + else + cblock->bb_u.s.bb_blkno = cpu_to_be64(cbp->b_bn); + } be16_add_cpu(&block->bb_level, 1); xfs_btree_set_numrecs(block, 1); -- 1.7.10.4 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH 3/3] xfs: ensure btree root split sets blkno correctly 2013-06-12 2:19 ` [PATCH 3/3] xfs: ensure btree root split sets blkno correctly Dave Chinner @ 2013-06-13 19:16 ` Ben Myers 2013-06-14 0:21 ` Dave Chinner 0 siblings, 1 reply; 31+ messages in thread From: Ben Myers @ 2013-06-13 19:16 UTC (permalink / raw) To: Dave Chinner; +Cc: xfs On Wed, Jun 12, 2013 at 12:19:08PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > For CRC enabled filesystems, the BMBT is rooted in an inode, so it > passes through a difference code path on root splits to the different than > freespace and inode btrees. This is much less traversed by xfstests > than the other trees. When testing on a 1k block size filesystem, > I've been seeing ASSERT failures in generic/234 like: > > XFS: Assertion failed: cur->bc_btnum != XFS_BTNUM_BMAP || cur->bc_private.b.allocated == 0, file: fs/xfs/xfs_btree.c, line: 317 > > which are generally preceded by a lblock check failure. I noticed > this in the bmbt stats: > > $ pminfo -f xfs.btree.block_map > > xfs.btree.block_map.lookup > value 39135 > > xfs.btree.block_map.compare > value 268432 > > xfs.btree.block_map.insrec > value 15786 > > xfs.btree.block_map.delrec > value 13884 > > xfs.btree.block_map.newroot > value 2 > > xfs.btree.block_map.killroot > value 0 > ..... > > Very little coverage of root splits and merges. Indeed, on a 4k > filesystem, block_map.newroot and block_map.killroot are both zero. > i.e the code is not exercised at all, and it's the only generic . > btree infrastruct operation that is not exercised by a default run infrastructure Cleaned those up. > of xfstests. > > Turns out that on a 1k filesystem, generic/234 accounts for one of > those two root splits, and that is somewhat of a smoking gun. In > fact, it's the same problem we saw in the directory/attr code where > headers are memcpy()d from one block to another without updating the > self describing metadata. It is very interesting that this area of code is exercised so infrequently. I remember seeing a paper that described a tool that would list codepaths that are exercised during a test run. Does that ring a bell? It seems like this might be worth looking into more generally. > Simple fix - when copying the header out of the root block, make > sure the block number is updated correctly. Yep, looks fine. > Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH 3/3] xfs: ensure btree root split sets blkno correctly 2013-06-13 19:16 ` Ben Myers @ 2013-06-14 0:21 ` Dave Chinner 0 siblings, 0 replies; 31+ messages in thread From: Dave Chinner @ 2013-06-14 0:21 UTC (permalink / raw) To: Ben Myers; +Cc: xfs On Thu, Jun 13, 2013 at 02:16:17PM -0500, Ben Myers wrote: > On Wed, Jun 12, 2013 at 12:19:08PM +1000, Dave Chinner wrote: > > Turns out that on a 1k filesystem, generic/234 accounts for one of > > those two root splits, and that is somewhat of a smoking gun. In > > fact, it's the same problem we saw in the directory/attr code where > > headers are memcpy()d from one block to another without updating the > > self describing metadata. > > It is very interesting that this area of code is exercised so infrequently. Well known problem - I remember that we did xfstests code coverage analysis way back in 2005 at SGI, and this was something that popped out. The stats simple confirmed what the code coverage profiling was telling us... That's the reason I run 512 byte/1k block size testing all the time with xfstests - it covers all of these code paths that 4k block size doesn't cover. > I > remember seeing a paper that described a tool that would list codepaths that > are exercised during a test run. Does that ring a bell? It seems like this > might be worth looking into more generally. You mean code coverage profiling? FWIW, RH QA has run some xfstests code coverage analysis recently, too, and it has mostly confirmed the same results that we got at SGI years ago - we get roughly 70% code coverage of the kernel code froma default 4k filesystem xfstests run. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2013-06-28 19:31 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-06-12 2:19 [PATCH 0/3] xfs: fixes for 3.10-rc6 Dave Chinner 2013-06-12 2:19 ` [PATCH 1/3] xfs: don't shutdown log recovery on validation errors Dave Chinner 2013-06-13 1:04 ` Ben Myers 2013-06-13 2:08 ` Dave Chinner 2013-06-13 22:09 ` Ben Myers 2013-06-14 0:13 ` Dave Chinner 2013-06-14 12:55 ` Mark Tinguely 2013-06-14 16:09 ` Ben Myers 2013-06-14 16:15 ` Eric Sandeen 2013-06-14 19:08 ` Ben Myers 2013-06-14 19:18 ` Eric Sandeen 2013-06-14 19:44 ` Ben Myers 2013-06-14 19:54 ` Eric Sandeen 2013-06-14 20:22 ` Ben Myers 2013-06-28 18:54 ` Dave Jones 2013-06-28 19:24 ` Ben Myers 2013-06-28 19:28 ` Dave Jones 2013-06-28 19:31 ` Ben Myers 2013-06-15 0:56 ` Dave Chinner 2013-06-17 14:53 ` Ben Myers 2013-06-18 1:22 ` Dave Chinner 2013-06-14 16:17 ` Dave Jones 2013-06-14 16:31 ` Ben Myers 2013-06-12 2:19 ` [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats Dave Chinner 2013-06-13 0:58 ` Ben Myers 2013-06-13 1:40 ` Michael L. Semon 2013-06-13 2:27 ` Dave Chinner 2013-06-13 21:31 ` Ben Myers 2013-06-12 2:19 ` [PATCH 3/3] xfs: ensure btree root split sets blkno correctly Dave Chinner 2013-06-13 19:16 ` Ben Myers 2013-06-14 0:21 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox