public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Ben Myers <bpm@sgi.com>
To: Eric Sandeen <sandeen@sandeen.net>
Cc: Dave Jones <davej@redhat.com>, xfs@oss.sgi.com
Subject: Re: [PATCH 1/3] xfs: don't shutdown log recovery on validation errors
Date: Fri, 14 Jun 2013 14:44:53 -0500	[thread overview]
Message-ID: <20130614194453.GC20932@sgi.com> (raw)
In-Reply-To: <51BB6C7C.6050300@sandeen.net>

Hey Eric,

On Fri, Jun 14, 2013 at 02:18:20PM -0500, Eric Sandeen wrote:
> On 6/14/13 2:08 PM, Ben Myers wrote:
> > On Fri, Jun 14, 2013 at 11:15:41AM -0500, Eric Sandeen wrote:
> >> Ben, isn't it the case that the corruption would only happen if
> >> log replay failed for some reason (as has always been the case,
> >> verifier or not), but with the verifier in place, it kills replay
> >> even w/o other problems due to a logical problem with the
> >> (recently added) verifiers?
> > 
> > It seems like the verifier prevented corruption from hitting disk during
> > log replay.  
> 
> It detected a an inconsistent *interim* state during replay, which is
> always made correct by log replay completion.  But it *stopped* that log
> replay completion.  And caused log replay to fail.  And mount to fail.
> This is *new* behavior, and bad.
> 
> As I understand it.
>
> > It is enforcing a partial replay up to the point where the
> > corruption occurred.  Now you should be able to zero the log and the
> > filesystem is not corrupted.
> > 
> >> IOW - this seems like an actual functional regression due to the
> >> addition of the verifier, and dchinner's patch gets us back
> >> to the almost-always-fine state we were in prior to the change.
> > 
> > Oh, the spin doctor is *in*!
> 
> This is not spin.
> 
> > This isn't a logical problem with the verifier, it's a logical problem
> > with log replay.  We need to find a way for recovery to know whether a
> > given transaction should be replayed.  Fixing that is nontrivial.
> 
> Right.
> 
> And it's been around for years.  The verifier now detects that
> interim state, and makes things *worse* than they would be had log
> replay been allowed to continue.
> 
> Fixing the interim state may be nontrivial; allowing log replay
> to continue to a consistent state as it always has *is* trivial,
> it's what's done in Dave's small patch.
>
> >> As we're at -rc6, it seems quite reasonable to me as a quick
> >> fix to just short-circuit it for now.
> > 
> > If we're talking about a short term fix, that's fine.  This should be
> > conditional on CONFIG_XFS_DEBUG and marked as such.
> > 
> > Long term, removing the verifiers is the wrong thing to do here.  We
> > need to fix the recovery bug and then remove this temporary workaround.  
> > 
> >> If you have time to analyze dave's metadump that's cool, but
> >> this seems like something that really needs to be addressed
> >> before 3.10 gets out the door.
> > 
> > If this really is a day one bug then it's been out the door almost
> > twenty years.  And you want to hurry now?  ;)
> 
> We seem to be talking past each other.
> 
> The corrupted interim state has been around for years.  Up until
> now, log replay completion left things in perfect state.
> 
> The verifier now *breaks replay* at that interim point.
> Were it allowed to continue, everything would be fine.
> 
> As things stand, it is not fine, and this is a recent change
> which Dave is trying to correct.
> 
> Leaving it in place will cause filesystems which were replaying
> logs just fine until recently to now fail with no good way out.

That is consistent with my understanding of the problem...

Unfortunately log replay is broken.  The verifier has detected this and stopped
replay.  Ideally the solution would be to fix log replay, but that is going to
take some time.  So, in the near term we're just going to disable the verifier
to allow replay to complete.

I'm suggesting that this disabling be done conditionally on CONFIG_XFS_DEBUG so
that developers still have a chance at hitting the log replay problem, and a
comment should be added explaining that we've disabled the verifier due to a
specific bug as a temporary workaround and we'll re-enable the verifier once
it's fixed.  I'll update the patch and repost.

Are you guys arguing that the log replay bug should not be fixed?

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2013-06-14 19:44 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-12  2:19 [PATCH 0/3] xfs: fixes for 3.10-rc6 Dave Chinner
2013-06-12  2:19 ` [PATCH 1/3] xfs: don't shutdown log recovery on validation errors Dave Chinner
2013-06-13  1:04   ` Ben Myers
2013-06-13  2:08     ` Dave Chinner
2013-06-13 22:09       ` Ben Myers
2013-06-14  0:13         ` Dave Chinner
2013-06-14 12:55           ` Mark Tinguely
2013-06-14 16:09           ` Ben Myers
2013-06-14 16:15             ` Eric Sandeen
2013-06-14 19:08               ` Ben Myers
2013-06-14 19:18                 ` Eric Sandeen
2013-06-14 19:44                   ` Ben Myers [this message]
2013-06-14 19:54                     ` Eric Sandeen
2013-06-14 20:22                       ` Ben Myers
2013-06-28 18:54                         ` Dave Jones
2013-06-28 19:24                           ` Ben Myers
2013-06-28 19:28                             ` Dave Jones
2013-06-28 19:31                               ` Ben Myers
2013-06-15  0:56                     ` Dave Chinner
2013-06-17 14:53                       ` Ben Myers
2013-06-18  1:22                         ` Dave Chinner
2013-06-14 16:17             ` Dave Jones
2013-06-14 16:31               ` Ben Myers
2013-06-12  2:19 ` [PATCH 2/3] xfs: fix implicit padding in directory and attr CRC formats Dave Chinner
2013-06-13  0:58   ` Ben Myers
2013-06-13  1:40     ` Michael L. Semon
2013-06-13  2:27     ` Dave Chinner
2013-06-13 21:31       ` Ben Myers
2013-06-12  2:19 ` [PATCH 3/3] xfs: ensure btree root split sets blkno correctly Dave Chinner
2013-06-13 19:16   ` Ben Myers
2013-06-14  0:21     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130614194453.GC20932@sgi.com \
    --to=bpm@sgi.com \
    --cc=davej@redhat.com \
    --cc=sandeen@sandeen.net \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox