public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* inconsistent xfs log record
@ 2008-04-08  1:54 Michael Nishimoto
  2008-04-08 15:50 ` David Chinner
  2008-04-09  4:38 ` Timothy Shimmin
  0 siblings, 2 replies; 6+ messages in thread
From: Michael Nishimoto @ 2008-04-08  1:54 UTC (permalink / raw)
  To: XFS Mailing List

I've just finished analyzing an xfs filesystem which won't recover.
An inconsistent log record has 332 log operations but the num_logop field
in the record header says 333 log operations.  The result is that xfs recovery
complains with "bad clientid" because recovery eventually attempts to decode
garbage.

The log record really has 332 log ops (I counted!).

Looking through xlog_write(), I don't see any way that record_cnt can be bumped
without also writing out a log operation.

Does this issue ring a bell with anyone?

    Michael

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: inconsistent xfs log record
  2008-04-08  1:54 inconsistent xfs log record Michael Nishimoto
@ 2008-04-08 15:50 ` David Chinner
  2008-04-08 16:14   ` Michael Nishimoto
  2008-04-08 21:36   ` Michael Nishimoto
  2008-04-09  4:38 ` Timothy Shimmin
  1 sibling, 2 replies; 6+ messages in thread
From: David Chinner @ 2008-04-08 15:50 UTC (permalink / raw)
  To: Michael Nishimoto; +Cc: XFS Mailing List

On Mon, Apr 07, 2008 at 06:54:21PM -0700, Michael Nishimoto wrote:
> I've just finished analyzing an xfs filesystem which won't recover.
> An inconsistent log record has 332 log operations but the num_logop field
> in the record header says 333 log operations.  The result is that xfs 
> recovery
> complains with "bad clientid" because recovery eventually attempts to decode
> garbage.
> 
> The log record really has 332 log ops (I counted!).
> 
> Looking through xlog_write(), I don't see any way that record_cnt can be 
> bumped
> without also writing out a log operation.

Yeah, i remember going through this a while back tracking done the same
error on snapshot images (was a freeze problem) and I couldn't see how
it would happen, either. 

Still, it's a single bit error so that's always suspicious - can you
reproduce this error reliably?

> Does this issue ring a bell with anyone?

FWIW, I have had 2-3 failures with a "bad clientid" on a 64k page size ia64
box since I switched from 16k page size about a month ago. I haven't seen any
consistent pattern to the failure yet, nor had a chance to perform any
sort of triage on the problem so I can't say whether I'm seeing the same
issue...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: inconsistent xfs log record
  2008-04-08 15:50 ` David Chinner
@ 2008-04-08 16:14   ` Michael Nishimoto
  2008-04-08 21:36   ` Michael Nishimoto
  1 sibling, 0 replies; 6+ messages in thread
From: Michael Nishimoto @ 2008-04-08 16:14 UTC (permalink / raw)
  To: David Chinner; +Cc: XFS Mailing List

David Chinner wrote:
> On Mon, Apr 07, 2008 at 06:54:21PM -0700, Michael Nishimoto wrote:
>  > I've just finished analyzing an xfs filesystem which won't recover.
>  > An inconsistent log record has 332 log operations but the num_logop field
>  > in the record header says 333 log operations.  The result is that xfs
>  > recovery
>  > complains with "bad clientid" because recovery eventually attempts to 
> decode
>  > garbage.
>  >
>  > The log record really has 332 log ops (I counted!).
>  >
>  > Looking through xlog_write(), I don't see any way that record_cnt can be
>  > bumped
>  > without also writing out a log operation.
> 
> Yeah, i remember going through this a while back tracking done the same
> error on snapshot images (was a freeze problem) and I couldn't see how
> it would happen, either.
> 
> Still, it's a single bit error so that's always suspicious - can you
> reproduce this error reliably?

We haven't tried doing this yet, but I doubt we will because the test that
found this problem is not unusual.  We just pulled power while alot of
activity was present.

A single bit, but also off-by-one. :-)

> 
>  > Does this issue ring a bell with anyone?
> 
> FWIW, I have had 2-3 failures with a "bad clientid" on a 64k page size ia64
> box since I switched from 16k page size about a month ago. I haven't 
> seen any
> consistent pattern to the failure yet, nor had a chance to perform any
> sort of triage on the problem so I can't say whether I'm seeing the same
> issue...
> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: inconsistent xfs log record
  2008-04-08 15:50 ` David Chinner
  2008-04-08 16:14   ` Michael Nishimoto
@ 2008-04-08 21:36   ` Michael Nishimoto
  2008-04-09  2:30     ` David Chinner
  1 sibling, 1 reply; 6+ messages in thread
From: Michael Nishimoto @ 2008-04-08 21:36 UTC (permalink / raw)
  To: David Chinner; +Cc: XFS Mailing List


David Chinner wrote:
> On Mon, Apr 07, 2008 at 06:54:21PM -0700, Michael Nishimoto wrote:
>  > I've just finished analyzing an xfs filesystem which won't recover.
>  > An inconsistent log record has 332 log operations but the num_logop field
>  > in the record header says 333 log operations.  The result is that xfs
>  > recovery
>  > complains with "bad clientid" because recovery eventually attempts to 
> decode
>  > garbage.
>  >
>  > The log record really has 332 log ops (I counted!).
>  >
>  > Looking through xlog_write(), I don't see any way that record_cnt can be
>  > bumped
>  > without also writing out a log operation.
> 
> Yeah, i remember going through this a while back tracking done the same
> error on snapshot images (was a freeze problem) and I couldn't see how
> it would happen, either.
> 
> Still, it's a single bit error so that's always suspicious - can you
> reproduce this error reliably?
> 
>  > Does this issue ring a bell with anyone?
> 
> FWIW, I have had 2-3 failures with a "bad clientid" on a 64k page size ia64
> box since I switched from 16k page size about a month ago. I haven't 
> seen any
> consistent pattern to the failure yet, nor had a chance to perform any
> sort of triage on the problem so I can't say whether I'm seeing the same
> issue...
> 
> Cheers,
> 
> Dave.
> --
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group

When you saw the problem, did you also have an off-by-one or one-bit difference
between num_logops and the real count?

     Michael

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: inconsistent xfs log record
  2008-04-08 21:36   ` Michael Nishimoto
@ 2008-04-09  2:30     ` David Chinner
  0 siblings, 0 replies; 6+ messages in thread
From: David Chinner @ 2008-04-09  2:30 UTC (permalink / raw)
  To: Michael Nishimoto; +Cc: XFS Mailing List

On Tue, Apr 08, 2008 at 02:36:24PM -0700, Michael Nishimoto wrote:
> David Chinner wrote:
> >On Mon, Apr 07, 2008 at 06:54:21PM -0700, Michael Nishimoto wrote:
> > > I've just finished analyzing an xfs filesystem which won't recover.
> > > An inconsistent log record has 332 log operations but the num_logop 
> > field
> > > in the record header says 333 log operations.  The result is that xfs
> > > recovery
> > > complains with "bad clientid" because recovery eventually attempts to 
> >decode
> > > garbage.
> > >
> > > The log record really has 332 log ops (I counted!).
.....
> >FWIW, I have had 2-3 failures with a "bad clientid" on a 64k page size ia64
> >box since I switched from 16k page size about a month ago. I haven't 
> >seen any
> >consistent pattern to the failure yet, nor had a chance to perform any
> >sort of triage on the problem so I can't say whether I'm seeing the same
> >issue...
> 
> When you saw the problem, did you also have an off-by-one or one-bit 
> difference
> between num_logops and the real count?

No idea - i didn't traige it because I'd just switched over to 64k page size
and had about 10 new QA failures to catalogue and record. Going back to
the bug I originally raised, I see that there was a reproducable case to
produce the error:

$ sudo XFS_MKFS_OPTIONS="-s size=1024" ./check 139

i.e. sector size of 1k on a 64k page machine. However, that's as far as
I got and i haven't revisited it yet so I can't say if there's any
real correlation or not to what you've seen. It does, however, point
out that there is a problem there somewhere...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: inconsistent xfs log record
  2008-04-08  1:54 inconsistent xfs log record Michael Nishimoto
  2008-04-08 15:50 ` David Chinner
@ 2008-04-09  4:38 ` Timothy Shimmin
  1 sibling, 0 replies; 6+ messages in thread
From: Timothy Shimmin @ 2008-04-09  4:38 UTC (permalink / raw)
  To: Michael Nishimoto; +Cc: XFS Mailing List

Michael Nishimoto wrote:
> I've just finished analyzing an xfs filesystem which won't recover.
> An inconsistent log record has 332 log operations but the num_logop field
> in the record header says 333 log operations.  The result is that xfs 
> recovery
> complains with "bad clientid" because recovery eventually attempts to 
> decode
> garbage.
> 
> The log record really has 332 log ops (I counted!).
> 
> Looking through xlog_write(), I don't see any way that record_cnt can be 
> bumped
> without also writing out a log operation.
> 
> Does this issue ring a bell with anyone?
> 
>    Michael
> 

Having a bit of a look at other bugs than the snapshot one...
nothing really helpful.

I've seen a few "bad clientid" but that, as you say, just reflects that
at some point we have crap in the log op header which we
notice when doing recovery.
I had one (pv#945899) where it seemed to have got the head of the log wrong -
you could see using "xfs_logprint -d" at the change of cycle#s - it didn't
match.
Yours appears different.
I also had another one (pv#971596) but I didn't narrow it down to the
wrong# of log ops but maybe I wasn't looking carefully enough at the time.
Okay, for that one there were 2 bugs in one, one for bad clientid and
one for bad transaction - for the bad transaction,
there was something like a 2nd startop without an intervening commit op
for the tid - I moved onto something else before getting anywhere further.


--Tim

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-04-09  4:37 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-08  1:54 inconsistent xfs log record Michael Nishimoto
2008-04-08 15:50 ` David Chinner
2008-04-08 16:14   ` Michael Nishimoto
2008-04-08 21:36   ` Michael Nishimoto
2008-04-09  2:30     ` David Chinner
2008-04-09  4:38 ` Timothy Shimmin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox