public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Kamal Dasu <kdasu.kdev@gmail.com>
Cc: xfs@oss.sgi.com
Subject: Re: xfs filesystem corruption with kernel 2.6.37
Date: Fri, 2 Nov 2012 12:27:28 +1100	[thread overview]
Message-ID: <20121102012728.GT29378@dastard> (raw)
In-Reply-To: <34630253.post@talk.nabble.com>

On Thu, Nov 01, 2012 at 12:30:13PM -0700, Kamal Dasu wrote:
> 
> Dave,
> 
> Thanks for you reply.
> 
> I am trying to act on the hints you gave me but I still have a few
> questions.
> 
> On Thu, Oct 25, 2012 at 6:47 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Oct 25, 2012 at 09:45:10AM -0400, Kamal Dasu wrote:
> >> with  "CONFIG_XFS_DEBUG=y" I get the following assertion:
> >>
> >> Assertion failed: prev.br_state == XFS_EXT_NORM, file:
> >> fs/xfs/xfs_bmap.c, line: 5192
> >
> > Yup, that's pretty clear indication of a corrupted extent record.
> >
> 
> What is the best way to prevent  transactions that record bad
> extent length and block numbers.

That should never occur - there are already checks in place to
prevent that. However, the log must be treated as potentially
corrupt during recovery, so when freeing extents on recovered files
we might be walking corrupt extents. xfs_bunmapi() is the place that
should be checking that the extent being freed is of a sane length.
(Just like xfs_bmapi checks to ensure the extent allocated is of
sane length).


> > That's the open, unlinked file at the time the system crashed. That
> > may be where your problems are coming from. The RT is mostly
> > untested, and we sure as anything don't do any crash resiliency or
> > recovery testing on it, so there's a good chance there are bugs in
> > it that might show up in situations like this....
> >
> > You need to detect extents with invalid lengths in them and trigger
> > a corruption-based filesystem shutdown.
> 
> Looked at the log during one of the filesystem shutdown when the
> I/O error occurs. is this an indication of already corrupted log due to
> corrupted in-memory metadata structures?.
> ===
> attempt to access beyond end of device
> sda2: rw=0, want=33792081130943048, limit=31471329
> I/O error in filesystem ("sda2") meta-data dev sda2 block
> 0x780db80007f240       ("xfs_trans_read_buf") error 5 buf count 4096
> xfs_force_shutdown(sda2,0x1) called from line 395 of file
> fs/xfs/xfs_trans_buf.c.  Return address = 0x801f4f88
> Filesystem "sda2": I/O Error Detected.  Shutting down filesystem: sda2
> Please umount the filesystem, and rectify the problem(s)

The I/O error is what triggered the shutdown. The transaction tried
to read a metadata block beyond EOF.

> However the log is already corrupted. So is there a check on a write
> to the log ?.

No, the above check caught the corruption as soon as it was found.
You need to walk back from this event to find where the corruption
was introduced.

> >> also if there is something that can be done to avoid this situation in
> >> the first place.
> >
> > Track down where those stray upper bits in the block numbers are
> > coming from, and you'll have your answer.
> >
> 
> Have not been able to track this down yet. But could it be a possible memory
> corruption, leading to the in-memory metadata to get corrupted.

Yes, that is a possible cause that lead to a bad block number being
written to disk.

> On a similar occurrence of this issue on recovery after a reboot seems
> to always go through the evict path
> 
> Filesystem "sda2": XFS internal error xfs_trans_cancel at line 1815
> of file fs/xfs/xfs_trans.c.  Caller 0x801f8524
> 
> Call Trace:
> [<80439d2c>] dump_stack+0x8/0x34
> [<801f3bec>] xfs_trans_cancel+0x10c/0x128
> [<801f8524>] xfs_inactive+0x2fc/0x450
> [<800dcd54>] evict+0x28/0xd0
> [<800dd300>] iput+0x19c/0x2d8
> [<801e5bcc>] xlog_recover_process_one_iunlink+0xec/0x130
> [<801e7b60>] xlog_recover_process_iunlinks.clone.25+0xa8/0x108
> [<801eb360>] xlog_recover_finish+0x40/0x100
> [<801eedd8>] xfs_mountfs+0x434/0x654

That's where it is processing files that were unlinked but still
referenced at the time of the crash. We already know that these are
the corrupted files...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2012-11-02  1:25 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-25 13:45 xfs filesystem corruption with kernel 2.6.37 Kamal Dasu
2012-10-25 22:47 ` Dave Chinner
2012-11-01 19:30   ` Kamal Dasu
2012-11-02  1:27     ` Dave Chinner [this message]
2012-11-02 16:34       ` Kamal Dasu
2012-11-02 22:55         ` Dave Chinner
2012-11-03  1:57           ` Kamal Dasu
2012-11-03 22:25             ` Dave Chinner
2012-11-09 21:18               ` Kamal Dasu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121102012728.GT29378@dastard \
    --to=david@fromorbit.com \
    --cc=kdasu.kdev@gmail.com \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox