public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* silent corruption after kernel panic?
@ 2011-09-19 12:28 Assarsson, Emil
  2011-09-19 14:27 ` Christoph Hellwig
  0 siblings, 1 reply; 3+ messages in thread
From: Assarsson, Emil @ 2011-09-19 12:28 UTC (permalink / raw)
  To: xfs@oss.sgi.com

Hi,

We are running a 20TB XFS filesystem on top of LVM2 and SAN storage (HP
Open-V) with multipathd. Ubuntu Lucid. The disk write cache is enabled
and we use mount options rw.

This is a log of events taken from my memory and can have missed out
things :-P

The system panicked and automatically restarted after 30 seconds.

It seemed to be ok but after awhile we got cases where users got files
with zero length. We tried to run xfs_check on the filesystem but it
couldn't find any problems with it. After that we restarted the system
and the files (even the files that was zero length) seemed ok again. But
then we got messages (short version):
-----
Sep 16 06:40:34 seldlnx034 kernel: [54607.977261] XFS internal error
XFS_WANT_CORRUPTED_RETURN at line 381 of
file /build/buildd/linux-2.6.32/fs/xfs/xfs_alloc.c.  Caller
0xffffffffa01eed36
Sep 16 06:40:34 seldlnx034 kernel: [54607.996676]  [<ffffffffa0215383>]
xfs_error_report+0x43/0x50 [xfs]
Sep 16 06:40:34 seldlnx034 kernel: [54607.996689]
-----

... and files written during this period became corrupt (zero length).

We did a xfs_repair on the filesystem (short version):
-----
entry "fw-radmp_all.deb" at block 0 offset 944 in directory inode
157891962 references free inode 195983876
	clearing inode number in entry at offset 944...
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
bad hash table for directory inode 13786 (no data entry): rebuilding
rebuilding directory inode 13786
bad hash table for directory inode 2130829772 (no data entry):
rebuilding
rebuilding directory inode 2130829772
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done
------

We have made a verification of the files now I we don't have any known
problems with the file system now but the files created when the file
system was broken needed to be recreated.



How can I avoid this in the future and how can I ensure that I get
informed about a problem? Do I do anything wrong with the setup that you
can see?

--
Emil Assarsson
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: silent corruption after kernel panic?
  2011-09-19 12:28 silent corruption after kernel panic? Assarsson, Emil
@ 2011-09-19 14:27 ` Christoph Hellwig
  2011-09-23 10:21   ` Assarsson, Emil
  0 siblings, 1 reply; 3+ messages in thread
From: Christoph Hellwig @ 2011-09-19 14:27 UTC (permalink / raw)
  To: Assarsson, Emil; +Cc: xfs@oss.sgi.com

On Mon, Sep 19, 2011 at 02:28:23PM +0200, Assarsson, Emil wrote:
> Hi,
> 
> We are running a 20TB XFS filesystem on top of LVM2 and SAN storage (HP
> Open-V) with multipathd. Ubuntu Lucid. The disk write cache is enabled
> and we use mount options rw.


> Sep 16 06:40:34 seldlnx034 kernel: [54607.977261] XFS internal error
> XFS_WANT_CORRUPTED_RETURN at line 381 of
> file /build/buildd/linux-2.6.32/fs/xfs/xfs_alloc.c.  Caller
> 0xffffffffa01eed36
> Sep 16 06:40:34 seldlnx034 kernel: [54607.996676]  [<ffffffffa0215383>]
> xfs_error_report+0x43/0x50 [xfs]
> Sep 16 06:40:34 seldlnx034 kernel: [54607.996689]

This (corrupted allocation btrees) is a typical indication of missing
cache flushes.

Given that before ~2.6.35 LVM/device mapper was not able to pass through
cache flush requests that is your most likely culprit.  A repair will
rebuild the freespace btrees, and make sure to keep the write caches
down the whole stack disabled.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: silent corruption after kernel panic?
  2011-09-19 14:27 ` Christoph Hellwig
@ 2011-09-23 10:21   ` Assarsson, Emil
  0 siblings, 0 replies; 3+ messages in thread
From: Assarsson, Emil @ 2011-09-23 10:21 UTC (permalink / raw)
  To: hch@infradead.org; +Cc: xfs@oss.sgi.com

mån 2011-09-19 klockan 10:27 -0400 skrev Christoph Hellwig:
> On Mon, Sep 19, 2011 at 02:28:23PM +0200, Assarsson, Emil wrote:
> > Hi,
> > 
> > We are running a 20TB XFS filesystem on top of LVM2 and SAN storage (HP
> > Open-V) with multipathd. Ubuntu Lucid. The disk write cache is enabled
> > and we use mount options rw.
> 
> 
> > Sep 16 06:40:34 seldlnx034 kernel: [54607.977261] XFS internal error
> > XFS_WANT_CORRUPTED_RETURN at line 381 of
> > file /build/buildd/linux-2.6.32/fs/xfs/xfs_alloc.c.  Caller
> > 0xffffffffa01eed36
> > Sep 16 06:40:34 seldlnx034 kernel: [54607.996676]  [<ffffffffa0215383>]
> > xfs_error_report+0x43/0x50 [xfs]
> > Sep 16 06:40:34 seldlnx034 kernel: [54607.996689]
> 
> This (corrupted allocation btrees) is a typical indication of missing
> cache flushes.
> 
> Given that before ~2.6.35 LVM/device mapper was not able to pass through
> cache flush requests that is your most likely culprit.  A repair will
> rebuild the freespace btrees, and make sure to keep the write caches
> down the whole stack disabled.

Thanks for you help Christoph. I guess you are right. Some of our system
had write cache enabled and used Device Mapper. We have disabled the
cache.

We got some new, possibly related, problems and was forced to clear the
log. We decided to move the data to a fresh file system. We will use
xfs_dump/restore.

--
Emil Assarsson
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-09-23 10:21 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-19 12:28 silent corruption after kernel panic? Assarsson, Emil
2011-09-19 14:27 ` Christoph Hellwig
2011-09-23 10:21   ` Assarsson, Emil

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox