Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Janos Haar <janos.haar@netcenter.hu>
Cc: xiyou.wangcong@gmail.com, linux-kernel@vger.kernel.org,
	kamezawa.hiroyu@jp.fujitsu.com, linux-mm@kvack.org,
	xfs@oss.sgi.com, axboe@kernel.dk
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
Date: Wed, 14 Apr 2010 10:16:15 +1000	[thread overview]
Message-ID: <20100414001615.GC2493@dastard> (raw)
In-Reply-To: <1cd501cadb62$3a93e790$0400a8c0@dcccs>

On Wed, Apr 14, 2010 at 01:36:56AM +0200, Janos Haar wrote:
> ----- Original Message ----- From: "Dave Chinner"
> >On Tue, Apr 13, 2010 at 11:23:36AM +0200, Janos Haar wrote:
> >>>If you run:
> >>>
> >>>$ xfs_db -r -c "inode 474253940" -c p /dev/sdb2
> >>>
> >>>Then I can can confirm whether there is corruption on disk or not.
> >>>Probably best to sample multiple of the inode numbers from the above
> >>>list of bad inodes.
> >>
> >>Here is the log:
> >>http://download.netcenter.hu/bughunt/20100413/debug.log
> >
> >There are multiple fields in the inode that are corrupted.
> >I am really surprised that xfs-repair - even an old version - is not
> >picking up the corruption....
> 
> I think i know now the reason....
> My case starting to turn into more and more interesting.
> 
> (Just a little note for remember: tuesday night, i have run the old
> 2.8.11 xfs_repair on the partiton wich was reported as corrupt by
> the kernel, but it was clean.
> The system was not restarted!)
> 
> Like you suggested, today, i have tried to make a backup from the data.
> During the copy, the kernel reported a lot of corrupted entries
> again, and finally the kernel crashed! (with the 19 patch pack)
> Unfortunately the kernel can't write the debug info into the syslog.
> The system restarted automatically, the service runs again, and i
> can't do another backup attempt because force of the owner.
> Today night, when the traffic was in the low period, i have stopped
> the service, umount the partition, and repeat the xfs_repair on the
> previously reported partition on more ways.
> 
> Here you can see the results:
> xfs_repair 2.8.11 run #1:
> http://download.netcenter.hu/bughunt/20100413/repair2811-nr1.log

So this successfully detected and repaired the corruption.  I don't
think this is new corruption - the corrupted inode numbers are the
same as you reported a few days back.

> xfs_repair 2.8.11 run #2:
> http://download.netcenter.hu/bughunt/20100413/repair2811-nr2.log
> 
> echo 3 >/proc/sys/vm/drop_caches - performed
> 
> xfs_repair 2.8.11 run #3:
> http://download.netcenter.hu/bughunt/20100413/repair2811-nr3.log

These two are clearing lost+found and rediscovering the
diesconnected inodes that were discovered in the first pass. Nothing
wrng here, that's just the way older repair versions behaved.

> xfs_reapir 3.1.1 run #1:
> http://download.netcenter.hu/bughunt/20100413/repair311-nr1.log

And this detected nothing wrong, either.

> For me, it looks like the FS gets corrupted between tuesday night
> and today night.
> Note: because i am expecting kernel crashes, the dirty data flush
> was set for some miliseconds timeout only for prevent too much data
> lost.
> It was one kernel crash in this period, but the XFS have journal,
> and should be cleaned correctly. (i don't think this is the problem)
> 
> The other interesting thing is, why only this partition gets
> corrupted? (again, and again?)

Can you reporduce the corruption again now that the filesystem has
been repaired? I want to know (if the corruption appears again)
whether it appears in the same location as this one.

> >>I mean, if i am right, the hw memory problem makes only 1-2 bit
> >>corruption seriously, and the sw page handling problem makes bad
> >>memory pages, no?
> >
> >RAM ECC guarantees correction of single bit errors and detection of
> >double bit errors (which cause the kernel to panic, IIRC). I can't
> >tell you what happens when larger errors occur, though...
> 
> Yes, but this system have non-ECC ram unfortunately.

If your hardware doesn't have ECC, then you can't rule out anything
- even a dodgy power supply can cause this sort of transient
problem. I'm not saying that this is the cause, but I've been
assuming that you're actually running hardware with ECC on RAM,
caches, buses, etc.

> This makes me think this is sw problem, and not a simple memory
> corruption, or the corruption can appear only for a short of time in
> the hw.

If you can take the performance hit, turn on the kernel memory leak
detector and see if that catches anything.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2010-04-14  0:16 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <03ca01cacb92$195adf50$0400a8c0@dcccs>
2010-03-25  3:29 ` Somebody take a look please! (some kind of kernel bug?) Américo Wang
2010-03-25  6:31   ` KAMEZAWA Hiroyuki
2010-03-25  8:54     ` Janos Haar
2010-04-01 10:01       ` Janos Haar
2010-04-01 10:37         ` Américo Wang
2010-04-02 22:07           ` Janos Haar
2010-04-02 23:09             ` Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...) Dave Chinner
2010-04-03 13:42               ` Janos Haar
2010-04-04 10:37                 ` Dave Chinner
2010-04-05 18:17                   ` Janos Haar
2010-04-05 22:45                     ` Dave Chinner
2010-04-05 22:59                       ` Janos Haar
2010-04-08  2:45                       ` Janos Haar
2010-04-08  2:58                         ` Dave Chinner
2010-04-08 11:21                           ` Janos Haar
2010-04-09 21:37                             ` Christian Kujau
2010-04-09 22:44                               ` Janos Haar
2010-04-10  8:06                                 ` Américo Wang
2010-04-10 21:21                                   ` Kernel crash in xfs_iflush_cluster (was Somebody take a lookplease!...) Janos Haar
2010-04-10 21:15                           ` Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...) Janos Haar
2010-04-11 22:44                           ` Janos Haar
2010-04-12  0:11                             ` Dave Chinner
2010-04-13  8:00                               ` Janos Haar
2010-04-13  8:39                                 ` Dave Chinner
2010-04-13  9:23                                   ` Janos Haar
2010-04-13 11:34                                     ` Dave Chinner
2010-04-13 23:36                                       ` Janos Haar
2010-04-14  0:16                                         ` Dave Chinner [this message]
2010-04-15  7:00                                           ` Janos Haar
2010-04-15  9:23                                             ` Dave Chinner
2010-04-15 10:23                                               ` Janos Haar
2010-04-16  8:01                                               ` Janos Haar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100414001615.GC2493@dastard \
    --to=david@fromorbit.com \
    --cc=axboe@kernel.dk \
    --cc=janos.haar@netcenter.hu \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=xfs@oss.sgi.com \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox