From: "Janos Haar" <janos.haar@netcenter.hu>
To: Dave Chinner <david@fromorbit.com>
Cc: axboe@kernel.dk, linux-kernel@vger.kernel.org, xfs@oss.sgi.com,
linux-mm@kvack.org, xiyou.wangcong@gmail.com,
kamezawa.hiroyu@jp.fujitsu.com
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
Date: Tue, 13 Apr 2010 11:23:36 +0200 [thread overview]
Message-ID: <190201cadaeb$02ec22c0$0400a8c0@dcccs> (raw)
In-Reply-To: 20100413083931.GW2493@dastard
----- Original Message -----
From: "Dave Chinner" <david@fromorbit.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>;
<kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>;
<axboe@kernel.dk>
Sent: Tuesday, April 13, 2010 10:39 AM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look
please!...)
> On Tue, Apr 13, 2010 at 10:00:17AM +0200, Janos Haar wrote:
>> >On Mon, Apr 12, 2010 at 12:44:37AM +0200, Janos Haar wrote:
>> >>Hi,
>> >>
>> >>Ok, here comes the funny part:
>> >>I have got several messages from the kernel about one of my XFS
>> >>(sdb2) have corrupted inodes, but my xfs_repair (v. 2.8.11) says the
>> >>FS is clean and shine.
>> >>Should i upgrade my xfs_repair, or this is another bug? :-)
>> >
>> >v2.8.11 is positively ancient. :/
>> >
>> >I'd upgrade (current is 3.1.1) and re-run repair again.
>>
>> OK, i will get the new repair today.
>>
>> btw
>> Since i tested the FS with the 2.8.11, today morning i found this in
>> the log:
>>
>> ...
>> Apr 12 00:41:10 alfa kernel: XFS mounting filesystem sdb2 # This
>> was the point of check with xfs_repair v2.8.11
>> Apr 13 03:08:33 alfa kernel: xfs_da_do_buf: bno 32768
>> Apr 13 03:08:33 alfa kernel: dir: inode 474253931
>> Apr 13 03:08:33 alfa kernel: Filesystem "sdb2": XFS internal error
>> xfs_da_do_buf(1) at line 2020 of file fs/xfs/xfs_da_btree.c. Caller
>> 0xffffffff811c4fa6
>
> A corrupted directory. There have been several different types of
> directory corruption that 2.8.11 didn't detect that 3.1.1 does.
>
>> The entire log is here:
>> http://download.netcenter.hu/bughunt/20100413/messages
>
> So the bad inodes are:
>
> $ awk '/corrupt inode/ { print $10 } /dir: inode/ { print $8 }' messages |
> sort -n -u
> 474253931
> 474253936
> 474253937
> 474253938
> 474253939
> 474253940
> 474253941
> 474253943
> 474253945
> 474253946
> 474253947
> 474253948
> 474253949
> 474253950
> 474253951
> 673160704
> 673160708
> 673160712
> 673160713
>
> It looks like the bad inodes are confined to two inode clusters. The
> nature of the errors - bad block mappings and bad extent counts -
> makes me think you might have bad memory in the machine:
>
> $ awk '/xfs_da_do_buf: bno/ { printf "%x\n", $8 }' messages | sort -n -u
> 4d8000
> 5e0000
> 7f8001
> 8000
> 8001
> 10000
> 10001
> 20001
> 28001
> 38000
> 270001
> 370001
> 548001
> 568000
> 568001
> 600000
> 600001
> 618000
> 618001
> 628000
> 628001
> 650001
>
> I think they should all be 0 or 1, and:
>
> $ awk '/corrupt inode/ { split($13, a, ")"); printf "%x\n", a[1] }'
> messages | sort -n -u
> fffffffffd000001
> 6b000001
> 1000001
> 75000001
>
> I think they should all be 1, too.
>
> I've seen this sort of error pattern before on a machine that had a
> bad DIMM. If the corruption is on disk then the buffers were
> corrupted between the time that the CPU writes to them and being
> written to disk. If there is no corruption on disk, then the CPU is
> reading bad data from memory...
>
> If you run:
>
> $ xfs_db -r -c "inode 474253940" -c p /dev/sdb2
>
> Then I can can confirm whether there is corruption on disk or not.
> Probably best to sample multiple of the inode numbers from the above
> list of bad inodes.
Here is the log:
http://download.netcenter.hu/bughunt/20100413/debug.log
The xfs_db does segmentation fault. :-)
Btw memory corruption:
In the beginnig of march, one of my bets was memory problem too, but the
server was offline for 7 days, and all the time runs the memtest86 on the
hw, and passed all the 8GB 74 times without any bit error.
I don't think it is memory problem, additionally the server can create big
size .tar.gz files without crc problem.
If i force my mind to think to hw memory problem, i can think only for the
raid card's cache memory, wich i can't test with memtest86.
Or the cache of the HDD's pcb...
In the other hand, i have seen more people reported memory corruption about
these kernel versions, can we check this and surely select wich is the
problem? (hw or sw)?
I mean, if i am right, the hw memory problem makes only 1-2 bit corruption
seriously, and the sw page handling problem makes bad memory pages, no?
>
> FWIW, I'd strongly suggest backing up everything you can first
> before running an updated xfs_repair....
Yes, i know that too. :-)
Thanks,
Janos
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
WARNING: multiple messages have this Message-ID (diff)
From: "Janos Haar" <janos.haar@netcenter.hu>
To: "Dave Chinner" <david@fromorbit.com>
Cc: <xiyou.wangcong@gmail.com>, <linux-kernel@vger.kernel.org>,
<kamezawa.hiroyu@jp.fujitsu.com>, <linux-mm@kvack.org>,
<xfs@oss.sgi.com>, <axboe@kernel.dk>
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
Date: Tue, 13 Apr 2010 11:23:36 +0200 [thread overview]
Message-ID: <190201cadaeb$02ec22c0$0400a8c0@dcccs> (raw)
In-Reply-To: 20100413083931.GW2493@dastard
----- Original Message -----
From: "Dave Chinner" <david@fromorbit.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>;
<kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>;
<axboe@kernel.dk>
Sent: Tuesday, April 13, 2010 10:39 AM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look
please!...)
> On Tue, Apr 13, 2010 at 10:00:17AM +0200, Janos Haar wrote:
>> >On Mon, Apr 12, 2010 at 12:44:37AM +0200, Janos Haar wrote:
>> >>Hi,
>> >>
>> >>Ok, here comes the funny part:
>> >>I have got several messages from the kernel about one of my XFS
>> >>(sdb2) have corrupted inodes, but my xfs_repair (v. 2.8.11) says the
>> >>FS is clean and shine.
>> >>Should i upgrade my xfs_repair, or this is another bug? :-)
>> >
>> >v2.8.11 is positively ancient. :/
>> >
>> >I'd upgrade (current is 3.1.1) and re-run repair again.
>>
>> OK, i will get the new repair today.
>>
>> btw
>> Since i tested the FS with the 2.8.11, today morning i found this in
>> the log:
>>
>> ...
>> Apr 12 00:41:10 alfa kernel: XFS mounting filesystem sdb2 # This
>> was the point of check with xfs_repair v2.8.11
>> Apr 13 03:08:33 alfa kernel: xfs_da_do_buf: bno 32768
>> Apr 13 03:08:33 alfa kernel: dir: inode 474253931
>> Apr 13 03:08:33 alfa kernel: Filesystem "sdb2": XFS internal error
>> xfs_da_do_buf(1) at line 2020 of file fs/xfs/xfs_da_btree.c. Caller
>> 0xffffffff811c4fa6
>
> A corrupted directory. There have been several different types of
> directory corruption that 2.8.11 didn't detect that 3.1.1 does.
>
>> The entire log is here:
>> http://download.netcenter.hu/bughunt/20100413/messages
>
> So the bad inodes are:
>
> $ awk '/corrupt inode/ { print $10 } /dir: inode/ { print $8 }' messages |
> sort -n -u
> 474253931
> 474253936
> 474253937
> 474253938
> 474253939
> 474253940
> 474253941
> 474253943
> 474253945
> 474253946
> 474253947
> 474253948
> 474253949
> 474253950
> 474253951
> 673160704
> 673160708
> 673160712
> 673160713
>
> It looks like the bad inodes are confined to two inode clusters. The
> nature of the errors - bad block mappings and bad extent counts -
> makes me think you might have bad memory in the machine:
>
> $ awk '/xfs_da_do_buf: bno/ { printf "%x\n", $8 }' messages | sort -n -u
> 4d8000
> 5e0000
> 7f8001
> 8000
> 8001
> 10000
> 10001
> 20001
> 28001
> 38000
> 270001
> 370001
> 548001
> 568000
> 568001
> 600000
> 600001
> 618000
> 618001
> 628000
> 628001
> 650001
>
> I think they should all be 0 or 1, and:
>
> $ awk '/corrupt inode/ { split($13, a, ")"); printf "%x\n", a[1] }'
> messages | sort -n -u
> fffffffffd000001
> 6b000001
> 1000001
> 75000001
>
> I think they should all be 1, too.
>
> I've seen this sort of error pattern before on a machine that had a
> bad DIMM. If the corruption is on disk then the buffers were
> corrupted between the time that the CPU writes to them and being
> written to disk. If there is no corruption on disk, then the CPU is
> reading bad data from memory...
>
> If you run:
>
> $ xfs_db -r -c "inode 474253940" -c p /dev/sdb2
>
> Then I can can confirm whether there is corruption on disk or not.
> Probably best to sample multiple of the inode numbers from the above
> list of bad inodes.
Here is the log:
http://download.netcenter.hu/bughunt/20100413/debug.log
The xfs_db does segmentation fault. :-)
Btw memory corruption:
In the beginnig of march, one of my bets was memory problem too, but the
server was offline for 7 days, and all the time runs the memtest86 on the
hw, and passed all the 8GB 74 times without any bit error.
I don't think it is memory problem, additionally the server can create big
size .tar.gz files without crc problem.
If i force my mind to think to hw memory problem, i can think only for the
raid card's cache memory, wich i can't test with memtest86.
Or the cache of the HDD's pcb...
In the other hand, i have seen more people reported memory corruption about
these kernel versions, can we check this and surely select wich is the
problem? (hw or sw)?
I mean, if i am right, the hw memory problem makes only 1-2 bit corruption
seriously, and the sw page handling problem makes bad memory pages, no?
>
> FWIW, I'd strongly suggest backing up everything you can first
> before running an updated xfs_repair....
Yes, i know that too. :-)
Thanks,
Janos
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
WARNING: multiple messages have this Message-ID (diff)
From: "Janos Haar" <janos.haar@netcenter.hu>
To: Dave Chinner <david@fromorbit.com>
Cc: xiyou.wangcong@gmail.com, linux-kernel@vger.kernel.org,
kamezawa.hiroyu@jp.fujitsu.com, linux-mm@kvack.org,
xfs@oss.sgi.com, axboe@kernel.dk
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...)
Date: Tue, 13 Apr 2010 11:23:36 +0200 [thread overview]
Message-ID: <190201cadaeb$02ec22c0$0400a8c0@dcccs> (raw)
In-Reply-To: 20100413083931.GW2493@dastard
----- Original Message -----
From: "Dave Chinner" <david@fromorbit.com>
To: "Janos Haar" <janos.haar@netcenter.hu>
Cc: <xiyou.wangcong@gmail.com>; <linux-kernel@vger.kernel.org>;
<kamezawa.hiroyu@jp.fujitsu.com>; <linux-mm@kvack.org>; <xfs@oss.sgi.com>;
<axboe@kernel.dk>
Sent: Tuesday, April 13, 2010 10:39 AM
Subject: Re: Kernel crash in xfs_iflush_cluster (was Somebody take a look
please!...)
> On Tue, Apr 13, 2010 at 10:00:17AM +0200, Janos Haar wrote:
>> >On Mon, Apr 12, 2010 at 12:44:37AM +0200, Janos Haar wrote:
>> >>Hi,
>> >>
>> >>Ok, here comes the funny part:
>> >>I have got several messages from the kernel about one of my XFS
>> >>(sdb2) have corrupted inodes, but my xfs_repair (v. 2.8.11) says the
>> >>FS is clean and shine.
>> >>Should i upgrade my xfs_repair, or this is another bug? :-)
>> >
>> >v2.8.11 is positively ancient. :/
>> >
>> >I'd upgrade (current is 3.1.1) and re-run repair again.
>>
>> OK, i will get the new repair today.
>>
>> btw
>> Since i tested the FS with the 2.8.11, today morning i found this in
>> the log:
>>
>> ...
>> Apr 12 00:41:10 alfa kernel: XFS mounting filesystem sdb2 # This
>> was the point of check with xfs_repair v2.8.11
>> Apr 13 03:08:33 alfa kernel: xfs_da_do_buf: bno 32768
>> Apr 13 03:08:33 alfa kernel: dir: inode 474253931
>> Apr 13 03:08:33 alfa kernel: Filesystem "sdb2": XFS internal error
>> xfs_da_do_buf(1) at line 2020 of file fs/xfs/xfs_da_btree.c. Caller
>> 0xffffffff811c4fa6
>
> A corrupted directory. There have been several different types of
> directory corruption that 2.8.11 didn't detect that 3.1.1 does.
>
>> The entire log is here:
>> http://download.netcenter.hu/bughunt/20100413/messages
>
> So the bad inodes are:
>
> $ awk '/corrupt inode/ { print $10 } /dir: inode/ { print $8 }' messages |
> sort -n -u
> 474253931
> 474253936
> 474253937
> 474253938
> 474253939
> 474253940
> 474253941
> 474253943
> 474253945
> 474253946
> 474253947
> 474253948
> 474253949
> 474253950
> 474253951
> 673160704
> 673160708
> 673160712
> 673160713
>
> It looks like the bad inodes are confined to two inode clusters. The
> nature of the errors - bad block mappings and bad extent counts -
> makes me think you might have bad memory in the machine:
>
> $ awk '/xfs_da_do_buf: bno/ { printf "%x\n", $8 }' messages | sort -n -u
> 4d8000
> 5e0000
> 7f8001
> 8000
> 8001
> 10000
> 10001
> 20001
> 28001
> 38000
> 270001
> 370001
> 548001
> 568000
> 568001
> 600000
> 600001
> 618000
> 618001
> 628000
> 628001
> 650001
>
> I think they should all be 0 or 1, and:
>
> $ awk '/corrupt inode/ { split($13, a, ")"); printf "%x\n", a[1] }'
> messages | sort -n -u
> fffffffffd000001
> 6b000001
> 1000001
> 75000001
>
> I think they should all be 1, too.
>
> I've seen this sort of error pattern before on a machine that had a
> bad DIMM. If the corruption is on disk then the buffers were
> corrupted between the time that the CPU writes to them and being
> written to disk. If there is no corruption on disk, then the CPU is
> reading bad data from memory...
>
> If you run:
>
> $ xfs_db -r -c "inode 474253940" -c p /dev/sdb2
>
> Then I can can confirm whether there is corruption on disk or not.
> Probably best to sample multiple of the inode numbers from the above
> list of bad inodes.
Here is the log:
http://download.netcenter.hu/bughunt/20100413/debug.log
The xfs_db does segmentation fault. :-)
Btw memory corruption:
In the beginnig of march, one of my bets was memory problem too, but the
server was offline for 7 days, and all the time runs the memtest86 on the
hw, and passed all the 8GB 74 times without any bit error.
I don't think it is memory problem, additionally the server can create big
size .tar.gz files without crc problem.
If i force my mind to think to hw memory problem, i can think only for the
raid card's cache memory, wich i can't test with memtest86.
Or the cache of the HDD's pcb...
In the other hand, i have seen more people reported memory corruption about
these kernel versions, can we check this and surely select wich is the
problem? (hw or sw)?
I mean, if i am right, the hw memory problem makes only 1-2 bit corruption
seriously, and the sw page handling problem makes bad memory pages, no?
>
> FWIW, I'd strongly suggest backing up everything you can first
> before running an updated xfs_repair....
Yes, i know that too. :-)
Thanks,
Janos
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-04-13 9:21 UTC|newest]
Thread overview: 93+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-03-24 20:39 Somebody take a look please! (some kind of kernel bug?) Janos Haar
2010-03-25 3:29 ` Américo Wang
2010-03-25 3:29 ` Américo Wang
2010-03-25 6:31 ` KAMEZAWA Hiroyuki
2010-03-25 6:31 ` KAMEZAWA Hiroyuki
2010-03-25 8:54 ` Janos Haar
2010-03-25 8:54 ` Janos Haar
2010-04-01 10:01 ` Janos Haar
2010-04-01 10:01 ` Janos Haar
2010-04-01 10:37 ` Américo Wang
2010-04-01 10:37 ` Américo Wang
2010-04-01 10:37 ` Américo Wang
2010-04-02 22:07 ` Janos Haar
2010-04-02 22:07 ` Janos Haar
2010-04-02 22:07 ` Janos Haar
2010-04-02 23:09 ` Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...) Dave Chinner
2010-04-02 23:09 ` Dave Chinner
2010-04-02 23:09 ` Dave Chinner
2010-04-03 13:42 ` Janos Haar
2010-04-03 13:42 ` Janos Haar
2010-04-03 13:42 ` Janos Haar
2010-04-04 10:37 ` Dave Chinner
2010-04-04 10:37 ` Dave Chinner
2010-04-04 10:37 ` Dave Chinner
2010-04-05 18:17 ` Janos Haar
2010-04-05 18:17 ` Janos Haar
2010-04-05 18:17 ` Janos Haar
2010-04-05 22:45 ` Dave Chinner
2010-04-05 22:45 ` Dave Chinner
2010-04-05 22:45 ` Dave Chinner
2010-04-05 22:59 ` Janos Haar
2010-04-05 22:59 ` Janos Haar
2010-04-05 22:59 ` Janos Haar
2010-04-08 2:45 ` Janos Haar
2010-04-08 2:45 ` Janos Haar
2010-04-08 2:45 ` Janos Haar
2010-04-08 2:58 ` Dave Chinner
2010-04-08 2:58 ` Dave Chinner
2010-04-08 2:58 ` Dave Chinner
2010-04-08 11:21 ` Janos Haar
2010-04-08 11:21 ` Janos Haar
2010-04-08 11:21 ` Janos Haar
2010-04-09 21:37 ` Christian Kujau
2010-04-09 21:37 ` Christian Kujau
2010-04-09 21:37 ` Christian Kujau
2010-04-09 22:44 ` Janos Haar
2010-04-09 22:44 ` Janos Haar
2010-04-09 22:44 ` Janos Haar
2010-04-10 8:06 ` Américo Wang
2010-04-10 8:06 ` Américo Wang
2010-04-10 8:06 ` Américo Wang
2010-04-10 21:21 ` Kernel crash in xfs_iflush_cluster (was Somebody take a lookplease!...) Janos Haar
2010-04-10 21:21 ` Janos Haar
2010-04-10 21:21 ` Janos Haar
2010-04-10 21:15 ` Kernel crash in xfs_iflush_cluster (was Somebody take a look please!...) Janos Haar
2010-04-10 21:15 ` Janos Haar
2010-04-10 21:15 ` Janos Haar
2010-04-11 22:44 ` Janos Haar
2010-04-11 22:44 ` Janos Haar
2010-04-11 22:44 ` Janos Haar
2010-04-12 0:11 ` Dave Chinner
2010-04-12 0:11 ` Dave Chinner
2010-04-12 0:11 ` Dave Chinner
2010-04-13 8:00 ` Janos Haar
2010-04-13 8:00 ` Janos Haar
2010-04-13 8:00 ` Janos Haar
2010-04-13 8:39 ` Dave Chinner
2010-04-13 8:39 ` Dave Chinner
2010-04-13 8:39 ` Dave Chinner
2010-04-13 9:23 ` Janos Haar [this message]
2010-04-13 9:23 ` Janos Haar
2010-04-13 9:23 ` Janos Haar
2010-04-13 11:34 ` Dave Chinner
2010-04-13 11:34 ` Dave Chinner
2010-04-13 11:34 ` Dave Chinner
2010-04-13 23:36 ` Janos Haar
2010-04-13 23:36 ` Janos Haar
2010-04-13 23:36 ` Janos Haar
2010-04-14 0:16 ` Dave Chinner
2010-04-14 0:16 ` Dave Chinner
2010-04-14 0:16 ` Dave Chinner
2010-04-15 7:00 ` Janos Haar
2010-04-15 7:00 ` Janos Haar
2010-04-15 7:00 ` Janos Haar
2010-04-15 9:23 ` Dave Chinner
2010-04-15 9:23 ` Dave Chinner
2010-04-15 9:23 ` Dave Chinner
2010-04-15 10:23 ` Janos Haar
2010-04-15 10:23 ` Janos Haar
2010-04-15 10:23 ` Janos Haar
2010-04-16 8:01 ` Janos Haar
2010-04-16 8:01 ` Janos Haar
2010-04-16 8:01 ` Janos Haar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='190201cadaeb$02ec22c0$0400a8c0@dcccs' \
--to=janos.haar@netcenter.hu \
--cc=axboe@kernel.dk \
--cc=david@fromorbit.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=xfs@oss.sgi.com \
--cc=xiyou.wangcong@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.