From: Andrew Morton <andrewm@uow.edu.au>
To: Alexander Viro <viro@math.psu.edu>
Cc: Andries.Brouwer@cwi.nl, torvalds@transmeta.com,
linux-kernel@vger.kernel.org, tigran@veritas.com,
"Stephen C. Tweedie" <sct@redhat.com>,
Lawrence Walton <lawrence@the-penguin.otak.com>
Subject: Re: corruption
Date: Fri, 01 Dec 2000 01:21:18 +1100 [thread overview]
Message-ID: <3A26625E.446AE3D@uow.edu.au> (raw)
In-Reply-To: <UTC200011292154.WAA150996.aeb@aak.cwi.nl> <Pine.GSO.4.21.0011291716190.17068-100000@weyl.math.psu.edu>
In thread "File corruption part deux", Lawrence Walton wrote:
>
> my system has been acting slightly odd on all the pre 12 kernels
> with the fs going read only with out any messages until now.
> no opps or anything like that, but I did get this just now.
>
> EXT2-fs error (device sd(8,2)): ext2_readdir:
> bad entry in directory #458430: directory entry
> across blocks - offset=152, inode=3393794200,
> rec_len=12440, name_len=73
>
3393794200 == 0xca493098. A kernel address. And 152 is 0x98,
which is equal to N * 0x20 + 0x18. Read on...
I am somewhat reluctant to report this problem because I always
run kernels with the lowish latency patch, but having reviewed
the effects of that patch on fs/*.c I don't think it's to blame.
Plus it's been 100% stable for months.
I believe that the problem I've observed is caused by or exposed
by the O_SYNC changes. Or maybe not.
Running test11-ac4 on *very* vanilla machines. x86 UP, IDE, 3c905
and really nothing else. No APM, fat, vfat, isofs, USB, audio, etc.
It has happened on two different machines which have been 100% reliable
for a year.
The problem is corruption of in-core files. It has only started
happening in the past few days. It happened after two days uptime.
In the most recent case my /bin/ls went bad. I took a copy and
rebooted. After reboot /bin/ls had a correct MD5 sum. Here's
the diff:
--- ls.good Thu Nov 30 15:07:11 2000
+++ ls.bad Thu Nov 30 15:07:04 2000
@@ -1589,7 +1589,7 @@
006340: C7 85 F8 BF FF FF 00 00 00 00 E9 EA 02 00 00 90 >@@@@@@@@@@@@@@@@<
006350: 8B BD FC BF FF FF 8D B5 00 E0 FF FF 57 68 00 20 >@@@@@@@@@@@@Wh@ <
006360: 00 00 56 E8 3C B2 FF FF 83 C4 0C 85 C0 0F 84 DD >@@V@<@@@@@@@@@@@<
-006370: 02 00 00 6A 0A 56 E8 49 B0 FF FF 83 C4 08 85 C0 >@@@j@V@I@@@@@@@@<
+006370: 02 00 00 6A 0A 56 E8 49 78 73 62 C6 78 73 62 C6 >@@@j@V@Ixsb@xsb@<
006380: 75 2E 8D 9D 00 C0 FF FF 8B BD FC BF FF FF 57 68 >u.@@@@@@@@@@@@Wh<
006390: 00 20 00 00 53 E8 0A B2 FF FF 83 C4 0C 85 C0 74 >@ @@S@@@@@@@@@@t<
0063A0: 0F 6A 0A 53 E8 1B B0 FF FF 83 C4 08 85 C0 74 D8 >@j@S@@@@@@@@@@t@<
@@ -1709,7 +1709,7 @@
006AC0: 00 00 00 FF 75 DF 83 E8 03 40 40 2B 44 24 58 83 >@@@@u@@@@@@+D$X@<
006AD0: C0 02 89 44 24 14 EB 08 C7 44 24 14 01 00 00 00 >@@@D$@@@@D$@@@@@<
006AE0: 8B 4C 24 3C F6 C1 01 74 5B 8B 44 24 5C 8B 74 24 >@L$<@@@t[@D$\@t$<
-006AF0: 14 89 C2 83 E0 03 74 16 7A 0F 83 F8 02 74 05 38 >@@@@@@t@z@@@@t@8<
+006AF0: 14 89 C2 83 E0 03 74 16 F8 7A 62 C6 F8 7A 62 C6 >@@@@@@t@@zb@@zb@<
006B00: 22 74 2F 42 38 22 74 2A 42 38 22 74 25 42 8B 02 >"t/B8"t*B8"t%B@@<
006B10: 84 E0 75 08 84 C0 74 1A 84 E4 74 15 A9 00 00 FF >@@u@@@t@@@t@@@@@<
006B20: 00 74 0D 83 C2 04 A9 00 00 00 FF 75 E1 83 EA 03 >@t@@@@@@@@@u@@@@<
@@ -1733,7 +1733,7 @@
006C40: 4C 24 54 40 51 50 E8 C9 A7 FF FF 83 C4 08 83 7C >L$T@QP@@@@@@@@@|<
006C50: 24 1C 00 74 38 C6 00 2C 8B 5C 24 3C 40 8B 4C 24 >$@@t8@@,@\$<@@L$<
006C60: 3C 83 E3 01 F6 C1 02 74 0E 8B 74 24 58 56 50 E8 ><@@@@@@t@@t$XVP@<
-006C70: A0 A7 FF FF 83 C4 08 85 DB 74 12 C6 00 5F 8B 4C >@@@@@@@@@t@@@_@L<
+006C70: A0 A7 FF FF 83 C4 08 85 78 7C 62 C6 78 7C 62 C6 >@@@@@@@@x|b@x|b@<
006C80: 24 5C 40 51 50 E8 8A A7 FF FF 83 C4 08 C6 00 2F >$\@QP@@@@@@@@@@/<
006C90: 31 FF 8B 74 24 60 40 56 50 E8 76 A7 FF FF 83 C4 >1@@t$`@VP@v@@@@@<
006CA0: 08 8B 4C 24 30 8B 29 85 ED 74 31 90 8D 74 26 00 >@@L$0@)@@t1@@t&@<
Note that in both my cases (and, apparently, Lawrence's) the
corrupted data consists of two identical kernel addresses which
have the value
N * 0x20 + 0x18
and they are always equal. And they occur at a file offset
of
N * 0x20 + 0x18
Which leads one to believe that someone somewhere is doing
an init_list_head() on a wild pointer.
Or, more likely, someone is doing a list_del() on a list_head
which points at recycled memory, and that list_head resides
within a structure at offset 0x18.
And that description perfectly matches the new i_dirty_buffers
field in struct inode.
Which would perhaps indicate that one of the following statements:
- the list_del in buffer_insert_inode_queue() or
- the list_del in __remove_inode_queue()
- the list_del in fsync_inode_buffers()
has gotten itself a wild pointer.
Other possible candidates apart from i_dirty_buffers which
have a list_head at offset 0x18 and whose list_dels should
be reviewed are:
- request_queue.elevator.queue
- dentry.d_hash
- anything which has a timer_list at offset 0x18
- anything which has a waitqueue at offset 0x14
There may be others which have list_heads at 0x38, 0x58, ...
This doesn't just happen a single time. The first time it happened
during a CVS commit at least eight files on the server ended up with
this corruption, as did /usr/lib/netscape/netscape-communicator,
so we had a whole bunch of corruptions happening in a short
period of time.
It takes a very bad kernel bug to be able to crash netscape.
Anyway, something to be thinking about. I've written the
canonical list_head debugging code. I'll run that overnight
and finish it off tomorrow.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
next prev parent reply other threads:[~2000-11-30 14:49 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2000-11-29 21:54 corruption Andries.Brouwer
2000-11-29 22:18 ` corruption Alexander Viro
2000-11-30 14:21 ` Andrew Morton [this message]
2000-11-30 18:39 ` corruption Jonathan Hudson
2000-11-30 19:07 ` corruption Alexander Viro
2000-11-30 21:35 ` corruption Andrew Morton
2000-12-01 0:57 ` corruption Andrew Morton
2000-12-01 12:18 ` corruption Jens Axboe
2000-12-01 12:34 ` corruption Andrew Morton
2000-12-01 12:37 ` corruption Jens Axboe
2000-12-01 12:23 ` corruption Andrew Morton
2000-12-01 15:04 ` corruption Lawrence Walton
2000-12-01 14:16 ` corruption Stephen C. Tweedie
2000-12-01 23:28 ` corruption Andrew Morton
2000-12-02 0:30 ` corruption kumon
2000-12-02 3:59 ` corruption Andrew Morton
2000-12-02 14:00 ` corruption Andrew Morton
2000-12-02 15:33 ` corruption Alexander Viro
2000-12-02 16:39 ` corruption Petr Vandrovec
2000-12-02 17:50 ` corruption Alexander Viro
2000-12-02 17:59 ` corruption Alexander Viro
2000-12-03 20:24 ` corruption Jonathan Hudson
2000-12-03 21:44 ` corruption Andrew Morton
2000-12-03 22:45 ` [resync?] corruption Alexander Viro
2000-12-04 0:56 ` Jeff V. Merkey
2000-12-04 15:00 ` corruption Stephen C. Tweedie
2000-12-04 15:19 ` corruption Alexander Viro
2000-12-01 17:29 ` corruption Jeff Garzik
[not found] <20001202161158.A475@ppc.vc.cvut.cz>
2000-12-02 15:35 ` corruption Petr Vandrovec
-- strict thread matches above, loose matches on Subject: below --
2000-11-29 13:44 corruption Andries.Brouwer
2000-11-29 14:10 ` corruption Tigran Aivazian
2000-11-29 14:16 ` corruption Alexander Viro
2000-11-29 14:26 ` corruption Jens Axboe
2000-11-29 11:16 corruption Andries.Brouwer
2000-11-29 17:47 ` corruption Linus Torvalds
2000-11-29 17:57 ` corruption Tigran Aivazian
2000-11-29 18:08 ` corruption Tigran Aivazian
2000-11-29 18:14 ` corruption Tigran Aivazian
2000-11-29 18:17 ` corruption Alexander Viro
2000-11-29 18:38 ` corruption Linus Torvalds
2000-11-29 18:47 ` corruption Tigran Aivazian
2000-11-29 18:07 ` corruption Zdenek Kabelac
2000-11-29 4:08 corruption Andries.Brouwer
2000-11-29 5:09 ` corruption Linus Torvalds
2000-11-29 9:08 ` corruption Alexander Viro
2000-11-29 9:20 ` corruption Tigran Aivazian
2000-11-29 9:26 ` corruption Alexander Viro
2000-11-29 10:52 ` corruption Tigran Aivazian
2000-11-29 18:56 ` corruption Andrea Arcangeli
2000-11-29 19:05 ` corruption Rik van Riel
2000-11-29 19:27 ` corruption Andrea Arcangeli
2000-11-29 20:02 ` corruption Rik van Riel
2000-11-29 19:25 ` corruption Linus Torvalds
2000-11-29 19:57 ` corruption Alexander Viro
2000-11-29 20:36 ` corruption Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3A26625E.446AE3D@uow.edu.au \
--to=andrewm@uow.edu.au \
--cc=Andries.Brouwer@cwi.nl \
--cc=lawrence@the-penguin.otak.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sct@redhat.com \
--cc=tigran@veritas.com \
--cc=torvalds@transmeta.com \
--cc=viro@math.psu.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox