public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrew Morton <andrewm@uow.edu.au>
To: Alexander Viro <viro@math.psu.edu>
Cc: Andries.Brouwer@cwi.nl, torvalds@transmeta.com,
	linux-kernel@vger.kernel.org, tigran@veritas.com,
	"Stephen C. Tweedie" <sct@redhat.com>,
	Lawrence Walton <lawrence@the-penguin.otak.com>
Subject: Re: corruption
Date: Fri, 01 Dec 2000 01:21:18 +1100	[thread overview]
Message-ID: <3A26625E.446AE3D@uow.edu.au> (raw)
In-Reply-To: <UTC200011292154.WAA150996.aeb@aak.cwi.nl> <Pine.GSO.4.21.0011291716190.17068-100000@weyl.math.psu.edu>

In thread "File corruption part deux", Lawrence Walton wrote:
> 
> my system has been acting slightly odd on all the pre 12 kernels
> with the fs going read only with out any messages until now.
> no opps or anything like that, but I did get this just now.
> 
> EXT2-fs error (device sd(8,2)): ext2_readdir:
> bad entry in directory #458430: directory entry
> across blocks - offset=152, inode=3393794200,
> rec_len=12440, name_len=73
>

3393794200 == 0xca493098.  A kernel address. And 152 is 0x98,
which is equal to N * 0x20 + 0x18. Read on...

I am somewhat reluctant to report this problem because I always
run kernels with the lowish latency patch, but having reviewed
the effects of that patch on fs/*.c I don't think it's to blame.
Plus it's been 100% stable for months.

I believe that the problem I've observed is caused by or exposed
by the O_SYNC changes.  Or maybe not.

Running test11-ac4 on *very* vanilla machines.  x86 UP, IDE, 3c905
and really nothing else.  No APM, fat, vfat, isofs, USB, audio, etc.
It has happened on two different machines which have been 100% reliable
for a year.

The problem is corruption of in-core files.  It has only started
happening in the past few days.  It happened after two days uptime.
In the most recent case my /bin/ls went bad.  I took a copy and
rebooted.  After reboot /bin/ls had a correct MD5 sum. Here's
the diff:

--- ls.good	Thu Nov 30 15:07:11 2000
+++ ls.bad	Thu Nov 30 15:07:04 2000
@@ -1589,7 +1589,7 @@
 006340: C7 85 F8 BF FF FF 00 00 00 00 E9 EA 02 00 00 90      >@@@@@@@@@@@@@@@@<
 006350: 8B BD FC BF FF FF 8D B5 00 E0 FF FF 57 68 00 20      >@@@@@@@@@@@@Wh@ <
 006360: 00 00 56 E8 3C B2 FF FF 83 C4 0C 85 C0 0F 84 DD      >@@V@<@@@@@@@@@@@<
-006370: 02 00 00 6A 0A 56 E8 49 B0 FF FF 83 C4 08 85 C0      >@@@j@V@I@@@@@@@@<
+006370: 02 00 00 6A 0A 56 E8 49 78 73 62 C6 78 73 62 C6      >@@@j@V@Ixsb@xsb@<
 006380: 75 2E 8D 9D 00 C0 FF FF 8B BD FC BF FF FF 57 68      >u.@@@@@@@@@@@@Wh<
 006390: 00 20 00 00 53 E8 0A B2 FF FF 83 C4 0C 85 C0 74      >@ @@S@@@@@@@@@@t<
 0063A0: 0F 6A 0A 53 E8 1B B0 FF FF 83 C4 08 85 C0 74 D8      >@j@S@@@@@@@@@@t@<
@@ -1709,7 +1709,7 @@
 006AC0: 00 00 00 FF 75 DF 83 E8 03 40 40 2B 44 24 58 83      >@@@@u@@@@@@+D$X@<
 006AD0: C0 02 89 44 24 14 EB 08 C7 44 24 14 01 00 00 00      >@@@D$@@@@D$@@@@@<
 006AE0: 8B 4C 24 3C F6 C1 01 74 5B 8B 44 24 5C 8B 74 24      >@L$<@@@t[@D$\@t$<
-006AF0: 14 89 C2 83 E0 03 74 16 7A 0F 83 F8 02 74 05 38      >@@@@@@t@z@@@@t@8<
+006AF0: 14 89 C2 83 E0 03 74 16 F8 7A 62 C6 F8 7A 62 C6      >@@@@@@t@@zb@@zb@<
 006B00: 22 74 2F 42 38 22 74 2A 42 38 22 74 25 42 8B 02      >"t/B8"t*B8"t%B@@<
 006B10: 84 E0 75 08 84 C0 74 1A 84 E4 74 15 A9 00 00 FF      >@@u@@@t@@@t@@@@@<
 006B20: 00 74 0D 83 C2 04 A9 00 00 00 FF 75 E1 83 EA 03      >@t@@@@@@@@@u@@@@<
@@ -1733,7 +1733,7 @@
 006C40: 4C 24 54 40 51 50 E8 C9 A7 FF FF 83 C4 08 83 7C      >L$T@QP@@@@@@@@@|<
 006C50: 24 1C 00 74 38 C6 00 2C 8B 5C 24 3C 40 8B 4C 24      >$@@t8@@,@\$<@@L$<
 006C60: 3C 83 E3 01 F6 C1 02 74 0E 8B 74 24 58 56 50 E8      ><@@@@@@t@@t$XVP@<
-006C70: A0 A7 FF FF 83 C4 08 85 DB 74 12 C6 00 5F 8B 4C      >@@@@@@@@@t@@@_@L<
+006C70: A0 A7 FF FF 83 C4 08 85 78 7C 62 C6 78 7C 62 C6      >@@@@@@@@x|b@x|b@<
 006C80: 24 5C 40 51 50 E8 8A A7 FF FF 83 C4 08 C6 00 2F      >$\@QP@@@@@@@@@@/<
 006C90: 31 FF 8B 74 24 60 40 56 50 E8 76 A7 FF FF 83 C4      >1@@t$`@VP@v@@@@@<
 006CA0: 08 8B 4C 24 30 8B 29 85 ED 74 31 90 8D 74 26 00      >@@L$0@)@@t1@@t&@<


Note that in both my cases (and, apparently, Lawrence's) the
corrupted data consists of two identical kernel addresses which
have the value

	N * 0x20 + 0x18

and they are always equal.  And they occur at a file offset
of

	N * 0x20 + 0x18

Which leads one to believe that someone somewhere is doing
an init_list_head() on a wild pointer.

Or, more likely, someone is doing a list_del() on a list_head
which points at recycled memory, and that list_head resides
within a structure at offset 0x18.

And that description perfectly matches the new i_dirty_buffers
field in struct inode.

Which would perhaps indicate that one of the following statements:

- the list_del in buffer_insert_inode_queue() or

- the list_del in __remove_inode_queue()

- the list_del in fsync_inode_buffers()

has gotten itself a wild pointer.


Other possible candidates apart from i_dirty_buffers which
have a list_head at offset 0x18 and whose list_dels should
be reviewed are:

- request_queue.elevator.queue
- dentry.d_hash
- anything which has a timer_list at offset 0x18
- anything which has a waitqueue at offset 0x14

There may be others which have list_heads at 0x38, 0x58, ...

This doesn't just happen a single time.  The first time it happened
during a CVS commit at least eight files on the server ended up with
this corruption, as did /usr/lib/netscape/netscape-communicator,
so we had a whole bunch of corruptions happening in a short
period of time.

It takes a very bad kernel bug to be able to crash netscape.

Anyway, something to be thinking about.  I've written the
canonical list_head debugging code.  I'll run that overnight
and finish it off tomorrow.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

  reply	other threads:[~2000-11-30 14:49 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2000-11-29 21:54 corruption Andries.Brouwer
2000-11-29 22:18 ` corruption Alexander Viro
2000-11-30 14:21   ` Andrew Morton [this message]
2000-11-30 18:39     ` corruption Jonathan Hudson
2000-11-30 19:07       ` corruption Alexander Viro
2000-11-30 21:35         ` corruption Andrew Morton
2000-12-01  0:57           ` corruption Andrew Morton
2000-12-01 12:18             ` corruption Jens Axboe
2000-12-01 12:34               ` corruption Andrew Morton
2000-12-01 12:37                 ` corruption Jens Axboe
2000-12-01 12:23             ` corruption Andrew Morton
2000-12-01 15:04               ` corruption Lawrence Walton
2000-12-01 14:16           ` corruption Stephen C. Tweedie
2000-12-01 23:28             ` corruption Andrew Morton
2000-12-02  0:30               ` corruption kumon
2000-12-02  3:59             ` corruption Andrew Morton
2000-12-02 14:00               ` corruption Andrew Morton
2000-12-02 15:33                 ` corruption Alexander Viro
2000-12-02 16:39                   ` corruption Petr Vandrovec
2000-12-02 17:50                     ` corruption Alexander Viro
2000-12-02 17:59                     ` corruption Alexander Viro
2000-12-03 20:24                       ` corruption Jonathan Hudson
2000-12-03 21:44                   ` corruption Andrew Morton
2000-12-03 22:45                     ` [resync?] corruption Alexander Viro
2000-12-04  0:56                       ` Jeff V. Merkey
2000-12-04 15:00                   ` corruption Stephen C. Tweedie
2000-12-04 15:19                     ` corruption Alexander Viro
2000-12-01 17:29           ` corruption Jeff Garzik
     [not found] <20001202161158.A475@ppc.vc.cvut.cz>
2000-12-02 15:35 ` corruption Petr Vandrovec
  -- strict thread matches above, loose matches on Subject: below --
2000-11-29 13:44 corruption Andries.Brouwer
2000-11-29 14:10 ` corruption Tigran Aivazian
2000-11-29 14:16   ` corruption Alexander Viro
2000-11-29 14:26   ` corruption Jens Axboe
2000-11-29 11:16 corruption Andries.Brouwer
2000-11-29 17:47 ` corruption Linus Torvalds
2000-11-29 17:57   ` corruption Tigran Aivazian
2000-11-29 18:08     ` corruption Tigran Aivazian
2000-11-29 18:14       ` corruption Tigran Aivazian
2000-11-29 18:17       ` corruption Alexander Viro
2000-11-29 18:38       ` corruption Linus Torvalds
2000-11-29 18:47         ` corruption Tigran Aivazian
2000-11-29 18:07   ` corruption Zdenek Kabelac
2000-11-29  4:08 corruption Andries.Brouwer
2000-11-29  5:09 ` corruption Linus Torvalds
2000-11-29  9:08   ` corruption Alexander Viro
2000-11-29  9:20     ` corruption Tigran Aivazian
2000-11-29  9:26       ` corruption Alexander Viro
2000-11-29 10:52         ` corruption Tigran Aivazian
2000-11-29 18:56     ` corruption Andrea Arcangeli
2000-11-29 19:05       ` corruption Rik van Riel
2000-11-29 19:27         ` corruption Andrea Arcangeli
2000-11-29 20:02           ` corruption Rik van Riel
2000-11-29 19:25     ` corruption Linus Torvalds
2000-11-29 19:57       ` corruption Alexander Viro
2000-11-29 20:36         ` corruption Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3A26625E.446AE3D@uow.edu.au \
    --to=andrewm@uow.edu.au \
    --cc=Andries.Brouwer@cwi.nl \
    --cc=lawrence@the-penguin.otak.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sct@redhat.com \
    --cc=tigran@veritas.com \
    --cc=torvalds@transmeta.com \
    --cc=viro@math.psu.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox