Re: Errors and later panics in 2.6.0-test11.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jens Axboe <axboe@suse.de>
To: Linus Torvalds <torvalds@osdl.org>
Cc: "David Martínez Moreno" <ender@debian.org>,
	"Kernel Mailing List" <linux-kernel@vger.kernel.org>,
	clubinfo.servers@adi.uam.es, "Ingo Molnar" <mingo@elte.hu>,
	"Neil Brown" <neilb@cse.unsw.edu.au>
Subject: Re: Errors and later panics in 2.6.0-test11.
Date: Thu, 4 Dec 2003 15:14:34 +0100	[thread overview]
Message-ID: <20031204141434.GF1086@suse.de> (raw)
In-Reply-To: <20031204140738.GE1086@suse.de>

On Thu, Dec 04 2003, Jens Axboe wrote:
> On Thu, Dec 04 2003, Jens Axboe wrote:
> > On Wed, Dec 03 2003, Linus Torvalds wrote:
> > > 
> > > 
> > > On Wed, 3 Dec 2003, David Martínez Moreno wrote:
> > > >
> > > > 	I've just rebooted about six hours ago, and it's giving panics elsewhere:
> > > >
> > > > [...]
> > > > Ending XFS recovery on filesystem: md0 (dev: md0)
> > > > b44: eth0: Link is down.
> > > > b44: eth0: Link is up at 100 Mbps, full duplex.
> > > > b44: eth0: Flow control is on for TX and on for RX.
> > > > eth0: no IPv6 routers present
> > > > Unable to handle kernel paging request at virtual address 00100104
> > > 
> > > That's the LIST_POISON stuff: 00100100 is the "bad list pointer". Somebody
> > > tried to remove a page twice.
> > > 
> > > Doesn't mean a lot - if your "struct page" got corrupted, anything can
> > > happen. Quite possibly it's a double free.
> > > 
> > > > 	I can rebuild the Debian mirror for not using the RAID and using the SATA
> > > > disks separately, but will be tomorrow, it's a lot of space to move, and I
> > > > need remote intervention.
> > > >
> > > > 	Anyway I'd love to know before doing if it will be useful, looking at what
> > > > Jens has said just ten minutes ago about RAIDs 0/5. Will it help to you? Say
> > > > so and I'll go for it.
> > > 
> > > It might be more useful to leave it as RAID0, if you're willing to try out
> > > patches to try to debug this. The slab-debugging thing I sent out earlier
> > > is one such patch (but may well cause out-of-memory problems under load),
> > > and possibly the atomic-decrement checker patch (appended). And maybe Jens
> > > and Neil can come up with something..
> > 
> > I can reproduce on raid5 with linear dm on top (using XFS). I need to
> > kill the slab and memory debugging, I've put some bio debugging in there
> > instead (the memory debugging interferes with it). It's definitely a bio
> > use after free case, clone_endio() ends up with a freed bio.
> > 
> > Program received signal SIGEMT, Emulation trap.
> > 0xc02dd454 in handle_stripe (sh=0xc17cf630) at drivers/md/raid5.c:1009
> > 1009                                        wbi = wbi2;
> > (gdb) bt
> > #0  0xc02dd454 in handle_stripe (sh=0xc17cf630) at drivers/md/raid5.c:1009
> > #1  0xc02de31f in raid5d (mddev=0xdfd2d200) at drivers/md/raid5.c:1436
> > #2  0xc02e675a in md_thread (arg=0xdffdc1a0) at drivers/md/md.c:2692
> > #3  0xc010752d in kernel_thread_helper () at arch/i386/kernel/process.c:226
> > 
> > wbi (dev->written) has already been freed by someone else.
> > 
> > My puny 512MB test box cannot use your slab-debug patch :). The
> > atomic-checker didn't catch anything.
> 
> I can just as easily reproduce with ext2, so I don't think XFS has
> anything to do with it. This is still raid5 with dm linear on top.
> 
> (gdb) bt
> #0  0x0b10dead in ?? ()
> #1  0xc0170e25 in bio_endio (bio=0xda37fae0, bytes_done=0, error=0)
>     at fs/bio.c:689
> #2  0xc02e8a0d in clone_endio (bio=0xda38c120, done=7168, error=0)
>     at drivers/md/dm.c:266
> #3  0xc02dd78c in handle_stripe (sh=0xdfc18de0) at drivers/md/raid5.c:1209
> #4  0xc02de38f in raid5d (mddev=0xdfd62200) at drivers/md/raid5.c:1437
> #5  0xc02e67da in md_thread (arg=0xdfd58260) at drivers/md/md.c:2692
> #6  0xc010752d in kernel_thread_helper () at arch/i386/kernel/process.c:226
> (gdb) p *(struct bio *) 0xda37fae0
> $1 = {bi_sector = 42974, bi_next = 0x0, bi_bdev = 0xb10dead, bi_flags =
> 1041, 
>   bi_rw = 1, bi_vcnt = 3, bi_idx = 0, bi_phys_segments = 0, 
>   bi_hw_segments = 0, bi_size = 0, bi_max_vecs = 0, bi_io_vec =
> 0xda357ce0, 
>   bi_end_io = 0xb10dead, bi_cnt = {counter = 0}, bi_private = 0xb10dead, 
>   bi_destructor = 0xc0170050 <bio_destructor>, free_eip = 0xc02e8a26}
> 
> EIP is bad, bio debug magic 0x0b10dead. bi_flags has the uptodate bit
> set, the clone bit, and the 10th bit (debug dead bit). The bio itself
> was freed by 0xc02e8a26, dm.c:clone_endio().

As far as I can see, dm bio handling looks perfectly fine (no bad usage
in there wrt references). So this looks more and more like a raid
problem. I tried looking at handle_strip() and associated raid5
functions, but I think I'll leave that to Neil... It's not a straight
forward read.

-- 
Jens Axboe

next prev parent reply	other threads:[~2003-12-04 14:14 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-12-03 13:17 Errors and later panics in 2.6.0-test11 David Martínez Moreno
2003-12-03 13:31 ` William Lee Irwin III
2003-12-03 16:49   ` David Martínez Moreno
2003-12-03 15:59 ` Linus Torvalds
2003-12-03 16:20   ` Jens Axboe
2003-12-03 16:26     ` Jens Axboe
2003-12-03 16:51     ` Linus Torvalds
2003-12-03 17:25       ` Kevin P. Fleming
2003-12-05  3:20         ` Nathan Scott
2003-12-05  3:49           ` Kevin P. Fleming
2003-12-03 20:09       ` Neil Brown
2003-12-04  6:39         ` Nathan Scott
2003-12-03 20:04     ` Neil Brown
2003-12-03 20:11       ` Linus Torvalds
2003-12-03 16:47   ` David Martínez Moreno
2003-12-03 17:25     ` Linus Torvalds
2003-12-04 12:43       ` Jens Axboe
2003-12-04 14:07         ` Jens Axboe
2003-12-04 14:14           ` Jens Axboe [this message]
2003-12-05  3:07           ` Neil Brown
2003-12-05  4:31             ` Kevin P. Fleming
2003-12-05  4:32             ` Nathan Scott
2003-12-04 12:53       ` David Martínez Moreno
2003-12-12 18:38       ` David Martínez Moreno

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20031204141434.GF1086@suse.de \
    --to=axboe@suse.de \
    --cc=clubinfo.servers@adi.uam.es \
    --cc=ender@debian.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=neilb@cse.unsw.edu.au \
    --cc=torvalds@osdl.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.