frequent slab corruption (since a long time)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* frequent slab corruption (since a long time)
@ 2006-08-02  2:16 Dave Jones
  2006-08-02  2:34 ` Roland Dreier
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Dave Jones @ 2006-08-02  2:16 UTC (permalink / raw)
  To: Linux Kernel

Every so often, I see a slab corruption bug reported against
the Fedora kernels (going back as far as 2.6.11), and it's
still plagueing us.

It seems to have turned up in a number of different scenarios,
which makes it all the more complicated, but the footprint is
always the same. We write ffffffff00000000 to freed memory.
All the example cases seen so far have been on 32-bit x86.

Anyone have any clues where that value could be coming from?

There's a collection of corruption reports at
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=160878

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: frequent slab corruption (since a long time)
  2006-08-02  2:16 frequent slab corruption (since a long time) Dave Jones
@ 2006-08-02  2:34 ` Roland Dreier
  2006-08-02  3:37 ` Andi Kleen
  2006-08-02  5:05 ` David Miller
  2 siblings, 0 replies; 14+ messages in thread
From: Roland Dreier @ 2006-08-02  2:34 UTC (permalink / raw)
  To: Dave Jones; +Cc: Linux Kernel

 > There's a collection of corruption reports at
 > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=160878

I notice that the first few reports (kernels <= 2.6.14) have len=4096,
while the later reports (kernels >= 2.6.16) have len=2048.  So
assuming it's the same bug, the use after free has moved from a
4096-byte slab to a 2048-byte slab.  Were there ny data structures we
shrank between 2.6.14 and 2.6.16?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: frequent slab corruption (since a long time)
  2006-08-02  2:16 frequent slab corruption (since a long time) Dave Jones
  2006-08-02  2:34 ` Roland Dreier
@ 2006-08-02  3:37 ` Andi Kleen
  2006-08-02  4:22   ` Dave Jones
  2006-08-02  5:05 ` David Miller
  2 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2006-08-02  3:37 UTC (permalink / raw)
  To: Dave Jones; +Cc: linux-kernel

Dave Jones <davej@redhat.com> writes:

> Every so often, I see a slab corruption bug reported against
> the Fedora kernels (going back as far as 2.6.11), and it's
> still plagueing us.
> 
> It seems to have turned up in a number of different scenarios,
> which makes it all the more complicated, but the footprint is
> always the same. We write ffffffff00000000 to freed memory.

DEBUG_PAGEALLOC + a small slab patch to force the 2k slab to be
only a single object per page (so that a kfree() immediately
triggers an unmap) would catch it I guess.

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: frequent slab corruption (since a long time)
  2006-08-02  3:37 ` Andi Kleen
@ 2006-08-02  4:22   ` Dave Jones
  2006-08-02  4:35     ` Andi Kleen
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Jones @ 2006-08-02  4:22 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Wed, Aug 02, 2006 at 05:37:28AM +0200, Andi Kleen wrote:
 > Dave Jones <davej@redhat.com> writes:
 > 
 > > Every so often, I see a slab corruption bug reported against
 > > the Fedora kernels (going back as far as 2.6.11), and it's
 > > still plagueing us.
 > > 
 > > It seems to have turned up in a number of different scenarios,
 > > which makes it all the more complicated, but the footprint is
 > > always the same. We write ffffffff00000000 to freed memory.
 > 
 > DEBUG_PAGEALLOC + a small slab patch to force the 2k slab to be
 > only a single object per page (so that a kfree() immediately
 > triggers an unmap) would catch it I guess.

Problem with that approach is that DEBUG_PAGEALLOC makes things
so damned slow that it's pretty much unusable, and this bug
doesn't seem to want to repeat itself to order, so I doubt
many people would put up with the slowdown long enough to chase it down.

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: frequent slab corruption (since a long time)
  2006-08-02  4:22   ` Dave Jones
@ 2006-08-02  4:35     ` Andi Kleen
  2006-08-02  4:46       ` Dave Jones
  0 siblings, 1 reply; 14+ messages in thread
From: Andi Kleen @ 2006-08-02  4:35 UTC (permalink / raw)
  To: Dave Jones; +Cc: linux-kernel

On Wednesday 02 August 2006 06:22, Dave Jones wrote:

> Problem with that approach is that DEBUG_PAGEALLOC makes things
> so damned slow that it's pretty much unusable, and this bug
> doesn't seem to want to repeat itself to order, so I doubt
> many people would put up with the slowdown long enough to chase it down.

Really?  It shouldn't be that much slower in theory. Do you
have numbers?

If it's a big problem it could probably be made faster by batching
the TLB flushes more.

-Andi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: frequent slab corruption (since a long time)
  2006-08-02  4:35     ` Andi Kleen
@ 2006-08-02  4:46       ` Dave Jones
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Jones @ 2006-08-02  4:46 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel

On Wed, Aug 02, 2006 at 06:35:00AM +0200, Andi Kleen wrote:
 > On Wednesday 02 August 2006 06:22, Dave Jones wrote:
 > 
 > > Problem with that approach is that DEBUG_PAGEALLOC makes things
 > > so damned slow that it's pretty much unusable, and this bug
 > > doesn't seem to want to repeat itself to order, so I doubt
 > > many people would put up with the slowdown long enough to chase it down.
 > 
 > Really?  It shouldn't be that much slower in theory. Do you
 > have numbers?

You need slower boxes :-)

Every time I enable it to try and diagnose a bug in the Fedora kernel
I get a flood of "hey what gives, everything got slow" emails.
That speaks louder than any numbers to me.

It could be less of an issue on modern CPUs than it used to be, but
it has been painful enough in the past that I've really only enabled
it when I've been desperately trying to chase something down.

 > If it's a big problem it could probably be made faster by batching
 > the TLB flushes more.

Maybe, though that gives me the creeps a little for some reason.

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: frequent slab corruption (since a long time)
  2006-08-02  2:16 frequent slab corruption (since a long time) Dave Jones
  2006-08-02  2:34 ` Roland Dreier
  2006-08-02  3:37 ` Andi Kleen
@ 2006-08-02  5:05 ` David Miller
  2006-08-02  5:31   ` David Miller
  2 siblings, 1 reply; 14+ messages in thread
From: David Miller @ 2006-08-02  5:05 UTC (permalink / raw)
  To: davej; +Cc: linux-kernel

From: Dave Jones <davej@redhat.com>
Date: Tue, 1 Aug 2006 22:16:17 -0400

> Anyone have any clues where that value could be coming from?

Some of the dumps in there looks like ethernet headers.  For example,
in comment #7:

000: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a
010: 5a 5a 00 11 85 6a 0f ef 00 e0 52 cf 6c 00 08 00
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That looks like an ethernet header for an IPv4 packet to ethernet
destination MAC 00:11:85:6a:0f:ef from ethernet source MAC
00:e0:52:cf:6c:00

But this chunk is OK since we are looking at a neighbouring
INUSE object.

The reocurring corruption:

0b0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ff ff ff ff
0c0: 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

is less identifyable.  Although tg3_alloc_rx_skb() shows up
consistently in the neighbouring objects, the actually
corrupted piece is marked by release_mem() which is the
TTY layer.

The corruption is always a 32-bit 0xffffffff followed by
a 32-bit 0x00000000, 12 bytes into the object.

If it's sitting next to RX ethernet packets, it's probably
something in the vacinity of 1500+ bytes in size as that's
how large the RX skb data areas will be that the tg3 driver
allocates unless it is in Jumbo MTU mode.

By default, do_tty_write() will use a chunk size of 2048 for the write
buffer, tty->write_buf, and this is freed up as part of release_mem()
processing.

Another possibility, is the struct tty_struct itself since that is a
sizable structure too.  And this theory is supported by
alloc_tty_struct() being in some of the triggering backtraces.

Perhaps a TTY refcounting problem or race condition of some sort.

What is 12-bytes into the tty_struct on x86?  The struct tty_ldisc,
"ldisc" is.  Oddly enough, this doesn't match up, since we'd expect
TTY_LDISC_MAGIC (0x5403) and instead we see 0xffffffff there.
Also, after the magic, we'd expect the address of the "n_tty"
string in tty_ldisc_N_TTY to show up in the next word, instead
we find NULL (0x00000000) there.

Something is clobbering the ldisc member of this free'd tty_struct is
seems.  Maybe there is some problem with ldisc refcounting.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: frequent slab corruption (since a long time)
  2006-08-02  5:05 ` David Miller
@ 2006-08-02  5:31   ` David Miller
  2006-08-02 22:23     ` Dave Jones
  0 siblings, 1 reply; 14+ messages in thread
From: David Miller @ 2006-08-02  5:31 UTC (permalink / raw)
  To: davej; +Cc: linux-kernel

From: David Miller <davem@davemloft.net>
Date: Tue, 01 Aug 2006 22:05:38 -0700 (PDT)

> The corruption is always a 32-bit 0xffffffff followed by
> a 32-bit 0x00000000, 12 bytes into the object.

This analysis is wrong, it's "0xb0 + 12" bytes into the object
which is 188 bytes.  For x86, this lands us at the "count"
member of the tty_struct, and it shows that the tty count
has decremented to -1 (0xffffffff) which is a serious bug.

Note also that the ws_row and ws_col fields of the tty->winsize are
next and has been zero'd out (0x00000000) by the corruption.

There aren't too many places that write to tty->winsize,
the most notable is tiocswinsz(), called from tty_ioctl.

For a PTY with sub-type PTY_TYPE_MASTE, which is very likely
the case we have here, it assigns the user provided winsize
to both tty->winsize and tty->link->winsz (via the real_tty)
argument to tiocswinsz().

Actually, there is one other notable spot that writes to
tty->winsize, drivers/char/vt.c:vc_resize(), which copies
it to vc->vc_tty->winsize.

Perhaps this is a clue... but wait there is a better clue.

con_open() sets values there, and in particular it only
assigns the ws_row and ws_col members, which matches up
with the fact that we see only the first two members of
tty->winsize spammed to zero.

If the guilty code path involved vc_resize() or tiocswinsz()
we'd see 8 bytes, not 4 bytes, of the tty->winsize set to
something other than the SLAB free poison chars.

So it seems, from this perspective, that con_open()'s assignments
which are causing the corruption.  This is backed up by the trace
found in bugzilla #200928 being early during bootup.

But even this doesn't add up because con_open() has to be seeing
zero's there at the time it runs, not the poison values, since it's
assignments are guarded like this:

	if (!tty->winsize.ws_row && !tty->winsize.ws_col) {
		tty->winsize.ws_row = vc_cons[currcons].d->vc_rows;
		tty->winsize.ws_col = vc_cons[currcons].d->vc_cols;
	}

Hmmm...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: frequent slab corruption (since a long time)
  2006-08-02  5:31   ` David Miller
@ 2006-08-02 22:23     ` Dave Jones
  2006-08-02 22:49       ` David Miller
  2006-08-02 23:14       ` Alan Cox
  0 siblings, 2 replies; 14+ messages in thread
From: Dave Jones @ 2006-08-02 22:23 UTC (permalink / raw)
  To: David Miller; +Cc: linux-kernel, jbaron

On Tue, Aug 01, 2006 at 10:31:10PM -0700, David Miller wrote:
 > From: David Miller <davem@davemloft.net>
 > Date: Tue, 01 Aug 2006 22:05:38 -0700 (PDT)
 > 
 > > The corruption is always a 32-bit 0xffffffff followed by
 > > a 32-bit 0x00000000, 12 bytes into the object.
 > 
 > This analysis is wrong, it's "0xb0 + 12" bytes into the object
 > which is 188 bytes.  For x86, this lands us at the "count"
 > member of the tty_struct, and it shows that the tty count
 > has decremented to -1 (0xffffffff) which is a serious bug.

<stabbing in the dark here>

None of the code manipulating tty->count seems to be under
the tty_mutex.  Should it be ?
Or is this protected through some other means?

Jason, ISTR you've done some digging around this area wrt races,
any insight ?

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: frequent slab corruption (since a long time)
  2006-08-02 22:23     ` Dave Jones
@ 2006-08-02 22:49       ` David Miller
  2006-08-03 17:40         ` Alan Cox
  2006-08-02 23:14       ` Alan Cox
  1 sibling, 1 reply; 14+ messages in thread
From: David Miller @ 2006-08-02 22:49 UTC (permalink / raw)
  To: davej; +Cc: linux-kernel, jbaron

From: Dave Jones <davej@redhat.com>
Date: Wed, 2 Aug 2006 18:23:21 -0400

> None of the code manipulating tty->count seems to be under
> the tty_mutex.  Should it be ?
> Or is this protected through some other means?

It is in the primary code paths at least, all callers of init_dev()
(which increments tty->count) grab the mutex and also release_dev()
grabs the mutex around tty->count manipulations.

I'm surprised that when this triggers we don't get one of these
two messages:

	if (pty_master) {
		if (--o_tty->count < 0) {
			printk(KERN_WARNING "release_dev: bad pty slave count "
					    "(%d) for %s\n",
			       o_tty->count, tty_name(o_tty, buf));
			o_tty->count = 0;
		}
	}
	if (--tty->count < 0) {
		printk(KERN_WARNING "release_dev: bad tty->count (%d) for %s\n",
		       tty->count, tty_name(tty, buf));
		tty->count = 0;
	}

However, there seems to be some kind of dependency of TTY opennings
holding the BKL, as least as far as this comment on con_close() is
concerned:

		/*
		 * tty_mutex is released, but we still hold BKL, so there is
		 * still exclusion against init_dev()
		 */

But it is not clear to me that tty_open() and ptmx_open() always run
with the BKL held.  chrdev_open() wraps the ->open call with the BKL
held, but then it plugs in the device's fops which should allow a
direct filp->fops->open() call from the VFS layer without the BKL
grabbing right?

chrdev_open() should catch /dev/foo char device opens, but what about
the sysfs instances?  They might bypass this path too somehow, thus
another case where the BKL won't be held on open().

Hmmm...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: frequent slab corruption (since a long time)
  2006-08-02 22:49       ` David Miller
@ 2006-08-03 17:40         ` Alan Cox
  2006-08-03 17:56           ` Dave Jones
  0 siblings, 1 reply; 14+ messages in thread
From: Alan Cox @ 2006-08-03 17:40 UTC (permalink / raw)
  To: David Miller; +Cc: davej, linux-kernel, jbaron

Ar Mer, 2006-08-02 am 15:49 -0700, ysgrifennodd David Miller:
> > None of the code manipulating tty->count seems to be under
> > the tty_mutex.  Should it be ?
> > Or is this protected through some other means?
> 
> It is in the primary code paths at least, all callers of init_dev()
> (which increments tty->count) grab the mutex and also release_dev()
> grabs the mutex around tty->count manipulations.

I've been auditing tty code and its joyously bad but only in harmless
places so far except for one.

init_dev (and caller) relies on tty_mutex to ensure that the
driver->ttys list is protected from things going away.

release_mem() removes stuff from the said list and frees memory. It is
not always called under tty_mutex and that appears very dubious to me at
the moment although tty->closing and the BKL *might* be sufficient.

Alan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: frequent slab corruption (since a long time)
  2006-08-03 17:40         ` Alan Cox
@ 2006-08-03 17:56           ` Dave Jones
  2006-08-03 20:48             ` Alan Cox
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Jones @ 2006-08-03 17:56 UTC (permalink / raw)
  To: Alan Cox; +Cc: David Miller, linux-kernel, jbaron

On Thu, Aug 03, 2006 at 06:40:42PM +0100, Alan Cox wrote:
 > Ar Mer, 2006-08-02 am 15:49 -0700, ysgrifennodd David Miller:
 > > > None of the code manipulating tty->count seems to be under
 > > > the tty_mutex.  Should it be ?
 > > > Or is this protected through some other means?
 > > 
 > > It is in the primary code paths at least, all callers of init_dev()
 > > (which increments tty->count) grab the mutex and also release_dev()
 > > grabs the mutex around tty->count manipulations.
 > 
 > I've been auditing tty code and its joyously bad but only in harmless
 > places so far except for one.
 > 
 > init_dev (and caller) relies on tty_mutex to ensure that the
 > driver->ttys list is protected from things going away.
 > 
 > release_mem() removes stuff from the said list and frees memory. It is
 > not always called under tty_mutex and that appears very dubious to me at
 > the moment although tty->closing and the BKL *might* be sufficient.

Against my better judgment I was poring over that code until the wee
hours last night, and one thing crossed my mind re: the assumptions made
about the BKL in that subsystem.  Now that the BKL is preemtible, do
any of those assumptions break ?

		Dave


-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: frequent slab corruption (since a long time)
  2006-08-03 17:56           ` Dave Jones
@ 2006-08-03 20:48             ` Alan Cox
  0 siblings, 0 replies; 14+ messages in thread
From: Alan Cox @ 2006-08-03 20:48 UTC (permalink / raw)
  To: Dave Jones; +Cc: David Miller, linux-kernel, jbaron

Ar Iau, 2006-08-03 am 13:56 -0400, ysgrifennodd Dave Jones:
> Against my better judgment I was poring over that code until the wee
> hours last night, and one thing crossed my mind re: the assumptions made
> about the BKL in that subsystem.  Now that the BKL is preemtible, do
> any of those assumptions break ?

>From the walking of the code so far I think quite a few of them were
already broken. I'm still annotating and I hope by the end of this
evening I'll have a patch to post that documents the current state of
affairs better, including some gems 8)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: frequent slab corruption (since a long time)
  2006-08-02 22:23     ` Dave Jones
  2006-08-02 22:49       ` David Miller
@ 2006-08-02 23:14       ` Alan Cox
  1 sibling, 0 replies; 14+ messages in thread
From: Alan Cox @ 2006-08-02 23:14 UTC (permalink / raw)
  To: Dave Jones; +Cc: David Miller, linux-kernel, jbaron

Ar Mer, 2006-08-02 am 18:23 -0400, ysgrifennodd Dave Jones:
> None of the code manipulating tty->count seems to be under
> the tty_mutex.  Should it be ?
> Or is this protected through some other means?

Its old lock_kernel() code so its a prime suspect for most offences.

Given the age of the reports it appears that the tty buffering changes
are too new. The ldisc locking changes are a candidate but shouldn't be
doing anything that breaks the kernel lock stuff. Will look tomorrow see
if anything strikes me about that lot.

The tty_mutex primarily protected current->signal.tty (except if you are
using SELinux which has some extremely dubious looking code for tty
handling). Someone ought to review flush_unauthorized_files() although
it doesn't fit this problem.

Alan

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-08-03 20:29 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-02  2:16 frequent slab corruption (since a long time) Dave Jones
2006-08-02  2:34 ` Roland Dreier
2006-08-02  3:37 ` Andi Kleen
2006-08-02  4:22   ` Dave Jones
2006-08-02  4:35     ` Andi Kleen
2006-08-02  4:46       ` Dave Jones
2006-08-02  5:05 ` David Miller
2006-08-02  5:31   ` David Miller
2006-08-02 22:23     ` Dave Jones
2006-08-02 22:49       ` David Miller
2006-08-03 17:40         ` Alan Cox
2006-08-03 17:56           ` Dave Jones
2006-08-03 20:48             ` Alan Cox
2006-08-02 23:14       ` Alan Cox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox