Re: Fw: Re: oops in choose

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Fw: Re: oops in choose_configuration()
@ 2006-03-08  3:29 Chuck Ebbert
  2006-03-08  3:48 ` Dmitry Torokhov
  2006-03-08  4:01 ` Linus Torvalds
  0 siblings, 2 replies; 28+ messages in thread
From: Chuck Ebbert @ 2006-03-08  3:29 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-kernel, Ingo Molnar, Andrew Morton, Linus Torvalds,
	Dmitry Torokhov

In-Reply-To: <20060308005455.GA23921@kroah.com>

On Tue, 7 Mar 2006 16:54:55 -0800, Greg KH wrote:

> On Tue, Mar 07, 2006 at 04:54:24PM -0500, Chuck Ebbert wrote:
> > At least one susbsystem rolls its own method of adding env vars to the
> > uevent buffer, and it's so broken it triggers the WARN_ON() in
> > lib/vsprintf.c::vsnprintf() by passing a negative length to that function.
> > Start at drivers/input/input.c::input_dev_uevent() and watch the fun.
> 
> All of the INPUT_ADD_HOTPLUG_VAR() calls do use add_uevent_var(), so we
> should be safe there.  The other calls also look safe, if not a bit
> wierd...  So I don't see how we could change this to be any safer, do
> you?

input.c line 747+ was recently added and caused the error message:

[1]=>   envp[i++] = buffer + len;
[2]=>   len += snprintf(buffer + len, buffer_size - len, "MODALIAS=");
[2]=>   len += print_modalias(buffer + len, buffer_size - len, dev) + 1;

[1]=>   envp[i] = NULL;
        return 0;

[1] What is checking for enough space here?  This didn't overflow AFAICT
but it could.

[2] The snprintf() and print_modalias() calls don't check for errors and
thus don't return -ENOMEM when the buffer does fill up.  Shouldn't they
do that instead of returning a truncated env string?  Only the final
sanity test in snprintf() keeps them from overrunning the buffer.
(It was print_modalias_bits() that actually caused the overflow.)

> Are you still seeing problems now?

Wasn't me, it was Horst Schirmeier <horst@schirmeier.com>

-- 
Chuck
"Penguins don't come from next door, they come from the Antarctic!"

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-08  3:29 Fw: Re: oops in choose_configuration() Chuck Ebbert
@ 2006-03-08  3:48 ` Dmitry Torokhov
  2006-03-08  4:01 ` Linus Torvalds
  1 sibling, 0 replies; 28+ messages in thread
From: Dmitry Torokhov @ 2006-03-08  3:48 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Greg KH, linux-kernel, Ingo Molnar, Andrew Morton, Linus Torvalds

On Tuesday 07 March 2006 22:29, Chuck Ebbert wrote:
> In-Reply-To: <20060308005455.GA23921@kroah.com>
> 
> On Tue, 7 Mar 2006 16:54:55 -0800, Greg KH wrote:
> 
> > On Tue, Mar 07, 2006 at 04:54:24PM -0500, Chuck Ebbert wrote:
> > > At least one susbsystem rolls its own method of adding env vars to the
> > > uevent buffer, and it's so broken it triggers the WARN_ON() in
> > > lib/vsprintf.c::vsnprintf() by passing a negative length to that function.
> > > Start at drivers/input/input.c::input_dev_uevent() and watch the fun.
> > 
> > All of the INPUT_ADD_HOTPLUG_VAR() calls do use add_uevent_var(), so we
> > should be safe there.  The other calls also look safe, if not a bit
> > wierd...  So I don't see how we could change this to be any safer, do
> > you?
> 
> input.c line 747+ was recently added and caused the error message:
> 
> [1]=>   envp[i++] = buffer + len;
> [2]=>   len += snprintf(buffer + len, buffer_size - len, "MODALIAS=");
> [2]=>   len += print_modalias(buffer + len, buffer_size - len, dev) + 1;
> 
> [1]=>   envp[i] = NULL;
>         return 0;
> 
> [1] What is checking for enough space here?  This didn't overflow AFAICT
> but it could.
> 
> [2] The snprintf() and print_modalias() calls don't check for errors and
> thus don't return -ENOMEM when the buffer does fill up.  Shouldn't they
> do that instead of returning a truncated env string?  Only the final
> sanity test in snprintf() keeps them from overrunning the buffer.
> (It was print_modalias_bits() that actually caused the overflow.)
> 

I agree with all of the above, it will be fixed.

-- 
Dmitry

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-08  3:29 Fw: Re: oops in choose_configuration() Chuck Ebbert
  2006-03-08  3:48 ` Dmitry Torokhov
@ 2006-03-08  4:01 ` Linus Torvalds
  1 sibling, 0 replies; 28+ messages in thread
From: Linus Torvalds @ 2006-03-08  4:01 UTC (permalink / raw)
  To: Chuck Ebbert
  Cc: Greg KH, linux-kernel, Ingo Molnar, Andrew Morton,
	Dmitry Torokhov

On Tue, 7 Mar 2006, Chuck Ebbert wrote:
> 
> [2] The snprintf() and print_modalias() calls don't check for errors and
> thus don't return -ENOMEM when the buffer does fill up.  Shouldn't they
> do that instead of returning a truncated env string?

They try to act like the standard says "snprintf()" should act.

And yes, the standard says to return the number of bytes you _would_ have
written, had not the buffer been to small.

(Of course, giving a negative buffer length is not ok, and the kernel 
version checking for that is a kernel extension on the standard. In the 
standard, the buffer size is a "size_t", which doesn't have the notion of 
"negative", since it's an unsigned type. The kernel version is just being 
safe and nice).

		Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
@ 2006-03-07 21:54 Chuck Ebbert
  2006-03-08  0:54 ` Greg KH
  2006-03-08  0:57 ` Linus Torvalds
  0 siblings, 2 replies; 28+ messages in thread
From: Chuck Ebbert @ 2006-03-07 21:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andrew Morton, Greg KH, Ingo Molnar, linux-kernel

In-Reply-To: <Pine.LNX.4.64.0603051840280.13139@g5.osdl.org>

On Sun, 5 Mar 2006 19:27:53 -0800, Linus Torvalds wrote:

> So I'd be more inclined to blame a buffer overflow on a kmalloc, and the 
> obvious target is the "add_uevent_var()" thing, since all/many of the 
> corruptions seem to come from uevent environment variable strings.

At least one susbsystem rolls its own method of adding env vars to the
uevent buffer, and it's so broken it triggers the WARN_ON() in
lib/vsprintf.c::vsnprintf() by passing a negative length to that function.
Start at drivers/input/input.c::input_dev_uevent() and watch the fun.

I reported this to linux-kernel, the input maintainer and the author
of that code on Feb. 26:

        http://lkml.org/lkml/2006/2/26/39

-- 
Chuck
"Penguins don't come from next door, they come from the Antarctic!"

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-07 21:54 Chuck Ebbert
@ 2006-03-08  0:54 ` Greg KH
  2006-03-08  0:57 ` Linus Torvalds
  1 sibling, 0 replies; 28+ messages in thread
From: Greg KH @ 2006-03-08  0:54 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: Linus Torvalds, Andrew Morton, Ingo Molnar, linux-kernel

On Tue, Mar 07, 2006 at 04:54:24PM -0500, Chuck Ebbert wrote:
> In-Reply-To: <Pine.LNX.4.64.0603051840280.13139@g5.osdl.org>
> 
> On Sun, 5 Mar 2006 19:27:53 -0800, Linus Torvalds wrote:
> 
> > So I'd be more inclined to blame a buffer overflow on a kmalloc, and the 
> > obvious target is the "add_uevent_var()" thing, since all/many of the 
> > corruptions seem to come from uevent environment variable strings.
> 
> At least one susbsystem rolls its own method of adding env vars to the
> uevent buffer, and it's so broken it triggers the WARN_ON() in
> lib/vsprintf.c::vsnprintf() by passing a negative length to that function.
> Start at drivers/input/input.c::input_dev_uevent() and watch the fun.

All of the INPUT_ADD_HOTPLUG_VAR() calls do use add_uevent_var(), so we
should be safe there.  The other calls also look safe, if not a bit
wierd...  So I don't see how we could change this to be any safer, do
you?

> I reported this to linux-kernel, the input maintainer and the author
> of that code on Feb. 26:
> 
>         http://lkml.org/lkml/2006/2/26/39

We should have fixed that already by increasing the size of the buffer,
but yes, we should catch errors in the MODALIAS function, that would
have stopped that previous overflow.  Are you still seeing problems now?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-07 21:54 Chuck Ebbert
  2006-03-08  0:54 ` Greg KH
@ 2006-03-08  0:57 ` Linus Torvalds
  2006-03-08  1:13   ` Dmitry Torokhov
  2006-03-08  4:28   ` Joe Korty
  1 sibling, 2 replies; 28+ messages in thread
From: Linus Torvalds @ 2006-03-08  0:57 UTC (permalink / raw)
  To: Chuck Ebbert, Dmitry Torokhov
  Cc: Andrew Morton, Greg KH, Ingo Molnar, linux-kernel

On Tue, 7 Mar 2006, Chuck Ebbert wrote:
> 
> At least one susbsystem rolls its own method of adding env vars to the
> uevent buffer, and it's so broken it triggers the WARN_ON() in
> lib/vsprintf.c::vsnprintf() by passing a negative length to that function.

Well, snprintf() should be safe, though. It will warn if the caller is 
lazy, but these days, the thing does

	max(buf_size - len, 0)

which should mean that the input layer passes in 0 instead of a negative 
number. And snprintf() will then _not_ print anything. 

So I think input_add_uevent_bm_var() is safe, even if it's not pretty.

However, input_devices_read() doesn't do any sanity checking at all, and 
if that ever ends up printing more than a page, that would be bad. I 
didn't look very closely, but it looks worrisome.

Dmitry?

		Linus

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-08  0:57 ` Linus Torvalds
@ 2006-03-08  1:13   ` Dmitry Torokhov
  2006-03-08  1:27     ` Greg KH
  2006-03-08  4:28   ` Joe Korty
  1 sibling, 1 reply; 28+ messages in thread
From: Dmitry Torokhov @ 2006-03-08  1:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chuck Ebbert, Andrew Morton, Greg KH, Ingo Molnar, linux-kernel

On Tuesday 07 March 2006 19:57, Linus Torvalds wrote:
> 
> On Tue, 7 Mar 2006, Chuck Ebbert wrote:
> > 
> > At least one susbsystem rolls its own method of adding env vars to the
> > uevent buffer, and it's so broken it triggers the WARN_ON() in
> > lib/vsprintf.c::vsnprintf() by passing a negative length to that function.
> 
> Well, snprintf() should be safe, though. It will warn if the caller is 
> lazy, but these days, the thing does
> 
> 	max(buf_size - len, 0)
> 
> which should mean that the input layer passes in 0 instead of a negative 
> number. And snprintf() will then _not_ print anything. 
> 
> So I think input_add_uevent_bm_var() is safe, even if it's not pretty.
> 
> However, input_devices_read() doesn't do any sanity checking at all, and 
> if that ever ends up printing more than a page, that would be bad. I 
> didn't look very closely, but it looks worrisome.
> 
> Dmitry?

I had all this code converted to seq_file, but it depends on converting
input handlers to class interfaces and it is not possible nowadays
because with latest Greg's changes to class code we would try to
register class devices while registering class devices/interfaces
(psmouse creates input_dev which binds to mousedev interface which in
turn tries to create mouseX which is also belongs to input class) and
deadlocking. Greg promised current implementation is only a temporary
solution.

I suppose I could separate those changes...

BTW, what is ETA for 2.6.16 - I could not affort enought time getting
psmouse resync logic working in all cases so I'd like to have it disabled
for 2.6.16 final.

-- 
Dmitry

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-08  1:13   ` Dmitry Torokhov
@ 2006-03-08  1:27     ` Greg KH
  2006-03-08  3:22       ` Dmitry Torokhov
  0 siblings, 1 reply; 28+ messages in thread
From: Greg KH @ 2006-03-08  1:27 UTC (permalink / raw)
  To: Dmitry Torokhov
  Cc: Linus Torvalds, Chuck Ebbert, Andrew Morton, Ingo Molnar,
	linux-kernel

On Tue, Mar 07, 2006 at 08:13:28PM -0500, Dmitry Torokhov wrote:
> On Tuesday 07 March 2006 19:57, Linus Torvalds wrote:
> > 
> > On Tue, 7 Mar 2006, Chuck Ebbert wrote:
> > > 
> > > At least one susbsystem rolls its own method of adding env vars to the
> > > uevent buffer, and it's so broken it triggers the WARN_ON() in
> > > lib/vsprintf.c::vsnprintf() by passing a negative length to that function.
> > 
> > Well, snprintf() should be safe, though. It will warn if the caller is 
> > lazy, but these days, the thing does
> > 
> > 	max(buf_size - len, 0)
> > 
> > which should mean that the input layer passes in 0 instead of a negative 
> > number. And snprintf() will then _not_ print anything. 
> > 
> > So I think input_add_uevent_bm_var() is safe, even if it's not pretty.
> > 
> > However, input_devices_read() doesn't do any sanity checking at all, and 
> > if that ever ends up printing more than a page, that would be bad. I 
> > didn't look very closely, but it looks worrisome.
> > 
> > Dmitry?
> 
> I had all this code converted to seq_file, but it depends on converting
> input handlers to class interfaces and it is not possible nowadays
> because with latest Greg's changes to class code we would try to
> register class devices while registering class devices/interfaces
> (psmouse creates input_dev which binds to mousedev interface which in
> turn tries to create mouseX which is also belongs to input class) and
> deadlocking. Greg promised current implementation is only a temporary
> solution.
> 
> I suppose I could separate those changes...

That would probably be a good idea :)

Oh, and I have been working on makeing the correct fix, you can see the
initial try of it at:
	http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/device-class.patch
if you are really brave :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-08  1:27     ` Greg KH
@ 2006-03-08  3:22       ` Dmitry Torokhov
  2006-03-08  5:23         ` Greg KH
  0 siblings, 1 reply; 28+ messages in thread
From: Dmitry Torokhov @ 2006-03-08  3:22 UTC (permalink / raw)
  To: Greg KH
  Cc: Linus Torvalds, Chuck Ebbert, Andrew Morton, Ingo Molnar,
	linux-kernel

On Tuesday 07 March 2006 20:27, Greg KH wrote:
> On Tue, Mar 07, 2006 at 08:13:28PM -0500, Dmitry Torokhov wrote:
> > On Tuesday 07 March 2006 19:57, Linus Torvalds wrote:
> > > 
> > > On Tue, 7 Mar 2006, Chuck Ebbert wrote:
> > > > 
> > > > At least one susbsystem rolls its own method of adding env vars to the
> > > > uevent buffer, and it's so broken it triggers the WARN_ON() in
> > > > lib/vsprintf.c::vsnprintf() by passing a negative length to that function.
> > > 
> > > Well, snprintf() should be safe, though. It will warn if the caller is 
> > > lazy, but these days, the thing does
> > > 
> > > 	max(buf_size - len, 0)
> > > 
> > > which should mean that the input layer passes in 0 instead of a negative 
> > > number. And snprintf() will then _not_ print anything. 
> > > 
> > > So I think input_add_uevent_bm_var() is safe, even if it's not pretty.
> > > 
> > > However, input_devices_read() doesn't do any sanity checking at all, and 
> > > if that ever ends up printing more than a page, that would be bad. I 
> > > didn't look very closely, but it looks worrisome.
> > > 
> > > Dmitry?
> > 
> > I had all this code converted to seq_file, but it depends on converting
> > input handlers to class interfaces and it is not possible nowadays
> > because with latest Greg's changes to class code we would try to
> > register class devices while registering class devices/interfaces
> > (psmouse creates input_dev which binds to mousedev interface which in
> > turn tries to create mouseX which is also belongs to input class) and
> > deadlocking. Greg promised current implementation is only a temporary
> > solution.
> > 
> > I suppose I could separate those changes...
> 
> That would probably be a good idea :)
> 

Hmm, what is the policy for attr->show()? With hotplug variables we
return -ENOMEM if there is not enough memory to store all data, but
what about attributes? Should we also return error (and which one,
-ENOMEM, -ENOBUFS?) or fill as much as we can and return up to
PAGE_SIZE? With sysfs not kernel nor application can really recover
if attribute needs buffer larger than a page. Or just rely on BUG_ON
in fs/sysfs/file.c::fill_read_buffer()?

-- 
Dmitry

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-08  3:22       ` Dmitry Torokhov
@ 2006-03-08  5:23         ` Greg KH
  2006-03-08  6:06           ` Dmitry Torokhov
  0 siblings, 1 reply; 28+ messages in thread
From: Greg KH @ 2006-03-08  5:23 UTC (permalink / raw)
  To: Dmitry Torokhov
  Cc: Linus Torvalds, Chuck Ebbert, Andrew Morton, Ingo Molnar,
	linux-kernel

On Tue, Mar 07, 2006 at 10:22:10PM -0500, Dmitry Torokhov wrote:
> On Tuesday 07 March 2006 20:27, Greg KH wrote:
> > On Tue, Mar 07, 2006 at 08:13:28PM -0500, Dmitry Torokhov wrote:
> > > On Tuesday 07 March 2006 19:57, Linus Torvalds wrote:
> > > > 
> > > > On Tue, 7 Mar 2006, Chuck Ebbert wrote:
> > > > > 
> > > > > At least one susbsystem rolls its own method of adding env vars to the
> > > > > uevent buffer, and it's so broken it triggers the WARN_ON() in
> > > > > lib/vsprintf.c::vsnprintf() by passing a negative length to that function.
> > > > 
> > > > Well, snprintf() should be safe, though. It will warn if the caller is 
> > > > lazy, but these days, the thing does
> > > > 
> > > > 	max(buf_size - len, 0)
> > > > 
> > > > which should mean that the input layer passes in 0 instead of a negative 
> > > > number. And snprintf() will then _not_ print anything. 
> > > > 
> > > > So I think input_add_uevent_bm_var() is safe, even if it's not pretty.
> > > > 
> > > > However, input_devices_read() doesn't do any sanity checking at all, and 
> > > > if that ever ends up printing more than a page, that would be bad. I 
> > > > didn't look very closely, but it looks worrisome.
> > > > 
> > > > Dmitry?
> > > 
> > > I had all this code converted to seq_file, but it depends on converting
> > > input handlers to class interfaces and it is not possible nowadays
> > > because with latest Greg's changes to class code we would try to
> > > register class devices while registering class devices/interfaces
> > > (psmouse creates input_dev which binds to mousedev interface which in
> > > turn tries to create mouseX which is also belongs to input class) and
> > > deadlocking. Greg promised current implementation is only a temporary
> > > solution.
> > > 
> > > I suppose I could separate those changes...
> > 
> > That would probably be a good idea :)
> > 
> 
> Hmm, what is the policy for attr->show()? With hotplug variables we
> return -ENOMEM if there is not enough memory to store all data, but
> what about attributes? Should we also return error (and which one,
> -ENOMEM, -ENOBUFS?) or fill as much as we can and return up to
> PAGE_SIZE?

Remember, sysfs files are supposed to be small, you are an "oddity" in
that you have a much larger buffer that you can return due to the wierd
aliases you have.

Truncating the buffer is probably good as we want userspace to get some
information, right?

> With sysfs not kernel nor application can really recover
> if attribute needs buffer larger than a page. Or just rely on BUG_ON
> in fs/sysfs/file.c::fill_read_buffer()?

How about just making this a binary attribute, then you can handle an
arbitrary size buffer and don't have to worry about the PAGE_SIZE stuff
(but it makes it more code that you have to write to handle it all,
there are tradeoffs...)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-08  5:23         ` Greg KH
@ 2006-03-08  6:06           ` Dmitry Torokhov
  2006-03-08  6:15             ` Greg KH
  0 siblings, 1 reply; 28+ messages in thread
From: Dmitry Torokhov @ 2006-03-08  6:06 UTC (permalink / raw)
  To: Greg KH
  Cc: Linus Torvalds, Chuck Ebbert, Andrew Morton, Ingo Molnar,
	linux-kernel

On Wednesday 08 March 2006 00:23, Greg KH wrote:
> > 
> > Hmm, what is the policy for attr->show()? With hotplug variables we
> > return -ENOMEM if there is not enough memory to store all data, but
> > what about attributes? Should we also return error (and which one,
> > -ENOMEM, -ENOBUFS?) or fill as much as we can and return up to
> > PAGE_SIZE?
> 
> Remember, sysfs files are supposed to be small, you are an "oddity" in
> that you have a much larger buffer that you can return due to the wierd
> aliases you have.
> 
> Truncating the buffer is probably good as we want userspace to get some
> information, right?
> 
> > With sysfs not kernel nor application can really recover
> > if attribute needs buffer larger than a page. Or just rely on BUG_ON
> > in fs/sysfs/file.c::fill_read_buffer()?
> 
> How about just making this a binary attribute, then you can handle an
> arbitrary size buffer and don't have to worry about the PAGE_SIZE stuff
> (but it makes it more code that you have to write to handle it all,
> there are tradeoffs...)
> 

I really don't believe that we ever going to cross 4096 boundary for any
single input attribute, but just to code defensively we need to decide
what to do if we ever encounter a crazy device. Just truncating to
PAGE_SIZE is easiest so that's what I am going to do.

-- 
Dmitry

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-08  6:06           ` Dmitry Torokhov
@ 2006-03-08  6:15             ` Greg KH
  0 siblings, 0 replies; 28+ messages in thread
From: Greg KH @ 2006-03-08  6:15 UTC (permalink / raw)
  To: Dmitry Torokhov
  Cc: Linus Torvalds, Chuck Ebbert, Andrew Morton, Ingo Molnar,
	linux-kernel

On Wed, Mar 08, 2006 at 01:06:59AM -0500, Dmitry Torokhov wrote:
> On Wednesday 08 March 2006 00:23, Greg KH wrote:
> > > 
> > > Hmm, what is the policy for attr->show()? With hotplug variables we
> > > return -ENOMEM if there is not enough memory to store all data, but
> > > what about attributes? Should we also return error (and which one,
> > > -ENOMEM, -ENOBUFS?) or fill as much as we can and return up to
> > > PAGE_SIZE?
> > 
> > Remember, sysfs files are supposed to be small, you are an "oddity" in
> > that you have a much larger buffer that you can return due to the wierd
> > aliases you have.
> > 
> > Truncating the buffer is probably good as we want userspace to get some
> > information, right?
> > 
> > > With sysfs not kernel nor application can really recover
> > > if attribute needs buffer larger than a page. Or just rely on BUG_ON
> > > in fs/sysfs/file.c::fill_read_buffer()?
> > 
> > How about just making this a binary attribute, then you can handle an
> > arbitrary size buffer and don't have to worry about the PAGE_SIZE stuff
> > (but it makes it more code that you have to write to handle it all,
> > there are tradeoffs...)
> > 
> 
> I really don't believe that we ever going to cross 4096 boundary for any
> single input attribute, but just to code defensively we need to decide
> what to do if we ever encounter a crazy device. Just truncating to
> PAGE_SIZE is easiest so that's what I am going to do.

Ok, that's fine with me.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-08  0:57 ` Linus Torvalds
  2006-03-08  1:13   ` Dmitry Torokhov
@ 2006-03-08  4:28   ` Joe Korty
  2006-03-08  4:44     ` Dmitry Torokhov
  1 sibling, 1 reply; 28+ messages in thread
From: Joe Korty @ 2006-03-08  4:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Chuck Ebbert, Dmitry Torokhov, Andrew Morton, Greg KH,
	Ingo Molnar, linux-kernel

On Tue, Mar 07, 2006 at 04:57:39PM -0800, Linus Torvalds wrote:

> Well, snprintf() should be safe, though. It will warn if the caller is 
> lazy, but these days, the thing does
> 
> 	max(buf_size - len, 0)
> 
> which should mean that the input layer passes in 0 instead of a negative 
> number. And snprintf() will then _not_ print anything. 

I assume this is a typo, and you meant scnprintf?  AFAIK, snprintf has
the same ol' bad behavior when #bytes-to-be-written > #bytes-in-buffer.

Joe

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-08  4:28   ` Joe Korty
@ 2006-03-08  4:44     ` Dmitry Torokhov
  0 siblings, 0 replies; 28+ messages in thread
From: Dmitry Torokhov @ 2006-03-08  4:44 UTC (permalink / raw)
  To: joe.korty
  Cc: Linus Torvalds, Chuck Ebbert, Andrew Morton, Greg KH, Ingo Molnar,
	linux-kernel

On Tuesday 07 March 2006 23:28, Joe Korty wrote:
> On Tue, Mar 07, 2006 at 04:57:39PM -0800, Linus Torvalds wrote:
> 
> > Well, snprintf() should be safe, though. It will warn if the caller is 
> > lazy, but these days, the thing does
> > 
> > 	max(buf_size - len, 0)
> > 
> > which should mean that the input layer passes in 0 instead of a negative 
> > number. And snprintf() will then _not_ print anything. 
> 
> I assume this is a typo, and you meant scnprintf?  AFAIK, snprintf has
> the same ol' bad behavior when #bytes-to-be-written > #bytes-in-buffer.
> 

No, we do want to know if output was truncated or not so snprintf is used.

-- 
Dmitry

^ permalink raw reply	[flat|nested] 28+ messages in thread

[parent not found: <20060304121723.19fe9b4b.akpm@osdl.org>]

[parent not found: <Pine.LNX.4.64.0603041235110.22647@g5.osdl.org>]

[parent not found: <20060304213447.GA4445@kroah.com>]

[parent not found: <20060304135138.613021bd.akpm@osdl.org>]

[parent not found: <20060304221810.GA20011@kroah.com>]

* Re: Fw: Re: oops in choose_configuration()
       [not found]       ` <20060304221810.GA20011@kroah.com>
@ 2006-03-05 23:48         ` Andrew Morton
  2006-03-06  3:27           ` Linus Torvalds
                             ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Andrew Morton @ 2006-03-05 23:48 UTC (permalink / raw)
  To: Greg KH; +Cc: torvalds, Ingo Molnar, linux-kernel


Might as well cc the list on this, so there's a record...

For several days I've been getting repeatable oopses in the -mm kernel. 
They occur once per ~30 boots, during initscripts.

Two days of bisection searching shows that when this scheduler patch

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.16-rc5/2.6.16-rc5-mm2/broken-out/sched-alter_uninterruptible_sleep_interactivity.patch

is included, the oopses start occurring.

I don't think the scheduler patch is the bug - I guess it just alters
timing.  The pipe code and the exec code are repeatedly implicated in the
oopses, and it's noteworthy that
sched-make_task_noninteractive_use_sleep_type.patch, which affects the pipe
wakeup handling does _not_ need to be present for this bug to manifest.

I tested current -linus with just these patches applied:

sched-fix-task-interactivity-calculation
sched-implement-smpnice
sched-smpnice-apply-review-suggestions
sched-smpnice-fix-average-load-per-run-queue-calculations
sched-cleanup_task_activated
sched-alter_uninterruptible_sleep_interactivity
sched-make_task_noninteractive_use_sleep_type
sched-dont_decrease_idle_sleep_avg
sched-include_noninteractive_sleep_in_idle_detect
sched-remove-on-runqueue-requeueing

and mainline oopses too.  So either there's a bug in mainline, or there's a
bug in one of these:

sched-fix-task-interactivity-calculation
sched-implement-smpnice
sched-smpnice-apply-review-suggestions
sched-smpnice-fix-average-load-per-run-queue-calculations
sched-cleanup_task_activated
sched-alter_uninterruptible_sleep_interactivity


The usual pattern of the oopses is that something has been overwritten with
some data which has been used as an environment string by the uevent code.

Example:

SELinux: initialized (dev rootfs, type rootfs), uses genfs_contexts
SELinux: initialized (dev sysfs, type sysfs), uses genfs_contexts
SELinux: initialized (dev usbfs, type usbfs), uses genfs_contexts
Unable to handle kernel paging request at virtual address 44535955
 printing eip:
*pde = 00000000
Oops: 0000 [#1]
Modules linked in: nvram ipw2200 ohci1394 ide_cd cdrom ieee1394 uhci_hcd sg ehci_hcd ieee80211 ieee80211_crypt joydev snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event i2c_i801 snd_seq i2c_core snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore piix hw_random snd_page_alloc generic ext3 jbd ide_disk ide_core
CPU:    0
EIP:    0060:[<c0242dd3>]    Not tainted VLI
EFLAGS: 00010246   (2.6.16-rc5-mm2 #298) 
EIP is at choose_configuration+0x58/0x13a
eax: 000000d2   ebx: f7724c00   ecx: 44535950   edx: 000001f4
esi: f77e3600   edi: 00000001   ebp: 00000000   esp: c1d7def8
ds: 007b   es: 007b   ss: 0068
Process khubd (pid: 114, threadinfo=c1d7d000 task=c1d7ca90)
Stack: <0>00000000 0000d000 f7724c00 f7724c00 00000000 00000001 c0242f68 00000000 
       f7724c00 c1d410c0 c02440b8 00000000 0000000a f7c71214 f7055000 00000010 
       c1d410c0 00000001 f7055000 00000001 c024450f 00000001 00000000 f7c71200 
Call Trace:
 [<c0242f68>] usb_new_device+0xb3/0x129
 [<c02440b8>] hub_port_connect_change+0x1ad/0x343
 [<c024450f>] hub_events+0x2c1/0x3f1
 [<c024463f>] hub_thread+0x0/0xe6
 [<c0244649>] hub_thread+0xa/0xe6
 [<c0129bb3>] autoremove_wake_function+0x0/0x2d
 [<c0129bb3>] autoremove_wake_function+0x0/0x2d
 [<c01298a8>] kthread+0x9c/0xa1
 [<c012980c>] kthread+0x0/0xa1
 [<c01012e1>] kernel_thread_helper+0x5/0xb
Code: 00 58 39 fd c7 04 24 00 00 00 00 7d 53 0f b6 46 08 8b 8e 90 00 00 00 0f b7 93 18 02 00 00 01 c0 83 c1 08 39 d0 7f 2e 85 ed 75 0a <80> 79 05 02 0f 84 bf 00 00 00 80 bb 7c 01 00 00 ff 74 0a 80 79 
 <7>ieee1394: Host added: ID:BUS[0-00:1023]  GUID[0800460300efc55d]

Here, the faulting address is ascii "UYSD".  Ignore the "D" because it was
offset a bit by pointer indexing.  That YSD appears in
drivers/base/core.c:dev_uevent().


Previously I've seen corruption of the local linux_binprm storage in
do_execve().  Often I've seen oopses like this:


audit(1141477276.672:3): policy loaded auid=4294967295
SELinux: initialized (dev usbfs, type usbfs), uses genfs_contexts
BUG: unable to handle kernel NULL pointer dereference at virtual address 00000004
 printing eip:
*pde = 00000000
Oops: 0002 [#1]
last sysfs file: /block/hda/removable
Modules linked in: ide_cd cdrom ipw2200 ohci1394 ieee1394 ehci_hcd uhci_hcd ieee80211 sg ieee80211_crypt joydev snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i2c_i801 snd_timer snd i2c_core piix soundcore hw_random snd_page_alloc generic ext3 jbd ide_disk ide_core
CPU:    0
EIP:    0060:[<c0129011>]    Not tainted VLI
EFLAGS: 00010046   (2.6.16-rc5-mm2 #287) 
EIP is at add_wait_queue+0xd/0x1d
eax: f76dc000   ebx: 00000000   ecx: f768a018   edx: f768a00c
esi: 00000246   edi: f7f6b200   ebp: f7106764   esp: f7763ea0
ds: 007b   es: 007b   ss: 0068
Process udevd (pid: 1005, threadinfo=f7763000 task=f7729ab0)
Stack: <0>f76dc000 f7763f0c c015e6a7 f7f6b200 00000000 00000008 00000145 c0163c6e 
       00000000 00000000 00000000 00000003 00000008 00000000 00000000 00000008 
       f705bf6c f705bf68 f705bf64 f705bf74 f705bf70 f705bf6c 00000003 f7763f0c 
Call Trace:
 <c015e6a7> pipe_poll+0x24/0x94   <c0163c6e> do_select+0x2d1/0x2d5
 <c016385f> __pollwait+0x0/0x94   <c0163e49> core_sys_select+0x1c4/0x2f8
 <c0164007> sys_select+0x8a/0x18f   <c0102a9b> sysenter_past_esp+0x54/0x75
Code: 11 c0 e8 af e1 ff ff eb c2 89 d8 e8 28 dd fe ff eb af 8d 06 e8 55 72 19 00 eb a6 90 90 90 56 53 83 22 fe 9c 5e fa 8b 18 8d 4a 0c <89> 4b 04 89 5a 0c 89 41 04 89 08 56 9d 5b 5e c3 56 53 83 0a 01 
 <1>BUG: unable to handle kernel paging request at virtual address 464c457f
 printing eip:
c0117418
*pde = 3e427067
Oops: 0000 [#2]
last sysfs file: /class/firmware/0000:06:0b.0/loading
Modules linked in: ide_cd cdrom ipw2200 ohci1394 ieee1394 ehci_hcd uhci_hcd ieee80211 sg ieee80211_crypt joydev snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i2c_i801 snd_timer snd i2c_core piix soundcore hw_random snd_page_alloc generic ext3 jbd ide_disk ide_core
CPU:    0
EIP:    0060:[<c0117418>]    Not tainted VLI
EFLAGS: 00010092   (2.6.16-rc5-mm2 #287) 
EIP is at __wake_up_common+0x12/0x57
eax: 464c457f   ebx: 00000202   ecx: 00000001   edx: 00000001
esi: 00000001   edi: f76dc000   ebp: f7eb1f1c   esp: f7eb1f04
ds: 007b   es: 007b   ss: 0068
Process cat (pid: 1013, threadinfo=f7eb1000 task=f7ebe570)
Stack: <0>00000006 c0339118 00000001 00000202 c17e9f40 00000006 f7eb1f30 c011746e 
       00000000 00000000 f76dc000 00000001 c015e4e8 00000000 f76dc000 f76dc010 
       00000000 00000001 00000006 f76dc000 f7106764 f7eb1f78 f7f6b740 00000006 
Call Trace:
 <c011746e> __wake_up+0x11/0x1a   <c015e4e8> pipe_writev+0x273/0x38a
 <c015e61b> pipe_write+0x1c/0x20   <c0154386> vfs_write+0xb1/0x151
 <c01544c6> sys_write+0x3c/0x64   <c0102a9b> sysenter_past_esp+0x54/0x75
Code: c0 89 f0 e8 df f6 ff ff e9 f4 fd ff ff 55 89 e5 8b 40 04 5d e9 95 f8 ff ff 55 89 e5 57 89 c7 56 89 ce 53 83 ec 0c 89 55 f0 8b 00 <8b> 10 39 f8 89 55 ec 74 34 8d 58 f4 8b 55 f0 8b 40 f4 ff 75 0c 
 <1>BUG: unable to handle kernel paging request at virtual address 6f732e6f
 printing eip:
c0163033
*pde = 00000000
Oops: 0000 [#3]


Note that in oops #2, the faulting address is ascii "\7fELF".  We think
that kernel_read() has read an elf file on top of *(inode->i_pipe) in
pipe_writev.  Note that i_pipe and linux_binprm both use the size-512 slab,
and the offsets into that memory is 0x00 in both cases.

But the management of do_execve():bprm is super-simple.


Here's another pipe-got-overwritten-by-ELF oops:

audit(1141524105.698:3): policy loaded auid=4294967295
SELinux: initialized (dev usbfs, type usbfs), uses genfs_contexts
BUG: unable to handle kernel paging request at virtual address 464c457f
 printing eip:
*pde = 3eeb0067
Oops: 0000 [#1]
last sysfs file: /devices/pci0000:00/0000:00:1d.7/usb5/5-0:1.0/modalias
Modules linked in: ipw2200 ohci1394 ieee1394 ehci_hcd uhci_hcd sg ieee80211 ieee80211_crypt joydev snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i2c_i801 snd_timer piix snd i2c_core soundcore hw_random snd_page_alloc generic ext3 jbd ide_disk ide_core
CPU:    0
EIP:    0060:[<c011741a>]    Not tainted VLI
EFLAGS: 00010092   (2.6.16-rc5-mm2 #293) 
EIP is at __wake_up_common+0x12/0x57
eax: 464c457f   ebx: 00000202   ecx: 00000001   edx: 00000001
esi: 00000001   edi: f76fde00   ebp: f7f7af1c   esp: f7f7af04
ds: 007b   es: 007b   ss: 0068
Process sh (pid: 994, threadinfo=f7f7a000 task=f776a030)
Stack: <0>0000000f c0338128 00000001 00000202 c17d5b60 0000000f f7f7af30 c0117470 
       00000000 00000000 f76fde00 00000001 c015e00c 00000000 f76fde00 f76fde10 
       00000000 00000001 0000000f f76fde00 f706b320 f7f7af78 f7702900 0000000f 
Call Trace:
 <c0117470> __wake_up+0x11/0x1a   <c015e00c> pipe_writev+0x273/0x38a
 <c015e13f> pipe_write+0x1c/0x20   <c0153eae> vfs_write+0xb1/0x151
 <c0153fee> sys_write+0x3c/0x64   <c0102a9b> sysenter_past_esp+0x54/0x75
Code: c0 89 f0 e8 d9 f6 ff ff e9 f4 fd ff ff 55 89 e5 8b 40 04 5d e9 95 f8 ff ff 55 89 e5 57 89 c7 56 89 ce 53 83 ec 0c 89 55 f0 8b 00 <8b> 10 39 f8 89 55 ec 74 34 8d 58 f4 8b 55 f0 8b 40 f4 ff 75 0c 
 <1>BUG: unable to handle kernel paging request at virtual address 0064777f


It takes maybe an hour of rebooting to reproduce this.  Kernel is uniproc,
voluntary preempt
(http://www.zip.com.au/~akpm/linux/patches/stuff/config-sony).

The next stage in brute-forcing is to start bisecting Linus's tree,
applying the sched patches at each stage.  At a few hours per step I don't
relish that.

Ingo, could you please take a real close look at the above six sched
patches, see if you can spot anything in there which might cause this?

I have a feeling that this bug has been there for a while - I have seen a
couple of funny oopses booting this machine over the past couple of months.
Those sched patches have been in -mm for a couple of months too.

Enabling CONFIG_SLAB_DEBUG or CONFIG_DEBUG_PAGEALLOC seems to make the bug
go away (naturally).   I'll retest that now.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-05 23:48         ` Andrew Morton
@ 2006-03-06  3:27           ` Linus Torvalds
  2006-03-06  8:48             ` Andrew Morton
  2006-03-06  5:00           ` Linus Torvalds
  2006-03-06 11:04           ` Jesper Juhl
  2 siblings, 1 reply; 28+ messages in thread
From: Linus Torvalds @ 2006-03-06  3:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Greg KH, Ingo Molnar, linux-kernel

Andrew,
 I'm worried about the fact that kmalloc() doesn't get redlining.

The lifetime rules for the pipe_info thing and the bprm seems totally 
obvious, and we'd get slab errors if somebody was doing something strange 
there anyway.

So I'd be more inclined to blame a buffer overflow on a kmalloc, and the 
obvious target is the "add_uevent_var()" thing, since all/many of the 
corruptions seem to come from uevent environment variable strings.

Because kmalloc() doesn't do redlining, we'd never see an overflow as an 
error, and it would just overwrite the next block. Now, they are in 
different slab blocks (the uevent strign allocation is a 2048-byte 
allocation), but maybe it flows over into the next page entirely..

Now, it all uses "vnsprintf()" which _should_ be safe, but that in turn 
uses pointer comparisons, so maybe gcc screws that up. Who knows. Gcc has 
been known to use signed comparisons on pointers and other brokenness. And 
we could just have screwed something up (not updating "len" when we update 
the buffer start etc etc)

Anyway, this trivial patch will check for buffer length consistency and 
overflow by just putting a magic value at the end of the buffer, and 
checking it. Maybe.

I don't see anything wrong there, and booting with this patch doesn't 
trigger anything for me, but it's simple enough to be worth checking out.

			Linus
---
diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 086a0c6..366214a 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -51,6 +51,9 @@ static char *action_to_string(enum kobje
 	}
 }

+#define UEV_MAGIC (0xc0edbabeu)
+#define uev_magic(buffer, len) ((unsigned int *) ((len) + (void *)(buffer)))[-1]
+
 /**
  * kobject_uevent - notify userspace by ending an uevent
  *
@@ -107,6 +110,8 @@ void kobject_uevent(struct kobject *kobj
 	if (!buffer)
 		goto exit;

+	uev_magic(buffer, BUFFER_SIZE) = UEV_MAGIC;
+
 	/* complete object path */
 	devpath = kobject_get_path(kobj, GFP_KERNEL);
 	if (!devpath)
@@ -223,6 +228,10 @@ int add_uevent_var(char **envp, int num_
 {
 	va_list args;

+	BUG_ON(buffer_size < 4);
+	BUG_ON(buffer_size > 2048);
+	BUG_ON(uev_magic(buffer, buffer_size) != UEV_MAGIC);
+
 	/*
 	 * We check against num_envp - 1 to make sure there is at
 	 * least one slot left after we return, since kobject_uevent()

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-06  3:27           ` Linus Torvalds
@ 2006-03-06  8:48             ` Andrew Morton
  2006-03-08  1:31               ` Greg KH
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2006-03-06  8:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: greg, mingo, linux-kernel

Linus Torvalds <torvalds@osdl.org> wrote:
>
> 
> Andrew,
>  I'm worried about the fact that kmalloc() doesn't get redlining.
> 
> The lifetime rules for the pipe_info thing and the bprm seems totally 
> obvious, and we'd get slab errors if somebody was doing something strange 
> there anyway.
> 
> So I'd be more inclined to blame a buffer overflow on a kmalloc, and the 
> obvious target is the "add_uevent_var()" thing, since all/many of the 
> corruptions seem to come from uevent environment variable strings.
> 
> Because kmalloc() doesn't do redlining, we'd never see an overflow as an 
> error, and it would just overwrite the next block. Now, they are in 
> different slab blocks (the uevent strign allocation is a 2048-byte 
> allocation), but maybe it flows over into the next page entirely..
> 
> Now, it all uses "vnsprintf()" which _should_ be safe, but that in turn 
> uses pointer comparisons, so maybe gcc screws that up. Who knows. Gcc has 
> been known to use signed comparisons on pointers and other brokenness. And 
> we could just have screwed something up (not updating "len" when we update 
> the buffer start etc etc)

Maybe.  Something which is as deterministic as that would trip
CONFIG_DEBUG_PAGEALLOC though.

> Anyway, this trivial patch will check for buffer length consistency and 
> overflow by just putting a magic value at the end of the buffer, and 
> checking it. Maybe.
> 
> I don't see anything wrong there, and booting with this patch doesn't 
> trigger anything for me, but it's simple enough to be worth checking out.

Yup.  The wind changed direction and I lost the ability to reproduce it for
a few hours.  Going back to exactly the rc5-mm2 lineup did the trick.  Your
debug patch didn't trigger.



SELinux: initialized (dev rootfs, type rootfs), uses genfs_contexts
SELinux: initialized (dev sysfs, type sysfs), uses genfs_contexts
audit(1141604160.744:3): policy loaded auid=4294967295
SELinux: initialized (dev usbfs, type usbfs), uses genfs_contexts
BUG: unable to handle kernel paging request at virtual address 44535955
 printing eip:
*pde = 00000000
Oops: 0000 [#1]
last sysfs file: /class/firmware/0000:06:0b.0/loading
Modules linked in: ide_cd cdrom ipw2200 ohci1394 ieee1394 ehci_hcd ieee80211 uhci_hcd sg joydev ieee80211_crypt snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm hw_random i2c_i801 snd_timer piix snd i2c_core soundcore snd_page_alloc generic ext3 jbd ide_disk ide_core
CPU:    0
EIP:    0060:[<c02437ed>]    Not tainted VLI
EFLAGS: 00010246   (2.6.16-rc5-mm2 #302) 
EIP is at choose_configuration+0x58/0x13a
eax: 000000d2   ebx: f70aac00   ecx: 44535950   edx: 000001f4
esi: f76eca00   edi: 00000001   ebp: 00000000   esp: f7f3def8
ds: 007b   es: 007b   ss: 0068
Process khubd (pid: 114, threadinfo=f7f3d000 task=f7f3cab0)
Stack: <0>00000000 0000d000 f70aac00 f70aac00 00000000 00000001 c0243a5a 00000000 
       f70aac00 c1caa3c0 c0244ba1 00000000 0000000a f76ec814 f7029000 00000011 
       c1caa3c0 00000001 f7029000 00000001 c0244ff8 00000003 00000000 f76ec800 
Call Trace:
 <c0243a5a> usb_new_device+0x15a/0x1d0   <c0244ba1> hub_port_connect_change+0x1ad/0x343
 <c0244ff8> hub_events+0x2c1/0x3f1   <c0245128> hub_thread+0x0/0xe6
 <c0245132> hub_thread+0xa/0xe6   <c0128dfb> autoremove_wake_function+0x0/0x2d
 <c0128dfb> autoremove_wake_function+0x0/0x2d   <c0128b0c> kthread+0x9c/0xa1
 <c0128a70> kthread+0x0/0xa1   <c01012ed> kernel_thread_helper+0x5/0xb
Code: 00 58 39 fd c7 04 24 00 00 00 00 7d 53 0f b6 46 08 8b 8e 90 00 00 00 0f b7 93 18 02 00 00 01 c0 83 c1 08 39 d0 7f 2e 85 ed 75 0a <80> 79 05 02 0f 84 bf 00 00 00 80 bb 7c 01 00 00 ff 74 0a 80 79 
 <6>Non-volatile memory driver v1.2
BUG: unable to handle kernel paging request at virtual address 6963702f
 printing eip:
c01bc9b0
*pde = 00000000
Oops: 0000 [#2]
last sysfs file: /class/firmware/0000:06:0b.0/loading
Modules linked in: nvram ide_cd cdrom ipw2200 ohci1394 ieee1394 ehci_hcd ieee80211 uhci_hcd sg joydev ieee80211_crypt snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm hw_random i2c_i801 snd_timer piix snd i2c_core soundcore snd_page_alloc generic ext3 jbd ide_disk ide_core
CPU:    0
EIP:    0060:[<c01bc9b0>]    Not tainted VLI
EFLAGS: 00010246   (2.6.16-rc5-mm2 #302) 
EIP is at kobject_uevent+0x28d/0x37c
eax: 00000000   ebx: f76f76c0   ecx: ffffffff   edx: f7176efc
esi: 0000002a   edi: 6963702f   ebp: f77542c0   esp: f7176f04
ds: 007b   es: 007b   ss: 0068
Process knodemgrd_0 (pid: 1025, threadinfo=f7176000 task=f7788ab0)
Stack: <0>f70f1000 c034da80 00000002 f70f1051 f8cf6e4d f7097640 c02e93d8 f76e9568 
       00000246 00000000 f7758a94 f76e9568 f76e9568 f76e9560 f76e9568 f8d28c40 
       c021e752 00000000 f70473a0 00000000 f76e9400 f76e9440 f76e9560 f763c000 
Call Trace:
 <c021e752> class_device_add+0x14e/0x201   <f8cf1efe> nodemgr_create_node+0x104/0x17b [ieee1394]
 <f8cf2b1a> nodemgr_node_scan_one+0x160/0x16e [ieee1394]   <f8cf2b76> nodemgr_node_scan+0x4e/0x50 [ieee1394]
 <f8cf3277> nodemgr_host_thread+0xdc/0x14c [ieee1394]   <f8cf319b> nodemgr_host_thread+0x0/0x14c [ieee1394]
 <c01012ed> kernel_thread_helper+0x5/0xb  
Code: 24 1c 68 01 1d 2f c0 57 e8 6e 25 00 00 c7 44 24 18 02 00 00 00 83 c4 10 83 7d 08 00 74 4f 8b 44 24 08 83 c9 ff 8b 7c 85 00 31 c0 <f2> ae f7 d1 49 83 7b 64 00 8d 71 01 8b bb 9c 00 00 00 75 61 01 
 <6>ACPI: AC Adapter [ACAD] (on-line)
ACPI: Battery Slot [BAT1] (battery present)
ACPI: Lid Switch [LID0]


As is usual with this particular manifestation, we died here:

		if (i == 0 && desc->bInterfaceClass == USB_CLASS_COMM


(gdb) p sizeof(struct usb_interface_descriptor)
$3 = 9
(gdb) p sizeof(struct usb_host_config)
$4 = 280
(gdb) p sizeof(struct usb_interface_cache)
$5 = 8
(gdb) p sizeof(struct usb_host_interface)
$6 = 28
(gdb) offsetof usb_interface_descriptor bInterfaceClass
5 0x5

And that's quite different.  `struct usb_interface_cache' is only 8 bytes,
and the scribble is (as far as we know), not at offset zero.  Unless there
are a large number of `usb_host_interface's in `struct
usb_interface_cache's variable-sized array, we're not using the size-512
slab here.

<looks at choose_configuration()>

		struct usb_interface_descriptor	*desc =
				&c->intf_cache[0]->altsetting->desc;

Seems funny.

struct usb_interface_cache {
	unsigned num_altsetting;	/* number of alternate settings */
	struct kref ref;		/* reference counter */

	/* variable-length array of alternate settings for this interface,
	 * stored in no particular order */
	struct usb_host_interface altsetting[0];
};

A clearer way of coding that assignment would have been

		struct usb_interface_descriptor	*desc =
				&c->intf_cache[0]->altsetting[0].desc;

a) How come we're only considering the zeroth slot in that array in here?

b) How do we know that there's actually anything _there_?  The length of
   that variable-sized array doesn't seem to have been stored anywhere
   obvious by usb_parse_configuration() and choose_configuration() doesn't
   check.  What happens if the length was zero?

drivers/usb/core/config.c hasn't changed since October.  If it was this
easy, we'd have hit it before now.

Greg, who knows this code?  Can we triple-check it please?



Oh dear, there it goes.


SELinux: initialized (dev sysfs, type sysfs), uses genfs_contexts
audit(1141605301.698:3): policy loaded auid=4294967295
SELinux: initialized (dev usbfs, type usbfs), uses genfs_contexts
------------[ cut here ]------------
kernel BUG at kernel/timer.c:213!
invalid opcode: 0000 [#1]
last sysfs file: /class/firmware/0000:06:0b.0/loading
Modules linked in: ide_cd cdrom ipw2200 ohci1394 ieee1394 ieee80211 ehci_hcd uhci_hcd ieee80211_crypt sg joydev snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm i2c_i801 hw_random piix snd_timer i2c_core snd soundcore snd_page_alloc generic ext3 jbd ide_disk ide_core
CPU:    0
EIP:    0060:[<c0120d68>]    Not tainted VLI
EFLAGS: 00010246   (2.6.16-rc5-mm2 #302) 
EIP is at __mod_timer+0x5e/0x68
eax: f70a0898   ebx: f70a0898   ecx: 00000000   edx: fffef455
esi: fffef455   edi: 00000000   ebp: f7fadc48   esp: f77e4e10
ds: 007b   es: 007b   ss: 0068
Process modprobe (pid: 613, threadinfo=f77e4000 task=f77d8570)
Stack: <0>00000000 f70a0840 f8d616e4 00000000 c022125f f706e240 f70faa0c f8d54328 
       f8d616e4 f8d54328 c0221323 00000001 f8d426d8 f70faa0c 00000000 00000003 
       f8d42a75 f8d54313 f70faa0c f70faa10 00000000 f7fadcb0 f8d4df95 f70faa0c 
Call Trace:
 <c022125f> _request_firmware+0x8f/0x14c   <c0221323> request_firmware+0x7/0x9
 <f8d426d8> ipw_get_fw+0x16/0xa0 [ipw2200]   <f8d42a75> ipw_load+0x232/0x799 [ipw2200]
 <f8d4df95> ipw_up+0x48/0x358 [ipw2200]   <f8d4e550> ipw_net_init+0x1c/0x3a [ipw2200]
 <c0271593> register_netdevice+0x4c/0x2da   <c0187433> create_files+0x23/0x44
 <c0271869> register_netdev+0x48/0x68   <f8d4e8b7> ipw_pci_probe+0x349/0x401 [ipw2200]
 <c0117484> __wake_up_locked+0x11/0x13   <c021db21> __driver_attach+0x0/0x6b
 <c01c693c> pci_call_probe+0xa/0xc   <c01c6972> __pci_device_probe+0x34/0x41
 <c01c699d> pci_device_probe+0x1e/0x32   <c021da87> driver_probe_device+0x49/0x8c
 <c021db77> __driver_attach+0x56/0x6b   <c021d2e2> bus_for_each_dev+0x44/0x59
 <c021db9d> driver_attach+0x11/0x13   <c021db21> __driver_attach+0x0/0x6b
 <c021d68e> bus_add_driver+0x57/0x81   <c021df2c> driver_register+0x70/0x75
 <c012f3bb> sys_init_module+0x66/0x15f   <c01c6b28> __pci_register_driver+0x37/0x49
 <c011a909> printk+0xe/0x11   <f8c93015> ipw_init+0x15/0x6a [ipw2200]
 <f8c93026> ipw_init+0x26/0x6a [ipw2200]   <c012f49c> sys_init_module+0x147/0x15f
 <c0102a9b> sysenter_past_esp+0x54/0x75  
Code: 20 00 a1 84 08 49 c0 39 c1 74 07 39 19 74 18 89 43 14 89 73 08 89 da e8 fb fe ff ff ff 34 24 9d 5a 89 f8 5b 5e 5f c3 89 c8 eb e7 <0f> 0b d5 00 c4 44 2e c0 eb a8 53 31 d2 8b 0d 84 08 49 c0 83 38 
 BUG: modprobe/613, lock held at task exit time!
 [c0357220] {rtnl_mutex}
.. held by:          modprobe:  613 [f77d8570, 113]
... acquired at:               register_netdev+0xb/0x68


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-06  8:48             ` Andrew Morton
@ 2006-03-08  1:31               ` Greg KH
  2006-03-08  1:49                 ` Andrew Morton
  0 siblings, 1 reply; 28+ messages in thread
From: Greg KH @ 2006-03-08  1:31 UTC (permalink / raw)
  To: Andrew Morton, linux-usb-devel; +Cc: Linus Torvalds, mingo, linux-kernel

On Mon, Mar 06, 2006 at 12:48:36AM -0800, Andrew Morton wrote:
> Linus Torvalds <torvalds@osdl.org> wrote:
> >
> > 
> > Andrew,
> >  I'm worried about the fact that kmalloc() doesn't get redlining.
> > 
> > The lifetime rules for the pipe_info thing and the bprm seems totally 
> > obvious, and we'd get slab errors if somebody was doing something strange 
> > there anyway.
> > 
> > So I'd be more inclined to blame a buffer overflow on a kmalloc, and the 
> > obvious target is the "add_uevent_var()" thing, since all/many of the 
> > corruptions seem to come from uevent environment variable strings.
> > 
> > Because kmalloc() doesn't do redlining, we'd never see an overflow as an 
> > error, and it would just overwrite the next block. Now, they are in 
> > different slab blocks (the uevent strign allocation is a 2048-byte 
> > allocation), but maybe it flows over into the next page entirely..
> > 
> > Now, it all uses "vnsprintf()" which _should_ be safe, but that in turn 
> > uses pointer comparisons, so maybe gcc screws that up. Who knows. Gcc has 
> > been known to use signed comparisons on pointers and other brokenness. And 
> > we could just have screwed something up (not updating "len" when we update 
> > the buffer start etc etc)
> 
> Maybe.  Something which is as deterministic as that would trip
> CONFIG_DEBUG_PAGEALLOC though.
> 
> > Anyway, this trivial patch will check for buffer length consistency and 
> > overflow by just putting a magic value at the end of the buffer, and 
> > checking it. Maybe.
> > 
> > I don't see anything wrong there, and booting with this patch doesn't 
> > trigger anything for me, but it's simple enough to be worth checking out.
> 
> Yup.  The wind changed direction and I lost the ability to reproduce it for
> a few hours.  Going back to exactly the rc5-mm2 lineup did the trick.  Your
> debug patch didn't trigger.
> 
> 
> 
> SELinux: initialized (dev rootfs, type rootfs), uses genfs_contexts
> SELinux: initialized (dev sysfs, type sysfs), uses genfs_contexts
> audit(1141604160.744:3): policy loaded auid=4294967295
> SELinux: initialized (dev usbfs, type usbfs), uses genfs_contexts
> BUG: unable to handle kernel paging request at virtual address 44535955
>  printing eip:
> *pde = 00000000
> Oops: 0000 [#1]
> last sysfs file: /class/firmware/0000:06:0b.0/loading
> Modules linked in: ide_cd cdrom ipw2200 ohci1394 ieee1394 ehci_hcd ieee80211 uhci_hcd sg joydev ieee80211_crypt snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm hw_random i2c_i801 snd_timer piix snd i2c_core soundcore snd_page_alloc generic ext3 jbd ide_disk ide_core
> CPU:    0
> EIP:    0060:[<c02437ed>]    Not tainted VLI
> EFLAGS: 00010246   (2.6.16-rc5-mm2 #302) 
> EIP is at choose_configuration+0x58/0x13a
> eax: 000000d2   ebx: f70aac00   ecx: 44535950   edx: 000001f4
> esi: f76eca00   edi: 00000001   ebp: 00000000   esp: f7f3def8
> ds: 007b   es: 007b   ss: 0068
> Process khubd (pid: 114, threadinfo=f7f3d000 task=f7f3cab0)
> Stack: <0>00000000 0000d000 f70aac00 f70aac00 00000000 00000001 c0243a5a 00000000 
>        f70aac00 c1caa3c0 c0244ba1 00000000 0000000a f76ec814 f7029000 00000011 
>        c1caa3c0 00000001 f7029000 00000001 c0244ff8 00000003 00000000 f76ec800 
> Call Trace:
>  <c0243a5a> usb_new_device+0x15a/0x1d0   <c0244ba1> hub_port_connect_change+0x1ad/0x343
>  <c0244ff8> hub_events+0x2c1/0x3f1   <c0245128> hub_thread+0x0/0xe6
>  <c0245132> hub_thread+0xa/0xe6   <c0128dfb> autoremove_wake_function+0x0/0x2d
>  <c0128dfb> autoremove_wake_function+0x0/0x2d   <c0128b0c> kthread+0x9c/0xa1
>  <c0128a70> kthread+0x0/0xa1   <c01012ed> kernel_thread_helper+0x5/0xb
> Code: 00 58 39 fd c7 04 24 00 00 00 00 7d 53 0f b6 46 08 8b 8e 90 00 00 00 0f b7 93 18 02 00 00 01 c0 83 c1 08 39 d0 7f 2e 85 ed 75 0a <80> 79 05 02 0f 84 bf 00 00 00 80 bb 7c 01 00 00 ff 74 0a 80 79 
>  <6>Non-volatile memory driver v1.2
> BUG: unable to handle kernel paging request at virtual address 6963702f
>  printing eip:
> c01bc9b0
> *pde = 00000000
> Oops: 0000 [#2]
> last sysfs file: /class/firmware/0000:06:0b.0/loading
> Modules linked in: nvram ide_cd cdrom ipw2200 ohci1394 ieee1394 ehci_hcd ieee80211 uhci_hcd sg joydev ieee80211_crypt snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm hw_random i2c_i801 snd_timer piix snd i2c_core soundcore snd_page_alloc generic ext3 jbd ide_disk ide_core
> CPU:    0
> EIP:    0060:[<c01bc9b0>]    Not tainted VLI
> EFLAGS: 00010246   (2.6.16-rc5-mm2 #302) 
> EIP is at kobject_uevent+0x28d/0x37c
> eax: 00000000   ebx: f76f76c0   ecx: ffffffff   edx: f7176efc
> esi: 0000002a   edi: 6963702f   ebp: f77542c0   esp: f7176f04
> ds: 007b   es: 007b   ss: 0068
> Process knodemgrd_0 (pid: 1025, threadinfo=f7176000 task=f7788ab0)
> Stack: <0>f70f1000 c034da80 00000002 f70f1051 f8cf6e4d f7097640 c02e93d8 f76e9568 
>        00000246 00000000 f7758a94 f76e9568 f76e9568 f76e9560 f76e9568 f8d28c40 
>        c021e752 00000000 f70473a0 00000000 f76e9400 f76e9440 f76e9560 f763c000 
> Call Trace:
>  <c021e752> class_device_add+0x14e/0x201   <f8cf1efe> nodemgr_create_node+0x104/0x17b [ieee1394]
>  <f8cf2b1a> nodemgr_node_scan_one+0x160/0x16e [ieee1394]   <f8cf2b76> nodemgr_node_scan+0x4e/0x50 [ieee1394]
>  <f8cf3277> nodemgr_host_thread+0xdc/0x14c [ieee1394]   <f8cf319b> nodemgr_host_thread+0x0/0x14c [ieee1394]
>  <c01012ed> kernel_thread_helper+0x5/0xb  
> Code: 24 1c 68 01 1d 2f c0 57 e8 6e 25 00 00 c7 44 24 18 02 00 00 00 83 c4 10 83 7d 08 00 74 4f 8b 44 24 08 83 c9 ff 8b 7c 85 00 31 c0 <f2> ae f7 d1 49 83 7b 64 00 8d 71 01 8b bb 9c 00 00 00 75 61 01 
>  <6>ACPI: AC Adapter [ACAD] (on-line)
> ACPI: Battery Slot [BAT1] (battery present)
> ACPI: Lid Switch [LID0]
> 
> 
> As is usual with this particular manifestation, we died here:
> 
> 		if (i == 0 && desc->bInterfaceClass == USB_CLASS_COMM
> 
> 
> (gdb) p sizeof(struct usb_interface_descriptor)
> $3 = 9
> (gdb) p sizeof(struct usb_host_config)
> $4 = 280
> (gdb) p sizeof(struct usb_interface_cache)
> $5 = 8
> (gdb) p sizeof(struct usb_host_interface)
> $6 = 28
> (gdb) offsetof usb_interface_descriptor bInterfaceClass
> 5 0x5
> 
> And that's quite different.  `struct usb_interface_cache' is only 8 bytes,
> and the scribble is (as far as we know), not at offset zero.  Unless there
> are a large number of `usb_host_interface's in `struct
> usb_interface_cache's variable-sized array, we're not using the size-512
> slab here.
> 
> <looks at choose_configuration()>
> 
> 		struct usb_interface_descriptor	*desc =
> 				&c->intf_cache[0]->altsetting->desc;
> 
> Seems funny.
> 
> struct usb_interface_cache {
> 	unsigned num_altsetting;	/* number of alternate settings */
> 	struct kref ref;		/* reference counter */
> 
> 	/* variable-length array of alternate settings for this interface,
> 	 * stored in no particular order */
> 	struct usb_host_interface altsetting[0];
> };
> 
> A clearer way of coding that assignment would have been
> 
> 		struct usb_interface_descriptor	*desc =
> 				&c->intf_cache[0]->altsetting[0].desc;

Agreed.

> a) How come we're only considering the zeroth slot in that array in here?

We start out with the first interface setting, as we always know we have
one of them as per the USB spec (I think, anyone from linux-usb-devel
want to verify this?)

> b) How do we know that there's actually anything _there_?  The length of
>    that variable-sized array doesn't seem to have been stored anywhere
>    obvious by usb_parse_configuration() and choose_configuration() doesn't
>    check.  What happens if the length was zero?

I don't think it is allowed to be, as all USB devices have to have at
least 1 interface.

> drivers/usb/core/config.c hasn't changed since October.  If it was this
> easy, we'd have hit it before now.
> 
> Greg, who knows this code?  Can we triple-check it please?

I'll walk through it again, it's some nasty crap, I agree.  Time to dig
out the USB spec...

Anyone else on linux-usb-devel want to verify it too?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-08  1:31               ` Greg KH
@ 2006-03-08  1:49                 ` Andrew Morton
  0 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2006-03-08  1:49 UTC (permalink / raw)
  To: Greg KH; +Cc: linux-usb-devel, torvalds, mingo, linux-kernel

Greg KH <greg@kroah.com> wrote:
>
> > a) How come we're only considering the zeroth slot in that array in here?
> 
>  We start out with the first interface setting, as we always know we have
>  one of them as per the USB spec (I think, anyone from linux-usb-devel
>  want to verify this?)

I think that code has to be OK - it's longstanding stuff and slab
debugging would have caught these sorts of things in a jiffy.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-05 23:48         ` Andrew Morton
  2006-03-06  3:27           ` Linus Torvalds
@ 2006-03-06  5:00           ` Linus Torvalds
  2006-03-06  7:47             ` Mike Galbraith
                               ` (2 more replies)
  2006-03-06 11:04           ` Jesper Juhl
  2 siblings, 3 replies; 28+ messages in thread
From: Linus Torvalds @ 2006-03-06  5:00 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Greg KH, Ingo Molnar, Linux Kernel Mailing List, Dave Jones

On Sun, 5 Mar 2006, Andrew Morton wrote:
> 
> For several days I've been getting repeatable oopses in the -mm kernel. 
> They occur once per ~30 boots, during initscripts.

Actually, having thought about this some more, I wonder if the bug isn't a 
hell of a lot simpler than we've given it credit for.

I think you're running with CONFIG_PREEMPT_VOLUNTARY, right?

And looking more closely, that thing is BROKEN. DaveJ - do Fedora kernels 
also enable that thing?

Ingo: as far as I can see, CONFIG_PREEMPT_VOLUNTARY is totally and utterly 
broken during bootup. It does:

	# define might_resched() cond_resched()

and then we have

	# define might_sleep() do { might_resched(); } while (0)

and but the fact is, we _know_ that "might_sleep()" is broken during early 
bootup. We know this, because when we ahev __might_sleep() enabled to 
warn about cases where we must not sleep, we've had those tests disabled 
during early boot for a long time, in order to avoid irritating and nasty 
known "sleeping function called from invalid context" messages:

	...
        if ((in_atomic() || irqs_disabled()) &&
            system_state == SYSTEM_RUNNING && !oops_in_progress) {
                if (time_before(jiffies, prev_jiffy + HZ) && prev_jiffy)
	...

Note in particular the "system_state == SYSTEM_RUNNING". It's there for a 
reason. Namely that we know that we do things that aren't valid during 
early bootup, and that we call functions that might sleep while we have 
interrupts disabled, for example.

HOWEVER, the "cond_resched()" does not take that into account at all, and 
will happily conditionally reschedule things at early bootup before we 
have set system_state to SYSTEM_RUNNING.

In other words, unless I've totally lost it, I think that 
CONFIG_PREEMPT_VOLUNTARY currently makes us re-schedule at points in the 
early boot that we _know_ are unsafe. We happen to not hit it very often, 
because (a) some of the time it doesn't matter and (b) when it matters, we 
seldom have "need_resched()" returning true, but I would not be at all 
surprised if Andrew's problems are because the scheduler heuristics make 
it happen when it shouldn't.

And the end result? I don't know. But we've traditionally run _all_ of the 
early boot ignoring the "might_sleep()" warnings, up until the point where 
we unlock the kernel lock, long after things like kmem_cache_init().

So I would not be surprised, for example, if we had kmem_cache_init() 
doing bad things because it got interrupts enabled at a point where it 
shouldn't, because it went through the scheduler. 

I dunno. I can't actually see what would corrupt anything, but the point 
is that we definitely do scheduling in places that have gotten absolutely 
_zero_ coverage, because we turned off the checks on purpose during early 
boot because the checks gave false positives.

And CONFIG_PREEMPT_VOLUNTARY turns those false positives into potential 
rescheduling events.

Maybe I'm crazy. But it looks really really broken to me.

Andrew, if I'm right, then this ugly patch should make a difference.

Is there something else I've missed?

			Linus

----
diff --git a/kernel/sched.c b/kernel/sched.c
index 12d291b..3454bb8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4028,6 +4028,8 @@ static inline void __cond_resched(void)
 	 */
 	if (unlikely(preempt_count()))
 		return;
+	if (unlikely(system_state != SYSTEM_RUNNING))
+		return;
 	do {
 		add_preempt_count(PREEMPT_ACTIVE);
 		schedule();

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-06  5:00           ` Linus Torvalds
@ 2006-03-06  7:47             ` Mike Galbraith
  2006-03-07  5:51               ` Mike Galbraith
  2006-03-06  9:14             ` Ingo Molnar
  2006-03-06 10:31             ` Andrew Morton
  2 siblings, 1 reply; 28+ messages in thread
From: Mike Galbraith @ 2006-03-06  7:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Greg KH, Ingo Molnar, Linux Kernel Mailing List,
	Dave Jones

On Sun, 2006-03-05 at 21:00 -0800, Linus Torvalds wrote:

> 
> Is there something else I've missed?

Maybe.  Does this add anything to the picture?  During boot,
recalc_task_prio() is called with now < p->timestamp.  This causes quite
a stir.  If you WARN_ON(now < p->timestamp) or printk, you'll have a
dead box due to hundreds of gripes as things churn.  Adding...

if (unlikely(now < p->timestamp))
	__sleep_time = 0ULL;

...turns it into exactly one gripe.

	-Mike


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-06  7:47             ` Mike Galbraith
@ 2006-03-07  5:51               ` Mike Galbraith
  2006-03-07  5:56                 ` Nick Piggin
  0 siblings, 1 reply; 28+ messages in thread
From: Mike Galbraith @ 2006-03-07  5:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Greg KH, Ingo Molnar, Linux Kernel Mailing List,
	Dave Jones

On Mon, 2006-03-06 at 08:47 +0100, Mike Galbraith wrote:
> On Sun, 2006-03-05 at 21:00 -0800, Linus Torvalds wrote:
> 
> > 
> > Is there something else I've missed?
> 
> Maybe.  Does this add anything to the picture?  During boot,
> recalc_task_prio() is called with now < p->timestamp.  This causes quite
> a stir.  If you WARN_ON(now < p->timestamp) or printk, you'll have a
> dead box due to hundreds of gripes as things churn.  Adding...
> 
> if (unlikely(now < p->timestamp))
> 	__sleep_time = 0ULL;
> 
> ...turns it into exactly one gripe.

Nope.  Further research shows that this is just a speed-step problem.
Still, the scheduler needs to protect itself, because even with all of
the speed-step stuff enabled, these continue to occurr even after
boot-up if you switch to a low power setting, which screws up the
scheduler.

	-Mike


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-07  5:51               ` Mike Galbraith
@ 2006-03-07  5:56                 ` Nick Piggin
  0 siblings, 0 replies; 28+ messages in thread
From: Nick Piggin @ 2006-03-07  5:56 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Linus Torvalds, Andrew Morton, Greg KH, Ingo Molnar,
	Linux Kernel Mailing List, Dave Jones

Mike Galbraith wrote:
> On Mon, 2006-03-06 at 08:47 +0100, Mike Galbraith wrote:
> 
>>On Sun, 2006-03-05 at 21:00 -0800, Linus Torvalds wrote:
>>
>>
>>>Is there something else I've missed?
>>
>>Maybe.  Does this add anything to the picture?  During boot,
>>recalc_task_prio() is called with now < p->timestamp.  This causes quite
>>a stir.  If you WARN_ON(now < p->timestamp) or printk, you'll have a
>>dead box due to hundreds of gripes as things churn.  Adding...
>>
>>if (unlikely(now < p->timestamp))
>>	__sleep_time = 0ULL;
>>
>>...turns it into exactly one gripe.
> 
> 
> Nope.  Further research shows that this is just a speed-step problem.
> Still, the scheduler needs to protect itself, because even with all of
> the speed-step stuff enabled, these continue to occurr even after
> boot-up if you switch to a low power setting, which screws up the
> scheduler.
> 

sched_clock needs to be fixed, rather than scheduler.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-06  5:00           ` Linus Torvalds
  2006-03-06  7:47             ` Mike Galbraith
@ 2006-03-06  9:14             ` Ingo Molnar
  2006-03-06 10:31             ` Andrew Morton
  2 siblings, 0 replies; 28+ messages in thread
From: Ingo Molnar @ 2006-03-06  9:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Greg KH, Linux Kernel Mailing List, Dave Jones


* Linus Torvalds <torvalds@osdl.org> wrote:

> @@ -4028,6 +4028,8 @@ static inline void __cond_resched(void)
>  	 */
>  	if (unlikely(preempt_count()))
>  		return;
> +	if (unlikely(system_state != SYSTEM_RUNNING))
> +		return;

we used to have this, but it was frowned upon during the initial lkml 
review of -VP so i fixed all the known 'early bootup scheduling 
assumptions' and removed this condition .. but that was a losing battle 
i suspect. In any case, i fully agree with your fix. We also need this 
fix for v2.6.16.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-06  5:00           ` Linus Torvalds
  2006-03-06  7:47             ` Mike Galbraith
  2006-03-06  9:14             ` Ingo Molnar
@ 2006-03-06 10:31             ` Andrew Morton
  2 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2006-03-06 10:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: greg, mingo, linux-kernel, davej

Linus Torvalds <torvalds@osdl.org> wrote:
>
> 
> 
> On Sun, 5 Mar 2006, Andrew Morton wrote:
> > 
> > For several days I've been getting repeatable oopses in the -mm kernel. 
> > They occur once per ~30 boots, during initscripts.
> 
> Actually, having thought about this some more, I wonder if the bug isn't a 
> hell of a lot simpler than we've given it credit for.
> 
> I think you're running with CONFIG_PREEMPT_VOLUNTARY, right?

Yes.

> ...
> And CONFIG_PREEMPT_VOLUNTARY turns those false positives into potential 
> rescheduling events.

Yes.  A similar problem exists with CONFIG_PREEMPT_BKL - we're doing down()
inside lock_kenrel() in situations where we shouldn't be sleeping yet. 
That's one of the reasons why might_sleep() is suppressed early in boot. 
We assume that the down() won't be contented early in boot.  We were
looking at fixing all that up a month or so back, but the patches started
getting pretty ugly.

Still.  If either lock_kernel() or cond_resched() _do_ sleep, we'll
hopefully get scheduling-while-atomic warnings.

> index 12d291b..3454bb8 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4028,6 +4028,8 @@ static inline void __cond_resched(void)
>  	 */
>  	if (unlikely(preempt_count()))
>  		return;
> +	if (unlikely(system_state != SYSTEM_RUNNING))
> +		return;
>  	do {
>  		add_preempt_count(PREEMPT_ACTIVE);
>  		schedule();

It died after 35 reboots.

SELinux: initialized (dev rootfs, type rootfs), uses genfs_contexts
SELinux: initialized (dev sysfs, type sysfs), uses genfs_contexts
audit(1141611448.702:3): policy loaded auid=4294967295
SELinux: initialized (dev usbfs, type usbfs), uses genfs_contexts
BUG: unable to handle kernel paging request at virtual address 464c4583
 printing eip:
*pde = 3f1ea067
Oops: 0002 [#1]
last sysfs file: /devices/pci0000:00/0000:00:1d.2/usb3/idVendor
Modules linked in: nvram snd_hda_intel snd_hda_codec snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq ide_cd cdrom snd_seq_device snd_pcm_oss ipw2200 snd_mixer_oss ieee80211 ieee80211_crypt ohci1394 snd_pcm ieee1394 uhci_hcd sg ehci_hcd joydev snd_timer snd soundcore i2c_i801 piix i2c_core hw_random snd_page_alloc generic ext3 jbd ide_disk ide_core
CPU:    0
EIP:    0060:[<c0128d66>]    Not tainted VLI
EFLAGS: 00010046   (2.6.16-rc5-mm2 #303) 
EIP is at prepare_to_wait+0x19/0x3b
eax: 464c457f   ebx: f776cf08   ecx: 00000246   edx: f776cf14
esi: f76fde00   edi: 00000041   ebp: f71378d0   esp: f776ced0
ds: 007b   es: 007b   ss: 0068
Process loadkeys (pid: 1119, threadinfo=f776c000 task=f7eaa030)
Stack: <0>f7137940 f776cefc f776cf1c c015db02 f7137940 f776cee8 00000000 f7eaa030 
       c0128e0b f776cf14 f776cf14 00000246 c02c0f02 c015dc4b 00000000 f7eaa030 
       c0128e0b f776cf14 f776cf14 c01a0a63 f776cf34 00000000 f76fde00 f7cbb240 
Call Trace:
 <c015db02> pipe_wait+0x66/0x90   <c0128e0b> autoremove_wake_function+0x0/0x2d
 <c02c0f02> __mutex_lock_slowpath+0x1e1/0x339   <c015dc4b> pipe_readv+0x58/0x25a
 <c0128e0b> autoremove_wake_function+0x0/0x2d   <c01a0a63> file_has_perm+0x87/0x8a
 <c015dd36> pipe_readv+0x143/0x25a   <c015de69> pipe_read+0x1c/0x20
 <c0153d36> vfs_read+0xae/0x14e   <c015405e> sys_read+0x3c/0x64
 <c0102a9b> sysenter_past_esp+0x54/0x75  
Code: 43 04 00 02 20 00 c7 42 0c 00 01 10 00 56 9d 5b 5e c3 57 89 cf 56 89 c6 53 89 d3 83 22 fe 9c 59 fa 8d 52 0c 39 53 0c 75 0d 8b 00 <89> 50 04 89 43 0c 89 72 04 89 16 83 7b 04 00 74 0b b8 00 f0 ff 


That's "\7fELF" again, plus 4 bytes.

Note that this time it's the sleeper that went splat - usually it's the
waker.


I don't think that a bisection search of mainline (adding the sched patches
each time) will be fruitful really.  Just breathing on this thing makes the
crashes go away.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-05 23:48         ` Andrew Morton
  2006-03-06  3:27           ` Linus Torvalds
  2006-03-06  5:00           ` Linus Torvalds
@ 2006-03-06 11:04           ` Jesper Juhl
  2006-03-06 11:15             ` Andrew Morton
  2 siblings, 1 reply; 28+ messages in thread
From: Jesper Juhl @ 2006-03-06 11:04 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Greg KH, torvalds, Ingo Molnar, linux-kernel

On 3/6/06, Andrew Morton <akpm@osdl.org> wrote:
>
> Might as well cc the list on this, so there's a record...
>
> For several days I've been getting repeatable oopses in the -mm kernel.
> They occur once per ~30 boots, during initscripts.
>
[snip]

Hi Andrew,

I send a mail to the list yesterday (with you and a few other people
on cc) about Slab corruption in 2.6.16-5c5-mm2.
You have a problem with memory corruption and I'm seeing slab
corruption messages in dmesg...

Just wanted to point it out in this thread in case they are related.

--
Jesper Juhl <jesper.juhl@gmail.com>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please      http://www.expita.com/nomime.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-06 11:04           ` Jesper Juhl
@ 2006-03-06 11:15             ` Andrew Morton
  2006-03-06 15:59               ` Dave Jones
  0 siblings, 1 reply; 28+ messages in thread
From: Andrew Morton @ 2006-03-06 11:15 UTC (permalink / raw)
  To: Jesper Juhl; +Cc: greg, torvalds, mingo, linux-kernel

"Jesper Juhl" <jesper.juhl@gmail.com> wrote:
>
> On 3/6/06, Andrew Morton <akpm@osdl.org> wrote:
> >
> > Might as well cc the list on this, so there's a record...
> >
> > For several days I've been getting repeatable oopses in the -mm kernel.
> > They occur once per ~30 boots, during initscripts.
> >
> [snip]
> 
> Hi Andrew,
> 
> I send a mail to the list yesterday (with you and a few other people
> on cc) about Slab corruption in 2.6.16-5c5-mm2.
> You have a problem with memory corruption and I'm seeing slab
> corruption messages in dmesg...
> 
> Just wanted to point it out in this thread in case they are related.
> 

Yes, there have been several memory-scribblish reports.  Until this one's
found I'm not sure there's much we can do with them.

I'll do an overnight run with CONFIG_DEBUG_SLAB, see if that catches
anything.  Although AFAICT it just fixes the bug.

Beyond that, I'm rather stuck.  I'd probably have to drop a pile of sched
patches, see if that improves things for people.  But if we cannot spot a
bug in those patches then it's probably in mainline and it will bite
people in generally random and rare ways.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Fw: Re: oops in choose_configuration()
  2006-03-06 11:15             ` Andrew Morton
@ 2006-03-06 15:59               ` Dave Jones
  0 siblings, 0 replies; 28+ messages in thread
From: Dave Jones @ 2006-03-06 15:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jesper Juhl, greg, torvalds, mingo, linux-kernel

On Mon, Mar 06, 2006 at 03:15:50AM -0800, Andrew Morton wrote:

 > Yes, there have been several memory-scribblish reports.  Until this one's
 > found I'm not sure there's much we can do with them.
 > 
 > I'll do an overnight run with CONFIG_DEBUG_SLAB, see if that catches
 > anything.  Although AFAICT it just fixes the bug.
 > 
 > Beyond that, I'm rather stuck.  I'd probably have to drop a pile of sched
 > patches, see if that improves things for people.  But if we cannot spot a
 > bug in those patches then it's probably in mainline and it will bite
 > people in generally random and rare ways.

Not sure if it was you or Linus I private-mailed about this, but
to reiterate - circa rc4-git2, the Fedora rawhide kernel stopped booting
for a bunch of people, all with 686-SMP boxes. I saw it myself too,
it hung just after the 'write protecting kernel rodata'.

It totally puzzled me for a day.  The following day, I rebased to rc4-git4,
and the problem "went away".  Nothing in the changesets merged could
explain the hangs I saw.

A few days ago, the exact same bug resurfaced, and like before, the
following day, it went into hiding again.

I'm terrified the release kernel I build for FC5 is going to be a
stinker because of this, but every time I tried to instrument this,
the bugger disappeared.

I'm not sure this is even remotely the same bug you're chasing,
but the unpredictability of really nasty bugs like these give
me the creeps.

		Dave

-- 
http://www.codemonkey.org.uk

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2006-03-08  6:15 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-08  3:29 Fw: Re: oops in choose_configuration() Chuck Ebbert
2006-03-08  3:48 ` Dmitry Torokhov
2006-03-08  4:01 ` Linus Torvalds
  -- strict thread matches above, loose matches on Subject: below --
2006-03-07 21:54 Chuck Ebbert
2006-03-08  0:54 ` Greg KH
2006-03-08  0:57 ` Linus Torvalds
2006-03-08  1:13   ` Dmitry Torokhov
2006-03-08  1:27     ` Greg KH
2006-03-08  3:22       ` Dmitry Torokhov
2006-03-08  5:23         ` Greg KH
2006-03-08  6:06           ` Dmitry Torokhov
2006-03-08  6:15             ` Greg KH
2006-03-08  4:28   ` Joe Korty
2006-03-08  4:44     ` Dmitry Torokhov
     [not found] <20060304121723.19fe9b4b.akpm@osdl.org>
     [not found] ` <Pine.LNX.4.64.0603041235110.22647@g5.osdl.org>
     [not found]   ` <20060304213447.GA4445@kroah.com>
     [not found]     ` <20060304135138.613021bd.akpm@osdl.org>
     [not found]       ` <20060304221810.GA20011@kroah.com>
2006-03-05 23:48         ` Andrew Morton
2006-03-06  3:27           ` Linus Torvalds
2006-03-06  8:48             ` Andrew Morton
2006-03-08  1:31               ` Greg KH
2006-03-08  1:49                 ` Andrew Morton
2006-03-06  5:00           ` Linus Torvalds
2006-03-06  7:47             ` Mike Galbraith
2006-03-07  5:51               ` Mike Galbraith
2006-03-07  5:56                 ` Nick Piggin
2006-03-06  9:14             ` Ingo Molnar
2006-03-06 10:31             ` Andrew Morton
2006-03-06 11:04           ` Jesper Juhl
2006-03-06 11:15             ` Andrew Morton
2006-03-06 15:59               ` Dave Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox