Re: [BUG] test9 ACPI bad: scheduling while atomic!

public inbox for linux-acpi@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [BUG] test9 ACPI bad: scheduling while atomic!
@ 2003-10-27  8:22 Noah J. Misch
       [not found] ` <Pine.GSO.4.58.0310262327040.19469-8Uwgm7wxmYs@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Noah J. Misch @ 2003-10-27  8:22 UTC (permalink / raw)
  To: alex.williamson-VXdhtT5mjnY
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	acpi-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	shaohua.li-ral2JQCrhuEAvxtiuMwx3w,
	len.brown-ral2JQCrhuEAvxtiuMwx3w, jon-fdRWMHV75ajk1uMJSBkQmQ

Hi,

>    On an Omnibook 500 running test9, removing AC power causes an
> immediate hang.  This laptop is getting a little old and I have to force
> on ACPI support, but this did not happen with test8.  The bug and panic

I have this problem as well, on a Sony Vaio, model PCG-571L.

> are shown below.  It looks like the AML associated with the AC event is
> trying to do an AML_SLEEP_OP.  Since this is called while in the
> interrupt handler, and the eventual call to acpi_os_sleep() sets the
> current state to interruptible... boom.  One simple, but terribly ugly,
> workaround is to make acpi_os_sleep() call acpi_os_stall() if
> in_atomic() is true (patch below).  Hopefully there's a better way to
> fix this.  Somehow the interpreter really needs to drop interrupt
> context before it starts making calls like this.  Thanks,

This problem stems from the changes in revision 1.26 of drivers/acpi/ec.c.
They come from a patch Shaohua Li submitted for kernel bug 1171 at
bugme.osdl.org.  That patch can cause acpi_ec_gpe_query to run in interrupt
context, whereas before it always ran from a workqueue.  It does non-interrupt
like things, like sleeping and kmalloc'ing with GFP_KERNEL.

This was obvious on my system because it has no ECDT table, and as such
acpi_ec_gpe_query was _always_ running in interrupt context, whereas with an
ECDT it would only do so for a brief time during boot, and the problem would be
much more subtle.  That's probably why nobody noticed this in earlier tests.

I reversed cset 1.1337.43.3 as follows, and that fixed the problem:

bk export -tpatch -r1.1337.43.3 | patch -p1 -R

I can't figure out why that patch fixed the oops in bug 1171.  It was a hook
into the ec address space handler, not the gpe handler, that led to the oops,
yet the patch seems to only modify gpe-related code.  Perhaps you could explain,
Shaohua?

I'd guess the T40 oops results from the ACPI_MEM_FREE on line 305 of
drivers/acpi/events/evregion.c freeing already-freed memory.  I'm actually not
sure why that free is even there.  I also can't figure why only SMP-configured
kernels exhibited the problem.  If someone has the problem hardware, I am
willing to debug it, however.

The errant patch does address what seems to be a race condition that could play
out as follows:

1) The early ECDT probe locates an ECDT and registers a handler for the relevant
GPE and address space.

2) An IRQ triggers acpi_ec_gpe_handler, which schedules acpi_ec_gpe_query.

3) ACPI scans for devices and adds the "real" embedded controller device,
freeing the (temporary) context of the old GPE query handler.

4) Queue runs acpi_ec_gpe_query with a context that has already been kfree'd,
causing it to fail.

It seems rather theoretical, but perhaps we could fix it with a patch like the
following.  I tested it for kicks and didn't hit any problems, but I'm afraid it
risks more problems than it solves.  Thoughts?

--- 1.27/drivers/acpi/ec.c	Mon Oct 27 03:50:57 2003
+++ edited/drivers/acpi/ec.c	Mon Oct 27 03:51:57 2003
@@ -28,6 +28,7 @@
 #include <linux/init.h>
 #include <linux/types.h>
 #include <linux/delay.h>
+#include <linux/workqueue.h>
 #include <linux/proc_fs.h>
 #include <asm/io.h>
 #include <acpi/acpi_bus.h>
@@ -593,6 +594,10 @@
 			ACPI_ADR_SPACE_EC, &acpi_ec_space_handler);

 		acpi_remove_gpe_handler(NULL, ec_ecdt->gpe_bit, &acpi_ec_gpe_handler);
+
+		/* Clear any pending GPE queries before freeing the context for
+		   their handlers */
+		flush_scheduled_work();

 		kfree(ec_ecdt);
 	}

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] test9 ACPI bad: scheduling while atomic!
       [not found] ` <Pine.GSO.4.58.0310262327040.19469-8Uwgm7wxmYs@public.gmane.org>
@ 2003-10-27 16:47   ` Alex Williamson
       [not found]     ` <1067273229.7497.30.camel-Wmjt7DDUnIVxnVILBQAtiA@public.gmane.org>
  2003-10-27 20:24   ` [ACPI] " Nate Lawson
  1 sibling, 1 reply; 6+ messages in thread
From: Alex Williamson @ 2003-10-27 16:47 UTC (permalink / raw)
  To: Noah J. Misch
  Cc: linux-kernel, acpi-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	shaohua.li-ral2JQCrhuEAvxtiuMwx3w,
	len.brown-ral2JQCrhuEAvxtiuMwx3w, jon-fdRWMHV75ajk1uMJSBkQmQ

On Mon, 2003-10-27 at 01:22, Noah J. Misch wrote:

> This problem stems from the changes in revision 1.26 of drivers/acpi/ec.c.
> They come from a patch Shaohua Li submitted for kernel bug 1171 at
> bugme.osdl.org.  That patch can cause acpi_ec_gpe_query to run in interrupt
> context, whereas before it always ran from a workqueue.  It does non-interrupt
> like things, like sleeping and kmalloc'ing with GFP_KERNEL.
> 
> This was obvious on my system because it has no ECDT table, and as such
> acpi_ec_gpe_query was _always_ running in interrupt context, whereas with an
> ECDT it would only do so for a brief time during boot, and the problem would be
> much more subtle.  That's probably why nobody noticed this in earlier tests.
> 

  I don't have an ECDT either.  Is it possible that the setting of
ec_device_init = 1 is simply misplaced?  I can see why we wouldn't want
to call acpi_os_queue_for_execution() early in bootup, but there ought
to be a fixed point after which it's ok, regardless of whether the
system has the ECDT table.  Would it be sufficient to set ec_device_init
to 1 at the beginning of acpi_ec_add(), with no dependency on the ECDT
table?

	Alex

-- 
Alex Williamson                             HP Linux & Open Source Lab



-------------------------------------------------------
This SF.net email is sponsored by: The SF.net Donation Program.
Do you like what SourceForge.net is doing for the Open
Source Community?  Make a contribution, and help us add new
features and functionality. Click here: http://sourceforge.net/donate/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] test9 ACPI bad: scheduling while atomic!
       [not found]     ` <1067273229.7497.30.camel-Wmjt7DDUnIVxnVILBQAtiA@public.gmane.org>
@ 2003-10-27 18:02       ` Noah J. Misch
  0 siblings, 0 replies; 6+ messages in thread
From: Noah J. Misch @ 2003-10-27 18:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: linux-kernel, acpi-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	shaohua.li-ral2JQCrhuEAvxtiuMwx3w,
	len.brown-ral2JQCrhuEAvxtiuMwx3w, jon-fdRWMHV75ajk1uMJSBkQmQ

On Mon, 27 Oct 2003, Alex Williamson wrote:

> > This was obvious on my system because it has no ECDT table, and as such
> > acpi_ec_gpe_query was _always_ running in interrupt context, whereas with an
> > ECDT it would only do so for a brief time during boot, and the problem would be
> > much more subtle.  That's probably why nobody noticed this in earlier tests.
> >
>
>   I don't have an ECDT either.  Is it possible that the setting of
> ec_device_init = 1 is simply misplaced?

It is misplaced.  If revision 1.26 of ec.c were otherwise sound, I would place
ec_device_init = 1 right before the call to acpi_install_gpe_handler in
acpi_ec_start.  Anywhere outside that if and between where _add removes the
handlers and _start installs them would work.  This would fix your crash, but
it's not the right fix.

> I can see why we wouldn't want to call acpi_os_queue_for_execution() early in
> bootup, but there ought to be a fixed point after which it's ok, regardless of
> whether the system has the ECDT table.

I don't think early calls to schedule_work (via acpi_os_queue_for_execution) are
a problem.  The call to init_workqueues is just before do_initcalls in
do_basic_setup, so it happens earlier than all this stuff.

The more general problem is that acpi_ec_gpe_query cannot run in an interrupt
handler as written.  It used to always run from a queue.  We can either fix it
so it can run from an interrupt handler or change it back to never doing so.  I
favor the latter, especially because I don't see how the recent change fixed the
problem T40 users were experiencing.

> Would it be sufficient to set ec_device_init to 1 at the beginning of
> acpi_ec_add(), with no dependency on the ECDT table?

That particular placement looks racy.  I would do it after removing the
handlers, as explained above.

Thanks,
Noah

-------------------------------------------------------
This SF.net email is sponsored by: The SF.net Donation Program.
Do you like what SourceForge.net is doing for the Open
Source Community?  Make a contribution, and help us add new
features and functionality. Click here: http://sourceforge.net/donate/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [ACPI] Re: [BUG] test9 ACPI bad: scheduling while atomic!
       [not found] ` <Pine.GSO.4.58.0310262327040.19469-8Uwgm7wxmYs@public.gmane.org>
  2003-10-27 16:47   ` Alex Williamson
@ 2003-10-27 20:24   ` Nate Lawson
  1 sibling, 0 replies; 6+ messages in thread
From: Nate Lawson @ 2003-10-27 20:24 UTC (permalink / raw)
  To: Noah J. Misch
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	acpi-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Mon, 27 Oct 2003, Noah J. Misch wrote:
> > are shown below.  It looks like the AML associated with the AC event is
> > trying to do an AML_SLEEP_OP.  Since this is called while in the
> > interrupt handler, and the eventual call to acpi_os_sleep() sets the
> > current state to interruptible... boom.  One simple, but terribly ugly,
> > workaround is to make acpi_os_sleep() call acpi_os_stall() if
> > in_atomic() is true (patch below).  Hopefully there's a better way to
> > fix this.  Somehow the interpreter really needs to drop interrupt
> > context before it starts making calls like this.  Thanks,

I thought a change was committed to address this, calling Stall for up to
255 us and Sleep for more than that.

-Nate

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [BUG] test9 ACPI bad: scheduling while atomic!
@ 2003-10-28  2:52 Li, Shaohua
       [not found] ` <571ACEFD467F7749BC50E0A98C17CDD8D5FDBB-4yWAQGcml64gGBtAFL8yw7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Li, Shaohua @ 2003-10-28  2:52 UTC (permalink / raw)
  To: Noah J. Misch; +Cc: acpi-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

Hi all,

Pl. sees comments below.

> I can't figure out why that patch fixed the oops in bug 1171.  It was
a
> hook
> into the ec address space handler, not the gpe handler, that led to
the
> oops,
> yet the patch seems to only modify gpe-related code.  Perhaps you
could
> explain,

> The errant patch does address what seems to be a race condition that
could
> play
> out as follows:
> 
> 1) The early ECDT probe locates an ECDT and registers a handler for
the
> relevant
> GPE and address space.
> 
> 2) An IRQ triggers acpi_ec_gpe_handler, which schedules
acpi_ec_gpe_query.
> 
> 3) ACPI scans for devices and adds the "real" embedded controller
device,
> freeing the (temporary) context of the old GPE query handler.
> 
> 4) Queue runs acpi_ec_gpe_query with a context that has already been
> kfree'd,
> causing it to fail.

The reason is just what you said. It's a race condition.


> I'd guess the T40 oops results from the ACPI_MEM_FREE on line 305 of
> drivers/acpi/events/evregion.c freeing already-freed memory.  I'm
actually
> not
> sure why that free is even there.  I also can't figure why only SMP-
> configured
> kernels exhibited the problem.  If someone has the problem hardware, I
am
> willing to debug it, however.

In UP, the problem occurs also. But it can't cause anything, because
spinlock does nothing in UP.
In some machines which have the problem in UP, you will see some info
like this:
Hwregs-0760[35] hw_low_level_read: Unsupported address space:C8

> 
> It seems rather theoretical, but perhaps we could fix it with a patch
like
> the
> following.  I tested it for kicks and didn't hit any problems, but I'm
> afraid it
> risks more problems than it solves.  Thoughts?

Did just discarding the events cause problem?


-------------------------------------------------------
This SF.net email is sponsored by: The SF.net Donation Program.
Do you like what SourceForge.net is doing for the Open
Source Community?  Make a contribution, and help us add new
features and functionality. Click here: http://sourceforge.net/donate/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [BUG] test9 ACPI bad: scheduling while atomic!
       [not found] ` <571ACEFD467F7749BC50E0A98C17CDD8D5FDBB-4yWAQGcml64gGBtAFL8yw7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2003-10-28 23:56   ` Noah J. Misch
  0 siblings, 0 replies; 6+ messages in thread
From: Noah J. Misch @ 2003-10-28 23:56 UTC (permalink / raw)
  To: Li, Shaohua
  Cc: acpi-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

On Tue, 28 Oct 2003, Li, Shaohua wrote:

> The reason is just what you said. It's a race condition.

> In UP, the problem occurs also. But it can't cause anything, because spinlock
> does nothing in UP. In some machines which have the problem in UP, you will
> see some info like this:
> Hwregs-0760[35] hw_low_level_read: Unsupported address space:C8

It's clearly a spinlock that leads up to Alex's oops, but the T40 oops?

> > It seems rather theoretical, but perhaps we could fix it with a patch like
> > the following.  I tested it for kicks and didn't hit any problems, but I'm
> > afraid it risks more problems than it solves.  Thoughts?
>
> Did just discarding the events cause problem?

Hmmm.  I don't imagine it would.  Indeed, registering the address space handler
alone may be good enough for ECDT use.  I'd favor that.

Regardless, we still have the Thinkpad T40 problem.

On Tue, 28 Oct 2003, Li, Shaohua wrote:

> one solution is using bh to handle such events before EC device added.
> After EC is loaded, we still use keventd to handle the events. Any idea?

Aren't BHs obsolete?

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2003-10-28 23:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-10-27  8:22 [BUG] test9 ACPI bad: scheduling while atomic! Noah J. Misch
     [not found] ` <Pine.GSO.4.58.0310262327040.19469-8Uwgm7wxmYs@public.gmane.org>
2003-10-27 16:47   ` Alex Williamson
     [not found]     ` <1067273229.7497.30.camel-Wmjt7DDUnIVxnVILBQAtiA@public.gmane.org>
2003-10-27 18:02       ` Noah J. Misch
2003-10-27 20:24   ` [ACPI] " Nate Lawson
  -- strict thread matches above, loose matches on Subject: below --
2003-10-28  2:52 Li, Shaohua
     [not found] ` <571ACEFD467F7749BC50E0A98C17CDD8D5FDBB-4yWAQGcml64gGBtAFL8yw7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2003-10-28 23:56   ` Noah J. Misch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox