Re: [BUG] test9 ACPI bad: scheduling while atomic!

public inbox for linux-acpi@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [BUG] test9 ACPI bad: scheduling while atomic!
@ 2003-10-27  8:22 Noah J. Misch
       [not found] ` <Pine.GSO.4.58.0310262327040.19469-8Uwgm7wxmYs@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Noah J. Misch @ 2003-10-27  8:22 UTC (permalink / raw)
  To: alex.williamson-VXdhtT5mjnY
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	acpi-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	shaohua.li-ral2JQCrhuEAvxtiuMwx3w,
	len.brown-ral2JQCrhuEAvxtiuMwx3w, jon-fdRWMHV75ajk1uMJSBkQmQ

Hi,

>    On an Omnibook 500 running test9, removing AC power causes an
> immediate hang.  This laptop is getting a little old and I have to force
> on ACPI support, but this did not happen with test8.  The bug and panic

I have this problem as well, on a Sony Vaio, model PCG-571L.

> are shown below.  It looks like the AML associated with the AC event is
> trying to do an AML_SLEEP_OP.  Since this is called while in the
> interrupt handler, and the eventual call to acpi_os_sleep() sets the
> current state to interruptible... boom.  One simple, but terribly ugly,
> workaround is to make acpi_os_sleep() call acpi_os_stall() if
> in_atomic() is true (patch below).  Hopefully there's a better way to
> fix this.  Somehow the interpreter really needs to drop interrupt
> context before it starts making calls like this.  Thanks,

This problem stems from the changes in revision 1.26 of drivers/acpi/ec.c.
They come from a patch Shaohua Li submitted for kernel bug 1171 at
bugme.osdl.org.  That patch can cause acpi_ec_gpe_query to run in interrupt
context, whereas before it always ran from a workqueue.  It does non-interrupt
like things, like sleeping and kmalloc'ing with GFP_KERNEL.

This was obvious on my system because it has no ECDT table, and as such
acpi_ec_gpe_query was _always_ running in interrupt context, whereas with an
ECDT it would only do so for a brief time during boot, and the problem would be
much more subtle.  That's probably why nobody noticed this in earlier tests.

I reversed cset 1.1337.43.3 as follows, and that fixed the problem:

bk export -tpatch -r1.1337.43.3 | patch -p1 -R

I can't figure out why that patch fixed the oops in bug 1171.  It was a hook
into the ec address space handler, not the gpe handler, that led to the oops,
yet the patch seems to only modify gpe-related code.  Perhaps you could explain,
Shaohua?

I'd guess the T40 oops results from the ACPI_MEM_FREE on line 305 of
drivers/acpi/events/evregion.c freeing already-freed memory.  I'm actually not
sure why that free is even there.  I also can't figure why only SMP-configured
kernels exhibited the problem.  If someone has the problem hardware, I am
willing to debug it, however.

The errant patch does address what seems to be a race condition that could play
out as follows:

1) The early ECDT probe locates an ECDT and registers a handler for the relevant
GPE and address space.

2) An IRQ triggers acpi_ec_gpe_handler, which schedules acpi_ec_gpe_query.

3) ACPI scans for devices and adds the "real" embedded controller device,
freeing the (temporary) context of the old GPE query handler.

4) Queue runs acpi_ec_gpe_query with a context that has already been kfree'd,
causing it to fail.

It seems rather theoretical, but perhaps we could fix it with a patch like the
following.  I tested it for kicks and didn't hit any problems, but I'm afraid it
risks more problems than it solves.  Thoughts?

--- 1.27/drivers/acpi/ec.c	Mon Oct 27 03:50:57 2003
+++ edited/drivers/acpi/ec.c	Mon Oct 27 03:51:57 2003
@@ -28,6 +28,7 @@
 #include <linux/init.h>
 #include <linux/types.h>
 #include <linux/delay.h>
+#include <linux/workqueue.h>
 #include <linux/proc_fs.h>
 #include <asm/io.h>
 #include <acpi/acpi_bus.h>
@@ -593,6 +594,10 @@
 			ACPI_ADR_SPACE_EC, &acpi_ec_space_handler);

 		acpi_remove_gpe_handler(NULL, ec_ecdt->gpe_bit, &acpi_ec_gpe_handler);
+
+		/* Clear any pending GPE queries before freeing the context for
+		   their handlers */
+		flush_scheduled_work();

 		kfree(ec_ecdt);
 	}

^ permalink raw reply	[flat|nested] 4+ messages in thread

[parent not found: <Pine.GSO.4.58.0310262327040.19469-8Uwgm7wxmYs@public.gmane.org>]

* Re: [BUG] test9 ACPI bad: scheduling while atomic!
       [not found] ` <Pine.GSO.4.58.0310262327040.19469-8Uwgm7wxmYs@public.gmane.org>
@ 2003-10-27 16:47   ` Alex Williamson
       [not found]     ` <1067273229.7497.30.camel-Wmjt7DDUnIVxnVILBQAtiA@public.gmane.org>
  2003-10-27 20:24   ` [ACPI] " Nate Lawson
  1 sibling, 1 reply; 4+ messages in thread
From: Alex Williamson @ 2003-10-27 16:47 UTC (permalink / raw)
  To: Noah J. Misch
  Cc: linux-kernel, acpi-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	shaohua.li-ral2JQCrhuEAvxtiuMwx3w,
	len.brown-ral2JQCrhuEAvxtiuMwx3w, jon-fdRWMHV75ajk1uMJSBkQmQ

On Mon, 2003-10-27 at 01:22, Noah J. Misch wrote:

> This problem stems from the changes in revision 1.26 of drivers/acpi/ec.c.
> They come from a patch Shaohua Li submitted for kernel bug 1171 at
> bugme.osdl.org.  That patch can cause acpi_ec_gpe_query to run in interrupt
> context, whereas before it always ran from a workqueue.  It does non-interrupt
> like things, like sleeping and kmalloc'ing with GFP_KERNEL.
> 
> This was obvious on my system because it has no ECDT table, and as such
> acpi_ec_gpe_query was _always_ running in interrupt context, whereas with an
> ECDT it would only do so for a brief time during boot, and the problem would be
> much more subtle.  That's probably why nobody noticed this in earlier tests.
> 

  I don't have an ECDT either.  Is it possible that the setting of
ec_device_init = 1 is simply misplaced?  I can see why we wouldn't want
to call acpi_os_queue_for_execution() early in bootup, but there ought
to be a fixed point after which it's ok, regardless of whether the
system has the ECDT table.  Would it be sufficient to set ec_device_init
to 1 at the beginning of acpi_ec_add(), with no dependency on the ECDT
table?

	Alex

-- 
Alex Williamson                             HP Linux & Open Source Lab



-------------------------------------------------------
This SF.net email is sponsored by: The SF.net Donation Program.
Do you like what SourceForge.net is doing for the Open
Source Community?  Make a contribution, and help us add new
features and functionality. Click here: http://sourceforge.net/donate/

^ permalink raw reply	[flat|nested] 4+ messages in thread

[parent not found: <1067273229.7497.30.camel-Wmjt7DDUnIVxnVILBQAtiA@public.gmane.org>]

* Re: [BUG] test9 ACPI bad: scheduling while atomic!
       [not found]     ` <1067273229.7497.30.camel-Wmjt7DDUnIVxnVILBQAtiA@public.gmane.org>
@ 2003-10-27 18:02       ` Noah J. Misch
  0 siblings, 0 replies; 4+ messages in thread
From: Noah J. Misch @ 2003-10-27 18:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: linux-kernel, acpi-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	shaohua.li-ral2JQCrhuEAvxtiuMwx3w,
	len.brown-ral2JQCrhuEAvxtiuMwx3w, jon-fdRWMHV75ajk1uMJSBkQmQ

On Mon, 27 Oct 2003, Alex Williamson wrote:

> > This was obvious on my system because it has no ECDT table, and as such
> > acpi_ec_gpe_query was _always_ running in interrupt context, whereas with an
> > ECDT it would only do so for a brief time during boot, and the problem would be
> > much more subtle.  That's probably why nobody noticed this in earlier tests.
> >
>
>   I don't have an ECDT either.  Is it possible that the setting of
> ec_device_init = 1 is simply misplaced?

It is misplaced.  If revision 1.26 of ec.c were otherwise sound, I would place
ec_device_init = 1 right before the call to acpi_install_gpe_handler in
acpi_ec_start.  Anywhere outside that if and between where _add removes the
handlers and _start installs them would work.  This would fix your crash, but
it's not the right fix.

> I can see why we wouldn't want to call acpi_os_queue_for_execution() early in
> bootup, but there ought to be a fixed point after which it's ok, regardless of
> whether the system has the ECDT table.

I don't think early calls to schedule_work (via acpi_os_queue_for_execution) are
a problem.  The call to init_workqueues is just before do_initcalls in
do_basic_setup, so it happens earlier than all this stuff.

The more general problem is that acpi_ec_gpe_query cannot run in an interrupt
handler as written.  It used to always run from a queue.  We can either fix it
so it can run from an interrupt handler or change it back to never doing so.  I
favor the latter, especially because I don't see how the recent change fixed the
problem T40 users were experiencing.

> Would it be sufficient to set ec_device_init to 1 at the beginning of
> acpi_ec_add(), with no dependency on the ECDT table?

That particular placement looks racy.  I would do it after removing the
handlers, as explained above.

Thanks,
Noah

-------------------------------------------------------
This SF.net email is sponsored by: The SF.net Donation Program.
Do you like what SourceForge.net is doing for the Open
Source Community?  Make a contribution, and help us add new
features and functionality. Click here: http://sourceforge.net/donate/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [ACPI] Re: [BUG] test9 ACPI bad: scheduling while atomic!
       [not found] ` <Pine.GSO.4.58.0310262327040.19469-8Uwgm7wxmYs@public.gmane.org>
  2003-10-27 16:47   ` Alex Williamson
@ 2003-10-27 20:24   ` Nate Lawson
  1 sibling, 0 replies; 4+ messages in thread
From: Nate Lawson @ 2003-10-27 20:24 UTC (permalink / raw)
  To: Noah J. Misch
  Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	acpi-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f

On Mon, 27 Oct 2003, Noah J. Misch wrote:
> > are shown below.  It looks like the AML associated with the AC event is
> > trying to do an AML_SLEEP_OP.  Since this is called while in the
> > interrupt handler, and the eventual call to acpi_os_sleep() sets the
> > current state to interruptible... boom.  One simple, but terribly ugly,
> > workaround is to make acpi_os_sleep() call acpi_os_stall() if
> > in_atomic() is true (patch below).  Hopefully there's a better way to
> > fix this.  Somehow the interpreter really needs to drop interrupt
> > context before it starts making calls like this.  Thanks,

I thought a change was committed to address this, calling Stall for up to
255 us and Sleep for more than that.

-Nate

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2003-10-27 20:24 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-10-27  8:22 [BUG] test9 ACPI bad: scheduling while atomic! Noah J. Misch
     [not found] ` <Pine.GSO.4.58.0310262327040.19469-8Uwgm7wxmYs@public.gmane.org>
2003-10-27 16:47   ` Alex Williamson
     [not found]     ` <1067273229.7497.30.camel-Wmjt7DDUnIVxnVILBQAtiA@public.gmane.org>
2003-10-27 18:02       ` Noah J. Misch
2003-10-27 20:24   ` [ACPI] " Nate Lawson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox