* RE: ACPI breakage (Re: 2.6.19-rc6: known regressions (v2))
@ 2006-11-18 19:01 Starikovskiy, Alexey Y
2006-11-18 19:05 ` Linus Torvalds
0 siblings, 1 reply; 15+ messages in thread
From: Starikovskiy, Alexey Y @ 2006-11-18 19:01 UTC (permalink / raw)
To: Linus Torvalds
Cc: Brown, Len, Adrian Bunk, Andrew Morton, David Brownell,
linux-acpi
>Feel free to send me test patches when working on these
>things, because I
>have no trouble at all to test my particular machine.
I've sent you a test patch back in July, but did not get a reply. May be
due to OLS?
Thanks,
Alex.
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: ACPI breakage (Re: 2.6.19-rc6: known regressions (v2))
2006-11-18 19:01 ACPI breakage (Re: 2.6.19-rc6: known regressions (v2)) Starikovskiy, Alexey Y
@ 2006-11-18 19:05 ` Linus Torvalds
[not found] ` <455FB44C.8050103@linux.intel.com>
0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2006-11-18 19:05 UTC (permalink / raw)
To: Starikovskiy, Alexey Y
Cc: Brown, Len, Adrian Bunk, Andrew Morton, David Brownell,
linux-acpi
On Sat, 18 Nov 2006, Starikovskiy, Alexey Y wrote:
>
> I've sent you a test patch back in July, but did not get a reply. May be
> due to OLS?
Heh. Whenever you send me something like that, and I don't answer within a
few days, you can pretty much depend on me not answering - my mailqueue
just fills up too fast. And yeah, it might have been during OLS. Just
re-send when it happens.
Linus
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: ACPI breakage (Re: 2.6.19-rc6: known regressions (v2))
@ 2006-11-18 16:23 Starikovskiy, Alexey Y
2006-11-18 17:12 ` Linus Torvalds
0 siblings, 1 reply; 15+ messages in thread
From: Starikovskiy, Alexey Y @ 2006-11-18 16:23 UTC (permalink / raw)
To: Linus Torvalds, Brown, Len, Adrian Bunk, Andrew Morton
Cc: David Brownell, linux-acpi
May because it does not have a single common line with the previous
patch?
Or may be because it fixes all the current AMD-HP notebooks?
Or may be because it did not fail while being in -mm?
I will not "sneak it in" again, I promise.
Regards,
Alex.
-----Original Message-----
From: Linus Torvalds [mailto:torvalds@osdl.org]
Sent: Saturday, November 18, 2006 4:25 AM
To: Brown, Len; Starikovskiy, Alexey Y; Adrian Bunk; Andrew Morton
Cc: David Brownell; linux-acpi@vger.kernel.org
Subject: Re: ACPI breakage (Re: 2.6.19-rc6: known regressions (v2))
On Fri, 17 Nov 2006, Linus Torvalds wrote:
>
> Total lockup - no sysrq, no messages, no nothing.
Dammit.
It looks like 37605a6900f6b4d886d995751fcfeef88c4e462c, and I should
have
realized that immediately.
That commit re-introduces the bug that we already reverted once.
Why the hell did that idiotic thing go in, when we had to revert it once
already (see commit 72945b2b90a5554975b8f72673ab7139d232a121 for the
earlier revert).
It was broken then, it is broken now. Nothing has changed.
Why did you guys try to sneak it in again? Last time this same "use a
second workqueue" patch went in (in a different form), we had _exactly_
the same problems, with total lockups, and way too high CPU usage.
The bugzilla entry that you refer to in that commit is even the same one
that discussed why the _original_ patch was totally broken.
It's even the same AUTHOR who wrote the original buggy patch, that
pushed
through the same buggy patch AGAIN.
Dammit, this is frustrating.
Why did people expect it to suddenly not be buggy?
Linus
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: ACPI breakage (Re: 2.6.19-rc6: known regressions (v2))
2006-11-18 16:23 Starikovskiy, Alexey Y
@ 2006-11-18 17:12 ` Linus Torvalds
2006-11-18 19:05 ` David Brownell
0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2006-11-18 17:12 UTC (permalink / raw)
To: Starikovskiy, Alexey Y
Cc: Brown, Len, Adrian Bunk, Andrew Morton, David Brownell,
linux-acpi
On Sat, 18 Nov 2006, Starikovskiy, Alexey Y wrote:
>
> May because it does not have a single common line with the previous
> patch?
Yeah, I do agree that it _looks_ very different as a patch, but it ends up
having all the same execution profiles..
It's been too long since I debugged the previous problem, so I don't
remember the exact details any more (back then I enabled ACPI debugging
and watched the messages scroll by etc - this time I initially thought it
was interrupt-related due to the other irq problems we've had, so I
started bisecting immediately _without_ doing any ACPI debugging stuff,
and by the time I actually bisected down enough, I recognized the problem,
so I didn't do all the same "enable ACPI messages and look deeply into
what is going on" thing).
But if I remember correctly, what happens is _roughly_ something like
this:
- thermal event happens - the CPU is getting warm, and the fan needs to
start up. Quite often, this happened early during boot (which is quite
busy - some init scripts are disgustingly CPU-intensive mainly due to
using inefficient scripting languages), but if it didn't happen there,
it's easy enough to force to happen other ways.
- part of the handling is "acpi_os_execute()" for something (don't ask me
what), but the interestign thing is how that "acpi_os_execue()" then
ends up causing a _recursive_ event.
- we handle the original event in kacpid, and hand over the new one as a
notification event. But the event keeps on happening, and kacpid keeps
on running, and the other thread doesn't actually ever _run_ because
kacpid holds he ACPI lock and is constantly busy.
- we not only are constantly running in kernel space, we also end up
eventually running out of memory for allocating all the work queue
entries.
So the reason the old code works is because everything is done in a single
thread, and yes, we end up getting multiple events, but because the queue
is all done onto the same queue that is _handling_ the events in the first
place, and because it's a FIFO queue, the notification events get handled
_before_ the later events.
So with the single-threaded situation, you basically end up always doing
the events in the same order they came in. In the "two separate threads"
case, you don't, and one thread will end up generating events forever,
waiting for them to happen, but they never _do_ happen, so you have a
lockup _and_ eventually an infinite event queue for the other thread.
> Or may be because it fixes all the current AMD-HP notebooks?
> Or may be because it did not fail while being in -mm?
I'm afraid that -mm doesn't get as much testing as it used to get.
Also, I do realize that the patch fixes other problems, but we have long
had a very strict policy that we do NOT accept regressions. Immediately
when you start accepting regressions, you will never know whether you're
going forward of backwards. It's better to have a known _old_ bug than to
introduce a new one.
So the "no regressions!" rule ends up trumping pretty much every single
other issue. It's unacceptable to have machines that used to work,
suddenly stop working. Even if it fixes another machine.
ACPI didn't use to have that rule, and it was wild and crazy. Maybe more
bugs got fixed, but the problem with accepting regressions is that nobody
can _ever_ trust that system. You do not want to have people _afraid_ of
upgrading - they should feel confident that upgrading never introduces any
new problems.
(Of course, that can never be reached 100%, but it's very much part of the
goal. It kind of falls into the same "backwards compatibility on
interfaces" absolute goal: it's ok to do new things, but you can never
allow them to break old programs)
> I will not "sneak it in" again, I promise.
Feel free to send me test patches when working on these things, because I
have no trouble at all to test my particular machine.
I think you'll find the ACPI dumps etc for that machine in your archives,
because I've sent them to Len and the acpi lists several times, but if you
want to get AML disassemblies etc, just tell me how. I've done them
before, but I work on this seldom enough that I always forget what the
magic incantations are, and where to get the tools etc.
Linus
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ACPI breakage (Re: 2.6.19-rc6: known regressions (v2))
2006-11-18 17:12 ` Linus Torvalds
@ 2006-11-18 19:05 ` David Brownell
2006-11-18 22:09 ` Linus Torvalds
2006-11-19 4:33 ` David Brownell
0 siblings, 2 replies; 15+ messages in thread
From: David Brownell @ 2006-11-18 19:05 UTC (permalink / raw)
To: Alexey Starikovskiy, Linus Torvalds
Cc: Adrian Bunk, Andrew Morton, Brown, Len, linux-acpi
> On Sat, 18 Nov 2006, Starikovskiy, Alexey Y wrote:
>
> > Or may be because it fixes all the current AMD-HP notebooks?
Whatever "it" is sure broke mine though... the one that's
currently on my lap! :)
Running right now with a patch reverting the update which
made trouble on Linus' machine, but without Alexey's two
tweaks to the EC interrupt handler. So far so good, even
after doing things which had previously caused AE_TIME
errors pretty quickly. But then, the errors weren't what
I'd call reproducible either.
Linus' explanation of what went wrong looks compatible with
the symptoms I've seen, FWIW.
- Dave
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ACPI breakage (Re: 2.6.19-rc6: known regressions (v2))
2006-11-18 19:05 ` David Brownell
@ 2006-11-18 22:09 ` Linus Torvalds
2006-11-18 22:16 ` Adrian Bunk
2006-11-19 4:33 ` David Brownell
1 sibling, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2006-11-18 22:09 UTC (permalink / raw)
To: David Brownell
Cc: Alexey Starikovskiy, Adrian Bunk, Andrew Morton, Brown, Len,
linux-acpi
On Sat, 18 Nov 2006, David Brownell wrote:
>
> Running right now with a patch reverting the update which
> made trouble on Linus' machine, but without Alexey's two
> tweaks to the EC interrupt handler. So far so good, even
> after doing things which had previously caused AE_TIME
> errors pretty quickly. But then, the errors weren't what
> I'd call reproducible either.
Ok, goodie.
Adrian, that means that there's one less regression on your list, unless
David reports that he can reproduce it again (I don't think he will be
able to: all the other ACPI changes looked relatively harmless, at least
in the particular area of ACPI changes I looked at)
Linus
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ACPI breakage (Re: 2.6.19-rc6: known regressions (v2))
2006-11-18 22:09 ` Linus Torvalds
@ 2006-11-18 22:16 ` Adrian Bunk
0 siblings, 0 replies; 15+ messages in thread
From: Adrian Bunk @ 2006-11-18 22:16 UTC (permalink / raw)
To: Linus Torvalds
Cc: David Brownell, Alexey Starikovskiy, Andrew Morton, Brown, Len,
linux-acpi
On Sat, Nov 18, 2006 at 02:09:56PM -0800, Linus Torvalds wrote:
>
>
> On Sat, 18 Nov 2006, David Brownell wrote:
> >
> > Running right now with a patch reverting the update which
> > made trouble on Linus' machine, but without Alexey's two
> > tweaks to the EC interrupt handler. So far so good, even
> > after doing things which had previously caused AE_TIME
> > errors pretty quickly. But then, the errors weren't what
> > I'd call reproducible either.
>
> Ok, goodie.
>
> Adrian, that means that there's one less regression on your list, unless
> David reports that he can reproduce it again (I don't think he will be
> able to: all the other ACPI changes looked relatively harmless, at least
> in the particular area of ACPI changes I looked at)
I had already removed it from my list based on David's email.
> Linus
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ACPI breakage (Re: 2.6.19-rc6: known regressions (v2))
2006-11-18 19:05 ` David Brownell
2006-11-18 22:09 ` Linus Torvalds
@ 2006-11-19 4:33 ` David Brownell
2006-11-20 18:46 ` David Brownell
1 sibling, 1 reply; 15+ messages in thread
From: David Brownell @ 2006-11-19 4:33 UTC (permalink / raw)
To: Alexey Starikovskiy
Cc: Linus Torvalds, Adrian Bunk, Andrew Morton, Brown, Len,
linux-acpi
On Saturday 18 November 2006 11:05 am, David Brownell wrote:
>
> Running right now with a patch reverting the update which
> made trouble on Linus' machine, but without Alexey's two
> tweaks to the EC interrupt handler. So far so good, even
> after doing things which had previously caused AE_TIME
> errors pretty quickly. But then, the errors weren't what
> I'd call reproducible either.
Hmm, well after a reboot to sort out some other patches,
and at uptime of ~2 hours, I noticed confusion about
whether AC or battery power was active, then the old:
ACPI Exception (evregion-0424): AE_TIME, Returned by Handler for [EmbeddedControl] [20060707]
ACPI Exception (dswexec-0458): AE_TIME, While resolving operands for [OpcodeName unavailable] [20060707]
ACPI Error (psparse-0537): Method parse/execution failed [\_TZ_.THRM._TMP] (Node ffff810002032d10), AE_TIME
So maybe that's not the entire story; sigh.
- Dave
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ACPI breakage (Re: 2.6.19-rc6: known regressions (v2))
2006-11-19 4:33 ` David Brownell
@ 2006-11-20 18:46 ` David Brownell
0 siblings, 0 replies; 15+ messages in thread
From: David Brownell @ 2006-11-20 18:46 UTC (permalink / raw)
To: Alexey Starikovskiy
Cc: Linus Torvalds, Adrian Bunk, Andrew Morton, Brown, Len,
linux-acpi
On Saturday 18 November 2006 8:33 pm, David Brownell wrote:
> On Saturday 18 November 2006 11:05 am, David Brownell wrote:
> >
> > Running right now with a patch reverting the update which
> > made trouble on Linus' machine, but without Alexey's two
> > tweaks to the EC interrupt handler. So far so good, even
> > after doing things which had previously caused AE_TIME
> > errors pretty quickly. But then, the errors weren't what
> > I'd call reproducible either.
>
> Hmm, well after a reboot to sort out some other patches,
> and at uptime of ~2 hours, I noticed confusion about
> whether AC or battery power was active, then the old:
>
> ACPI Exception (evregion-0424): AE_TIME, Returned by Handler for [EmbeddedControl] [20060707]
> ACPI Exception (dswexec-0458): AE_TIME, While resolving operands for [OpcodeName unavailable] [20060707]
> ACPI Error (psparse-0537): Method parse/execution failed [\_TZ_.THRM._TMP] (Node ffff810002032d10), AE_TIME
>
> So maybe that's not the entire story; sigh.
Whatever it is, it hasn't shown its ugly little face since then.
So while it doesn't seem completely fixed ... it's nowhere near
as broken as it was previously.
- Dave
^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <Pine.LNX.4.64.0611152008450.3349@woody.osdl.org>]
* 2.6.19-rc6: known regressions (v2)
[not found] <Pine.LNX.4.64.0611152008450.3349@woody.osdl.org>
@ 2006-11-17 20:40 ` Adrian Bunk
2006-11-17 23:58 ` ACPI breakage (Re: 2.6.19-rc6: known regressions (v2)) Linus Torvalds
0 siblings, 1 reply; 15+ messages in thread
From: Adrian Bunk @ 2006-11-17 20:40 UTC (permalink / raw)
To: Linus Torvalds, Andrew Morton
Cc: Linux Kernel Mailing List, Thomas Gleixner, Alan Stern,
Ingo Molnar, davej, cpufreq, Alexey Starikovskiy, Mattia Dongili,
Andre Noll, Andi Kleen, discuss, Prakash Punnoor, phil.el,
oprofile-list, Ray Lee, Michael Buesch, Larry Finger, st3,
linville, netdev, David Brownell, Len Brown, linux-acpi,
Ernst Herzberg
This email lists some known regressions in 2.6.19-rc6 compared to 2.6.18
that are not yet fixed in Linus' tree.
If you find your name in the Cc header, you are either submitter of one
of the bugs, maintainer of an affectected subsystem or driver, a patch
of you caused a breakage or I'm considering you in any other way possibly
involved with one or more of these issues.
Due to the huge amount of recipients, please trim the Cc when answering.
Subject : cpufreq notification broken
References : http://lkml.org/lkml/2006/11/16/177
Submitter : Thomas Gleixner <tglx@timesys.com>
Caused-By : Alan Stern <stern@rowland.harvard.edu>
commit b4dfdbb3c707474a2254c5b4d7e62be31a4b7da9
Handled-By : Ingo Molnar <mingo@elte.hu>
Linus Torvalds <torvalds@osdl.org>
Status : patches are being discussed
Subject : CPU_FREQ_GOV_ONDEMAND=y compile error
References : http://lkml.org/lkml/2006/11/17/198
Submitter : alex1000@comcast.net
Caused-By : Alexey Starikovskiy <alexey_y_starikovskiy@linux.intel.com>
commit 05ca0350e8caa91a5ec9961c585c98005b6934ea
Handled-By : Mattia Dongili <malattia@linux.it>
Patch : http://lkml.org/lkml/2006/11/17/236
Status : patch available
Subject : x86_64: Bad page state in process 'swapper'
References : http://lkml.org/lkml/2006/11/10/135
http://lkml.org/lkml/2006/11/10/208
Submitter : Andre Noll <maan@systemlinux.org>
Handled-By : Andi Kleen <ak@suse.de>
Status : Andi is investigating
Subject : x86_64: oprofile doesn't work
References : http://lkml.org/lkml/2006/10/27/3
http://lkml.org/lkml/2006/11/15/92
Submitter : Prakash Punnoor <prakash@punnoor.de>
Status : problem is being discussed
Subject : bcm43xx: serious problems
References : http://lkml.org/lkml/2006/11/15/296
Submitter : Ray Lee <ray-lk@madrabbit.org>
Handled-By : Michael Buesch <mb@bu3sch.de>
Larry Finger <Larry.Finger@lwfinger.net>
Status : problem is being debugged
Subject : nasty ACPI regression, AE_TIME errors
References : http://lkml.org/lkml/2006/11/15/12
Submitter : David Brownell <david-b@pacbell.net>
Handled-By : Len Brown <len.brown@intel.com>
Alexey Starikovskiy <alexey.y.starikovskiy@linux.intel.com>
Status : problem is being debugged
Subject : ThinkPad R50p: boot fail with (lapic && on_battery)
References : http://lkml.org/lkml/2006/10/31/333
Submitter : Ernst Herzberg <earny@net4u.de>
Handled-By : Len Brown <len.brown@intel.com>
Status : problem is being debugged
^ permalink raw reply [flat|nested] 15+ messages in thread* ACPI breakage (Re: 2.6.19-rc6: known regressions (v2))
2006-11-17 20:40 ` 2.6.19-rc6: known regressions (v2) Adrian Bunk
@ 2006-11-17 23:58 ` Linus Torvalds
2006-11-18 1:25 ` Linus Torvalds
0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2006-11-17 23:58 UTC (permalink / raw)
To: Len Brown, Adrian Bunk, Andrew Morton; +Cc: David Brownell, linux-acpi
On Fri, 17 Nov 2006, Adrian Bunk wrote:
>
> Subject : nasty ACPI regression, AE_TIME errors
> References : http://lkml.org/lkml/2006/11/15/12
> Submitter : David Brownell <david-b@pacbell.net>
> Handled-By : Len Brown <len.brown@intel.com>
> Alexey Starikovskiy <alexey.y.starikovskiy@linux.intel.com>
> Status : problem is being debugged
I do not know if this is related, but testing one of my laptops (always a
good idea to check the week before release) shows that my trusty old
Compaq N620c locks up rather quickly at boot with the current -git tree.
Total lockup - no sysrq, no messages, no nothing.
I've mostly bisected it (what the _hell_ did we do before "git bisect"?),
and right now I know:
commit 9aaed2b42d00d4abb2748d72d599a8033600e2bf is bad (that's Len's "pull
trivial into test branch") commit.
v2.6.19-rc2 seems all good.
Which leaves a chunk of just a few ACPI commits left to bisect.
I'll do five or so more reboots, and I should be able to tell exactly
which commit breaks. It almost always locks up very early during boot
(generally during the "initializing udev" phase), although sometimes it
survives a bit further..
Linus
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: ACPI breakage (Re: 2.6.19-rc6: known regressions (v2))
2006-11-17 23:58 ` ACPI breakage (Re: 2.6.19-rc6: known regressions (v2)) Linus Torvalds
@ 2006-11-18 1:25 ` Linus Torvalds
0 siblings, 0 replies; 15+ messages in thread
From: Linus Torvalds @ 2006-11-18 1:25 UTC (permalink / raw)
To: Len Brown, Alexey Starikovskiy, Adrian Bunk, Andrew Morton
Cc: David Brownell, linux-acpi
On Fri, 17 Nov 2006, Linus Torvalds wrote:
>
> Total lockup - no sysrq, no messages, no nothing.
Dammit.
It looks like 37605a6900f6b4d886d995751fcfeef88c4e462c, and I should have
realized that immediately.
That commit re-introduces the bug that we already reverted once.
Why the hell did that idiotic thing go in, when we had to revert it once
already (see commit 72945b2b90a5554975b8f72673ab7139d232a121 for the
earlier revert).
It was broken then, it is broken now. Nothing has changed.
Why did you guys try to sneak it in again? Last time this same "use a
second workqueue" patch went in (in a different form), we had _exactly_
the same problems, with total lockups, and way too high CPU usage.
The bugzilla entry that you refer to in that commit is even the same one
that discussed why the _original_ patch was totally broken.
It's even the same AUTHOR who wrote the original buggy patch, that pushed
through the same buggy patch AGAIN.
Dammit, this is frustrating.
Why did people expect it to suddenly not be buggy?
Linus
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2006-11-21 3:11 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-18 19:01 ACPI breakage (Re: 2.6.19-rc6: known regressions (v2)) Starikovskiy, Alexey Y
2006-11-18 19:05 ` Linus Torvalds
[not found] ` <455FB44C.8050103@linux.intel.com>
[not found] ` <Pine.LNX.4.64.0611182048560.3692@woody.osdl.org>
[not found] ` <456043F7.1030105@linux.intel.com>
[not found] ` <Pine.LNX.4.64.0611201003540.3692@woody.osdl.org>
2006-11-20 18:27 ` Linus Torvalds
2006-11-20 19:31 ` Alexey Starikovskiy
2006-11-21 3:10 ` Sanjoy Mahajan
2006-11-20 22:13 ` Alexey Starikovskiy
-- strict thread matches above, loose matches on Subject: below --
2006-11-18 16:23 Starikovskiy, Alexey Y
2006-11-18 17:12 ` Linus Torvalds
2006-11-18 19:05 ` David Brownell
2006-11-18 22:09 ` Linus Torvalds
2006-11-18 22:16 ` Adrian Bunk
2006-11-19 4:33 ` David Brownell
2006-11-20 18:46 ` David Brownell
[not found] <Pine.LNX.4.64.0611152008450.3349@woody.osdl.org>
2006-11-17 20:40 ` 2.6.19-rc6: known regressions (v2) Adrian Bunk
2006-11-17 23:58 ` ACPI breakage (Re: 2.6.19-rc6: known regressions (v2)) Linus Torvalds
2006-11-18 1:25 ` Linus Torvalds
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox