public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@elte.hu>
To: Linus Torvalds <torvalds@linux-foundation.org>,
	Yinghai Lu <yinghai@kernel.org>,
	Suresh Siddha <suresh.b.siddha@intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	"Maciej W. Rozycki" <macro@linux-mips.org>,
	"Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
Cc: Frans Pop <elendil@planet.nl>,
	lenb@kernel.org, "Rafael J. Wysocki" <rjw@sisk.pl>,
	Greg KH <greg@kroah.com>,
	jbarnes@virtuousgeek.org,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	tiwai@suse.de, Andrew Morton <akpm@linux-foundation.org>
Subject: Re: "APIC error on CPU1: 00(40)" during resume (was: Regression from 2.6.26: Hibernation (possibly suspend) broken on Toshiba R500)
Date: Wed, 10 Dec 2008 18:33:43 +0100	[thread overview]
Message-ID: <20081210173343.GA1120@elte.hu> (raw)
In-Reply-To: <alpine.LFD.2.00.0812100819570.3340@localhost.localdomain>


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Ingo - who's the main apic person these days?

When it comes to blame someone for bugs then it's me :-)

When it comes to code details, it's multiple people: Yinghai, Suresh, 
Venki, Maciej and Thomas, Peter and me as x86 maintainers. I tried to Cc: 
everyone.

> On Wed, 10 Dec 2008, Frans Pop wrote:
> 
> > On Wednesday 10 December 2008, Linus Torvalds wrote:
> > > On Wed, 10 Dec 2008, Frans Pop wrote:
> > > > Anybody interested in persuing this issue?
> > > >
> > > > > The third thing that worries me is the _very_ early occurrence of
> > > > >
> > > > > 	ACPI: Waking up from system sleep state S3
> > > > > 	APIC error on CPU1: 00(40)
> > > > > 	ACPI: EC: non-query interrupt received, switching to interrupt
> > > > > mode
> > >
> > > Well, the "too early" part is fixed with the PCI resume changes in
> > > -next, and googling for "APIC error on CPU1: 00(40)" shows that it's
> > > actually pretty common. Which is sad, but makes it somewhat less scary.
> > >
> > > The fact that it happens at resume for you (and not randomly) does
> > > imply that we perhaps don't have a wonderful APIC wakeup sequence and
> > > are doing something slightly wrong. But it likely isn't a big deal.
> > >
> > > Is that message new? If it is, maybe you can pinpoint roughly when it
> > > started happening, and we could try guess which change triggered it.
> > 
> > It's been there since 2.6.26.3, which was the first kernel I've run on 
> > this notebook.
> 
> Hmm. Our IO-APIC reprogramming looks pretty simple, and may well be 
> correct.
> 
> However, it looks like our _local_ APIC suspend/resume is a total piece 
> of sh*t.  It's set up as a "system device" and has a single 
> suspend/resume buffer, but the local APIC is a per-CPU thing. We even 
> have a comment there (written by yours trule back in 2003!) that says:
> 
>          * FIXME! This will be wrong if we ever support suspend on
>          * SMP! We'll need to do this as part of the CPU restore!
> 
> and back then suspend/resume on SMP was just a crazy notion, but now 
> it's obviously every-day reality.
> 
> So it looks like we don't reprogram the APIC -at-all- on secondary 
> CPU's.
> 
> What am I missing?

we do reset it - local APIC timer IRQs would not be working, the NMI 
watchdog wouldnt be working, we wouldnt be able to do cross-IPIs nor TLB 
flushes etc. - so a non-working lapic is the sure way to a system lockup.
But the resume/hotplug path is still a maze, agreed.

regarding those APIC error messages:

> > > > >       ACPI: Waking up from system sleep state S3
> > > > >       APIC error on CPU1: 00(40)
> > > > >       ACPI: EC: non-query interrupt received, switching to interrupt

that does suggest that the APIC was re-enabled (we dont get any APIC 
error exceptions otherwise!), and its LVT was programmed as well, but 
somehow we got an erroneous APIC message from an illegal vector.

Illegal APIC message vectors can have two sources in practice:

 1) the system bus being thermally unstable and corrupting APIC messages 
    that would randomly contain the wrong vector (zero for example).
    I had one (old) testbox that would do this. (Maybe other hw 
    conditions can animate the southbridge to do this to us too.)

 2) _another CPU_ sending an IPI with an illegal vector field. An APIC 
    vector is 'illegal' if it is below 16 (architecturally protected 
    exception entries), or if it points to an IDT entry that is not
    present.

#2: in this case would mean another CPU has set a target vector smaller
than 16: we dont have any IDT entry that is explicitly non-existent (we 
have a dummy entry mapped for everything). That seems unlikely - but we 
could stick in a WARN_ON_ONCE() into the IPI send methods to catch this. 

[ sidenote: as weird as it might seem it is valid IPI use to trigger
  architectural exception vectors between 16 and 31. ]

#1: seems like a too easy path out to blame the hw for it :-/ By all 
means this has the appearance of kernel-induced damage to me, and seems 
to occur when we fiddle the hw and are in a sensitive path of 
resume-wakeup. I'd blame the kernel 9 times out of 10 bugs that trigger 
at this stage.

Still i have no idea how this APIC message could be kernel-inflicted - 
even assuming buggy resume time lapic setup. The lapic timer cannot 
inject 'bad' vectors to itself AFAIK. It's pretty hard to do it even 
intentionally from another CPU, and when we do it we kill the whole 
system by flooding it with bad IPIs.

Maybe we have a window of setup where one of the LVT entries has zero in 
the vector field but is still enabled, and the hw condition (lapic timer 
tick) happens that triggers the IRQ injection - but the lapic cannot do 
it due to the zero vector? I do not see such window of setup in the APIC 
re-setup codepath though.

Maybe it's the reschedule or TLB flush IPI from another CPU somehow 
hitting the lapic in the wrong moment?

Dunno. Does anyone else on the Cc: list have another theory that matches 
up with some detail of the code?

	Ingo

  parent reply	other threads:[~2008-12-10 17:34 UTC|newest]

Thread overview: 136+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-02  2:20 Regression from 2.6.26: Hibernation (possibly suspend) broken on Toshiba R500 (bisected) Rafael J. Wysocki
2008-12-02  3:32 ` Linus Torvalds
2008-12-02  3:42   ` Linus Torvalds
2008-12-02  4:31     ` Frans Pop
2008-12-02  4:46       ` Linus Torvalds
2008-12-02  5:29         ` Frans Pop
2008-12-02  5:56           ` Frans Pop
2008-12-02 15:46           ` Linus Torvalds
2008-12-02 17:46             ` Frans Pop
2008-12-02 18:17               ` Linus Torvalds
2008-12-05  8:53             ` MSI changes in .28 Frans Pop
2008-12-05  9:09               ` Yinghai Lu
2008-12-05 12:20               ` Ingo Molnar
2008-12-05 13:04                 ` Eric Dumazet
2008-12-05 17:49                 ` H. Peter Anvin
2008-12-02  4:13   ` Regression from 2.6.26: Hibernation (possibly suspend) broken on Toshiba R500 (bisected) Frans Pop
2008-12-02  4:36     ` Linus Torvalds
2008-12-02 22:38       ` Rafael J. Wysocki
2008-12-02 23:37         ` Linus Torvalds
2008-12-03  0:00           ` Rafael J. Wysocki
2008-12-03  0:05             ` Rafael J. Wysocki
2008-12-03  0:31             ` Rafael J. Wysocki
2008-12-03  0:41             ` Linus Torvalds
2008-12-03  1:22               ` Rafael J. Wysocki
2008-12-03  2:02                 ` Linus Torvalds
2008-12-03  7:40                   ` Rafael J. Wysocki
2008-12-03  7:52                     ` Rafael J. Wysocki
2008-12-03 11:20                       ` Rafael J. Wysocki
2008-12-03 15:53                         ` Linus Torvalds
2008-12-04  1:23                           ` Rafael J. Wysocki
2008-12-04  4:40                             ` Linus Torvalds
2008-12-04  8:21                               ` Frans Pop
2008-12-04 22:01                               ` Rafael J. Wysocki
2008-12-04 11:29                           ` Frans Pop
2008-12-04 16:17                             ` Linus Torvalds
2008-12-04 18:00                               ` Frans Pop
2008-12-04 20:03                                 ` Linus Torvalds
2008-12-05 21:26                                   ` Linus Torvalds
2008-12-05 22:01                                     ` Rafael J. Wysocki
2008-12-05 22:14                                       ` Linus Torvalds
2008-12-06  0:04                                         ` Rafael J. Wysocki
2008-12-06  0:50                                           ` Linus Torvalds
2008-12-06  1:18                                             ` Rafael J. Wysocki
2008-12-06  1:55                                               ` Linus Torvalds
2008-12-06  2:18                                                 ` Rafael J. Wysocki
2008-12-06 13:53                                                   ` Rafael J. Wysocki
2008-12-06  2:45                                                 ` Greg KH
2009-01-28 12:00                                     ` Frans Pop
2009-01-29 14:11                                       ` Ingo Molnar
2009-01-29 14:48                                         ` Rafael J. Wysocki
2009-01-29 16:44                                           ` Alexey Starikovskiy
2009-01-30  4:35                                         ` Frans Pop
2008-12-06  9:20                                   ` [patch,rfc] usb: restore config before enabling device on resume Frans Pop
2008-12-06 13:48                                     ` Rafael J. Wysocki
2008-12-06 15:02                                       ` Frans Pop
2008-12-10 14:06                                   ` "APIC error on CPU1: 00(40)" during resume (was: Regression from 2.6.26: Hibernation (possibly suspend) broken on Toshiba R500) Frans Pop
2008-12-10 15:51                                     ` Linus Torvalds
2008-12-10 16:05                                       ` Frans Pop
2008-12-10 16:26                                         ` Linus Torvalds
2008-12-10 16:52                                           ` Matthew Garrett
2008-12-10 17:13                                             ` Linus Torvalds
2008-12-10 17:33                                           ` Ingo Molnar [this message]
2008-12-10 18:41                                             ` Maxim Levitsky
2008-12-20 21:31                                             ` "APIC error on CPU1: 00(40)" during resume Frans Pop
2008-12-21  8:29                                               ` Ingo Molnar
2008-12-23  4:28                                                 ` Len Brown
2008-12-04 22:46                                 ` Regression from 2.6.26: Hibernation (possibly suspend) broken on Toshiba R500 (bisected) Rafael J. Wysocki
2008-12-04 22:40                               ` Rafael J. Wysocki
2008-12-04 23:22                                 ` Linus Torvalds
2008-12-04 23:45                                   ` Rafael J. Wysocki
2008-12-05  0:07                                     ` Linus Torvalds
2008-12-05  0:20                                       ` Rafael J. Wysocki
2008-12-05  6:55                                     ` Frans Pop
2008-12-04 22:09                             ` Rafael J. Wysocki
2008-12-04 22:20                               ` Linus Torvalds
2008-12-04 23:31                                 ` Rafael J. Wysocki
2008-12-05  0:03                                   ` Linus Torvalds
2008-12-05  0:45                                     ` Linus Torvalds
2008-12-05  1:08                                       ` Rafael J. Wysocki
2008-12-05  1:45                                         ` Linus Torvalds
2008-12-05  2:55                                           ` Linus Torvalds
2008-12-05  3:25                                             ` Linus Torvalds
2008-12-05  6:44                                               ` Frans Pop
2008-12-05  8:27                                                 ` Frans Pop
2008-12-05 12:00                                               ` Rafael J. Wysocki
2008-12-05 15:57                                                 ` Linus Torvalds
2008-12-05 21:32                                                   ` Rafael J. Wysocki
2008-12-05 17:25                                               ` Jesse Barnes
2008-12-02 15:49   ` Rafael J. Wysocki
2008-12-06 14:05 ` [PATCH 0/3] Fix hibernation regression on Toshiba Portege R500 Rafael J. Wysocki
2008-12-06 14:07   ` [PATCH 1/3] PCI: Rework default handling of suspend and resume Rafael J. Wysocki
2008-12-06 17:07     ` Linus Torvalds
2008-12-06 17:22       ` Rafael J. Wysocki
2008-12-06 17:33         ` Linus Torvalds
2008-12-06 17:43           ` Rafael J. Wysocki
2008-12-06 18:00             ` Linus Torvalds
2008-12-06 21:24               ` Rafael J. Wysocki
2008-12-07  4:44               ` Jesse Barnes
2008-12-07  5:41               ` Greg KH
2008-12-07 12:47                 ` Rafael J. Wysocki
2008-12-07 16:44                   ` Linus Torvalds
2008-12-07 21:02                     ` Rafael J. Wysocki
2008-12-07 17:26                   ` Greg KH
2008-12-07 23:34                     ` [PATCH 1/3] PCI: Rework default handling of suspend and resume (rebased) Rafael J. Wysocki
2008-12-06 18:30             ` [linux-pm] [PATCH 1/3] PCI: Rework default handling of suspend and resume Alan Stern
2008-12-06 21:36               ` Rafael J. Wysocki
2008-12-06 22:24                 ` Linus Torvalds
2008-12-06 23:25                   ` Arjan van de Ven
2008-12-06 23:35                     ` Alan Cox
2008-12-07  6:00                     ` Linus Torvalds
2008-12-07  6:03                       ` Linus Torvalds
2008-12-07 13:39                         ` Rafael J. Wysocki
2008-12-07 16:34                           ` Linus Torvalds
2008-12-14  9:28                             ` Pavel Machek
2008-12-07 17:18                           ` Arjan van de Ven
2008-12-07  9:44                       ` Takashi Iwai
2008-12-07  0:02                 ` Alan Stern
2008-12-07 13:14                   ` Rafael J. Wysocki
2008-12-06 21:09             ` Alan Cox
2008-12-06 21:50               ` Rafael J. Wysocki
2008-12-06 14:07   ` [PATCH 2/3] PCI: Suspend and resume PCI Express ports with interrupts disabled Rafael J. Wysocki
2008-12-06 17:15     ` Linus Torvalds
2008-12-06 17:25       ` Rafael J. Wysocki
2008-12-06 17:38         ` Linus Torvalds
2008-12-06 17:46           ` Rafael J. Wysocki
2008-12-07  2:18             ` Jesse Barnes
2008-12-07 12:53               ` Rafael J. Wysocki
2008-12-06 14:09   ` [PATCH 3/3] Sound (HDA Intel): Restore PCI configuration space with interrupts off Rafael J. Wysocki
2008-12-07  4:45     ` Jesse Barnes
2008-12-07  9:47       ` Takashi Iwai
2008-12-11  7:07         ` Takashi Iwai
2008-12-11 20:03           ` Rafael J. Wysocki
2008-12-11 20:27             ` Takashi Iwai
2008-12-11 20:38               ` Rafael J. Wysocki
2008-12-12  6:32                 ` Takashi Iwai
2008-12-06 19:30   ` [PATCH 0/3] Fix hibernation regression on Toshiba Portege R500 Frans Pop

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20081210173343.GA1120@elte.hu \
    --to=mingo@elte.hu \
    --cc=akpm@linux-foundation.org \
    --cc=elendil@planet.nl \
    --cc=greg@kroah.com \
    --cc=hpa@zytor.com \
    --cc=jbarnes@virtuousgeek.org \
    --cc=lenb@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=macro@linux-mips.org \
    --cc=rjw@sisk.pl \
    --cc=suresh.b.siddha@intel.com \
    --cc=tglx@linutronix.de \
    --cc=tiwai@suse.de \
    --cc=torvalds@linux-foundation.org \
    --cc=venkatesh.pallipadi@intel.com \
    --cc=yinghai@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox