[patch] Re: [Bug #12100] resume (S2R) broken by Intel microcode module, on A110L

kernel-testers.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch] Re: [Bug #12100] resume (S2R) broken by Intel microcode module, on A110L
@ 2008-12-19 23:15 Dmitry Adamushko
  2008-12-19 23:30 ` Ingo Molnar
  2008-12-20 13:19 ` Andreas Mohr
  0 siblings, 2 replies; 7+ messages in thread
From: Dmitry Adamushko @ 2008-12-19 23:15 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Ingo Molnar, Andreas Mohr, Linux Kernel Mailing List,
	Kernel Testers List


Hi,


This is in response to the following bug report:

Bug-Entry       : http://bugzilla.kernel.org/show_bug.cgi?id=12100
Subject         : resume (S2R) broken by Intel microcode module, on A110L
Submitter       : Andreas Mohr <andi-5+Cda9B46AM@public.gmane.org>
Date            : 2008-11-25 08:48 (19 days old)
Handled-By      : Dmitry Adamushko <dmitry.adamushko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>


---

From: Dmitry Adamushko <dmitry.adamushko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Subject: [microcode] fix a deadlock in microcode_resume_cpu()

[ The deadlock scenario has been discovered by Andreas Mohr ]

I think I might have a logical explanation why the system 
(http://bugzilla.kernel.org/show_bug.cgi?id=12100)
might hang upon resuming, OTOH it should have likely hanged each and every time.

(1) possible deadlock in microcode_resume_cpu() if either 'if' section is
taken;

(2) now, I don't see it in spec. and can't experimentally verify it (newer
ucodes don't seem to be available for my Core2duo)... but logically-wise, I'd
think that when read upon resuming, the 'microcode revision' (MSR 0x8B) should
be back to its original one (we need to reload ucode anyway so it doesn't seem
logical if a cpu doesn't drop the version)... if so, the comparison with
memcmp() for the full 'struct cpu_signature' is wrong... and that's how one of
the aforementioned 'if' sections might have been triggered - leading to a
deadlock.

Obviously, in my tests I simulated loading/resuming with the ucode of the same
version (just to see that the file is loaded/re-loaded upon resuming) so this
issue has never popped up.

I'd appreciate if someone with an appropriate system might give a try to the
2nd patch (titled "fix a comparison && deadlock...").


In any case, the deadlock situation is a must-have fix.


Signed-off-by: Dmitry Adamushko <dmitry.adamushko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
CC: Info Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
CC: Andreas Mohr <andi-5+Cda9B46AM@public.gmane.org>


---

diff --git a/arch/x86/kernel/microcode_core.c b/arch/x86/kernel/microcode_core.c
index 82fb280..c4b5b24 100644
--- a/arch/x86/kernel/microcode_core.c
+++ b/arch/x86/kernel/microcode_core.c
@@ -272,13 +272,18 @@ static struct attribute_group mc_attr_group = {
 	.name = "microcode",
 };
 
-static void microcode_fini_cpu(int cpu)
+static void __microcode_fini_cpu(int cpu)
 {
 	struct ucode_cpu_info *uci = ucode_cpu_info + cpu;
 
-	mutex_lock(&microcode_mutex);
 	microcode_ops->microcode_fini_cpu(cpu);
 	uci->valid = 0;
+}
+
+static void microcode_fini_cpu(int cpu)
+{
+	mutex_lock(&microcode_mutex);
+	__microcode_fini_cpu(cpu);
 	mutex_unlock(&microcode_mutex);
 }
 
@@ -306,12 +311,16 @@ static int microcode_resume_cpu(int cpu)
 	 * to this cpu (a bit of paranoia):
 	 */
 	if (microcode_ops->collect_cpu_info(cpu, &nsig)) {
-		microcode_fini_cpu(cpu);
+		__microcode_fini_cpu(cpu);
+		printk(KERN_ERR "failed to collect_cpu_info for resuming cpu #%d\n",
+				cpu);
 		return -1;
 	}
 
-	if (memcmp(&nsig, &uci->cpu_sig, sizeof(nsig))) {
-		microcode_fini_cpu(cpu);
+	if ((nsig.sig != uci->cpu_sig.sig) || (nsig.pf != uci->cpu_sig.pf)) {
+		__microcode_fini_cpu(cpu);
+		printk(KERN_ERR "cached ucode doesn't match the resuming cpu #%d\n",
+				cpu);
 		/* Should we look for a new ucode here? */
 		return 1;
 	}
diff --git a/arch/x86/kernel/microcode_intel.c b/arch/x86/kernel/microcode_intel.c
index 622dc4a..a8e6279 100644
--- a/arch/x86/kernel/microcode_intel.c
+++ b/arch/x86/kernel/microcode_intel.c
@@ -155,6 +155,7 @@ static DEFINE_SPINLOCK(microcode_update_lock);
 static int collect_cpu_info(int cpu_num, struct cpu_signature *csig)
 {
 	struct cpuinfo_x86 *c = &cpu_data(cpu_num);
+	unsigned long flags;
 	unsigned int val[2];
 
 	memset(csig, 0, sizeof(*csig));
@@ -174,11 +175,16 @@ static int collect_cpu_info(int cpu_num, struct cpu_signature *csig)
 		csig->pf = 1 << ((val[1] >> 18) & 7);
 	}
 
+	/* serialize access to the physical write to MSR 0x79 */
+	spin_lock_irqsave(&microcode_update_lock, flags);
+
 	wrmsr(MSR_IA32_UCODE_REV, 0, 0);
 	/* see notes above for revision 1.07.  Apparent chip bug */
 	sync_core();
 	/* get the current revision from MSR 0x8B */
 	rdmsr(MSR_IA32_UCODE_REV, val[0], csig->rev);
+	spin_unlock_irqrestore(&microcode_update_lock, flags);
+
 	pr_debug("microcode: collect_cpu_info : sig=0x%x, pf=0x%x, rev=0x%x\n",
 			csig->sig, csig->pf, csig->rev);
 


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [patch] Re: [Bug #12100] resume (S2R) broken by Intel microcode module, on A110L
  2008-12-19 23:15 [patch] Re: [Bug #12100] resume (S2R) broken by Intel microcode module, on A110L Dmitry Adamushko
@ 2008-12-19 23:30 ` Ingo Molnar
       [not found]   ` <20081219233006.GA17984-X9Un+BFzKDI@public.gmane.org>
  2008-12-20 13:19 ` Andreas Mohr
  1 sibling, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2008-12-19 23:30 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: Rafael J. Wysocki, Andreas Mohr, Linux Kernel Mailing List,
	Kernel Testers List


* Dmitry Adamushko <dmitry.adamushko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> Hi,
> 
> 
> This is in response to the following bug report:
> 
> Bug-Entry       : http://bugzilla.kernel.org/show_bug.cgi?id=12100
> Subject         : resume (S2R) broken by Intel microcode module, on A110L
> Submitter       : Andreas Mohr <andi-5+Cda9B46AM@public.gmane.org>
> Date            : 2008-11-25 08:48 (19 days old)
> Handled-By      : Dmitry Adamushko <dmitry.adamushko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

applied to tip/x86/microcode, thanks Dmitry!

The fix looks right but somewhat intrusive in scope, so i'm a bit 
reluctant to push it towards .28 straight away - without having feedback 
in the bugzilla. If feedback is positive (the bug reported there goes away 
completely) we can cherry-pick it over into x86/urgent, ok? And in any 
case i've marked it as a -stable backport for .28.1.

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <20081219233006.GA17984-X9Un+BFzKDI@public.gmane.org>]

* Re: [patch] Re: [Bug #12100] resume (S2R) broken by Intel microcode module, on A110L
       [not found]   ` <20081219233006.GA17984-X9Un+BFzKDI@public.gmane.org>
@ 2009-01-11 14:56     ` Ingo Molnar
       [not found]       ` <20090111145615.GA26173-X9Un+BFzKDI@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Ingo Molnar @ 2009-01-11 14:56 UTC (permalink / raw)
  To: Dmitry Adamushko, andeas.herrmann3-5C7GfCeVMHo, Peter Zijlstra
  Cc: Rafael J. Wysocki, Andreas Mohr, Linux Kernel Mailing List,
	Kernel Testers List


* Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:

> * Dmitry Adamushko <dmitry.adamushko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> > Hi,
> > 
> > 
> > This is in response to the following bug report:
> > 
> > Bug-Entry       : http://bugzilla.kernel.org/show_bug.cgi?id=12100
> > Subject         : resume (S2R) broken by Intel microcode module, on A110L
> > Submitter       : Andreas Mohr <andi-5+Cda9B46AM@public.gmane.org>
> > Date            : 2008-11-25 08:48 (19 days old)
> > Handled-By      : Dmitry Adamushko <dmitry.adamushko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> 
> applied to tip/x86/microcode, thanks Dmitry!
> 
> The fix looks right but somewhat intrusive in scope, so i'm a bit 
> reluctant to push it towards .28 straight away - without having feedback 
> in the bugzilla. If feedback is positive (the bug reported there goes 
> away completely) we can cherry-pick it over into x86/urgent, ok? And in 
> any case i've marked it as a -stable backport for .28.1.

hm, -tip testing just found this microcode locking lockdep splat:

[   48.004158] SMP alternatives: switching to UP code
[   48.342853] CPU0 attaching NULL sched-domain.
[   48.344288] CPU1 attaching NULL sched-domain.
[   48.354696] CPU0 attaching NULL sched-domain.
[   48.361215] device: 'cpu1': device_unregister
[   48.364231] device: 'cpu1': device_create_release
[   48.368138] 
[   48.368139] =======================================================
[   48.372039] [ INFO: possible circular locking dependency detected ]
[   48.372039] 2.6.29-rc1-tip-00901-g9699183-dirty #15577
[   48.372039] -------------------------------------------------------
[   48.372039] S99local/3496 is trying to acquire lock:
[   48.372039]  (microcode_mutex){--..}, at: [<c0118489>] microcode_fini_cpu+0x17/0x2b
[   48.372039] 
[   48.372039] but task is already holding lock:
[   48.372039]  (&cpu_hotplug.lock){--..}, at: [<c012f508>] cpu_hotplug_begin+0x1f/0x47
[   48.372039] 
[   48.372039] which lock already depends on the new lock.
[   48.372039] 
[   48.372039] 
[   48.372039] the existing dependency chain (in reverse order) is:
[   48.372039] 
[   48.372039] -> #1 (&cpu_hotplug.lock){--..}:
[   48.372039]        [<c014d3f1>] validate_chain+0x8e9/0xb94
[   48.372039]        [<c014dd03>] __lock_acquire+0x667/0x6e1
[   48.372039]        [<c014ddda>] lock_acquire+0x5d/0x7a
[   48.372039]        [<c0a6fac3>] mutex_lock_nested+0xdc/0x170
[   48.372039]        [<c012f552>] get_online_cpus+0x22/0x34
[   48.372039]        [<c013ce08>] work_on_cpu+0x50/0x8a
[   48.372039]        [<c0118465>] microcode_init_cpu+0x25/0x32
[   48.372039]        [<c0118699>] mc_sysdev_add+0x91/0x9b
[   48.372039]        [<c04cbd09>] sysdev_driver_register+0x9b/0xea
[   48.372039]        [<c0fb68e0>] microcode_init+0x8a/0xe4
[   48.372039]        [<c010106a>] do_one_initcall+0x6a/0x16e
[   48.372039]        [<c0fa952d>] kernel_init+0x115/0x166
[   48.372039]        [<c0103737>] kernel_thread_helper+0x7/0x10
[   48.372039]        [<ffffffff>] 0xffffffff
[   48.372039] 
[   48.372039] -> #0 (microcode_mutex){--..}:
[   48.372039]        [<c014d0fc>] validate_chain+0x5f4/0xb94
[   48.372039]        [<c014dd03>] __lock_acquire+0x667/0x6e1
[   48.372039]        [<c014ddda>] lock_acquire+0x5d/0x7a
[   48.372039]        [<c0a6fac3>] mutex_lock_nested+0xdc/0x170
[   48.372039]        [<c0118489>] microcode_fini_cpu+0x17/0x2b
[   48.372039]        [<c0a6ce38>] mc_cpu_callback+0xed/0xfa
[   48.372039]        [<c0142a48>] notifier_call_chain+0x2b/0x4a
[   48.372039]        [<c0142a98>] __raw_notifier_call_chain+0x13/0x15
[   48.372039]        [<c0142aab>] raw_notifier_call_chain+0x11/0x13
[   48.372039]        [<c0a2925e>] _cpu_down+0x171/0x22a
[   48.372039]        [<c0a2935a>] cpu_down+0x43/0x68
[   48.372039]        [<c0a2a27f>] store_online+0x2a/0x5e
[   48.372039]        [<c04cba85>] sysdev_store+0x20/0x28
[   48.372039]        [<c01d07d0>] sysfs_write_file+0xbd/0xe8
[   48.372039]        [<c01912c2>] vfs_write+0x91/0x138
[   48.372039]        [<c01917c8>] sys_write+0x40/0x65
[   48.372039]        [<c0102e55>] sysenter_do_call+0x12/0x35
[   48.372039]        [<ffffffff>] 0xffffffff
[   48.372039] 
[   48.372039] other info that might help us debug this:
[   48.372039] 
[   48.372039] 3 locks held by S99local/3496:
[   48.372039]  #0:  (&buffer->mutex){--..}, at: [<c01d073d>] sysfs_write_file+0x2a/0xe8
[   48.372039]  #1:  (cpu_add_remove_lock){--..}, at: [<c012f4b5>] cpu_maps_update_begin+0x14/0x16
[   48.372039]  #2:  (&cpu_hotplug.lock){--..}, at: [<c012f508>] cpu_hotplug_begin+0x1f/0x47
[   48.372039] 
[   48.372039] stack backtrace:
[   48.372039] Pid: 3496, comm: S99local Not tainted 2.6.29-rc1-tip-00901-g9699183-dirty #15577
[   48.372039] Call Trace:
[   48.372039]  [<c014cafd>] print_circular_bug_tail+0xab/0xb6
[   48.372039]  [<c014d0fc>] validate_chain+0x5f4/0xb94
[   48.372039]  [<c0a710b4>] ? _spin_unlock_irqrestore+0x34/0x41
[   48.372039]  [<c014dd03>] __lock_acquire+0x667/0x6e1
[   48.372039]  [<c014c508>] ? trace_hardirqs_on_caller+0x120/0x15f
[   48.372039]  [<c014ddda>] lock_acquire+0x5d/0x7a
[   48.372039]  [<c0118489>] ? microcode_fini_cpu+0x17/0x2b
[   48.372039]  [<c0a6fac3>] mutex_lock_nested+0xdc/0x170
[   48.372039]  [<c0118489>] ? microcode_fini_cpu+0x17/0x2b
[   48.372039]  [<c0118489>] ? microcode_fini_cpu+0x17/0x2b
[   48.372039]  [<c0118489>] microcode_fini_cpu+0x17/0x2b
[   48.372039]  [<c0a6ce38>] mc_cpu_callback+0xed/0xfa
[   48.372039]  [<c0142a48>] notifier_call_chain+0x2b/0x4a
[   48.372039]  [<c0142a98>] __raw_notifier_call_chain+0x13/0x15
[   48.372039]  [<c0142aab>] raw_notifier_call_chain+0x11/0x13
[   48.372039]  [<c0a2925e>] _cpu_down+0x171/0x22a
[   48.372039]  [<c0a2935a>] cpu_down+0x43/0x68
[   48.372039]  [<c0a2a27f>] store_online+0x2a/0x5e
[   48.372039]  [<c0a2a255>] ? store_online+0x0/0x5e
[   48.372039]  [<c04cba85>] sysdev_store+0x20/0x28
[   48.372039]  [<c01d07d0>] sysfs_write_file+0xbd/0xe8
[   48.372039]  [<c01d0713>] ? sysfs_write_file+0x0/0xe8
[   48.372039]  [<c01912c2>] vfs_write+0x91/0x138
[   48.372039]  [<c01917c8>] sys_write+0x40/0x65
[   48.372039]  [<c0102e55>] sysenter_do_call+0x12/0x35
[   49.380693] device: 'cpu1': device_add
[   49.384346] lockdep: fixing up alternatives.
[   49.388142] SMP alternatives: switching to SMP code

config/full bootlog on request. Andreas, Dmitry, any ideas?

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <20090111145615.GA26173-X9Un+BFzKDI@public.gmane.org>]

* Re: [patch] Re: [Bug #12100] resume (S2R) broken by Intel microcode module, on A110L
       [not found]       ` <20090111145615.GA26173-X9Un+BFzKDI@public.gmane.org>
@ 2009-01-11 19:58         ` Dmitry Adamushko
       [not found]           ` <b647ffbd0901111158r467f3358wa50f6a6e92d8129f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Dmitry Adamushko @ 2009-01-11 19:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: andeas.herrmann3-5C7GfCeVMHo, Peter Zijlstra, Rafael J. Wysocki,
	Andreas Mohr, Linux Kernel Mailing List, Kernel Testers List

2009/1/11 Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>:
>
> * Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:
>
>> * Dmitry Adamushko <dmitry.adamushko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>> > Hi,
>> >
>> >
>> > This is in response to the following bug report:
>> >
>> > Bug-Entry       : http://bugzilla.kernel.org/show_bug.cgi?id=12100
>> > Subject         : resume (S2R) broken by Intel microcode module, on A110L
>> > Submitter       : Andreas Mohr <andi-5+Cda9B46AM@public.gmane.org>
>> > Date            : 2008-11-25 08:48 (19 days old)
>> > Handled-By      : Dmitry Adamushko <dmitry.adamushko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>>
>> applied to tip/x86/microcode, thanks Dmitry!
>>
>> The fix looks right but somewhat intrusive in scope, so i'm a bit
>> reluctant to push it towards .28 straight away - without having feedback
>> in the bugzilla. If feedback is positive (the bug reported there goes
>> away completely) we can cherry-pick it over into x86/urgent, ok? And in
>> any case i've marked it as a -stable backport for .28.1.
>
> hm, -tip testing just found this microcode locking lockdep splat:
>
> [   48.004158] SMP alternatives: switching to UP code
> [   48.342853] CPU0 attaching NULL sched-domain.
> [   48.344288] CPU1 attaching NULL sched-domain.
> [   48.354696] CPU0 attaching NULL sched-domain.
> [   48.361215] device: 'cpu1': device_unregister
> [   48.364231] device: 'cpu1': device_create_release
> [   48.368138]
> [   48.368139] =======================================================
> [   48.372039] [ INFO: possible circular locking dependency detected ]
> [   48.372039] 2.6.29-rc1-tip-00901-g9699183-dirty #15577
> [   48.372039] -------------------------------------------------------
> [   48.372039] S99local/3496 is trying to acquire lock:
> [   48.372039]  (microcode_mutex){--..}, at: [<c0118489>] microcode_fini_cpu+0x17/0x2b
> [   48.372039]
> [   48.372039] but task is already holding lock:
> [   48.372039]  (&cpu_hotplug.lock){--..}, at: [<c012f508>] cpu_hotplug_begin+0x1f/0x47
> [   48.372039]
> [   48.372039] which lock already depends on the new lock.
> [   48.372039]
> [   48.372039]
> [   48.372039] the existing dependency chain (in reverse order) is:
> [   48.372039]
> [   48.372039] -> #1 (&cpu_hotplug.lock){--..}:
> [   48.372039]        [<c014d3f1>] validate_chain+0x8e9/0xb94
> [   48.372039]        [<c014dd03>] __lock_acquire+0x667/0x6e1
> [   48.372039]        [<c014ddda>] lock_acquire+0x5d/0x7a
> [   48.372039]        [<c0a6fac3>] mutex_lock_nested+0xdc/0x170
> [   48.372039]        [<c012f552>] get_online_cpus+0x22/0x34
> [   48.372039]        [<c013ce08>] work_on_cpu+0x50/0x8a
> [   48.372039]        [<c0118465>] microcode_init_cpu+0x25/0x32
> [   48.372039]        [<c0118699>] mc_sysdev_add+0x91/0x9b
> [   48.372039]        [<c04cbd09>] sysdev_driver_register+0x9b/0xea

I'll check more carefully... At the first glance, the presence of
work_on_cpu() looks strange.

My first idea was that it's used somewhere by request_firmware() but
even assuming some functions might have been inlined (and a call via a
function pointer is not shown either), I don't immediately see how we
might end up with microcode_init_cpu() -> ... -> work_on_cpu().

I've locked up all the use cases of work_on_cpu() in the current -tip
(about 20), and none of them seem to explain its appearance in the
trace. weird...


>
>        Ingo
>

-- 
Best regards,
Dmitry Adamushko

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <b647ffbd0901111158r467f3358wa50f6a6e92d8129f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [patch] Re: [Bug #12100] resume (S2R) broken by Intel microcode module, on A110L
       [not found]           ` <b647ffbd0901111158r467f3358wa50f6a6e92d8129f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-01-11 20:11             ` Mike Travis
  0 siblings, 0 replies; 7+ messages in thread
From: Mike Travis @ 2009-01-11 20:11 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: Ingo Molnar, andeas.herrmann3-5C7GfCeVMHo, Peter Zijlstra,
	Rafael J. Wysocki, Andreas Mohr, Linux Kernel Mailing List,
	Kernel Testers List, Rusty Russell

Dmitry Adamushko wrote:
> 2009/1/11 Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>:
>> * Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org> wrote:
>>
>>> * Dmitry Adamushko <dmitry.adamushko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> This is in response to the following bug report:
>>>>
>>>> Bug-Entry       : http://bugzilla.kernel.org/show_bug.cgi?id=12100
>>>> Subject         : resume (S2R) broken by Intel microcode module, on A110L
>>>> Submitter       : Andreas Mohr <andi-5+Cda9B46AM@public.gmane.org>
>>>> Date            : 2008-11-25 08:48 (19 days old)
>>>> Handled-By      : Dmitry Adamushko <dmitry.adamushko-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
>>> applied to tip/x86/microcode, thanks Dmitry!
>>>
>>> The fix looks right but somewhat intrusive in scope, so i'm a bit
>>> reluctant to push it towards .28 straight away - without having feedback
>>> in the bugzilla. If feedback is positive (the bug reported there goes
>>> away completely) we can cherry-pick it over into x86/urgent, ok? And in
>>> any case i've marked it as a -stable backport for .28.1.
>> hm, -tip testing just found this microcode locking lockdep splat:
>>
>> [   48.004158] SMP alternatives: switching to UP code
>> [   48.342853] CPU0 attaching NULL sched-domain.
>> [   48.344288] CPU1 attaching NULL sched-domain.
>> [   48.354696] CPU0 attaching NULL sched-domain.
>> [   48.361215] device: 'cpu1': device_unregister
>> [   48.364231] device: 'cpu1': device_create_release
>> [   48.368138]
>> [   48.368139] =======================================================
>> [   48.372039] [ INFO: possible circular locking dependency detected ]
>> [   48.372039] 2.6.29-rc1-tip-00901-g9699183-dirty #15577
>> [   48.372039] -------------------------------------------------------
>> [   48.372039] S99local/3496 is trying to acquire lock:
>> [   48.372039]  (microcode_mutex){--..}, at: [<c0118489>] microcode_fini_cpu+0x17/0x2b
>> [   48.372039]
>> [   48.372039] but task is already holding lock:
>> [   48.372039]  (&cpu_hotplug.lock){--..}, at: [<c012f508>] cpu_hotplug_begin+0x1f/0x47
>> [   48.372039]
>> [   48.372039] which lock already depends on the new lock.
>> [   48.372039]
>> [   48.372039]
>> [   48.372039] the existing dependency chain (in reverse order) is:
>> [   48.372039]
>> [   48.372039] -> #1 (&cpu_hotplug.lock){--..}:
>> [   48.372039]        [<c014d3f1>] validate_chain+0x8e9/0xb94
>> [   48.372039]        [<c014dd03>] __lock_acquire+0x667/0x6e1
>> [   48.372039]        [<c014ddda>] lock_acquire+0x5d/0x7a
>> [   48.372039]        [<c0a6fac3>] mutex_lock_nested+0xdc/0x170
>> [   48.372039]        [<c012f552>] get_online_cpus+0x22/0x34
>> [   48.372039]        [<c013ce08>] work_on_cpu+0x50/0x8a
>> [   48.372039]        [<c0118465>] microcode_init_cpu+0x25/0x32
>> [   48.372039]        [<c0118699>] mc_sysdev_add+0x91/0x9b
>> [   48.372039]        [<c04cbd09>] sysdev_driver_register+0x9b/0xea
> 
> I'll check more carefully... At the first glance, the presence of
> work_on_cpu() looks strange.
> 
> My first idea was that it's used somewhere by request_firmware() but
> even assuming some functions might have been inlined (and a call via a
> function pointer is not shown either), I don't immediately see how we
> might end up with microcode_init_cpu() -> ... -> work_on_cpu().

It was in a commit that (it appears) Ingo has reverted:

Subject: x86: cleanup remaining cpumask_t code in microcode_core.c

Impact: Reduce problem with changing current->cpus_allowed mask directly.

Use "work_on_cpu" to replace instances where set_cpus_allowed_ptr was being used.

Signed-off-by: Mike Travis <travis-sJ/iWh9BUns@public.gmane.org>


This work_on_cpu is to replace setting current->cpus_allowed when it's
only for one cpu.  But it has a call to get_online_cpus() that (I believe)
is just to keep from offlining the cpu the work function is running on.
And it's also the cause of the circular lock dependencies.

Thanks,
Mike


> 
> I've locked up all the use cases of work_on_cpu() in the current -tip
> (about 20), and none of them seem to explain its appearance in the
> trace. weird...
> 
> 
>>        Ingo
>>
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] Re: [Bug #12100] resume (S2R) broken by Intel microcode module, on A110L
  2008-12-19 23:15 [patch] Re: [Bug #12100] resume (S2R) broken by Intel microcode module, on A110L Dmitry Adamushko
  2008-12-19 23:30 ` Ingo Molnar
@ 2008-12-20 13:19 ` Andreas Mohr
       [not found]   ` <20081220131946.GA31366-p/qQFhXj4MHA4IYVXhSI5GHfThorsUsI@public.gmane.org>
  1 sibling, 1 reply; 7+ messages in thread
From: Andreas Mohr @ 2008-12-20 13:19 UTC (permalink / raw)
  To: Dmitry Adamushko
  Cc: Rafael J. Wysocki, Ingo Molnar, Andreas Mohr,
	Linux Kernel Mailing List, Kernel Testers List

Hi,

On Sat, Dec 20, 2008 at 12:15:24AM +0100, Dmitry Adamushko wrote:
> I'd appreciate if someone with an appropriate system might give a try to the
> 2nd patch (titled "fix a comparison && deadlock...").
> 
> 
> In any case, the deadlock situation is a must-have fix.

YES, that actually fixed it!!

Built -rc9, booted 3 times and suspended, always deadlock upon resume.
Patched -rc9, make modules_install, suspended 3 times, always fine.

Weird, I thought I had attempted all (most?) of the parts that your patch
references before...

Thanks a lot for your very responsive handling!

I suppose this can get to -urgent now, JIT for the Christmas
release?

Andreas Mohr

^ permalink raw reply	[flat|nested] 7+ messages in thread

[parent not found: <20081220131946.GA31366-p/qQFhXj4MHA4IYVXhSI5GHfThorsUsI@public.gmane.org>]

* Re: [patch] Re: [Bug #12100] resume (S2R) broken by Intel microcode module, on A110L
       [not found]   ` <20081220131946.GA31366-p/qQFhXj4MHA4IYVXhSI5GHfThorsUsI@public.gmane.org>
@ 2008-12-20 13:30     ` Ingo Molnar
  0 siblings, 0 replies; 7+ messages in thread
From: Ingo Molnar @ 2008-12-20 13:30 UTC (permalink / raw)
  To: Andreas Mohr
  Cc: Dmitry Adamushko, Rafael J. Wysocki, Linux Kernel Mailing List,
	Kernel Testers List


* Andreas Mohr <andi-5+Cda9B46AM@public.gmane.org> wrote:

> Hi,
> 
> On Sat, Dec 20, 2008 at 12:15:24AM +0100, Dmitry Adamushko wrote:
> > I'd appreciate if someone with an appropriate system might give a try to the
> > 2nd patch (titled "fix a comparison && deadlock...").
> > 
> > 
> > In any case, the deadlock situation is a must-have fix.
> 
> YES, that actually fixed it!!
> 
> Built -rc9, booted 3 times and suspended, always deadlock upon resume.
> Patched -rc9, make modules_install, suspended 3 times, always fine.

great!

> Weird, I thought I had attempted all (most?) of the parts that your 
> patch references before...
> 
> Thanks a lot for your very responsive handling!
> 
> I suppose this can get to -urgent now, JIT for the Christmas release?

yeah, i've now moved it over to tip/x86/urgent, thanks Andreas!

	Ingo

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-01-11 20:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-19 23:15 [patch] Re: [Bug #12100] resume (S2R) broken by Intel microcode module, on A110L Dmitry Adamushko
2008-12-19 23:30 ` Ingo Molnar
     [not found]   ` <20081219233006.GA17984-X9Un+BFzKDI@public.gmane.org>
2009-01-11 14:56     ` Ingo Molnar
     [not found]       ` <20090111145615.GA26173-X9Un+BFzKDI@public.gmane.org>
2009-01-11 19:58         ` Dmitry Adamushko
     [not found]           ` <b647ffbd0901111158r467f3358wa50f6a6e92d8129f-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-01-11 20:11             ` Mike Travis
2008-12-20 13:19 ` Andreas Mohr
     [not found]   ` <20081220131946.GA31366-p/qQFhXj4MHA4IYVXhSI5GHfThorsUsI@public.gmane.org>
2008-12-20 13:30     ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).