odd GPF bug on resume from hibernate.

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* odd GPF bug on resume from hibernate.
@ 2013-02-20 19:28 Dave Jones
  2013-02-20 19:42 ` Rafael J. Wysocki
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Jones @ 2013-02-20 19:28 UTC (permalink / raw)
  To: x86; +Cc: Linux Kernel

We had two users report hitting a bug that looks like this..

general protection fault: 8800 [#1] SMP 
PM: Restoring platform NVS memory
Modules linked in: fuse ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq coretemp kvm_intel snd_seq_device kvm arc4 microcode snd_pcm rtl8192se serio_raw rtlwifi mac80211 i2c_i801 intel_ips jme lpc_ich mfd_core cfg80211 mii jmb38x_ms snd_page_alloc memstick snd_timer rfkill snd mei soundcore uinput hid_logitech_dj i915 crc32c_intel i2c_algo_bit drm_kms_helper sdhci_pci drm sdhci mmc_core i2c_core wmi video
CPU 2 
Pid: 0, comm: swapper/2 Not tainted 3.7.5-201.fc18.x86_64 #1 System76, Inc.                        Lemur UltraThin                      /Lemur UltraThin                      
RIP: 0010:[<ffffffff81038335>]  [<ffffffff81038335>] read_apic_id+0x5/0x30
RSP: 0018:ffff880131ea3ee8  EFLAGS: 00010046
RAX: 0000000000000020 RBX: ffff880131ea2010 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001
RBP: ffff880131ea3ef8 R08: ffff880131ea2000 R09: ffff880131ea3ed0
R10: ffff880131ea3ed4 R11: 0000000000000000 R12: 0000000000000020
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff880137d00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f2f06237750 CR3: 0000000001c0b000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper/2 (pid: 0, threadinfo ffff880131ea2000, task ffff880131e9ae40)
Stack:
 ffff880131ea3fd8 ffffffff81cdc5d0 ffff880131ea3f28 ffffffff8101d508
 ffff880131ea3f18 b9aafd27b651a356 0000000000000000 0000000000000000
 ffff880131ea3f48 ffffffff816248dd 0000000000000000 e0e608b14cc4dccd
Call Trace:
 [<ffffffff8101d508>] cpu_idle+0x108/0x120
 [<ffffffff816248dd>] start_secondary+0x23e/0x240
Code: 25 40 3b 01 00 3c 03 76 07 0f 09 66 66 90 66 90 f4 eb fd 41 bc ff ff ff ff eb b1 0f 1f 00 0f 1f 84 00 00 00 00 00 48 8b 05 41 31 <ca> 00 55 bf 20 00 00 00 48 89 e5 ff 90 20 01 00 00 89 c7 48 8b 
RIP  [<ffffffff81038335>] read_apic_id+0x5/0x30
 RSP <ffff880131ea3ee8>

That Code: line translates to..

   0:	ca 00 55             	lret   $0x5500

At this point I don't know where to begin debugging..

Is that 8800 error code a clue ?

	Dave


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: odd GPF bug on resume from hibernate.
  2013-02-20 19:28 odd GPF bug on resume from hibernate Dave Jones
@ 2013-02-20 19:42 ` Rafael J. Wysocki
  2013-02-20 20:13   ` Dave Jones
  0 siblings, 1 reply; 4+ messages in thread
From: Rafael J. Wysocki @ 2013-02-20 19:42 UTC (permalink / raw)
  To: Dave Jones; +Cc: x86, Linux Kernel

On Wednesday, February 20, 2013 02:28:26 PM Dave Jones wrote:
> We had two users report hitting a bug that looks like this..
> 
> general protection fault: 8800 [#1] SMP 
> PM: Restoring platform NVS memory
> Modules linked in: fuse ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq coretemp kvm_intel snd_seq_device kvm arc4 microcode snd_pcm rtl8192se serio_raw rtlwifi mac80211 i2c_i801 intel_ips jme lpc_ich mfd_core cfg80211 mii jmb38x_ms snd_page_alloc memstick snd_timer rfkill snd mei soundcore uinput hid_logitech_dj i915 crc32c_intel i2c_algo_bit drm_kms_helper sdhci_pci drm sdhci mmc_core i2c_core wmi video
> CPU 2 
> Pid: 0, comm: swapper/2 Not tainted 3.7.5-201.fc18.x86_64 #1 System76, Inc.                        Lemur UltraThin                      /Lemur UltraThin                      
> RIP: 0010:[<ffffffff81038335>]  [<ffffffff81038335>] read_apic_id+0x5/0x30
> RSP: 0018:ffff880131ea3ee8  EFLAGS: 00010046
> RAX: 0000000000000020 RBX: ffff880131ea2010 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000001
> RBP: ffff880131ea3ef8 R08: ffff880131ea2000 R09: ffff880131ea3ed0
> R10: ffff880131ea3ed4 R11: 0000000000000000 R12: 0000000000000020
> R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> FS:  0000000000000000(0000) GS:ffff880137d00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007f2f06237750 CR3: 0000000001c0b000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process swapper/2 (pid: 0, threadinfo ffff880131ea2000, task ffff880131e9ae40)
> Stack:
>  ffff880131ea3fd8 ffffffff81cdc5d0 ffff880131ea3f28 ffffffff8101d508
>  ffff880131ea3f18 b9aafd27b651a356 0000000000000000 0000000000000000
>  ffff880131ea3f48 ffffffff816248dd 0000000000000000 e0e608b14cc4dccd
> Call Trace:
>  [<ffffffff8101d508>] cpu_idle+0x108/0x120
>  [<ffffffff816248dd>] start_secondary+0x23e/0x240
> Code: 25 40 3b 01 00 3c 03 76 07 0f 09 66 66 90 66 90 f4 eb fd 41 bc ff ff ff ff eb b1 0f 1f 00 0f 1f 84 00 00 00 00 00 48 8b 05 41 31 <ca> 00 55 bf 20 00 00 00 48 89 e5 ff 90 20 01 00 00 89 c7 48 8b 
> RIP  [<ffffffff81038335>] read_apic_id+0x5/0x30
>  RSP <ffff880131ea3ee8>
> 
> That Code: line translates to..
> 
>    0:	ca 00 55             	lret   $0x5500
> 
> At this point I don't know where to begin debugging..
> 
> Is that 8800 error code a clue ?

Does CPU offline/online work on this machine?

Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: odd GPF bug on resume from hibernate.
  2013-02-20 19:42 ` Rafael J. Wysocki
@ 2013-02-20 20:13   ` Dave Jones
  2013-02-20 20:46     ` Thomas Gleixner
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Jones @ 2013-02-20 20:13 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: x86, Linux Kernel

On Wed, Feb 20, 2013 at 08:42:46PM +0100, Rafael J. Wysocki wrote:
 > On Wednesday, February 20, 2013 02:28:26 PM Dave Jones wrote:
 > > We had two users report hitting a bug that looks like this..
 > > 
 > > general protection fault: 8800 [#1] SMP 
 > > 
 > >    0:	ca 00 55             	lret   $0x5500
 > > 
 > > At this point I don't know where to begin debugging..
 > > 
 > > Is that 8800 error code a clue ?
 > 
 > Does CPU offline/online work on this machine?

I just asked the user to give that a try at https://bugzilla.redhat.com/show_bug.cgi?id=910162

Incidentally, I found that offlining a cpu in Linus' current tree
causes a mess..


numa_remove_cpu cpu 1 node 0: mask now 0,2-3
smpboot: CPU 1 is now offline
BUG: using smp_processor_id() in preemptible [00000000] code: bash/5976
caller is cmci_rediscover+0x6b/0xe0
Pid: 5976, comm: bash Not tainted 3.8.0-rc7+ #63
Call Trace:
 [<ffffffff812fb901>] debug_smp_processor_id+0xe1/0x100
 [<ffffffff8101e4bb>] cmci_rediscover+0x6b/0xe0
 [<ffffffff8158f55f>] mce_cpu_callback+0x1af/0x1c3
 [<ffffffff815a6893>] notifier_call_chain+0x53/0xa0
 [<ffffffff8107338e>] __raw_notifier_call_chain+0xe/0x10
 [<ffffffff810491e0>] __cpu_notify+0x20/0x40
 [<ffffffff81049215>] cpu_notify+0x15/0x20
 [<ffffffff8104939e>] cpu_notify_nofail+0xe/0x20
 [<ffffffff81588512>] _cpu_down+0x242/0x2b0
 [<ffffffff815885b6>] cpu_down+0x36/0x50
 [<ffffffff8158b4bd>] store_online+0x5d/0xe0
 [<ffffffff813c01b8>] dev_attr_store+0x18/0x30
 [<ffffffff81200ac0>] sysfs_write_file+0xe0/0x150
 [<ffffffff81191dcf>] vfs_write+0xaf/0x190
 [<ffffffff81192125>] sys_write+0x55/0xa0
 [<ffffffff815aa9d9>] system_call_fastpath+0x16/0x1b
BUG: using smp_processor_id() in preemptible [00000000] code: bash/5976
caller is cmci_rediscover+0x6b/0xe0
Pid: 5976, comm: bash Not tainted 3.8.0-rc7+ #63
Call Trace:
 [<ffffffff812fb901>] debug_smp_processor_id+0xe1/0x100
 [<ffffffff8101e4bb>] cmci_rediscover+0x6b/0xe0
 [<ffffffff8158f55f>] mce_cpu_callback+0x1af/0x1c3
 [<ffffffff815a6893>] notifier_call_chain+0x53/0xa0
 [<ffffffff8107338e>] __raw_notifier_call_chain+0xe/0x10
 [<ffffffff810491e0>] __cpu_notify+0x20/0x40
 [<ffffffff81049215>] cpu_notify+0x15/0x20
 [<ffffffff8104939e>] cpu_notify_nofail+0xe/0x20
 [<ffffffff81588512>] _cpu_down+0x242/0x2b0
 [<ffffffff815885b6>] cpu_down+0x36/0x50
 [<ffffffff8158b4bd>] store_online+0x5d/0xe0
 [<ffffffff813c01b8>] dev_attr_store+0x18/0x30
 [<ffffffff81200ac0>] sysfs_write_file+0xe0/0x150
 [<ffffffff81191dcf>] vfs_write+0xaf/0x190
 [<ffffffff81192125>] sys_write+0x55/0xa0
 [<ffffffff815aa9d9>] system_call_fastpath+0x16/0x1b
BUG: using smp_processor_id() in preemptible [00000000] code: bash/5976
caller is cmci_rediscover+0x6b/0xe0
Pid: 5976, comm: bash Not tainted 3.8.0-rc7+ #63
Call Trace:
 [<ffffffff812fb901>] debug_smp_processor_id+0xe1/0x100
 [<ffffffff8101e4bb>] cmci_rediscover+0x6b/0xe0
 [<ffffffff8158f55f>] mce_cpu_callback+0x1af/0x1c3
 [<ffffffff815a6893>] notifier_call_chain+0x53/0xa0
 [<ffffffff8107338e>] __raw_notifier_call_chain+0xe/0x10
 [<ffffffff810491e0>] __cpu_notify+0x20/0x40
 [<ffffffff81049215>] cpu_notify+0x15/0x20
 [<ffffffff8104939e>] cpu_notify_nofail+0xe/0x20
 [<ffffffff81588512>] _cpu_down+0x242/0x2b0
 [<ffffffff815885b6>] cpu_down+0x36/0x50
 [<ffffffff8158b4bd>] store_online+0x5d/0xe0
 [<ffffffff813c01b8>] dev_attr_store+0x18/0x30
 [<ffffffff81200ac0>] sysfs_write_file+0xe0/0x150
 [<ffffffff81191dcf>] vfs_write+0xaf/0x190
 [<ffffffff81192125>] sys_write+0x55/0xa0
 [<ffffffff815aa9d9>] system_call_fastpath+0x16/0x1b
BUG: using smp_processor_id() in preemptible [00000000] code: bash/5976
caller is cmci_discover+0x25/0x260
Pid: 5976, comm: bash Not tainted 3.8.0-rc7+ #63
Call Trace:
 [<ffffffff812fb901>] debug_smp_processor_id+0xe1/0x100
 [<ffffffff8101de25>] cmci_discover+0x25/0x260
 [<ffffffff81005575>] ? show_trace+0x15/0x20
 [<ffffffff815950c4>] ? dump_stack+0x77/0x80
 [<ffffffff8102444b>] ? lapic_get_maxlvt+0x1b/0x30
 [<ffffffff8101e112>] cmci_rediscover_work_func+0x22/0x30
 [<ffffffff8101e527>] cmci_rediscover+0xd7/0xe0
 [<ffffffff8158f55f>] mce_cpu_callback+0x1af/0x1c3
 [<ffffffff815a6893>] notifier_call_chain+0x53/0xa0
 [<ffffffff8107338e>] __raw_notifier_call_chain+0xe/0x10
 [<ffffffff810491e0>] __cpu_notify+0x20/0x40
 [<ffffffff81049215>] cpu_notify+0x15/0x20
 [<ffffffff8104939e>] cpu_notify_nofail+0xe/0x20
 [<ffffffff81588512>] _cpu_down+0x242/0x2b0
 [<ffffffff815885b6>] cpu_down+0x36/0x50
 [<ffffffff8158b4bd>] store_online+0x5d/0xe0
 [<ffffffff813c01b8>] dev_attr_store+0x18/0x30
 [<ffffffff81200ac0>] sysfs_write_file+0xe0/0x150
 [<ffffffff81191dcf>] vfs_write+0xaf/0x190
 [<ffffffff81192125>] sys_write+0x55/0xa0
 [<ffffffff815aa9d9>] system_call_fastpath+0x16/0x1b

 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: odd GPF bug on resume from hibernate.
  2013-02-20 20:13   ` Dave Jones
@ 2013-02-20 20:46     ` Thomas Gleixner
  0 siblings, 0 replies; 4+ messages in thread
From: Thomas Gleixner @ 2013-02-20 20:46 UTC (permalink / raw)
  To: Dave Jones; +Cc: Rafael J. Wysocki, x86, Linux Kernel, Tang Chen, Tony Luck

On Wed, 20 Feb 2013, Dave Jones wrote:

> On Wed, Feb 20, 2013 at 08:42:46PM +0100, Rafael J. Wysocki wrote:
>  > On Wednesday, February 20, 2013 02:28:26 PM Dave Jones wrote:
>  > > We had two users report hitting a bug that looks like this..
>  > > 
>  > > general protection fault: 8800 [#1] SMP 
>  > > 
>  > >    0:	ca 00 55             	lret   $0x5500
>  > > 
>  > > At this point I don't know where to begin debugging..
>  > > 
>  > > Is that 8800 error code a clue ?
>  > 
>  > Does CPU offline/online work on this machine?
> 
> I just asked the user to give that a try at https://bugzilla.redhat.com/show_bug.cgi?id=910162
> 
> Incidentally, I found that offlining a cpu in Linus' current tree
> causes a mess..
> 
> 
> numa_remove_cpu cpu 1 node 0: mask now 0,2-3
> smpboot: CPU 1 is now offline
> BUG: using smp_processor_id() in preemptible [00000000] code: bash/5976
> caller is cmci_rediscover+0x6b/0xe0
> Pid: 5976, comm: bash Not tainted 3.8.0-rc7+ #63
> Call Trace:
>  [<ffffffff812fb901>] debug_smp_processor_id+0xe1/0x100
>  [<ffffffff8101e4bb>] cmci_rediscover+0x6b/0xe0
>  [<ffffffff8158f55f>] mce_cpu_callback+0x1af/0x1c3
>  [<ffffffff815a6893>] notifier_call_chain+0x53/0xa0
>  [<ffffffff8107338e>] __raw_notifier_call_chain+0xe/0x10
>  [<ffffffff810491e0>] __cpu_notify+0x20/0x40
>  [<ffffffff81049215>] cpu_notify+0x15/0x20
>  [<ffffffff8104939e>] cpu_notify_nofail+0xe/0x20
>  [<ffffffff81588512>] _cpu_down+0x242/0x2b0
>  [<ffffffff815885b6>] cpu_down+0x36/0x50

That's caused by: commit 85b97637bb40a9f486459dd254598759af9c3d50

       x86/mce: Do not change worker's running cpu in cmci_rediscover().

mce_cpu_callback() does:

        if (action == CPU_POST_DEAD) {
                /* intentionally ignoring frozen here */
                cmci_rediscover(cpu);
        }

This is called from preemptible context.

Now cmci_rediscover() grew the following addon:

+   		      if (cpu == smp_processor_id()) {
+		      	      cmci_rediscover_work_func(NULL);
					continue;

Which causes the above splat. It seems testing with full debugging is
overrated.

Find the fix below, though it's debatable whether that "optimization"
of calling the function directly is worth the trouble.

Thanks,

	tglx

Index: linux-2.6/arch/x86/kernel/cpu/mcheck/mce_intel.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ linux-2.6/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -311,10 +311,12 @@ void cmci_rediscover(int dying)
 		if (cpu == dying)
 			continue;
 
-		if (cpu == smp_processor_id()) {
+		if (cpu == get_cpu()) {
 			cmci_rediscover_work_func(NULL);
+			put_cpu();
 			continue;
 		}
+		put_cpu();
 
 		work_on_cpu(cpu, cmci_rediscover_work_func, NULL);
 	}

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-02-20 20:46 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-20 19:28 odd GPF bug on resume from hibernate Dave Jones
2013-02-20 19:42 ` Rafael J. Wysocki
2013-02-20 20:13   ` Dave Jones
2013-02-20 20:46     ` Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox