linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* 3.13 Oops on ppc64_cpu --smt=off
@ 2013-11-30 17:45 Alexander Graf
  2013-12-02  4:01 ` Preeti U Murthy
  0 siblings, 1 reply; 4+ messages in thread
From: Alexander Graf @ 2013-11-30 17:45 UTC (permalink / raw)
  To: Ben Herrenschmidt; +Cc: Paul Mackerras, linuxppc-dev

Hi Ben,

With current linus master (3.13-rc2+) I'm facing an interesting issue =
with SMT disabling on p7. When I trigger the cpu offlining it works as =
expected, but after a few seconds the machine goes into an oops as you =
can see below.

It looks like a null pointer dereference.


Alex

($ ppc64_cpu --smt=3Doff)
kvm: disabling virtualization on CPU1
kvm: disabling virtualization on CPU2
kvm: disabling virtualization on CPU3
kvm: disabling virtualization on CPU5
kvm: disabling virtualization on CPU6
kvm: disabling virtualization on CPU7
kvm: disabling virtualization on CPU9
kvm: disabling virtualization on CPU10
kvm: disabling virtualization on CPU11
kvm: disabling virtualization on CPU13
kvm: disabling virtualization on CPU14
kvm: disabling virtualization on CPU15
kvm: disabling virtualization on CPU17
kvm: disabling virtualization on CPU18
kvm: disabling virtualization on CPU19
kvm: disabling virtualization on CPU21
kvm: disabling virtualization on CPU22
kvm: disabling virtualization on CPU23
kvm: disabling virtualization on CPU25
kvm: disabling virtualization on CPU26
kvm: disabling virtualization on CPU27
kvm: disabling virtualization on CPU29
kvm: disabling virtualization on CPU30
kvm: disabling virtualization on CPU31
kvm: disabling virtualization on CPU33
kvm: disabling virtualization on CPU34
kvm: disabling virtualization on CPU35
kvm: disabling virtualization on CPU37
kvm: disabling virtualization on CPU38
kvm: disabling virtualization on CPU39
kvm: disabling virtualization on CPU41
kvm: disabling virtualization on CPU42
kvm: disabling virtualization on CPU43
kvm: disabling virtualization on CPU45
kvm: disabling virtualization on CPU46
kvm: disabling virtualization on CPU47
kvm: disabling virtualization on CPU49
kvm: disabling virtualization on CPU50
kvm: disabling virtualization on CPU51
kvm: disabling virtualization on CPU53
kvm: disabling virtualization on CPU54
kvm: disabling virtualization on CPU55
kvm: disabling virtualization on CPU57
kvm: disabling virtualization on CPU58
kvm: disabling virtualization on CPU59
kvm: disabling virtualization on CPU61
kvm: disabling virtualization on CPU62
kvm: disabling virtualization on CPU63
Unable to handle kernel paging request for data at address 0x00000010
Faulting instruction address: 0xc000000000124188
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=3D1024 NUMA PowerNV
Modules linked in: iptable_filter ip_tables x_tables nfsv3 nfs_acl nfs =
fscache lockd sunrpc autofs4 binfmt_misc af_packet fuse loop dm_mod =
ohci_pci ohci_hcd ehci_pci ehci_hcd e1000e usbcore sr_mod cdrom ses =
enclosure rtc_generic usb_common ptp sg pps_core sd_mod crc_t10dif =
crct10dif_common scsi_dh_hp_sw scsi_dh_alua scsi_dh_emc scsi_dh_rdac =
scsi_dh virtio_pci virtio_console virtio_blk virtio virtio_ring ipr =
libata scsi_mod
CPU: 56 PID: 0 Comm: swapper/56 Not tainted =
3.13.0-rc2-0.g01695c8-default+ #1
task: c0000007f28b5180 ti: c0000007f28c8000 task.ti: c0000007f28c8000
NIP: c000000000124188 LR: c000000000124144 CTR: c00000000011e650
REGS: c0000007f28cb1e0 TRAP: 0300   Not tainted  =
(3.13.0-rc2-0.g01695c8-default+)
MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI>  CR: 24000028  XER: =
00000000
CFAR: c00000000000908c DAR: 0000000000000010 DSISR: 40000000 SOFTE: 0
GPR00: 00000000ef4546c9 c0000007f28cb460 c0000000013c7690 =
0000000000000000
GPR04: 0000000000000038 0000000000000010 c000000003314ea0 =
c000000000c72878
GPR08: c000000000c83448 c0000007ef454600 0000000002690000 =
0000000000000000
GPR12: 000000000000c345 c00000000ff0e000 c0000007f28cb8b0 =
0000000000000001
GPR16: 7fffffffffffffff c0000007f28cb8c0 0000000002690000 =
000000219729878b
GPR20: 0000000000000000 c000000000c72698 c0000000033027d0 =
c00000000142ca58
GPR24: c000000000c84e80 c000000003314e80 c00000000142ca58 =
00000000ffffc32c
GPR28: 0000000000000038 c0000007f28b5180 c0000000012f8cd0 =
c000000001422180
NIP [c000000000124188] .trigger_load_balance+0xc8/0x2e0
LR [c000000000124144] .trigger_load_balance+0x84/0x2e0
Call Trace:
[c0000007f28cb460] [c000000000124134] .trigger_load_balance+0x74/0x2e0 =
(unreliable)
[c0000007f28cb510] [c00000000011ca50] .scheduler_tick+0x100/0x160
[c0000007f28cb5d0] [c0000000000e9074] .update_process_times+0x64/0x90
[c0000007f28cb660] [c0000000001628f4] .tick_sched_handle+0x34/0xc0
[c0000007f28cb6f0] [c000000000162c60] .tick_sched_timer+0x70/0xc0
[c0000007f28cb790] [c000000000109000] .__run_hrtimer+0x180/0x280
[c0000007f28cb840] [c000000000109738] .hrtimer_interrupt+0x158/0x340
[c0000007f28cb960] [c00000000001ec74] .timer_interrupt+0x174/0x2d0
[c0000007f28cba10] [c000000000002824] decrementer_common+0x124/0x180
--- Exception: 901 at .arch_local_irq_restore+0x84/0xa0
    LR =3D .arch_local_irq_restore+0x84/0xa0
[c0000007f28cbd00] [c000000000010c34] .arch_local_irq_restore+0x54/0xa0 =
(unreliable)
[c0000007f28cbd70] [c0000000000174f8] .arch_cpu_idle+0xc8/0x170
[c0000007f28cbe00] [c00000000014597c] .cpu_idle_loop+0x9c/0x2c0
[c0000007f28cbed0] [c00000000003f800] .start_secondary+0x2a0/0x2d0
[c0000007f28cbf90] [c0000000000097fc] .start_secondary_prolog+0x10/0x14
Instruction dump:
78001f24 e8fe8040 7d7a002a 7ce93b78 7d29582a 2fa90000 419e0030 8009004c
2f800000 419e0024 9069004c e9690010 <e92b0010> 3929001c 7c004828 =
30000001
---[ end trace 5d5f06c369432fa1 ]---

Kernel panic - not syncing: Fatal exception in interrupt
Rebooting in 100 seconds..=

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 3.13 Oops on ppc64_cpu --smt=off
  2013-11-30 17:45 3.13 Oops on ppc64_cpu --smt=off Alexander Graf
@ 2013-12-02  4:01 ` Preeti U Murthy
  2013-12-02  9:57   ` Alexander Graf
  0 siblings, 1 reply; 4+ messages in thread
From: Preeti U Murthy @ 2013-12-02  4:01 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Paul Mackerras, linuxppc-dev

Hi,

On 11/30/2013 11:15 PM, Alexander Graf wrote:
> Hi Ben,
> 
> With current linus master (3.13-rc2+) I'm facing an interesting issue with

SMT disabling on p7. When I trigger the cpu offlining it works as expected,
but after a few seconds the machine goes into an oops as you can see below.
> 
> It looks like a null pointer dereference.

tip/sched/urgent has the below fix. Can you please apply the following it and
check if the issue gets resolved?  A similar issue was reported earlier as
well and it pointed to the commit id 37dc65. I believe the problem that you report
is also pointing to the regression caused by the same commit id.

Thanks

Regards
Preeti U Murthy

---
commit 42eb088ed246a5a817bb45a8b32fe234cf1c0f8b
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Tue Nov 19 16:41:49 2013 +0100

    sched: Avoid NULL dereference on sd_busy
    
    Commit 37dc6b50cee9 ("sched: Remove unnecessary iteration over sched
    domains to update nr_busy_cpus") forgot to clear 'sd_busy' under some
    conditions leading to a possible NULL deref in set_cpu_sd_state_idle().
    
    Reported-by: Anton Blanchard <anton@samba.org>
    Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
    Signed-off-by: Peter Zijlstra <peterz@infradead.org>
    Link: http://lkml.kernel.org/r/20131118113701.GF3866@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c1808606ee5f..a1591ca7eb5a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4910,8 +4910,9 @@ static void update_top_cache_domain(int cpu)
 	if (sd) {
 		id = cpumask_first(sched_domain_span(sd));
 		size = cpumask_weight(sched_domain_span(sd));
-		rcu_assign_pointer(per_cpu(sd_busy, cpu), sd->parent);
+		sd = sd->parent; /* sd_busy */
 	}
+	rcu_assign_pointer(per_cpu(sd_busy, cpu), sd);

 	rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
 	per_cpu(sd_llc_size, cpu) = size;


> 
> 
> Alex
> 
> ($ ppc64_cpu --smt=off)
> kvm: disabling virtualization on CPU1
> kvm: disabling virtualization on CPU2
> kvm: disabling virtualization on CPU3
> kvm: disabling virtualization on CPU5
> kvm: disabling virtualization on CPU6
> kvm: disabling virtualization on CPU7
> kvm: disabling virtualization on CPU9
> kvm: disabling virtualization on CPU10
> kvm: disabling virtualization on CPU11
> kvm: disabling virtualization on CPU13
> kvm: disabling virtualization on CPU14
> kvm: disabling virtualization on CPU15
> kvm: disabling virtualization on CPU17
> kvm: disabling virtualization on CPU18
> kvm: disabling virtualization on CPU19
> kvm: disabling virtualization on CPU21
> kvm: disabling virtualization on CPU22
> kvm: disabling virtualization on CPU23
> kvm: disabling virtualization on CPU25
> kvm: disabling virtualization on CPU26
> kvm: disabling virtualization on CPU27
> kvm: disabling virtualization on CPU29
> kvm: disabling virtualization on CPU30
> kvm: disabling virtualization on CPU31
> kvm: disabling virtualization on CPU33
> kvm: disabling virtualization on CPU34
> kvm: disabling virtualization on CPU35
> kvm: disabling virtualization on CPU37
> kvm: disabling virtualization on CPU38
> kvm: disabling virtualization on CPU39
> kvm: disabling virtualization on CPU41
> kvm: disabling virtualization on CPU42
> kvm: disabling virtualization on CPU43
> kvm: disabling virtualization on CPU45
> kvm: disabling virtualization on CPU46
> kvm: disabling virtualization on CPU47
> kvm: disabling virtualization on CPU49
> kvm: disabling virtualization on CPU50
> kvm: disabling virtualization on CPU51
> kvm: disabling virtualization on CPU53
> kvm: disabling virtualization on CPU54
> kvm: disabling virtualization on CPU55
> kvm: disabling virtualization on CPU57
> kvm: disabling virtualization on CPU58
> kvm: disabling virtualization on CPU59
> kvm: disabling virtualization on CPU61
> kvm: disabling virtualization on CPU62
> kvm: disabling virtualization on CPU63
> Unable to handle kernel paging request for data at address 0x00000010
> Faulting instruction address: 0xc000000000124188
> Oops: Kernel access of bad area, sig: 11 [#1]
> SMP NR_CPUS=1024 NUMA PowerNV
> Modules linked in: iptable_filter ip_tables x_tables nfsv3 nfs_acl nfs fscache lockd sunrpc autofs4 binfmt_misc af_packet fuse loop dm_mod ohci_pci ohci_hcd ehci_pci ehci_hcd e1000e usbcore sr_mod cdrom ses enclosure rtc_generic usb_common ptp sg pps_core sd_mod crc_t10dif crct10dif_common scsi_dh_hp_sw scsi_dh_alua scsi_dh_emc scsi_dh_rdac scsi_dh virtio_pci virtio_console virtio_blk virtio virtio_ring ipr libata scsi_mod
> CPU: 56 PID: 0 Comm: swapper/56 Not tainted 3.13.0-rc2-0.g01695c8-default+ #1
> task: c0000007f28b5180 ti: c0000007f28c8000 task.ti: c0000007f28c8000
> NIP: c000000000124188 LR: c000000000124144 CTR: c00000000011e650
> REGS: c0000007f28cb1e0 TRAP: 0300   Not tainted  (3.13.0-rc2-0.g01695c8-default+)
> MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI>  CR: 24000028  XER: 00000000
> CFAR: c00000000000908c DAR: 0000000000000010 DSISR: 40000000 SOFTE: 0
> GPR00: 00000000ef4546c9 c0000007f28cb460 c0000000013c7690 0000000000000000
> GPR04: 0000000000000038 0000000000000010 c000000003314ea0 c000000000c72878
> GPR08: c000000000c83448 c0000007ef454600 0000000002690000 0000000000000000
> GPR12: 000000000000c345 c00000000ff0e000 c0000007f28cb8b0 0000000000000001
> GPR16: 7fffffffffffffff c0000007f28cb8c0 0000000002690000 000000219729878b
> GPR20: 0000000000000000 c000000000c72698 c0000000033027d0 c00000000142ca58
> GPR24: c000000000c84e80 c000000003314e80 c00000000142ca58 00000000ffffc32c
> GPR28: 0000000000000038 c0000007f28b5180 c0000000012f8cd0 c000000001422180
> NIP [c000000000124188] .trigger_load_balance+0xc8/0x2e0
> LR [c000000000124144] .trigger_load_balance+0x84/0x2e0
> Call Trace:
> [c0000007f28cb460] [c000000000124134] .trigger_load_balance+0x74/0x2e0 (unreliable)
> [c0000007f28cb510] [c00000000011ca50] .scheduler_tick+0x100/0x160
> [c0000007f28cb5d0] [c0000000000e9074] .update_process_times+0x64/0x90
> [c0000007f28cb660] [c0000000001628f4] .tick_sched_handle+0x34/0xc0
> [c0000007f28cb6f0] [c000000000162c60] .tick_sched_timer+0x70/0xc0
> [c0000007f28cb790] [c000000000109000] .__run_hrtimer+0x180/0x280
> [c0000007f28cb840] [c000000000109738] .hrtimer_interrupt+0x158/0x340
> [c0000007f28cb960] [c00000000001ec74] .timer_interrupt+0x174/0x2d0
> [c0000007f28cba10] [c000000000002824] decrementer_common+0x124/0x180
> --- Exception: 901 at .arch_local_irq_restore+0x84/0xa0
>     LR = .arch_local_irq_restore+0x84/0xa0
> [c0000007f28cbd00] [c000000000010c34] .arch_local_irq_restore+0x54/0xa0 (unreliable)
> [c0000007f28cbd70] [c0000000000174f8] .arch_cpu_idle+0xc8/0x170
> [c0000007f28cbe00] [c00000000014597c] .cpu_idle_loop+0x9c/0x2c0
> [c0000007f28cbed0] [c00000000003f800] .start_secondary+0x2a0/0x2d0
> [c0000007f28cbf90] [c0000000000097fc] .start_secondary_prolog+0x10/0x14
> Instruction dump:
> 78001f24 e8fe8040 7d7a002a 7ce93b78 7d29582a 2fa90000 419e0030 8009004c
> 2f800000 419e0024 9069004c e9690010 <e92b0010> 3929001c 7c004828 30000001
> ---[ end trace 5d5f06c369432fa1 ]---
> 
> Kernel panic - not syncing: Fatal exception in interrupt
> Rebooting in 100 seconds..
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: 3.13 Oops on ppc64_cpu --smt=off
  2013-12-02  4:01 ` Preeti U Murthy
@ 2013-12-02  9:57   ` Alexander Graf
  2013-12-02 11:20     ` Preeti U Murthy
  0 siblings, 1 reply; 4+ messages in thread
From: Alexander Graf @ 2013-12-02  9:57 UTC (permalink / raw)
  To: Preeti U Murthy; +Cc: Paul Mackerras, linuxppc-dev


On 02.12.2013, at 05:01, Preeti U Murthy <preeti@linux.vnet.ibm.com> =
wrote:

> Hi,
>=20
> On 11/30/2013 11:15 PM, Alexander Graf wrote:
>> Hi Ben,
>>=20
>> With current linus master (3.13-rc2+) I'm facing an interesting issue =
with
>=20
> SMT disabling on p7. When I trigger the cpu offlining it works as =
expected,
> but after a few seconds the machine goes into an oops as you can see =
below.
>>=20
>> It looks like a null pointer dereference.
>=20
> tip/sched/urgent has the below fix. Can you please apply the following =
it and
> check if the issue gets resolved?  A similar issue was reported =
earlier as

I've disabled NO_HZ now on that machine which also "fixed" it for me. =
Unfortunately I can't reboot that box for at least the next week now to =
test whether the patch does fix the issue.


Alex

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: 3.13 Oops on ppc64_cpu --smt=off
  2013-12-02  9:57   ` Alexander Graf
@ 2013-12-02 11:20     ` Preeti U Murthy
  0 siblings, 0 replies; 4+ messages in thread
From: Preeti U Murthy @ 2013-12-02 11:20 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Paul Mackerras, linuxppc-dev

Hi,

On 12/02/2013 03:27 PM, Alexander Graf wrote:
> 
> On 02.12.2013, at 05:01, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> 
>> Hi,
>>
>> On 11/30/2013 11:15 PM, Alexander Graf wrote:
>>> Hi Ben,
>>>
>>> With current linus master (3.13-rc2+) I'm facing an interesting issue with
>>
>> SMT disabling on p7. When I trigger the cpu offlining it works as expected,
>> but after a few seconds the machine goes into an oops as you can see below.
>>>
>>> It looks like a null pointer dereference.
>>
>> tip/sched/urgent has the below fix. Can you please apply the following it and
>> check if the issue gets resolved?  A similar issue was reported earlier as
> 
> I've disabled NO_HZ now on that machine which also "fixed" it for me. Unfortunately I can't reboot that box for at least the next week now to test whether the patch does fix the issue.

The commit 37dc6b50cee9 that has caused this regression is around NO_HZ.
It decides when to kick nohz idle balancing.

Regards
Preeti U Murthy
> 
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-12-02 11:23 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-30 17:45 3.13 Oops on ppc64_cpu --smt=off Alexander Graf
2013-12-02  4:01 ` Preeti U Murthy
2013-12-02  9:57   ` Alexander Graf
2013-12-02 11:20     ` Preeti U Murthy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).