* Crash/hang in iommu_flush_dev_iotlb
@ 2016-10-28 15:29 Brian Rak
[not found] ` <4aaeb6a2-2b39-c5fc-126f-df3ce9a7a65d-4QcKpDfc+ZsswetKESUqMA@public.gmane.org>
0 siblings, 1 reply; 2+ messages in thread
From: Brian Rak @ 2016-10-28 15:29 UTC (permalink / raw)
To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
We've been seeing a pretty frequent crash/hang that seems to be pointing
at the Intel IOMMU code.
This manifests in one of two ways:
1) Kernel reports a BUG, then the system hangs
2) Kernel reports a BUG, then the kernel notices something else terrible
has occurred, and triggers a reboot.
Examples of "other" terrible things we've seen:
Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast
exception handler
Kernel panic - not syncing: Hard LOCKUP
Kernel panic - not syncing: Attempted to kill the idle task!
Kernel panic - not syncing: Fatal exception in interrupt
This is an example of a stack trace we're seeing:
2016-10-28 11:04:15 BUG: unable to handle kernel 2016-10-28
11:04:15 NULL pointer dereference2016-10-28 11:04:15 at
0000000000000304
2016-10-28 11:04:15 IP:2016-10-28 11:04:15 [<ffffffff81476d51>]
iommu_flush_dev_iotlb+0x21/0xc0
2016-10-28 11:04:15 PGD 0 2016-10-28 11:04:15
2016-10-28 11:04:15 Oops: 0000 [#1] SMP
2016-10-28 11:04:15 Modules linked in:2016-10-28 11:04:15
vxlan2016-10-28 11:04:15 udp_tunnel2016-10-28 11:04:15
ip6_udp_tunnel2016-10-28 11:04:15 ip6t_rpfilter2016-10-28
11:04:15 ipt_rpfilter2016-10-28 11:04:15 ts_bm2016-10-28
11:04:15 xt_string2016-10-28 11:04:15 ip6table_mangle2016-10-28
11:04:15 ebt_arp2016-10-28 11:04:15 ebtable_nat2016-10-28
11:04:15 ebtables2016-10-28 11:04:15 netconsole2016-10-28
11:04:15 configfs2016-10-28 11:04:15 sch_fq_codel2016-10-28
11:04:15 vhost_net2016-10-28 11:04:15 macvtap2016-10-28 11:04:15
macvlan2016-10-28 11:04:15 vhost2016-10-28 11:04:15 tun2016-10-28
11:04:15 kvm_intel2016-10-28 11:04:15 kvm2016-10-28 11:04:15
irqbypass2016-10-28 11:04:15 8021q2016-10-28 11:04:15
garp2016-10-28 11:04:15 dummy2016-10-28 11:04:15
xt_CHECKSUM2016-10-28 11:04:15 iptable_mangle2016-10-28
11:04:15 ipt_REJECT2016-10-28 11:04:15 nf_reject_ipv42016-10-28
11:04:15 iptable_filter2016-10-28 11:04:15
ip_tables2016-10-28 11:04:15 xt_comment2016-10-28 11:04:15
ip6t_REJECT2016-10-28 11:04:15 nf_reject_ipv62016-10-28 11:04:15
ip6table_filter2016-10-28 11:04:15 ip6_tables2016-10-28 11:04:15
joydev2016-10-28 11:04:15 input_leds2016-10-28 11:04:15
mlx4_ib2016-10-28 11:04:15 ib_core2016-10-28 11:04:15
mlx4_en2016-10-28 11:04:15 mlx4_core2016-10-28 11:04:15
ip_set2016-10-28 11:04:15 nfnetlink2016-10-28 11:04:15
bcache2016-10-28 11:04:15 iTCO_wdt2016-10-28 11:04:15
iTCO_vendor_support2016-10-28 11:04:15 pcspkr2016-10-28 11:04:15
ixgbe2016-10-28 11:04:15 mdio2016-10-28 11:04:15
sg2016-10-28 11:04:15 i2c_i8012016-10-28 11:04:15
lpc_ich2016-10-28 11:04:15 shpchp2016-10-28 11:04:15
xhci_pci2016-10-28 11:04:15 xhci_hcd2016-10-28 11:04:15
ioatdma2016-10-28 11:04:15 igb2016-10-28 11:04:15 dca2016-10-28
11:04:15 ptp2016-10-28 11:04:15 pps_core2016-10-28 11:04:15
fjes2016-10-28 11:04:15 ipmi_devintf2016-10-28 11:04:15
ipmi_si2016-10-28 11:04:15 ipmi_msghandler2016-10-28 11:04:15
acpi_power_meter2016-10-28 11:04:15 hwmon2016-10-28 11:04:15
ext42016-10-28 11:04:15 mbcache2016-10-28 11:04:15
jbd22016-10-28 11:04:15 raid12016-10-28 11:04:15
sd_mod2016-10-28 11:04:15 ahci2016-10-28 11:04:15
libahci2016-10-28 11:04:15 wmi2016-10-28 11:04:15 ast2016-10-28
11:04:15 ttm2016-10-28 11:04:15 dm_mirror2016-10-28 11:04:15
dm_region_hash2016-10-28 11:04:15 dm_log2016-10-28 11:04:15
dm_mod2016-10-28 11:04:15
2016-10-28 11:04:15 CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.7.2-1.el6.elrepo.x86_64 #1
2016-10-28 11:04:15 Hardware name: Supermicro
SYS-2U4NODES-03-CL011/X10DRT-P, BIOS 2.0 12/18/2015
2016-10-28 11:04:15 task: ffffffff81c0d540 ti: ffffffff81c00000
task.ti: ffffffff81c00000
2016-10-28 11:04:15 RIP: 0010:[<ffffffff81476d51>] 2016-10-28
11:04:15 [<ffffffff81476d51>] iommu_flush_dev_iotlb+0x21/0xc0
2016-10-28 11:04:15 RSP: 0018:ffff881fff803cb8 EFLAGS: 00010086
2016-10-28 11:04:15 RAX: 0000000000000001 RBX: 0000000000000000 RCX:
ffff883ff2a05400
2016-10-28 11:04:15 RDX: 000000000000003f RSI: 0000000000001000 RDI:
0000000000000000
2016-10-28 11:04:15 RBP: ffff881fff803ce8 R08: 0000000000000010 R09:
0000000000000040
2016-10-28 11:04:15 R10: 0000000000000000 R11: 0000000200000025 R12:
ffff881fef301f48
2016-10-28 11:04:15 R13: 00000000000ff83a R14: 0000000000000000 R15:
ffff883feb4db500
2016-10-28 11:04:15 FS: 0000000000000000(0000)
GS:ffff881fff800000(0000) knlGS:0000000000000000
2016-10-28 11:04:15 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2016-10-28 11:04:15 CR2: 0000000000000304 CR3: 0000003efbaff000 CR4:
00000000003426f0
2016-10-28 11:04:15 Stack:
2016-10-28 11:04:15 ffff883feb4db5002016-10-28 11:04:15
00000000000000002016-10-28 11:04:15 ffff881fef301f482016-10-28
11:04:15 00000000000ff83a2016-10-28 11:04:15
2016-10-28 11:04:15 00000000000000002016-10-28 11:04:15
ffff883feb4db5002016-10-28 11:04:15 ffff881fff803d482016-10-28
11:04:15 ffffffff8147702c2016-10-28 11:04:15
2016-10-28 11:04:15 00000000000000012016-10-28 11:04:15
ffff881fff813b002016-10-28 11:04:15 00000001ff803d082016-10-28
11:04:15 ffff883ff2a054002016-10-28 11:04:15
2016-10-28 11:04:15 Call Trace:
2016-10-28 11:04:15 <IRQ> 2016-10-28 11:04:15
2016-10-28 11:04:15 [<ffffffff8147702c>] flush_unmaps+0xac/0x190
2016-10-28 11:04:15 [<ffffffff81477148>] flush_unmaps_timeout+0x38/0x50
2016-10-28 11:04:15 [<ffffffff81477110>] ? flush_unmaps+0x190/0x190
2016-10-28 11:04:15 [<ffffffff810ee56a>] call_timer_fn+0x4a/0x160
2016-10-28 11:04:15 [<ffffffff8133d839>] ? timerqueue_add+0x59/0xb0
2016-10-28 11:04:15 [<ffffffff810ef67e>] run_timer_softirq+0x26e/0x300
2016-10-28 11:04:15 [<ffffffff81477110>] ? flush_unmaps+0x190/0x190
2016-10-28 11:04:15 [<ffffffff810edfec>] ?
get_next_timer_interrupt+0xcc/0x210
2016-10-28 11:04:15 [<ffffffff810dc69d>] ?
handle_irq_event_percpu+0xbd/0x200
2016-10-28 11:04:15 [<ffffffff810f799c>] ? ktime_get+0x4c/0xc0
2016-10-28 11:04:15 [<ffffffff81778b41>] __do_softirq+0xf1/0x2e4
2016-10-28 11:04:15 [<ffffffff810f10d8>] ? hrtimer_interrupt+0xb8/0x170
2016-10-28 11:04:15 [<ffffffff81085fb6>] irq_exit+0xa6/0xb0
2016-10-28 11:04:15 [<ffffffff81778936>]
smp_apic_timer_interrupt+0x46/0x60
2016-10-28 11:04:15 [<ffffffff81776c32>] apic_timer_interrupt+0x82/0x90
2016-10-28 11:04:15 <EOI> 2016-10-28 11:04:15
2016-10-28 11:04:15 [<ffffffff810623b6>] ? native_safe_halt+0x6/0x10
2016-10-28 11:04:15 [<ffffffff8103921a>] default_idle+0x2a/0xf0
2016-10-28 11:04:15 [<ffffffff81037379>] ? sched_clock+0x9/0x10
2016-10-28 11:04:15 [<ffffffff810b1dd5>] ? sched_clock_cpu+0xb5/0xc0
2016-10-28 11:04:15 [<ffffffff81038adf>] arch_cpu_idle+0xf/0x20
2016-10-28 11:04:15 [<ffffffff810c4cde>] default_idle_call+0x2e/0x40
2016-10-28 11:04:15 [<ffffffff810c4e35>] cpuidle_idle_call+0xa5/0x120
2016-10-28 11:04:15 [<ffffffff810c5008>] cpu_idle_loop+0x158/0x240
2016-10-28 11:04:15 [<ffffffff81d7c117>] ?
early_idt_handler_array+0x117/0x120
2016-10-28 11:04:15 [<ffffffff810c510e>] ? cpu_startup_entry+0x1e/0x70
2016-10-28 11:04:15 [<ffffffff8145388b>] ? get_random_bytes+0x4b/0xb0
2016-10-28 11:04:15 [<ffffffff81d7c117>] ?
early_idt_handler_array+0x117/0x120
2016-10-28 11:04:15 [<ffffffff810c5157>] cpu_startup_entry+0x67/0x70
2016-10-28 11:04:15 [<ffffffff81769ad7>] rest_init+0x77/0x80
2016-10-28 11:04:15 [<ffffffff81d7d440>] start_kernel+0x3f3/0x3f5
2016-10-28 11:04:15 [<ffffffff81d7ce6f>] ? set_init_arg+0x5e/0x5e
2016-10-28 11:04:15 [<ffffffff81d7c398>]
x86_64_start_reservations+0x2f/0x31
2016-10-28 11:04:15 [<ffffffff81d7c6ef>]
x86_64_start_kernel+0x14d/0x15c
2016-10-28 11:04:15 Code: 2016-10-28 11:04:15 66 2016-10-28
11:04:15 2e 2016-10-28 11:04:15 0f 2016-10-28 11:04:15 1f 2016-10-28
11:04:15 84 2016-10-28 11:04:15 00 2016-10-28 11:04:15 00 2016-10-28
11:04:15 00 2016-10-28 11:04:15 00 2016-10-28 11:04:15 00 2016-10-28
11:04:15 55 2016-10-28 11:04:15 48 2016-10-28 11:04:15 89 2016-10-28
11:04:15 e5 2016-10-28 11:04:15 48 2016-10-28 11:04:15 83 2016-10-28
11:04:15 ec 2016-10-28 11:04:15 30 2016-10-28 11:04:15 48 2016-10-28
11:04:15 89 2016-10-28 11:04:15 5d 2016-10-28 11:04:15 d8 2016-10-28
11:04:15 4c 2016-10-28 11:04:15 89 2016-10-28 11:04:15 65 2016-10-28
11:04:15 e0 2016-10-28 11:04:15 4c 2016-10-28 11:04:15 89 2016-10-28
11:04:15 6d
I should note that the '0000000000000304' address is fairly consistent,
it's so far always been one of 304 or 8f8, across ~100 crashes.
Unfortunately, we haven't been able to come up with good reproduction
steps. So far, we're mainly seeing the issue on machines with Intel
E5-2640 v4 CPUs (though it has occasionally happened on E5-2630 v3 CPUs).
We're seeing this on around 50 different machines. We've tried swapping
out the memory on a few of them, and the issue has persisted.
Any suggestions here?
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2016-11-07 20:57 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-10-28 15:29 Crash/hang in iommu_flush_dev_iotlb Brian Rak
[not found] ` <4aaeb6a2-2b39-c5fc-126f-df3ce9a7a65d-4QcKpDfc+ZsswetKESUqMA@public.gmane.org>
2016-11-07 20:57 ` Jacob Pan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).