* Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot
@ 2014-07-14 2:25 Armin Zentai
2014-07-14 9:52 ` Wei Liu
2014-07-14 11:54 ` Zoltan Kiss
0 siblings, 2 replies; 14+ messages in thread
From: Armin Zentai @ 2014-07-14 2:25 UTC (permalink / raw)
To: xen-devel
Dear Xen Developers!
We're running Xen on multiple machines, most of them are Dell R410 or SM
X8DTL, with one E5645 cpu, and 48 GB of RAM. We've update the kernel to
3.15.4, after the some of our hypervisors started to rebooting at random
times.
The logs were empty, and we have no information about the crashes, we've
tried some tricks, and at the end the netconsole kernel modul helped, so
we can do a very thin layer of remote kernel logging. We've found the
following in the remote logs:
Jul 13 00:46:58 node11 [157060.106323] vif vif-2-0 h14z4mzbvfrrhb:
Trying to unmap invalid handle! pending_idx: c
Jul 13 00:46:58 node11 [157060.106476] ------------[ cut here ]------------
Jul 13 00:46:58 node11 [157060.106546] kernel BUG at
drivers/net/xen-netback/netback.c:998!
Jul 13 00:46:58 node11 [157060.106616] invalid opcode: 0000 [#1]
Jul 13 00:46:58 SMP
Jul 13 00:46:58 node11
Jul 13 00:46:58 node11 [157060.106765] Modules linked in:
Jul 13 00:46:58 node11 netconsole
Jul 13 00:46:58 node11 configfs
Jul 13 00:46:58 node11 nf_conntrack_ipv4
Jul 13 00:46:58 node11 nf_defrag_ipv4
Jul 13 00:46:58 node11 xt_multiport
Jul 13 00:46:58 node11 xt_physdev
Jul 13 00:46:58 node11 sch_tbf
Jul 13 00:46:58 node11 dm_snapshot
Jul 13 00:46:58 node11 dm_bufio
Jul 13 00:46:58 node11 arptable_filter
Jul 13 00:46:58 node11 arp_tables
Jul 13 00:46:58 node11 ip6t_REJECT
Jul 13 00:46:58 node11 ip6table_mangle
Jul 13 00:46:58 node11 ipt_REJECT
Jul 13 00:46:58 node11 iptable_filter
Jul 13 00:46:58 node11 ip_tables
Jul 13 00:46:58 node11 bridge
Jul 13 00:46:58 node11 xen_pciback
Jul 13 00:46:58 node11 xen_gntalloc
Jul 13 00:46:58 node11 autofs4
Jul 13 00:46:58 node11 dm_round_robin
Jul 13 00:46:58 node11 scsi_dh_alua
Jul 13 00:46:58 node11 8021q
Jul 13 00:46:58 node11 mrp
Jul 13 00:46:58 node11 garp
Jul 13 00:46:58 node11 stp
Jul 13 00:46:58 node11 llc
Jul 13 00:46:58 node11 bonding
Jul 13 00:46:58 node11 xt_tcpudp
Jul 13 00:46:58 node11 nf_conntrack_ipv6
Jul 13 00:46:58 node11 nf_defrag_ipv6
Jul 13 00:46:58 node11 xt_state
Jul 13 00:46:58 node11 nf_conntrack
Jul 13 00:46:58 node11 ip6table_filter
Jul 13 00:46:58 node11 ip6_tables
Jul 13 00:46:58 node11 x_tables
Jul 13 00:46:58 node11 bnx2i
Jul 13 00:46:58 node11 cnic
Jul 13 00:46:58 node11 uio
Jul 13 00:46:58 node11 cxgb4i
Jul 13 00:46:58 node11 cxgb4
Jul 13 00:46:58 node11 cxgb3i
Jul 13 00:46:58 node11 libcxgbi
Jul 13 00:46:58 node11 cxgb3
Jul 13 00:46:58 node11 mdio
Jul 13 00:46:58 node11 ib_iser
Jul 13 00:46:58 node11 rdma_cm
Jul 13 00:46:58 node11 ib_cm
Jul 13 00:46:58 node11 iw_cm
Jul 13 00:46:58 node11 ib_sa
Jul 13 00:46:58 node11 ib_mad
Jul 13 00:46:58 node11 ib_core
Jul 13 00:46:58 node11 ib_addr
Jul 13 00:46:58 node11 iscsi_tcp
Jul 13 00:46:58 node11 libiscsi_tcp
Jul 13 00:46:58 node11 binfmt_misc
Jul 13 00:46:58 node11 dm_mirror
Jul 13 00:46:58 node11 dm_region_hash
Jul 13 00:46:58 node11 dm_log
Jul 13 00:46:58 node11 dm_multipath
Jul 13 00:46:58 node11 scsi_dh
Jul 13 00:46:58 node11 xen_netback
Jul 13 00:46:58 node11 xen_blkback
Jul 13 00:46:58 node11 xen_gntdev
Jul 13 00:46:58 node11 xen_evtchn
Jul 13 00:46:58 node11 xenfs
Jul 13 00:46:58 node11 xen_privcmd
Jul 13 00:46:58 node11 ufs
Jul 13 00:46:58 node11 gpio_ich
Jul 13 00:46:58 node11 iTCO_wdt
Jul 13 00:46:58 node11 iTCO_vendor_support
Jul 13 00:46:58 node11 joydev
Jul 13 00:46:58 node11 psmouse
Jul 13 00:46:58 node11 serio_raw
Jul 13 00:46:58 node11 pcspkr
Jul 13 00:46:58 node11 tpm_infineon
Jul 13 00:46:58 node11 i2c_i801
Jul 13 00:46:58 node11 lpc_ich
Jul 13 00:46:58 node11 hid_generic
Jul 13 00:46:58 node11 e1000e
Jul 13 00:46:58 node11 ptp
Jul 13 00:46:58 node11 pps_core
Jul 13 00:46:58 node11 snd_hda_intel
Jul 13 00:46:58 node11 snd_hda_controller
Jul 13 00:46:58 node11 snd_hda_codec
Jul 13 00:46:58 node11 snd_hwdep
Jul 13 00:46:58 node11 snd_seq
Jul 13 00:46:58 node11 snd_seq_device
Jul 13 00:46:58 node11 snd_pcm
Jul 13 00:46:58 node11 snd_timer
Jul 13 00:46:58 node11 snd
Jul 13 00:46:58 node11 soundcore
Jul 13 00:46:58 node11 ioatdma
Jul 13 00:46:58 node11 dca
Jul 13 00:46:58 node11 mac_hid
Jul 13 00:46:58 node11 i7core_edac
Jul 13 00:46:58 node11 edac_core
Jul 13 00:46:58 node11 be2iscsi
Jul 13 00:46:58 node11 iscsi_boot_sysfs
Jul 13 00:46:58 node11 libiscsi
Jul 13 00:46:58 node11 scsi_transport_iscsi
Jul 13 00:46:58 node11 be2net
Jul 13 00:46:58 node11 vxlan
Jul 13 00:46:58 node11 ahci(E)
Jul 13 00:46:58 node11 libahci(E)
Jul 13 00:46:58 node11 usbhid(E)
Jul 13 00:46:58 node11 hid(E)
Jul 13 00:46:58 node11 [last unloaded: evbug]
Jul 13 00:46:58 node11
Jul 13 00:46:58 node11 [157060.112705] CPU: 0 PID: 0 Comm: swapper/0
Tainted: G E 3.15.4 #1
Jul 13 00:46:58 node11 [157060.112776] Hardware name: Supermicro
X8DTL/X8DTL, BIOS 1.1b 03/19/2010
Jul 13 00:46:58 node11 [157060.112848] task: ffffffff81c1b480 ti:
ffffffff81c00000 task.ti: ffffffff81c00000
Jul 13 00:46:58 node11 [157060.112936] RIP: e030:[<ffffffffa025f61d>]
Jul 13 00:46:58 node11 [<ffffffffa025f61d>]
xenvif_idx_unmap+0x11d/0x130 [xen_netback]
Jul 13 00:46:58 node11 [157060.113078] RSP: e02b:ffff88008ea03d48
EFLAGS: 00010292
Jul 13 00:46:58 node11 [157060.113147] RAX: 000000000000004a RBX:
000000000000000c RCX: 0000000000000000
Jul 13 00:46:58 node11 [157060.113234] RDX: ffff88008a40b600 RSI:
ffff88008ea03a18 RDI: 000000000000021b
Jul 13 00:46:58 node11 [157060.113321] RBP: ffff88008ea03d88 R08:
0000000000000000 R09: ffff88008a40b600
Jul 13 00:46:58 node11 [157060.113408] R10: ffff88008a0004e8 R11:
00000000000006d8 R12: ffff8800569708c0
Jul 13 00:46:58 node11 [157060.113495] R13: ffff88006558fec0 R14:
ffff8800569708c0 R15: 0000000000000001
Jul 13 00:46:58 node11 [157060.113589] FS: 00007f351684b700(0000)
GS:ffff88008ea00000(0000) knlGS:0000000000000000
Jul 13 00:46:58 node11 [157060.113679] CS: e033 DS: 0000 ES: 0000 CR0:
000000008005003b
Jul 13 00:46:58 node11 [157060.113747] CR2: 00007fc2a4372000 CR3:
00000000049f3000 CR4: 0000000000002660
Jul 13 00:46:58 node11 [157060.113835] Stack:
Jul 13 00:46:58 node11 [157060.113896] ffff880056979f90
Jul 13 00:46:58 node11 ff00000000000001
Jul 13 00:46:58 node11 ffff880b0605e000
Jul 13 00:46:58 node11 0000000000000000
Jul 13 00:46:58 node11
Jul 13 00:46:58 node11 [157060.114143] ffff0000ffffffff
Jul 13 00:46:58 node11 00000000fffffff6
Jul 13 00:46:58 node11 0000000000000001
Jul 13 00:46:58 node11 ffff8800569769d0
Jul 13 00:46:58 node11
Jul 13 00:46:58 node11 [157060.114390] ffff88008ea03e58
Jul 13 00:46:58 node11 ffffffffa02622fc
Jul 13 00:46:58 node11 ffff88008ea03dd8
Jul 13 00:46:58 node11 ffffffff810b5223
Jul 13 00:46:58 node11
Jul 13 00:46:58 node11 [157060.114637] Call Trace:
Jul 13 00:46:58 node11 [157060.114700] <IRQ>
Jul 13 00:46:58 node11
Jul 13 00:46:58 node11 [157060.114750]
Jul 13 00:46:58 node11 [<ffffffffa02622fc>]
xenvif_tx_action+0x27c/0x7f0 [xen_netback]
Jul 13 00:46:58 node11 [157060.114927] [<ffffffff810b5223>] ?
__wake_up+0x53/0x70
Jul 13 00:46:58 node11 [157060.114998] [<ffffffff810ca077>] ?
handle_irq_event_percpu+0xa7/0x1b0
Jul 13 00:46:58 node11 [157060.115073] [<ffffffffa02647d1>]
xenvif_poll+0x31/0x64 [xen_netback]
Jul 13 00:46:58 node11 [157060.115147] [<ffffffff81653d4b>]
net_rx_action+0x10b/0x290
Jul 13 00:46:58 node11 [157060.115221] [<ffffffff81071c73>]
__do_softirq+0x103/0x320
Jul 13 00:46:58 node11 [157060.115292] [<ffffffff81072015>]
irq_exit+0x135/0x140
Jul 13 00:46:58 node11 [157060.115363] [<ffffffff8144759c>]
xen_evtchn_do_upcall+0x3c/0x50
Jul 13 00:46:58 node11 [157060.115436] [<ffffffff8175c07e>]
xen_do_hypervisor_callback+0x1e/0x30
Jul 13 00:46:58 node11 [157060.115506] <EOI>
Jul 13 00:46:58 node11
Jul 13 00:46:58 node11 [157060.115551]
Jul 13 00:46:58 node11 [<ffffffff810013aa>] ?
xen_hypercall_sched_op+0xa/0x20
Jul 13 00:46:58 node11 [157060.115722] [<ffffffff810013aa>] ?
xen_hypercall_sched_op+0xa/0x20
Jul 13 00:46:58 node11 [157060.115794] [<ffffffff8100a200>] ?
xen_safe_halt+0x10/0x20
Jul 13 00:46:58 node11 [157060.115869] [<ffffffff8101dbbf>] ?
default_idle+0x1f/0xc0
Jul 13 00:46:58 node11 [157060.115939] [<ffffffff8101d38f>] ?
arch_cpu_idle+0xf/0x20
Jul 13 00:46:58 node11 [157060.116009] [<ffffffff810b5aa1>] ?
cpu_startup_entry+0x201/0x360
Jul 13 00:46:58 node11 [157060.116084] [<ffffffff817420a7>] ?
rest_init+0x77/0x80
Jul 13 00:46:58 node11 [157060.116156] [<ffffffff81d3a156>] ?
start_kernel+0x406/0x413
Jul 13 00:46:58 node11 [157060.116227] [<ffffffff81d39b6e>] ?
repair_env_string+0x5b/0x5b
Jul 13 00:46:58 node11 [157060.116298] [<ffffffff81d39603>] ?
x86_64_start_reservations+0x2a/0x2c
Jul 13 00:46:58 node11 [157060.116373] [<ffffffff81d3d5dc>] ?
xen_start_kernel+0x584/0x586
Jul 13 00:46:58 node11 [157060.116446] Code:
Jul 13 00:46:58 41
Jul 13 00:46:58 5c
Jul 13 00:46:58 c9
Jul 13 00:46:58 c3
Jul 13 00:46:58 48
Jul 13 00:46:58 be
Jul 13 00:46:58 00
Jul 13 00:46:58 00
Jul 13 00:46:58 00
Jul 13 00:46:58 80
Jul 13 00:46:58 ff
Jul 13 00:46:58 77
Jul 13 00:46:58 00
Jul 13 00:46:58 00
Jul 13 00:46:58 e9
Jul 13 00:46:58 62
Jul 13 00:46:58 ff
Jul 13 00:46:58 ff
Jul 13 00:46:58 ff
Jul 13 00:46:58 49
Jul 13 00:46:58 8b
Jul 13 00:46:58 bc
Jul 13 00:46:58 24
Jul 13 00:46:58 78
Jul 13 00:46:58 ba
Jul 13 00:46:58 00
Jul 13 00:46:58 00
Jul 13 00:46:58 89
Jul 13 00:46:58 da
Jul 13 00:46:58 48
Jul 13 00:46:58 c7
Jul 13 00:46:58 c6
Jul 13 00:46:58 30
Jul 13 00:46:58 57
Jul 13 00:46:58 26
Jul 13 00:46:58 a0
Jul 13 00:46:58 31
Jul 13 00:46:58 c0
Jul 13 00:46:58 e8
Jul 13 00:46:58 73
Jul 13 00:46:58 df
Jul 13 00:46:58 3e
Jul 13 00:46:58 e1
Jul 13 00:46:58 node11 f>
Jul 13 00:46:58 0b
Jul 13 00:46:58 eb
Jul 13 00:46:58 fe
Jul 13 00:46:58 66
Jul 13 00:46:58 66
Jul 13 00:46:58 66
Jul 13 00:46:58 66
Jul 13 00:46:58 66
Jul 13 00:46:58 66
Jul 13 00:46:58 2e
Jul 13 00:46:58 0f
Jul 13 00:46:58 1f
Jul 13 00:46:58 84
Jul 13 00:46:58 00
Jul 13 00:46:58 00
Jul 13 00:46:58 00
Jul 13 00:46:58 00
Jul 13 00:46:58 00
Jul 13 00:46:58 55
Jul 13 00:46:58 48
Jul 13 00:46:58 node11
Jul 13 00:46:58 node11 [157060.119179] RIP
Jul 13 00:46:58 node11 [<ffffffffa025f61d>]
xenvif_idx_unmap+0x11d/0x130 [xen_netback]
Jul 13 00:46:58 node11 [157060.119312] RSP <ffff88008ea03d48>
Jul 13 00:46:58 node11 [157060.119395] ---[ end trace 7e021c96c8cfea53 ]---
Jul 13 00:46:58 node11 [157060.119465] Kernel panic - not syncing: Fatal
exception in interrupt
h14z4mzbvfrrhb was a name of a VIF. This VIF belongs to a Windows Server
2008 R2 X64 virtual machine. We had 6 random reboots until now, all of
the VIFs are belonged to the same operating system, but different
virtual machines. So only Windows Server 2008 R2 X64 system's virtual
interfaces caused the crashes, these systems has been provisioned from
different installs or templates. The GPLPV driver's versions are also
different.
XM info output:
[root@c2-node11 ~]# xm info
host : c2-node11
release : 3.15.4
version : #1 SMP Tue Jul 8 17:58:26 CEST 2014
machine : x86_64
nr_cpus : 12
nr_nodes : 1
cores_per_socket : 6
threads_per_core : 2
cpu_mhz : 2400
hw_caps :
bfebfbff:2c100800:00000000:00003f40:029ee3ff:00000000:00000001:00000000
virt_caps : hvm hvm_directio
total_memory : 49143
free_memory : 41830
free_cpus : 0
xen_major : 4
xen_minor : 2
xen_extra : .4-33.el6
xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler : credit
xen_pagesize : 4096
platform_params : virt_start=0xffff800000000000
xen_changeset : unavailable
xen_commandline : dom0_mem=3145728 pcie_aspm=off noreboot=true
cc_compiler : gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-4)
cc_compile_by : mockbuild
cc_compile_domain : centos.org
cc_compile_date : Mon Jun 16 17:22:14 UTC 2014
xend_config_format : 4
[root@c2-node11 ~]# uname -a
Linux c2-node11 3.15.4 #1 SMP Tue Jul 8 17:58:26 CEST 2014 x86_64 x86_64
x86_64 GNU/Linux
The xm create config file of the specified VM (the other VM's config
files are the same):
kernel = "/usr/lib/xen/boot/hvmloader"
device_model = "/usr/lib64/xen/bin/qemu-dm"
builder = "hvm"
memory = "2000"
name = "vna3mhwnv9pn4m"
vcpus = "1"
timer_mode = "2"
viridian = "1"
vif = [ "type=ioemu, mac=00:16:3e:64:c8:ba, bridge=x0evss6g1ztoa4,
ip=..., vifname=h14z4mzbvfrrhb, rate=100Mb/s" ]
disk = [ "phy:/dev/q7jiqc2gh02b2b/xz7wget4ycmp77,ioemu:hda,w" ]
vnc = 1
vncpasswd="aaaaa1"
usbdevice="tablet"
The HV's networking looks as the following:
We are using dual emulex 10gbit network adapters, with bonding (LACP),
and on the top of the bond, we're using VLAN's for the VM, management
and the iSCSI traffic.
We're tried to reproduce the error, but we couldn't, the crash/reboot
happened randomly every time.
Thanks, for your help,
- Armin Zentai
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot
2014-07-14 2:25 Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot Armin Zentai
@ 2014-07-14 9:52 ` Wei Liu
2014-07-14 10:53 ` Armin Zentai
2014-07-14 11:54 ` Zoltan Kiss
1 sibling, 1 reply; 14+ messages in thread
From: Wei Liu @ 2014-07-14 9:52 UTC (permalink / raw)
To: Armin Zentai; +Cc: wei.liu2, zoltan.kiss, xen-devel
Hello
On Mon, Jul 14, 2014 at 04:25:54AM +0200, Armin Zentai wrote:
> Dear Xen Developers!
>
>
> We're running Xen on multiple machines, most of them are Dell R410 or SM
> X8DTL, with one E5645 cpu, and 48 GB of RAM. We've update the kernel to
> 3.15.4, after the some of our hypervisors started to rebooting at random
> times.
>
> The logs were empty, and we have no information about the crashes, we've
> tried some tricks, and at the end the netconsole kernel modul helped, so we
> can do a very thin layer of remote kernel logging. We've found the following
> in the remote logs:
It's good you've got netconsole working. I would still like to point out
that we have a wiki page on setting up serial console on Xen, which
might be helpful.
http://wiki.xen.org/wiki/Xen_Serial_Console
>
> Jul 13 00:46:58 node11 [157060.106323] vif vif-2-0 h14z4mzbvfrrhb: Trying to
> unmap invalid handle! pending_idx: c
> Jul 13 00:46:58 node11 [157060.106476] ------------[ cut here ]------------
> Jul 13 00:46:58 node11 [157060.106546] kernel BUG at
> drivers/net/xen-netback/netback.c:998!
> Jul 13 00:46:58 node11 [157060.106616] invalid opcode: 0000 [#1]
> Jul 13 00:46:58 SMP
> Jul 13 00:46:58 node11
[...]
> Jul 13 00:46:58 node11 [157060.112705] CPU: 0 PID: 0 Comm: swapper/0
> Tainted: G E 3.15.4 #1
> Jul 13 00:46:58 node11 [157060.112776] Hardware name: Supermicro
> X8DTL/X8DTL, BIOS 1.1b 03/19/2010
> Jul 13 00:46:58 node11 [157060.112848] task: ffffffff81c1b480 ti:
> ffffffff81c00000 task.ti: ffffffff81c00000
> Jul 13 00:46:58 node11 [157060.112936] RIP: e030:[<ffffffffa025f61d>]
> Jul 13 00:46:58 node11 [<ffffffffa025f61d>] xenvif_idx_unmap+0x11d/0x130
> [xen_netback]
> Jul 13 00:46:58 node11 [157060.113078] RSP: e02b:ffff88008ea03d48 EFLAGS:
> 00010292
> Jul 13 00:46:58 node11 [157060.113147] RAX: 000000000000004a RBX:
> 000000000000000c RCX: 0000000000000000
> Jul 13 00:46:58 node11 [157060.113234] RDX: ffff88008a40b600 RSI:
> ffff88008ea03a18 RDI: 000000000000021b
> Jul 13 00:46:58 node11 [157060.113321] RBP: ffff88008ea03d88 R08:
> 0000000000000000 R09: ffff88008a40b600
> Jul 13 00:46:58 node11 [157060.113408] R10: ffff88008a0004e8 R11:
> 00000000000006d8 R12: ffff8800569708c0
> Jul 13 00:46:58 node11 [157060.113495] R13: ffff88006558fec0 R14:
> ffff8800569708c0 R15: 0000000000000001
> Jul 13 00:46:58 node11 [157060.113589] FS: 00007f351684b700(0000)
> GS:ffff88008ea00000(0000) knlGS:0000000000000000
> Jul 13 00:46:58 node11 [157060.113679] CS: e033 DS: 0000 ES: 0000 CR0:
> 000000008005003b
> Jul 13 00:46:58 node11 [157060.113747] CR2: 00007fc2a4372000 CR3:
> 00000000049f3000 CR4: 0000000000002660
> Jul 13 00:46:58 node11 [157060.113835] Stack:
> Jul 13 00:46:58 node11 [157060.113896] ffff880056979f90
> Jul 13 00:46:58 node11 ff00000000000001
> Jul 13 00:46:58 node11 ffff880b0605e000
> Jul 13 00:46:58 node11 0000000000000000
> Jul 13 00:46:58 node11
> Jul 13 00:46:58 node11 [157060.114143] ffff0000ffffffff
> Jul 13 00:46:58 node11 00000000fffffff6
> Jul 13 00:46:58 node11 0000000000000001
> Jul 13 00:46:58 node11 ffff8800569769d0
> Jul 13 00:46:58 node11
> Jul 13 00:46:58 node11 [157060.114390] ffff88008ea03e58
> Jul 13 00:46:58 node11 ffffffffa02622fc
> Jul 13 00:46:58 node11 ffff88008ea03dd8
> Jul 13 00:46:58 node11 ffffffff810b5223
> Jul 13 00:46:58 node11
> Jul 13 00:46:58 node11 [157060.114637] Call Trace:
> Jul 13 00:46:58 node11 [157060.114700] <IRQ>
> Jul 13 00:46:58 node11
> Jul 13 00:46:58 node11 [157060.114750]
> Jul 13 00:46:58 node11 [<ffffffffa02622fc>] xenvif_tx_action+0x27c/0x7f0
> [xen_netback]
> Jul 13 00:46:58 node11 [157060.114927] [<ffffffff810b5223>] ?
> __wake_up+0x53/0x70
> Jul 13 00:46:58 node11 [157060.114998] [<ffffffff810ca077>] ?
> handle_irq_event_percpu+0xa7/0x1b0
> Jul 13 00:46:58 node11 [157060.115073] [<ffffffffa02647d1>]
> xenvif_poll+0x31/0x64 [xen_netback]
> Jul 13 00:46:58 node11 [157060.115147] [<ffffffff81653d4b>]
> net_rx_action+0x10b/0x290
> Jul 13 00:46:58 node11 [157060.115221] [<ffffffff81071c73>]
> __do_softirq+0x103/0x320
> Jul 13 00:46:58 node11 [157060.115292] [<ffffffff81072015>]
> irq_exit+0x135/0x140
> Jul 13 00:46:58 node11 [157060.115363] [<ffffffff8144759c>]
> xen_evtchn_do_upcall+0x3c/0x50
> Jul 13 00:46:58 node11 [157060.115436] [<ffffffff8175c07e>]
> xen_do_hypervisor_callback+0x1e/0x30
> Jul 13 00:46:58 node11 [157060.115506] <EOI>
> Jul 13 00:46:58 node11
> Jul 13 00:46:58 node11 [157060.115551]
> Jul 13 00:46:58 node11 [<ffffffff810013aa>] ?
> xen_hypercall_sched_op+0xa/0x20
> Jul 13 00:46:58 node11 [157060.115722] [<ffffffff810013aa>] ?
> xen_hypercall_sched_op+0xa/0x20
> Jul 13 00:46:58 node11 [157060.115794] [<ffffffff8100a200>] ?
> xen_safe_halt+0x10/0x20
> Jul 13 00:46:58 node11 [157060.115869] [<ffffffff8101dbbf>] ?
> default_idle+0x1f/0xc0
> Jul 13 00:46:58 node11 [157060.115939] [<ffffffff8101d38f>] ?
> arch_cpu_idle+0xf/0x20
> Jul 13 00:46:58 node11 [157060.116009] [<ffffffff810b5aa1>] ?
> cpu_startup_entry+0x201/0x360
> Jul 13 00:46:58 node11 [157060.116084] [<ffffffff817420a7>] ?
> rest_init+0x77/0x80
> Jul 13 00:46:58 node11 [157060.116156] [<ffffffff81d3a156>] ?
> start_kernel+0x406/0x413
> Jul 13 00:46:58 node11 [157060.116227] [<ffffffff81d39b6e>] ?
> repair_env_string+0x5b/0x5b
> Jul 13 00:46:58 node11 [157060.116298] [<ffffffff81d39603>] ?
> x86_64_start_reservations+0x2a/0x2c
> Jul 13 00:46:58 node11 [157060.116373] [<ffffffff81d3d5dc>] ?
> xen_start_kernel+0x584/0x586
[...]
> Jul 13 00:46:58 node11
> Jul 13 00:46:58 node11 [157060.119179] RIP
> Jul 13 00:46:58 node11 [<ffffffffa025f61d>] xenvif_idx_unmap+0x11d/0x130
> [xen_netback]
> Jul 13 00:46:58 node11 [157060.119312] RSP <ffff88008ea03d48>
> Jul 13 00:46:58 node11 [157060.119395] ---[ end trace 7e021c96c8cfea53 ]---
> Jul 13 00:46:58 node11 [157060.119465] Kernel panic - not syncing: Fatal
> exception in interrupt
>
>
> h14z4mzbvfrrhb was a name of a VIF. This VIF belongs to a Windows Server
> 2008 R2 X64 virtual machine. We had 6 random reboots until now, all of the
> VIFs are belonged to the same operating system, but different virtual
> machines. So only Windows Server 2008 R2 X64 system's virtual interfaces
> caused the crashes, these systems has been provisioned from different
> installs or templates. The GPLPV driver's versions are also different.
>
Unfortunately I don't have Windows server 2008 R2. :-(
This bug is in guest TX path. What's the workload of your guest? Is
there any pattern of its traffic?
I've checked changesets between 3.15.4 and 3.16-rc5 there's no fix for
this, so this is the first report of this issue. If there's a reliable
reproduce then that would be great.
Zoltan, have you seen this before? Can your work on pktgen help?
> [root@c2-node11 ~]# uname -a
> Linux c2-node11 3.15.4 #1 SMP Tue Jul 8 17:58:26 CEST 2014 x86_64 x86_64
> x86_64 GNU/Linux
>
>
> The xm create config file of the specified VM (the other VM's config files
> are the same):
>
> kernel = "/usr/lib/xen/boot/hvmloader"
> device_model = "/usr/lib64/xen/bin/qemu-dm"
> builder = "hvm"
> memory = "2000"
> name = "vna3mhwnv9pn4m"
> vcpus = "1"
>
> timer_mode = "2"
> viridian = "1"
>
> vif = [ "type=ioemu, mac=00:16:3e:64:c8:ba, bridge=x0evss6g1ztoa4, ip=...,
> vifname=h14z4mzbvfrrhb, rate=100Mb/s" ]
>
> disk = [ "phy:/dev/q7jiqc2gh02b2b/xz7wget4ycmp77,ioemu:hda,w" ]
> vnc = 1
> vncpasswd="aaaaa1"
> usbdevice="tablet"
>
>
> The HV's networking looks as the following:
> We are using dual emulex 10gbit network adapters, with bonding (LACP), and
> on the top of the bond, we're using VLAN's for the VM, management and the
> iSCSI traffic.
> We're tried to reproduce the error, but we couldn't, the crash/reboot
> happened randomly every time.
>
In that case you will need to instrument netback to spit out more
information. Zoltan, is there any other information that you would like
to know?
Wei.
> Thanks, for your help,
>
> - Armin Zentai
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot
2014-07-14 9:52 ` Wei Liu
@ 2014-07-14 10:53 ` Armin Zentai
2014-07-14 11:15 ` Wei Liu
0 siblings, 1 reply; 14+ messages in thread
From: Armin Zentai @ 2014-07-14 10:53 UTC (permalink / raw)
To: Wei Liu; +Cc: zoltan.kiss, xen-devel
Hello!
On 14/07/14 11:52, Wei Liu wrote:
> Hello
>
> On Mon, Jul 14, 2014 at 04:25:54AM +0200, Armin Zentai wrote:
>> Dear Xen Developers!
>>
>>
>> We're running Xen on multiple machines, most of them are Dell R410 or SM
>> X8DTL, with one E5645 cpu, and 48 GB of RAM. We've update the kernel to
>> 3.15.4, after the some of our hypervisors started to rebooting at random
>> times.
>>
>> The logs were empty, and we have no information about the crashes, we've
>> tried some tricks, and at the end the netconsole kernel modul helped, so we
>> can do a very thin layer of remote kernel logging. We've found the following
>> in the remote logs:
>
> It's good you've got netconsole working. I would still like to point out
> that we have a wiki page on setting up serial console on Xen, which
> might be helpful.
>
> http://wiki.xen.org/wiki/Xen_Serial_Console
>
We've set up xen serial console, but we wanted to avoid about to reboot
the hypervisors if it's not neccessary, so it will be activated on the
systems on the next reboot.
(We have set up a system, that logs into every Dell iDRAC via telnet [we
have 18 nodes, so we cannot plug every machine via a physical serial
link], set up the SOL, and logs every output, but netconsole was a much
more painless solution to gather the logs)
>>
>> Jul 13 00:46:58 node11 [157060.106323] vif vif-2-0 h14z4mzbvfrrhb: Trying to
>> unmap invalid handle! pending_idx: c
>> Jul 13 00:46:58 node11 [157060.106476] ------------[ cut here ]------------
>> Jul 13 00:46:58 node11 [157060.106546] kernel BUG at
>> drivers/net/xen-netback/netback.c:998!
>> Jul 13 00:46:58 node11 [157060.106616] invalid opcode: 0000 [#1]
>> Jul 13 00:46:58 SMP
>> Jul 13 00:46:58 node11
> [...]
>> Jul 13 00:46:58 node11 [157060.112705] CPU: 0 PID: 0 Comm: swapper/0
>> Tainted: G E 3.15.4 #1
>> Jul 13 00:46:58 node11 [157060.112776] Hardware name: Supermicro
>> X8DTL/X8DTL, BIOS 1.1b 03/19/2010
>> Jul 13 00:46:58 node11 [157060.112848] task: ffffffff81c1b480 ti:
>> ffffffff81c00000 task.ti: ffffffff81c00000
>> Jul 13 00:46:58 node11 [157060.112936] RIP: e030:[<ffffffffa025f61d>]
>> Jul 13 00:46:58 node11 [<ffffffffa025f61d>] xenvif_idx_unmap+0x11d/0x130
>> [xen_netback]
>> Jul 13 00:46:58 node11 [157060.113078] RSP: e02b:ffff88008ea03d48 EFLAGS:
>> 00010292
>> Jul 13 00:46:58 node11 [157060.113147] RAX: 000000000000004a RBX:
>> 000000000000000c RCX: 0000000000000000
>> Jul 13 00:46:58 node11 [157060.113234] RDX: ffff88008a40b600 RSI:
>> ffff88008ea03a18 RDI: 000000000000021b
>> Jul 13 00:46:58 node11 [157060.113321] RBP: ffff88008ea03d88 R08:
>> 0000000000000000 R09: ffff88008a40b600
>> Jul 13 00:46:58 node11 [157060.113408] R10: ffff88008a0004e8 R11:
>> 00000000000006d8 R12: ffff8800569708c0
>> Jul 13 00:46:58 node11 [157060.113495] R13: ffff88006558fec0 R14:
>> ffff8800569708c0 R15: 0000000000000001
>> Jul 13 00:46:58 node11 [157060.113589] FS: 00007f351684b700(0000)
>> GS:ffff88008ea00000(0000) knlGS:0000000000000000
>> Jul 13 00:46:58 node11 [157060.113679] CS: e033 DS: 0000 ES: 0000 CR0:
>> 000000008005003b
>> Jul 13 00:46:58 node11 [157060.113747] CR2: 00007fc2a4372000 CR3:
>> 00000000049f3000 CR4: 0000000000002660
>> Jul 13 00:46:58 node11 [157060.113835] Stack:
>> Jul 13 00:46:58 node11 [157060.113896] ffff880056979f90
>> Jul 13 00:46:58 node11 ff00000000000001
>> Jul 13 00:46:58 node11 ffff880b0605e000
>> Jul 13 00:46:58 node11 0000000000000000
>> Jul 13 00:46:58 node11
>> Jul 13 00:46:58 node11 [157060.114143] ffff0000ffffffff
>> Jul 13 00:46:58 node11 00000000fffffff6
>> Jul 13 00:46:58 node11 0000000000000001
>> Jul 13 00:46:58 node11 ffff8800569769d0
>> Jul 13 00:46:58 node11
>> Jul 13 00:46:58 node11 [157060.114390] ffff88008ea03e58
>> Jul 13 00:46:58 node11 ffffffffa02622fc
>> Jul 13 00:46:58 node11 ffff88008ea03dd8
>> Jul 13 00:46:58 node11 ffffffff810b5223
>> Jul 13 00:46:58 node11
>> Jul 13 00:46:58 node11 [157060.114637] Call Trace:
>> Jul 13 00:46:58 node11 [157060.114700] <IRQ>
>> Jul 13 00:46:58 node11
>> Jul 13 00:46:58 node11 [157060.114750]
>> Jul 13 00:46:58 node11 [<ffffffffa02622fc>] xenvif_tx_action+0x27c/0x7f0
>> [xen_netback]
>> Jul 13 00:46:58 node11 [157060.114927] [<ffffffff810b5223>] ?
>> __wake_up+0x53/0x70
>> Jul 13 00:46:58 node11 [157060.114998] [<ffffffff810ca077>] ?
>> handle_irq_event_percpu+0xa7/0x1b0
>> Jul 13 00:46:58 node11 [157060.115073] [<ffffffffa02647d1>]
>> xenvif_poll+0x31/0x64 [xen_netback]
>> Jul 13 00:46:58 node11 [157060.115147] [<ffffffff81653d4b>]
>> net_rx_action+0x10b/0x290
>> Jul 13 00:46:58 node11 [157060.115221] [<ffffffff81071c73>]
>> __do_softirq+0x103/0x320
>> Jul 13 00:46:58 node11 [157060.115292] [<ffffffff81072015>]
>> irq_exit+0x135/0x140
>> Jul 13 00:46:58 node11 [157060.115363] [<ffffffff8144759c>]
>> xen_evtchn_do_upcall+0x3c/0x50
>> Jul 13 00:46:58 node11 [157060.115436] [<ffffffff8175c07e>]
>> xen_do_hypervisor_callback+0x1e/0x30
>> Jul 13 00:46:58 node11 [157060.115506] <EOI>
>> Jul 13 00:46:58 node11
>> Jul 13 00:46:58 node11 [157060.115551]
>> Jul 13 00:46:58 node11 [<ffffffff810013aa>] ?
>> xen_hypercall_sched_op+0xa/0x20
>> Jul 13 00:46:58 node11 [157060.115722] [<ffffffff810013aa>] ?
>> xen_hypercall_sched_op+0xa/0x20
>> Jul 13 00:46:58 node11 [157060.115794] [<ffffffff8100a200>] ?
>> xen_safe_halt+0x10/0x20
>> Jul 13 00:46:58 node11 [157060.115869] [<ffffffff8101dbbf>] ?
>> default_idle+0x1f/0xc0
>> Jul 13 00:46:58 node11 [157060.115939] [<ffffffff8101d38f>] ?
>> arch_cpu_idle+0xf/0x20
>> Jul 13 00:46:58 node11 [157060.116009] [<ffffffff810b5aa1>] ?
>> cpu_startup_entry+0x201/0x360
>> Jul 13 00:46:58 node11 [157060.116084] [<ffffffff817420a7>] ?
>> rest_init+0x77/0x80
>> Jul 13 00:46:58 node11 [157060.116156] [<ffffffff81d3a156>] ?
>> start_kernel+0x406/0x413
>> Jul 13 00:46:58 node11 [157060.116227] [<ffffffff81d39b6e>] ?
>> repair_env_string+0x5b/0x5b
>> Jul 13 00:46:58 node11 [157060.116298] [<ffffffff81d39603>] ?
>> x86_64_start_reservations+0x2a/0x2c
>> Jul 13 00:46:58 node11 [157060.116373] [<ffffffff81d3d5dc>] ?
>> xen_start_kernel+0x584/0x586
> [...]
>> Jul 13 00:46:58 node11
>> Jul 13 00:46:58 node11 [157060.119179] RIP
>> Jul 13 00:46:58 node11 [<ffffffffa025f61d>] xenvif_idx_unmap+0x11d/0x130
>> [xen_netback]
>> Jul 13 00:46:58 node11 [157060.119312] RSP <ffff88008ea03d48>
>> Jul 13 00:46:58 node11 [157060.119395] ---[ end trace 7e021c96c8cfea53 ]---
>> Jul 13 00:46:58 node11 [157060.119465] Kernel panic - not syncing: Fatal
>> exception in interrupt
>>
>>
>> h14z4mzbvfrrhb was a name of a VIF. This VIF belongs to a Windows Server
>> 2008 R2 X64 virtual machine. We had 6 random reboots until now, all of the
>> VIFs are belonged to the same operating system, but different virtual
>> machines. So only Windows Server 2008 R2 X64 system's virtual interfaces
>> caused the crashes, these systems has been provisioned from different
>> installs or templates. The GPLPV driver's versions are also different.
>>
>
> Unfortunately I don't have Windows server 2008 R2. :-(
>
> This bug is in guest TX path. What's the workload of your guest? Is
> there any pattern of its traffic?
It's not relevant, some of them uses one core at nearly 100%, some of
them had 1-2% CPU and 5-10 mbps of networking and/or IO. I've tried to
test the CPU with CPU burn, prime95, tried to stress the network with
SYN flood, IIS stress testing with apache ab, stressing the throughput
and bandwidth, but these attempts did not caused a reboot.
>
> I've checked changesets between 3.15.4 and 3.16-rc5 there's no fix for
> this, so this is the first report of this issue. If there's a reliable
> reproduce then that would be great.
>
> Zoltan, have you seen this before? Can your work on pktgen help?
>
>> [root@c2-node11 ~]# uname -a
>> Linux c2-node11 3.15.4 #1 SMP Tue Jul 8 17:58:26 CEST 2014 x86_64 x86_64
>> x86_64 GNU/Linux
>>
>>
>> The xm create config file of the specified VM (the other VM's config files
>> are the same):
>>
>> kernel = "/usr/lib/xen/boot/hvmloader"
>> device_model = "/usr/lib64/xen/bin/qemu-dm"
>> builder = "hvm"
>> memory = "2000"
>> name = "vna3mhwnv9pn4m"
>> vcpus = "1"
>>
>> timer_mode = "2"
>> viridian = "1"
>>
>> vif = [ "type=ioemu, mac=00:16:3e:64:c8:ba, bridge=x0evss6g1ztoa4, ip=...,
>> vifname=h14z4mzbvfrrhb, rate=100Mb/s" ]
>>
>> disk = [ "phy:/dev/q7jiqc2gh02b2b/xz7wget4ycmp77,ioemu:hda,w" ]
>> vnc = 1
>> vncpasswd="aaaaa1"
>> usbdevice="tablet"
>>
>>
>> The HV's networking looks as the following:
>> We are using dual emulex 10gbit network adapters, with bonding (LACP), and
>> on the top of the bond, we're using VLAN's for the VM, management and the
>> iSCSI traffic.
>> We're tried to reproduce the error, but we couldn't, the crash/reboot
>> happened randomly every time.
>>
>
> In that case you will need to instrument netback to spit out more
> information. Zoltan, is there any other information that you would like
> to know?
>
> Wei.
>
>> Thanks, for your help,
>>
>> - Armin Zentai
>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot
2014-07-14 10:53 ` Armin Zentai
@ 2014-07-14 11:15 ` Wei Liu
0 siblings, 0 replies; 14+ messages in thread
From: Wei Liu @ 2014-07-14 11:15 UTC (permalink / raw)
To: Armin Zentai; +Cc: Wei Liu, zoltan.kiss, xen-devel
On Mon, Jul 14, 2014 at 12:53:49PM +0200, Armin Zentai wrote:
[...]
> >>h14z4mzbvfrrhb was a name of a VIF. This VIF belongs to a Windows Server
> >>2008 R2 X64 virtual machine. We had 6 random reboots until now, all of the
> >>VIFs are belonged to the same operating system, but different virtual
> >>machines. So only Windows Server 2008 R2 X64 system's virtual interfaces
> >>caused the crashes, these systems has been provisioned from different
> >>installs or templates. The GPLPV driver's versions are also different.
> >>
> >
> >Unfortunately I don't have Windows server 2008 R2. :-(
> >
> >This bug is in guest TX path. What's the workload of your guest? Is
> >there any pattern of its traffic?
>
> It's not relevant, some of them uses one core at nearly 100%, some of them
> had 1-2% CPU and 5-10 mbps of networking and/or IO. I've tried to test the
> CPU with CPU burn, prime95, tried to stress the network with SYN flood, IIS
> stress testing with apache ab, stressing the throughput and bandwidth, but
> these attempts did not caused a reboot.
>
FYI I think the only thing that matters is the packets sent from the
guest, stressing CPU won't help. I could be wrong, but the tests you
mentioned seem to be sending packets to the guest. That's guest RX path
which is not very relevant to this bug. I think the stress test should
run in the guest, i.e. sending packets from guest to outside world. That
might help reproduce this bug.
Wei.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot
2014-07-14 2:25 Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot Armin Zentai
2014-07-14 9:52 ` Wei Liu
@ 2014-07-14 11:54 ` Zoltan Kiss
2014-07-14 12:07 ` Zoltan Kiss
2014-07-14 12:14 ` Armin Zentai
1 sibling, 2 replies; 14+ messages in thread
From: Zoltan Kiss @ 2014-07-14 11:54 UTC (permalink / raw)
To: Armin Zentai, xen-devel
Hi,
Based on the stack trace I think the guest sends packet with more than
MAX_SKB_SLOTS slots, and one slot's grant mapping operation goes wrong.
Then the error handling screws up the internal bookkeeping of pending
slots, and tries to unmap something that is not mapped. Unfortunately it
is quite complicated to emulate grant mapping failures in a predictable way.
Armin, if we provide some patches (hopefully with a solution, but at
least with some additional logging), would you be able to compile and
verify it?
Regards,
Zoltan Kiss
On 14/07/14 03:25, Armin Zentai wrote:
> Jul 13 00:46:58 node11 [157060.106323] vif vif-2-0 h14z4mzbvfrrhb:
> Trying to unmap invalid handle! pending_idx: c
> Jul 13 00:46:58 node11 [157060.106476] ------------[ cut here ]------------
> Jul 13 00:46:58 node11 [157060.106546] kernel BUG at
> drivers/net/xen-netback/netback.c:998!
> Jul 13 00:46:58 node11 [157060.106616] invalid opcode: 0000 [#1]
> Jul 13 00:46:58 node11 [157060.112705] CPU: 0 PID: 0 Comm: swapper/0
> Tainted: G E 3.15.4 #1
> Jul 13 00:46:58 node11 [157060.112776] Hardware name: Supermicro
> X8DTL/X8DTL, BIOS 1.1b 03/19/2010
> Jul 13 00:46:58 node11 [157060.112848] task: ffffffff81c1b480 ti:
> ffffffff81c00000 task.ti: ffffffff81c00000
> Jul 13 00:46:58 node11 [157060.112936] RIP: e030:[<ffffffffa025f61d>]
> Jul 13 00:46:58 node11 [<ffffffffa025f61d>]
> xenvif_idx_unmap+0x11d/0x130 [xen_netback]
> Jul 13 00:46:58 node11 [157060.113078] RSP: e02b:ffff88008ea03d48
> EFLAGS: 00010292
> Jul 13 00:46:58 node11 [157060.113147] RAX: 000000000000004a RBX:
> 000000000000000c RCX: 0000000000000000
> Jul 13 00:46:58 node11 [157060.113234] RDX: ffff88008a40b600 RSI:
> ffff88008ea03a18 RDI: 000000000000021b
> Jul 13 00:46:58 node11 [157060.113321] RBP: ffff88008ea03d88 R08:
> 0000000000000000 R09: ffff88008a40b600
> Jul 13 00:46:58 node11 [157060.113408] R10: ffff88008a0004e8 R11:
> 00000000000006d8 R12: ffff8800569708c0
> Jul 13 00:46:58 node11 [157060.113495] R13: ffff88006558fec0 R14:
> ffff8800569708c0 R15: 0000000000000001
> Jul 13 00:46:58 node11 [157060.113589] FS: 00007f351684b700(0000)
> GS:ffff88008ea00000(0000) knlGS:0000000000000000
> Jul 13 00:46:58 node11 [157060.113679] CS: e033 DS: 0000 ES: 0000 CR0:
> 000000008005003b
> Jul 13 00:46:58 node11 [157060.113747] CR2: 00007fc2a4372000 CR3:
> 00000000049f3000 CR4: 0000000000002660
> Jul 13 00:46:58 node11 [157060.113835] Stack:
> Jul 13 00:46:58 node11 [157060.113896] ffff880056979f90
> Jul 13 00:46:58 node11 ff00000000000001
> Jul 13 00:46:58 node11 ffff880b0605e000
> Jul 13 00:46:58 node11 0000000000000000
> Jul 13 00:46:58 node11
> Jul 13 00:46:58 node11 [157060.114143] ffff0000ffffffff
> Jul 13 00:46:58 node11 00000000fffffff6
> Jul 13 00:46:58 node11 0000000000000001
> Jul 13 00:46:58 node11 ffff8800569769d0
> Jul 13 00:46:58 node11
> Jul 13 00:46:58 node11 [157060.114390] ffff88008ea03e58
> Jul 13 00:46:58 node11 ffffffffa02622fc
> Jul 13 00:46:58 node11 ffff88008ea03dd8
> Jul 13 00:46:58 node11 ffffffff810b5223
> Jul 13 00:46:58 node11
> Jul 13 00:46:58 node11 [157060.114637] Call Trace:
> Jul 13 00:46:58 node11 [157060.114700] <IRQ>
> Jul 13 00:46:58 node11
> Jul 13 00:46:58 node11 [157060.114750]
> Jul 13 00:46:58 node11 [<ffffffffa02622fc>]
> xenvif_tx_action+0x27c/0x7f0 [xen_netback]
> Jul 13 00:46:58 node11 [157060.114927] [<ffffffff810b5223>] ?
> __wake_up+0x53/0x70
> Jul 13 00:46:58 node11 [157060.114998] [<ffffffff810ca077>] ?
> handle_irq_event_percpu+0xa7/0x1b0
> Jul 13 00:46:58 node11 [157060.115073] [<ffffffffa02647d1>]
> xenvif_poll+0x31/0x64 [xen_netback]
> Jul 13 00:46:58 node11 [157060.115147] [<ffffffff81653d4b>]
> net_rx_action+0x10b/0x290
> Jul 13 00:46:58 node11 [157060.115221] [<ffffffff81071c73>]
> __do_softirq+0x103/0x320
> Jul 13 00:46:58 node11 [157060.115292] [<ffffffff81072015>]
> irq_exit+0x135/0x140
> Jul 13 00:46:58 node11 [157060.115363] [<ffffffff8144759c>]
> xen_evtchn_do_upcall+0x3c/0x50
> Jul 13 00:46:58 node11 [157060.115436] [<ffffffff8175c07e>]
> xen_do_hypervisor_callback+0x1e/0x30
> Jul 13 00:46:58 node11 [157060.115506] <EOI>
> Jul 13 00:46:58 node11
> Jul 13 00:46:58 node11 [157060.115551]
> Jul 13 00:46:58 node11 [<ffffffff810013aa>] ?
> xen_hypercall_sched_op+0xa/0x20
> Jul 13 00:46:58 node11 [157060.115722] [<ffffffff810013aa>] ?
> xen_hypercall_sched_op+0xa/0x20
> Jul 13 00:46:58 node11 [157060.115794] [<ffffffff8100a200>] ?
> xen_safe_halt+0x10/0x20
> Jul 13 00:46:58 node11 [157060.115869] [<ffffffff8101dbbf>] ?
> default_idle+0x1f/0xc0
> Jul 13 00:46:58 node11 [157060.115939] [<ffffffff8101d38f>] ?
> arch_cpu_idle+0xf/0x20
> Jul 13 00:46:58 node11 [157060.116009] [<ffffffff810b5aa1>] ?
> cpu_startup_entry+0x201/0x360
> Jul 13 00:46:58 node11 [157060.116084] [<ffffffff817420a7>] ?
> rest_init+0x77/0x80
> Jul 13 00:46:58 node11 [157060.116156] [<ffffffff81d3a156>] ?
> start_kernel+0x406/0x413
> Jul 13 00:46:58 node11 [157060.116227] [<ffffffff81d39b6e>] ?
> repair_env_string+0x5b/0x5b
> Jul 13 00:46:58 node11 [157060.116298] [<ffffffff81d39603>] ?
> x86_64_start_reservations+0x2a/0x2c
> Jul 13 00:46:58 node11 [157060.116373] [<ffffffff81d3d5dc>] ?
> xen_start_kernel+0x584/0x586
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot
2014-07-14 11:54 ` Zoltan Kiss
@ 2014-07-14 12:07 ` Zoltan Kiss
2014-07-14 12:27 ` Zoltan Kiss
2014-07-14 12:14 ` Armin Zentai
1 sibling, 1 reply; 14+ messages in thread
From: Zoltan Kiss @ 2014-07-14 12:07 UTC (permalink / raw)
To: Armin Zentai, xen-devel
I forgot to mention this: Armin, can you set the loglevel to DBG, so we
can see the grant op failure messages from xenvif_tx_check_gop next time?
Regards,
Zoltan
On 14/07/14 12:54, Zoltan Kiss wrote:
> Hi,
>
> Based on the stack trace I think the guest sends packet with more than
> MAX_SKB_SLOTS slots, and one slot's grant mapping operation goes wrong.
> Then the error handling screws up the internal bookkeeping of pending
> slots, and tries to unmap something that is not mapped. Unfortunately it
> is quite complicated to emulate grant mapping failures in a predictable
> way.
> Armin, if we provide some patches (hopefully with a solution, but at
> least with some additional logging), would you be able to compile and
> verify it?
>
> Regards,
>
> Zoltan Kiss
>
> On 14/07/14 03:25, Armin Zentai wrote:
>> Jul 13 00:46:58 node11 [157060.106323] vif vif-2-0 h14z4mzbvfrrhb:
>> Trying to unmap invalid handle! pending_idx: c
>> Jul 13 00:46:58 node11 [157060.106476] ------------[ cut here
>> ]------------
>> Jul 13 00:46:58 node11 [157060.106546] kernel BUG at
>> drivers/net/xen-netback/netback.c:998!
>> Jul 13 00:46:58 node11 [157060.106616] invalid opcode: 0000 [#1]
>
>> Jul 13 00:46:58 node11 [157060.112705] CPU: 0 PID: 0 Comm: swapper/0
>> Tainted: G E 3.15.4 #1
>> Jul 13 00:46:58 node11 [157060.112776] Hardware name: Supermicro
>> X8DTL/X8DTL, BIOS 1.1b 03/19/2010
>> Jul 13 00:46:58 node11 [157060.112848] task: ffffffff81c1b480 ti:
>> ffffffff81c00000 task.ti: ffffffff81c00000
>> Jul 13 00:46:58 node11 [157060.112936] RIP: e030:[<ffffffffa025f61d>]
>> Jul 13 00:46:58 node11 [<ffffffffa025f61d>]
>> xenvif_idx_unmap+0x11d/0x130 [xen_netback]
>> Jul 13 00:46:58 node11 [157060.113078] RSP: e02b:ffff88008ea03d48
>> EFLAGS: 00010292
>> Jul 13 00:46:58 node11 [157060.113147] RAX: 000000000000004a RBX:
>> 000000000000000c RCX: 0000000000000000
>> Jul 13 00:46:58 node11 [157060.113234] RDX: ffff88008a40b600 RSI:
>> ffff88008ea03a18 RDI: 000000000000021b
>> Jul 13 00:46:58 node11 [157060.113321] RBP: ffff88008ea03d88 R08:
>> 0000000000000000 R09: ffff88008a40b600
>> Jul 13 00:46:58 node11 [157060.113408] R10: ffff88008a0004e8 R11:
>> 00000000000006d8 R12: ffff8800569708c0
>> Jul 13 00:46:58 node11 [157060.113495] R13: ffff88006558fec0 R14:
>> ffff8800569708c0 R15: 0000000000000001
>> Jul 13 00:46:58 node11 [157060.113589] FS: 00007f351684b700(0000)
>> GS:ffff88008ea00000(0000) knlGS:0000000000000000
>> Jul 13 00:46:58 node11 [157060.113679] CS: e033 DS: 0000 ES: 0000 CR0:
>> 000000008005003b
>> Jul 13 00:46:58 node11 [157060.113747] CR2: 00007fc2a4372000 CR3:
>> 00000000049f3000 CR4: 0000000000002660
>> Jul 13 00:46:58 node11 [157060.113835] Stack:
>> Jul 13 00:46:58 node11 [157060.113896] ffff880056979f90
>> Jul 13 00:46:58 node11 ff00000000000001
>> Jul 13 00:46:58 node11 ffff880b0605e000
>> Jul 13 00:46:58 node11 0000000000000000
>> Jul 13 00:46:58 node11
>> Jul 13 00:46:58 node11 [157060.114143] ffff0000ffffffff
>> Jul 13 00:46:58 node11 00000000fffffff6
>> Jul 13 00:46:58 node11 0000000000000001
>> Jul 13 00:46:58 node11 ffff8800569769d0
>> Jul 13 00:46:58 node11
>> Jul 13 00:46:58 node11 [157060.114390] ffff88008ea03e58
>> Jul 13 00:46:58 node11 ffffffffa02622fc
>> Jul 13 00:46:58 node11 ffff88008ea03dd8
>> Jul 13 00:46:58 node11 ffffffff810b5223
>> Jul 13 00:46:58 node11
>> Jul 13 00:46:58 node11 [157060.114637] Call Trace:
>> Jul 13 00:46:58 node11 [157060.114700] <IRQ>
>> Jul 13 00:46:58 node11
>> Jul 13 00:46:58 node11 [157060.114750]
>> Jul 13 00:46:58 node11 [<ffffffffa02622fc>]
>> xenvif_tx_action+0x27c/0x7f0 [xen_netback]
>> Jul 13 00:46:58 node11 [157060.114927] [<ffffffff810b5223>] ?
>> __wake_up+0x53/0x70
>> Jul 13 00:46:58 node11 [157060.114998] [<ffffffff810ca077>] ?
>> handle_irq_event_percpu+0xa7/0x1b0
>> Jul 13 00:46:58 node11 [157060.115073] [<ffffffffa02647d1>]
>> xenvif_poll+0x31/0x64 [xen_netback]
>> Jul 13 00:46:58 node11 [157060.115147] [<ffffffff81653d4b>]
>> net_rx_action+0x10b/0x290
>> Jul 13 00:46:58 node11 [157060.115221] [<ffffffff81071c73>]
>> __do_softirq+0x103/0x320
>> Jul 13 00:46:58 node11 [157060.115292] [<ffffffff81072015>]
>> irq_exit+0x135/0x140
>> Jul 13 00:46:58 node11 [157060.115363] [<ffffffff8144759c>]
>> xen_evtchn_do_upcall+0x3c/0x50
>> Jul 13 00:46:58 node11 [157060.115436] [<ffffffff8175c07e>]
>> xen_do_hypervisor_callback+0x1e/0x30
>> Jul 13 00:46:58 node11 [157060.115506] <EOI>
>> Jul 13 00:46:58 node11
>> Jul 13 00:46:58 node11 [157060.115551]
>> Jul 13 00:46:58 node11 [<ffffffff810013aa>] ?
>> xen_hypercall_sched_op+0xa/0x20
>> Jul 13 00:46:58 node11 [157060.115722] [<ffffffff810013aa>] ?
>> xen_hypercall_sched_op+0xa/0x20
>> Jul 13 00:46:58 node11 [157060.115794] [<ffffffff8100a200>] ?
>> xen_safe_halt+0x10/0x20
>> Jul 13 00:46:58 node11 [157060.115869] [<ffffffff8101dbbf>] ?
>> default_idle+0x1f/0xc0
>> Jul 13 00:46:58 node11 [157060.115939] [<ffffffff8101d38f>] ?
>> arch_cpu_idle+0xf/0x20
>> Jul 13 00:46:58 node11 [157060.116009] [<ffffffff810b5aa1>] ?
>> cpu_startup_entry+0x201/0x360
>> Jul 13 00:46:58 node11 [157060.116084] [<ffffffff817420a7>] ?
>> rest_init+0x77/0x80
>> Jul 13 00:46:58 node11 [157060.116156] [<ffffffff81d3a156>] ?
>> start_kernel+0x406/0x413
>> Jul 13 00:46:58 node11 [157060.116227] [<ffffffff81d39b6e>] ?
>> repair_env_string+0x5b/0x5b
>> Jul 13 00:46:58 node11 [157060.116298] [<ffffffff81d39603>] ?
>> x86_64_start_reservations+0x2a/0x2c
>> Jul 13 00:46:58 node11 [157060.116373] [<ffffffff81d3d5dc>] ?
>> xen_start_kernel+0x584/0x586
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot
2014-07-14 11:54 ` Zoltan Kiss
2014-07-14 12:07 ` Zoltan Kiss
@ 2014-07-14 12:14 ` Armin Zentai
2014-07-14 15:30 ` Zoltan Kiss
2014-07-17 19:17 ` Zoltan Kiss
1 sibling, 2 replies; 14+ messages in thread
From: Armin Zentai @ 2014-07-14 12:14 UTC (permalink / raw)
To: Zoltan Kiss, xen-devel
Hi!
On 14/07/14 13:54, Zoltan Kiss wrote:
> Hi,
>
> Based on the stack trace I think the guest sends packet with more than
> MAX_SKB_SLOTS slots, and one slot's grant mapping operation goes wrong.
> Then the error handling screws up the internal bookkeeping of pending
> slots, and tries to unmap something that is not mapped. Unfortunately it
> is quite complicated to emulate grant mapping failures in a predictable
> way.
> Armin, if we provide some patches (hopefully with a solution, but at
> least with some additional logging), would you be able to compile and
> verify it?
I can compile and install it, the verification may take a few days.
We're building up a test system, and try to test with outgoing traffic.
If you have any advice, about how to test the new kernel, please share
with me (I mean large pps, or large bandwidth is the better, or try to
update/downgrade the GPLPV drivers to a specific version).
I'm setting the loglevel to DBG.
>
> Regards,
>
> Zoltan Kiss
>
> On 14/07/14 03:25, Armin Zentai wrote:
>> Jul 13 00:46:58 node11 [157060.106323] vif vif-2-0 h14z4mzbvfrrhb:
>> Trying to unmap invalid handle! pending_idx: c
>> Jul 13 00:46:58 node11 [157060.106476] ------------[ cut here
>> ]------------
>> Jul 13 00:46:58 node11 [157060.106546] kernel BUG at
>> drivers/net/xen-netback/netback.c:998!
>> Jul 13 00:46:58 node11 [157060.106616] invalid opcode: 0000 [#1]
>
>> Jul 13 00:46:58 node11 [157060.112705] CPU: 0 PID: 0 Comm: swapper/0
>> Tainted: G E 3.15.4 #1
>> Jul 13 00:46:58 node11 [157060.112776] Hardware name: Supermicro
>> X8DTL/X8DTL, BIOS 1.1b 03/19/2010
>> Jul 13 00:46:58 node11 [157060.112848] task: ffffffff81c1b480 ti:
>> ffffffff81c00000 task.ti: ffffffff81c00000
>> Jul 13 00:46:58 node11 [157060.112936] RIP: e030:[<ffffffffa025f61d>]
>> Jul 13 00:46:58 node11 [<ffffffffa025f61d>]
>> xenvif_idx_unmap+0x11d/0x130 [xen_netback]
>> Jul 13 00:46:58 node11 [157060.113078] RSP: e02b:ffff88008ea03d48
>> EFLAGS: 00010292
>> Jul 13 00:46:58 node11 [157060.113147] RAX: 000000000000004a RBX:
>> 000000000000000c RCX: 0000000000000000
>> Jul 13 00:46:58 node11 [157060.113234] RDX: ffff88008a40b600 RSI:
>> ffff88008ea03a18 RDI: 000000000000021b
>> Jul 13 00:46:58 node11 [157060.113321] RBP: ffff88008ea03d88 R08:
>> 0000000000000000 R09: ffff88008a40b600
>> Jul 13 00:46:58 node11 [157060.113408] R10: ffff88008a0004e8 R11:
>> 00000000000006d8 R12: ffff8800569708c0
>> Jul 13 00:46:58 node11 [157060.113495] R13: ffff88006558fec0 R14:
>> ffff8800569708c0 R15: 0000000000000001
>> Jul 13 00:46:58 node11 [157060.113589] FS: 00007f351684b700(0000)
>> GS:ffff88008ea00000(0000) knlGS:0000000000000000
>> Jul 13 00:46:58 node11 [157060.113679] CS: e033 DS: 0000 ES: 0000 CR0:
>> 000000008005003b
>> Jul 13 00:46:58 node11 [157060.113747] CR2: 00007fc2a4372000 CR3:
>> 00000000049f3000 CR4: 0000000000002660
>> Jul 13 00:46:58 node11 [157060.113835] Stack:
>> Jul 13 00:46:58 node11 [157060.113896] ffff880056979f90
>> Jul 13 00:46:58 node11 ff00000000000001
>> Jul 13 00:46:58 node11 ffff880b0605e000
>> Jul 13 00:46:58 node11 0000000000000000
>> Jul 13 00:46:58 node11
>> Jul 13 00:46:58 node11 [157060.114143] ffff0000ffffffff
>> Jul 13 00:46:58 node11 00000000fffffff6
>> Jul 13 00:46:58 node11 0000000000000001
>> Jul 13 00:46:58 node11 ffff8800569769d0
>> Jul 13 00:46:58 node11
>> Jul 13 00:46:58 node11 [157060.114390] ffff88008ea03e58
>> Jul 13 00:46:58 node11 ffffffffa02622fc
>> Jul 13 00:46:58 node11 ffff88008ea03dd8
>> Jul 13 00:46:58 node11 ffffffff810b5223
>> Jul 13 00:46:58 node11
>> Jul 13 00:46:58 node11 [157060.114637] Call Trace:
>> Jul 13 00:46:58 node11 [157060.114700] <IRQ>
>> Jul 13 00:46:58 node11
>> Jul 13 00:46:58 node11 [157060.114750]
>> Jul 13 00:46:58 node11 [<ffffffffa02622fc>]
>> xenvif_tx_action+0x27c/0x7f0 [xen_netback]
>> Jul 13 00:46:58 node11 [157060.114927] [<ffffffff810b5223>] ?
>> __wake_up+0x53/0x70
>> Jul 13 00:46:58 node11 [157060.114998] [<ffffffff810ca077>] ?
>> handle_irq_event_percpu+0xa7/0x1b0
>> Jul 13 00:46:58 node11 [157060.115073] [<ffffffffa02647d1>]
>> xenvif_poll+0x31/0x64 [xen_netback]
>> Jul 13 00:46:58 node11 [157060.115147] [<ffffffff81653d4b>]
>> net_rx_action+0x10b/0x290
>> Jul 13 00:46:58 node11 [157060.115221] [<ffffffff81071c73>]
>> __do_softirq+0x103/0x320
>> Jul 13 00:46:58 node11 [157060.115292] [<ffffffff81072015>]
>> irq_exit+0x135/0x140
>> Jul 13 00:46:58 node11 [157060.115363] [<ffffffff8144759c>]
>> xen_evtchn_do_upcall+0x3c/0x50
>> Jul 13 00:46:58 node11 [157060.115436] [<ffffffff8175c07e>]
>> xen_do_hypervisor_callback+0x1e/0x30
>> Jul 13 00:46:58 node11 [157060.115506] <EOI>
>> Jul 13 00:46:58 node11
>> Jul 13 00:46:58 node11 [157060.115551]
>> Jul 13 00:46:58 node11 [<ffffffff810013aa>] ?
>> xen_hypercall_sched_op+0xa/0x20
>> Jul 13 00:46:58 node11 [157060.115722] [<ffffffff810013aa>] ?
>> xen_hypercall_sched_op+0xa/0x20
>> Jul 13 00:46:58 node11 [157060.115794] [<ffffffff8100a200>] ?
>> xen_safe_halt+0x10/0x20
>> Jul 13 00:46:58 node11 [157060.115869] [<ffffffff8101dbbf>] ?
>> default_idle+0x1f/0xc0
>> Jul 13 00:46:58 node11 [157060.115939] [<ffffffff8101d38f>] ?
>> arch_cpu_idle+0xf/0x20
>> Jul 13 00:46:58 node11 [157060.116009] [<ffffffff810b5aa1>] ?
>> cpu_startup_entry+0x201/0x360
>> Jul 13 00:46:58 node11 [157060.116084] [<ffffffff817420a7>] ?
>> rest_init+0x77/0x80
>> Jul 13 00:46:58 node11 [157060.116156] [<ffffffff81d3a156>] ?
>> start_kernel+0x406/0x413
>> Jul 13 00:46:58 node11 [157060.116227] [<ffffffff81d39b6e>] ?
>> repair_env_string+0x5b/0x5b
>> Jul 13 00:46:58 node11 [157060.116298] [<ffffffff81d39603>] ?
>> x86_64_start_reservations+0x2a/0x2c
>> Jul 13 00:46:58 node11 [157060.116373] [<ffffffff81d3d5dc>] ?
>> xen_start_kernel+0x584/0x586
>
Thanks,
- Armin
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot
2014-07-14 12:07 ` Zoltan Kiss
@ 2014-07-14 12:27 ` Zoltan Kiss
0 siblings, 0 replies; 14+ messages in thread
From: Zoltan Kiss @ 2014-07-14 12:27 UTC (permalink / raw)
To: Armin Zentai, xen-devel
Hi,
I've already identified 2 issues in xenvif_tx_check_gop:
- (*gopp_copy)++ happens before we print the error message, though its
only effect that the error message becomes incorrect
- a more severe issue is that if there is a mapping problem on the frags
(or on the frag_list skb), AND the first TX request is entirely grant
copied ( !(data_len < txreq.size) in build_gops), that pending_idx is
never released. The frags are released but then the function returns,
skb is freed and the while loop goes to next iteration. In the normal
path it should be released a bit later in that loop, at the "Schedule a
response immediately" comment.
However, I don't see yet how this second issue should cause this problem.
Zoltan
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot
2014-07-14 12:14 ` Armin Zentai
@ 2014-07-14 15:30 ` Zoltan Kiss
2014-07-14 21:15 ` Armin Zentai
2014-07-17 19:17 ` Zoltan Kiss
1 sibling, 1 reply; 14+ messages in thread
From: Zoltan Kiss @ 2014-07-14 15:30 UTC (permalink / raw)
To: Armin Zentai, xen-devel
On 14/07/14 13:14, Armin Zentai wrote:
> Hi!
>
> On 14/07/14 13:54, Zoltan Kiss wrote:
>> Hi,
>>
>> Based on the stack trace I think the guest sends packet with more than
>> MAX_SKB_SLOTS slots, and one slot's grant mapping operation goes wrong.
>> Then the error handling screws up the internal bookkeeping of pending
>> slots, and tries to unmap something that is not mapped. Unfortunately it
>> is quite complicated to emulate grant mapping failures in a predictable
>> way.
>> Armin, if we provide some patches (hopefully with a solution, but at
>> least with some additional logging), would you be able to compile and
>> verify it?
>
> I can compile and install it, the verification may take a few days.
> We're building up a test system, and try to test with outgoing traffic.
> If you have any advice, about how to test the new kernel, please share
> with me (I mean large pps, or large bandwidth is the better, or try to
> update/downgrade the GPLPV drivers to a specific version).
>
> I'm setting the loglevel to DBG.
Also, if you have your vmlinux file, can you please run the following:
addr2line -e <your vmlinux file> ffffffffa02622fc
That would hopefully point out where xenvif_idx_unmap were called in
check_gop, as it has 3 call sites.
Zoli
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot
2014-07-14 15:30 ` Zoltan Kiss
@ 2014-07-14 21:15 ` Armin Zentai
2014-07-15 9:32 ` Wei Liu
0 siblings, 1 reply; 14+ messages in thread
From: Armin Zentai @ 2014-07-14 21:15 UTC (permalink / raw)
To: Zoltan Kiss, xen-devel
Hi!
On 14/07/14 17:30, Zoltan Kiss wrote:
> On 14/07/14 13:14, Armin Zentai wrote:
>> Hi!
>>
>> On 14/07/14 13:54, Zoltan Kiss wrote:
>>> Hi,
>>>
>>> Based on the stack trace I think the guest sends packet with more than
>>> MAX_SKB_SLOTS slots, and one slot's grant mapping operation goes wrong.
>>> Then the error handling screws up the internal bookkeeping of pending
>>> slots, and tries to unmap something that is not mapped. Unfortunately it
>>> is quite complicated to emulate grant mapping failures in a predictable
>>> way.
>>> Armin, if we provide some patches (hopefully with a solution, but at
>>> least with some additional logging), would you be able to compile and
>>> verify it?
>>
>> I can compile and install it, the verification may take a few days.
>> We're building up a test system, and try to test with outgoing traffic.
>> If you have any advice, about how to test the new kernel, please share
>> with me (I mean large pps, or large bandwidth is the better, or try to
>> update/downgrade the GPLPV drivers to a specific version).
>>
>> I'm setting the loglevel to DBG.
>
> Also, if you have your vmlinux file, can you please run the following:
>
> addr2line -e <your vmlinux file> ffffffffa02622fc
>
> That would hopefully point out where xenvif_idx_unmap were called in
> check_gop, as it has 3 call sites.
>
I've run it on the vmlinux file, that I've made with make vmlinux.
[root@c2-node16 linux-3.15.4]# addr2line -e vmlinux ffffffffa02622fc
??:0
I think it does not found anything.
At the time we've found one VM that crashes its hypervisor in every
evening at the same time, this virtual machine is belongs to one of our
clients, but I'll try to get some info about a scheduled job or
something that starts in that time.
> Zoli
>
Thanks,
- Armin
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot
2014-07-14 21:15 ` Armin Zentai
@ 2014-07-15 9:32 ` Wei Liu
0 siblings, 0 replies; 14+ messages in thread
From: Wei Liu @ 2014-07-15 9:32 UTC (permalink / raw)
To: Armin Zentai; +Cc: wei.liu2, Zoltan Kiss, xen-devel
On Mon, Jul 14, 2014 at 11:15:00PM +0200, Armin Zentai wrote:
[...]
> I've run it on the vmlinux file, that I've made with make vmlinux.
>
> [root@c2-node16 linux-3.15.4]# addr2line -e vmlinux ffffffffa02622fc
> ??:0
>
> I think it does not found anything.
>
AFAICT your netback is a loadable kernel module. Addr2line won't work...
Use gdb instead
# gdb PATH_TO/xen-netback.ko
(gdb) list *(xenvif_tx_action+0x27c)
> At the time we've found one VM that crashes its hypervisor in every evening
> at the same time, this virtual machine is belongs to one of our clients, but
> I'll try to get some info about a scheduled job or something that starts in
> that time.
>
"Every evening at the same time", how exciting...
Wei.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot
2014-07-14 12:14 ` Armin Zentai
2014-07-14 15:30 ` Zoltan Kiss
@ 2014-07-17 19:17 ` Zoltan Kiss
2014-07-20 17:20 ` Armin Zentai
1 sibling, 1 reply; 14+ messages in thread
From: Zoltan Kiss @ 2014-07-17 19:17 UTC (permalink / raw)
To: Armin Zentai, xen-devel
Hi,
I've just submitted 2 patch series to xen-devel, one is for the net tree
(with multiqueue) and one for 3.15. I believe the first two patch in the
series should address this problem, the another two is for related
problems. Armin, can you test all four of them together?
Thanks,
Zoltan
p.s.: if you have the test setup running with increased loglevel, it
would be good to see the result of that, as it can confirm that it is
the problem I've solved in those patches
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot
2014-07-17 19:17 ` Zoltan Kiss
@ 2014-07-20 17:20 ` Armin Zentai
2014-07-21 10:24 ` Zoltan Kiss
0 siblings, 1 reply; 14+ messages in thread
From: Armin Zentai @ 2014-07-20 17:20 UTC (permalink / raw)
To: Zoltan Kiss, xen-devel
Hi!
On 17/07/14 21:17, Zoltan Kiss wrote:
> Hi,
>
> I've just submitted 2 patch series to xen-devel, one is for the net tree
> (with multiqueue) and one for 3.15. I believe the first two patch in the
> series should address this problem, the another two is for related
> problems. Armin, can you test all four of them together?
>
> Thanks,
>
> Zoltan
>
> p.s.: if you have the test setup running with increased loglevel, it
> would be good to see the result of that, as it can confirm that it is
> the problem I've solved in those patches
I'll compile the patches at next week, and I'll post the test results.
It's a live service, not a developers' playground, so its make the
testing harder. We've moved back to 3.10.43-11 from the Xen4CentOS
repository, instead of using a self-compiled kernel. The problem has
been solved (I think), no crash over 5 days.
BEfore we're moving back the problematic VMs to the old kernel, we've
cloned 2 of them, and running them on a HV, that runs the 3.15 kernel,
with debug logging enabled. These VMs produced only one HV crash after
the cloning, but we had no usable logs. It happened at night, and we
were contiously getting very high load alerts from the HV, and after
some minutes we've lost the connection to it, and to every VM on it.
We've check it via KVM, and we found it continously posting the Draining
TX queue message for all vifs. We only can do a hard reset on it, the
logs were remain intact, no releated log entry have found in them. The
netconsole module remained useless too, no messages has been forwarded
after the _crash_.
So I'll apply your patches, and wait a week or two.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot
2014-07-20 17:20 ` Armin Zentai
@ 2014-07-21 10:24 ` Zoltan Kiss
0 siblings, 0 replies; 14+ messages in thread
From: Zoltan Kiss @ 2014-07-21 10:24 UTC (permalink / raw)
To: Armin Zentai, xen-devel
On 20/07/14 18:20, Armin Zentai wrote:
> Hi!
>
> On 17/07/14 21:17, Zoltan Kiss wrote:
>> Hi,
>>
>> I've just submitted 2 patch series to xen-devel, one is for the net tree
>> (with multiqueue) and one for 3.15. I believe the first two patch in the
>> series should address this problem, the another two is for related
>> problems. Armin, can you test all four of them together?
>>
>> Thanks,
>>
>> Zoltan
>>
>> p.s.: if you have the test setup running with increased loglevel, it
>> would be good to see the result of that, as it can confirm that it is
>> the problem I've solved in those patches
>
> I'll compile the patches at next week, and I'll post the test results.
>
> It's a live service, not a developers' playground, so its make the
> testing harder. We've moved back to 3.10.43-11 from the Xen4CentOS
> repository, instead of using a self-compiled kernel. The problem has
> been solved (I think), no crash over 5 days.
>
> BEfore we're moving back the problematic VMs to the old kernel, we've
> cloned 2 of them, and running them on a HV, that runs the 3.15 kernel,
> with debug logging enabled. These VMs produced only one HV crash after
> the cloning, but we had no usable logs. It happened at night, and we
> were contiously getting very high load alerts from the HV, and after
> some minutes we've lost the connection to it, and to every VM on it.
> We've check it via KVM, and we found it continously posting the Draining
> TX queue message for all vifs. We only can do a hard reset on it, the
> logs were remain intact, no releated log entry have found in them. The
> netconsole module remained useless too, no messages has been forwarded
> after the _crash_.
That seems to be a different issue, might be not even netback related.
The "Draining ..." messages means netback is still alive and purging the
VMs queue out as it doesn't empty its ring at all. This usually means
the guest OS is crashed in some way but haven't rebooted, so the domain
is still there. Can you check the logs from the VM during the crash time?
If something similar happens next time, I recommend to check top/xentop,
and the console of the guest.
>
> So I'll apply your patches, and wait a week or two.
>
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2014-07-21 10:24 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-14 2:25 Trying to unmap invalid handle! pending_idx: @ drivers/net/xen-netback/netback.c:998 causes kernel panic/reboot Armin Zentai
2014-07-14 9:52 ` Wei Liu
2014-07-14 10:53 ` Armin Zentai
2014-07-14 11:15 ` Wei Liu
2014-07-14 11:54 ` Zoltan Kiss
2014-07-14 12:07 ` Zoltan Kiss
2014-07-14 12:27 ` Zoltan Kiss
2014-07-14 12:14 ` Armin Zentai
2014-07-14 15:30 ` Zoltan Kiss
2014-07-14 21:15 ` Armin Zentai
2014-07-15 9:32 ` Wei Liu
2014-07-17 19:17 ` Zoltan Kiss
2014-07-20 17:20 ` Armin Zentai
2014-07-21 10:24 ` Zoltan Kiss
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).