Intermittent xenvif_disconnect() hang on domU destroy

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* Intermittent xenvif_disconnect() hang on domU destroy
@ 2016-06-02  0:43 Ed Swierk
  2016-06-02  9:51 ` David Vrabel
  0 siblings, 1 reply; 2+ messages in thread
From: Ed Swierk @ 2016-06-02  0:43 UTC (permalink / raw)
  To: xen-devel; +Cc: eswierk

I'm seeing the xenwatch kernel thread hang intermittently when
destroying a domU on recent stable xen 4.5, with Linux 4.4.11 + grsec
dom0.

The domU is created with a virtual network interface connected to a
physical interface (ixgbevf) via an openvswitch virtual switch.

Everything works fine until the domain is destroyed. Once in a while,
a few seconds after the domain goes away, xenwatch hangs in
xenvif_disconnect(), calling kthread_stop() on a dealloc task.

I added a warning to xenvif_dealloc_kthread_should_stop() when
kthread_should_stop() is true and queue->inflight_packets > 0,
printing inflight_packets as well as stats.tx_zerocopy_*. Each time
the hang occurs, inflight_packets == 1 and tx_zerocopy_sent ==
tx_zerocopy_success + tx_zerocopy_fail + 1.

I also added a warning to xenvif_skb_zerocopy_complete() when
queue->task is null. If I manually bring down the physical interface
to which the vif was connected (ifconfig down), this somehow causes
the last in-flight packet to be transmitted, and everything is
unblocked.

The following shows xenwatch hung trying to stop vif44.0-q0-dealloc,
waking up again after I bring down the physical interface net0_52.

[xl destroy]
...
[ 2914.510070] net vif44.0: stopping vif44.0-q0-dealloc task (pid 28045)
[ 2914.510224] xen_netback:xenvif_dealloc_kthread_should_stop: vif44.0-q0-dealloc task (pid 28045) should_stop=1 inflight_packets=1 tx_zerocopy_sent=209494 tx_zerocopy_success=209492 tx_zerocopy_fail=1
...
[ 2933.561404] device net0_52 left promiscuous mode
[ 2933.564813] device vif44.0 left promiscuous mode
[ 3136.324009] INFO: task xenwatch:29 blocked for more than 120 seconds.
[ 3136.324119]       Not tainted 4.4.11-grsec-skyport #66
[ 3136.324181] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3136.324284] xenwatch        D ffffc90005d0baf8 12640    29      2 0x00000000
[ 3136.324411]  ffffc90005d0baf8 0000000000000000 ffffffff81b70d60 ffff8804ec08c500
[ 3136.324536]  ffff8804ec08dbb8 ffffc900115abec0 ffffc900115abeb8 ffff8804ec08c500
[ 3136.324662]  0000000000000005 ffffc90005d0bb10 ffffffff816d6ff7 7fffffffffffffff
[ 3136.324806] Call Trace:
[ 3136.324866]  [<ffffffff816d6ff7>] schedule+0x37/0x80
[ 3136.324932]  [<ffffffff816dbdec>] schedule_timeout+0x1bc/0x240
[ 3136.325004]  [<ffffffff8100a73c>] ? xen_clocksource_read+0x1c/0x30
[ 3136.325078]  [<ffffffff810194e3>] ? sched_clock+0x13/0x20
[ 3136.325153]  [<ffffffff8109f72c>] ? local_clock+0x1c/0x20
[ 3136.325228]  [<ffffffff810bdd69>] ? mark_held_locks+0x79/0xa0
[ 3136.325298]  [<ffffffff816dd077>] ? _raw_spin_unlock_irq+0x27/0x50
[ 3136.325367]  [<ffffffff810bdecd>] ? trace_hardirqs_on_caller+0x13d/0x1d0
[ 3136.325441]  [<ffffffff816d8426>] wait_for_completion+0xd6/0x110
[ 3136.325514]  [<ffffffff81099570>] ? wake_up_q+0x70/0x70
[ 3136.325585]  [<ffffffff8108f087>] kthread_stop+0x47/0x80
[ 3136.325660]  [<ffffffff814f1661>] xenvif_disconnect+0xb1/0x130
[ 3136.325729]  [<ffffffff814ef3c6>] set_backend_state+0x116/0xde0
[ 3136.325805]  [<ffffffff8143717e>] ? xenbus_gather+0x10e/0x140
[ 3136.325881]  [<ffffffff811a5e42>] ? kfree+0x1c2/0x1e0
[ 3136.325960]  [<ffffffff8109f72c>] ? local_clock+0x1c/0x20
[ 3136.326026]  [<ffffffff814f0577>] frontend_changed+0xb7/0xc0
[ 3136.326095]  [<ffffffff81437fb0>] xenbus_otherend_changed+0x80/0x90
[ 3136.330341]  [<ffffffff81437410>] ? unregister_xenbus_watch+0x260/0x260
[ 3136.330414]  [<ffffffff81438d2b>] frontend_changed+0xb/0x10
[ 3136.330483]  [<ffffffff8143744a>] xenwatch_thread+0x3a/0x130
[ 3136.330553]  [<ffffffff810b2ba0>] ? wake_up_atomic_t+0x30/0x30
[ 3136.330621]  [<ffffffff8108eccc>] kthread+0xfc/0x120
[ 3136.330686]  [<ffffffff8108ebd0>] ? kthread_create_on_node+0x240/0x240
[ 3136.330775]  [<ffffffff816ddbee>] ret_from_fork+0x3e/0x70
[ 3136.330840]  [<ffffffff8108ebd0>] ? kthread_create_on_node+0x240/0x240
[ 3136.330911] 1 lock held by xenwatch/29:
[ 3136.330972]  #0:  (xenwatch_mutex){+.+.+.}, at: [<ffffffff814374a7>] ffffffff814374a7
...
[ifconfig net0_52 down]
...
[ 3162.217907] ixgbe 0000:81:00.0 net0: VF Reset msg received from vf 52
[ 3162.228840] ------------[ cut here ]------------
[ 3162.228945] WARNING: CPU: 3 PID: 31978 at drivers/net/xen-netback/interface.c:71 xenvif_skb_zerocopy_complete+0x79/0x90()
[ 3162.229104] vif44.0: dead queue vif44.0-q0 decremented inflight_packets to 0
[ 3162.229184] Modules linked in: xt_physdev br_netfilter bridge stp llc tun xen_blkback ip_gre ip_tunnel gre ixgbevf drbg ansi_cprng dm_crypt algif_skcipher af_alg xen_evtchn xenfs xen_privcmd xen_pciback openvswitch nf_defrag_ipv6 libcrc32c nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul raid1 aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac lpc_ich mfd_core mei_me mei i2c_i801 ioatdma ipmi_ssif ipmi_msghandler squashfs lz4_decompress ixgbe mdio vxlan xhci_pci xhci_hcd ip6_udp_tunnel udp_tunnel ptp pps_core dca ahci libahci ehci_pci ehci_hcd usbcore usb_common tpm_tis
[ 3162.230580] CPU: 3 PID: 31978 Comm: ifconfig Not tainted 4.4.11-grsec-skyport #66
[ 3162.230681] Hardware name: ABCD, BIOS SE5C610.86B.01.01.0009.C1.060120151350 06/01/2015
[ 3162.230814]  0000000000000000 ffffc9000f7c3a68 ffffffff8137d3f8 0000000000000000
[ 3162.230938]  ffffc9000f7c3ab0 ffffffff81aa9060 ffffc9000f7c3aa0 ffffffff81068de8
[ 3162.231062]  ffffc90011565000 ffffc9001156f788 0000000000000001 ffffc90011565000
[ 3162.231187] Call Trace:
[ 3162.231251]  [<ffffffff8137d3f8>] dump_stack+0x9a/0xe2
[ 3162.231324]  [<ffffffff81068de8>] warn_slowpath_common+0x78/0xb0
[ 3162.231392]  [<ffffffff81068e67>] warn_slowpath_fmt+0x47/0x50
[ 3162.231469]  [<ffffffff810bdecd>] ? trace_hardirqs_on_caller+0x13d/0x1d0
[ 3162.231540]  [<ffffffff814f0dc9>] xenvif_skb_zerocopy_complete+0x79/0x90
[ 3162.231612]  [<ffffffff814ede0f>] xenvif_zerocopy_callback+0x9f/0xc0
[ 3162.231694]  [<ffffffff8156cdb4>] skb_release_data+0xc4/0xe0
[ 3162.231761]  [<ffffffff8156cdef>] skb_release_all+0x1f/0x30
[ 3162.231828]  [<ffffffff8156cecd>] consume_skb+0x1d/0x40
[ 3162.231900]  [<ffffffff81586a74>] __dev_kfree_skb_any+0x34/0x40
[ 3162.231974]  [<ffffffffa0bd00d0>] ixgbevf_unmap_and_free_tx_resource.isra.46+0x20/0x80 [ixgbevf]
[ 3162.232083]  [<ffffffffa0bd016c>] ixgbevf_clean_tx_ring+0x3c/0x80 [ixgbevf]
[ 3162.232156]  [<ffffffffa0bd418e>] ixgbevf_down+0x2be/0x330 [ixgbevf]
[ 3162.232226]  [<ffffffffa0bd5332>] ixgbevf_close+0x22/0xa0 [ixgbevf]
[ 3162.232299]  [<ffffffff81581240>] __dev_close_many+0x90/0xe0
[ 3162.232378]  [<ffffffff815813be>] __dev_close+0x2e/0x50
[ 3162.232448]  [<ffffffff8158cee8>] __dev_change_flags+0x98/0x160
[ 3162.232517]  [<ffffffff8158cfd4>] dev_change_flags+0x24/0x60
[ 3162.232590]  [<ffffffff8161f534>] devinet_ioctl+0x834/0x8f0
[ 3162.232661]  [<ffffffff8162077b>] inet_ioctl+0x4b/0x70
[ 3162.232739]  [<ffffffff8155e6c0>] sock_do_ioctl+0x20/0x50
[ 3162.232804]  [<ffffffff8155e8d0>] sock_ioctl+0x1e0/0x290
[ 3162.232876]  [<ffffffff811da630>] do_vfs_ioctl+0x430/0x7f0
[ 3162.232939]  [<ffffffff811daa64>] SyS_ioctl+0x74/0x80
[ 3162.233009]  [<ffffffff816dd83a>] entry_SYSCALL_64_fastpath+0x16/0x7e
[ 3162.233129] ---[ end trace 2e4a237dee6f3318 ]---
[ 3162.357009] xen:events: domain 45 does not have 103 anymore
[ 3162.357099] xen:events: domain 45 does not have 102 anymore
...
[xenwatch unblocked]
...
[ 3162.394227] net vif50.0: stopping vif50.0-q0-dealloc task (pid 29413)
[ 3162.667880] net vif42.0: stopping vif42.0-q0-dealloc task (pid 27272)
[ 3162.705464] net vif48.0: stopping vif48.0-q0-dealloc task (pid 28779)

It's not clear to me whether the problem lies in netback, ixgbevf, or
somewhere in between. Is the root cause ixgbevf hanging onto a skb for
so long, and doing nothing with it until I bring the interface down,
or is that a symptom of some other problem? Or is netback supposed to
somehow flush in-flight transmit packets before it gets as far as
xenvif_disconnect()?  Or should it forget about the in-flight packets
since the interface is disappearing anyway?

Any clues would be appreciated.

--Ed

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Intermittent xenvif_disconnect() hang on domU destroy
  2016-06-02  0:43 Intermittent xenvif_disconnect() hang on domU destroy Ed Swierk
@ 2016-06-02  9:51 ` David Vrabel
  0 siblings, 0 replies; 2+ messages in thread
From: David Vrabel @ 2016-06-02  9:51 UTC (permalink / raw)
  To: Ed Swierk, xen-devel

On 02/06/16 01:43, Ed Swierk wrote:
> I'm seeing the xenwatch kernel thread hang intermittently when
> destroying a domU on recent stable xen 4.5, with Linux 4.4.11 + grsec
> dom0.
> 
> The domU is created with a virtual network interface connected to a
> physical interface (ixgbevf) via an openvswitch virtual switch.
> 
> Everything works fine until the domain is destroyed. Once in a while,
> a few seconds after the domain goes away, xenwatch hangs in
> xenvif_disconnect(), calling kthread_stop() on a dealloc task.
> 
> I added a warning to xenvif_dealloc_kthread_should_stop() when
> kthread_should_stop() is true and queue->inflight_packets > 0,
> printing inflight_packets as well as stats.tx_zerocopy_*. Each time
> the hang occurs, inflight_packets == 1 and tx_zerocopy_sent ==
> tx_zerocopy_success + tx_zerocopy_fail + 1.
> 
> I also added a warning to xenvif_skb_zerocopy_complete() when
> queue->task is null. If I manually bring down the physical interface
> to which the vif was connected (ifconfig down), this somehow causes
> the last in-flight packet to be transmitted, and everything is
> unblocked.
> 
[...]
> 
> It's not clear to me whether the problem lies in netback, ixgbevf, or
> somewhere in between. Is the root cause ixgbevf hanging onto a skb for
> so long, and doing nothing with it until I bring the interface down,
> or is that a symptom of some other problem? Or is netback supposed to
> somehow flush in-flight transmit packets before it gets as far as
> xenvif_disconnect()?  Or should it forget about the in-flight packets
> since the interface is disappearing anyway?
> 
> Any clues would be appreciated.

netback can't flush in-flight packets because the network stack doesn't
provide a mechanism for this.  It can't forget about in-flight packets
because they have foreign (grant mapped) pages which must be unmapped
and the requests completed before the device can be CLOSED.

You need to investigate what is holding on to this skb.

David

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2016-06-02  9:51 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-06-02  0:43 Intermittent xenvif_disconnect() hang on domU destroy Ed Swierk
2016-06-02  9:51 ` David Vrabel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).