From mboxrd@z Thu Jan 1 00:00:00 1970 From: Philipp Hahn Subject: Re: RFH: Kernel OOPS in xen_netbk_rx_action / xenvif_gop_skb Date: Fri, 27 Jun 2014 10:42:24 +0200 Message-ID: <53AD2E70.5060002@univention.de> References: <5391976F.8020800@univention.de> <20140606105804.GD11959@zion.uk.xensource.com> <53923CD0.7010001@univention.de> <53A1C2DF.10407@univention.de> <20140619141252.GO20819@zion.uk.xensource.com> <53A84002.5050402@univention.de> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta5.messagelabs.com ([195.245.231.135]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1X0Rjn-0004u1-T2 for xen-devel@lists.xenproject.org; Fri, 27 Jun 2014 08:42:28 +0000 In-Reply-To: <53A84002.5050402@univention.de> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Wei Liu Cc: xen-devel , Erik Damrose , Ian Campbell , Zoltan Kiss List-Id: xen-devel@lists.xenproject.org Hello Wei Liu, On 23.06.2014 16:56, Philipp Hahn wrote: > On 19.06.2014 16:12, Wei Liu wrote: >> On Wed, Jun 18, 2014 at 06:48:31PM +0200, Philipp Hahn wrote: > ... >>> 5. then xen-netback continues processing the pending requests and tries > ... >> I think your analysis makes sense. Netback does have it's internal queue >> and kthread can certainly be scheduled away. There doesn't seem to be a >> synchronisation point between a vif getting disconnet and internal queue >> gets processed. I attach a quick hack. If it does work to a degree then >> we can try to work out a proper fix. > > Your quick hack seems to have solved the problem: The network survived > the week-end, but we had to change the VMs as one of them was required > last weekend. We're currently re-checking that the bug still occurs with > the old kernel but the ne We added some debug output (UniDEBUG) as we observed another OOPS in one test run, but I think that was a mis-compiled kernel as the size of function was the same as previous, but it should be 0x712 according to "objdump -S": > [ 6196.712232] BUG: unable to handle kernel paging request at ffffc90010d94678 > [ 6196.712322] IP: [] xen_netbk_rx_action+0x18b/0x6f0 [xen_netback] > [ 6196.712410] PGD 95822067 PUD 95823067 PMD 94721067 PTE 0 > [ 6196.712473] Oops: 0000 [#1] SMP ... > [ 6196.713434] CPU: 0 PID: 11743 Comm: netback/0 Not tainted 3.10.0-ucs58-amd64 #1 > Univention 3.10.11-1.58.201405060908a~xenXXX ... > [ 6196.713618] task: ffff8800917f7840 ti: ffff880004bde000 task.ti: ffff880004bde000 > [ 6196.713701] RIP: e030:[] [] > xen_netbk_rx_action+0x18b/0x6f0 [xen_netback] With the modified patch we now get the following hang: > [ 84.833333] device eth2 entered promiscuous mode > [ 248.191165] UniDEBUG vif->mapped is set to false (xenvif_alloc) > [ 248.442727] device vif1.0 entered promiscuous mode > [ 250.721054] UniDEBUG vif->mapped is true (xen_netbk_map_frontend_rings) > [ 250.721099] XXXlan0: port 2(vif1.0) entered forwarding state > [ 250.721103] XXXlan0: port 2(vif1.0) entered forwarding state > [ 253.473859] UniDEBUG vif->mapped is set to false (xenvif_alloc) > [ 253.737812] device vif2.0 entered promiscuous mode > [ 255.639021] UniDEBUG vif->mapped is true (xen_netbk_map_frontend_rings) > [ 255.639067] XXXlan0: port 3(vif2.0) entered forwarding state > [ 255.639072] XXXlan0: port 3(vif2.0) entered forwarding state > [ 592.867375] UniDEBUG vif->mapped is set to false(xen_netbk_unmap_frontend_rings) > [ 592.868147] XXXlan0: port 3(vif2.0) entered disabled state > [ 593.499258] XXXlan0: port 3(vif2.0) entered disabled state > [ 593.499293] device vif2.0 left promiscuous mode > [ 593.499295] XXXlan0: port 3(vif2.0) entered disabled state > [ 595.386548] UniDEBUG vif->mapped is set to false (xenvif_alloc) > [ 595.633665] device vif3.0 entered promiscuous mode > [ 597.390410] UniDEBUG vif->mapped is true (xen_netbk_map_frontend_rings) > [ 597.390458] XXXlan0: port 3(vif3.0) entered forwarding state > [ 597.390462] XXXlan0: port 3(vif3.0) entered forwarding state > [ 936.549840] UniDEBUG vif->mapped is set to false(xen_netbk_unmap_frontend_rings) > [ 936.549869] XXXlan0: port 3(vif3.0) entered disabled state > [ 936.553024] UniDEBUG vif->mapped is false here it would oops previously. > [ 937.459565] device vif3.0 left promiscuous mode > [ 937.459570] XXXlan0: port 3(vif3.0) entered disabled state > [ 1115.250900] INFO: task xenwatch:14 blocked for more than 120 seconds. > [ 1115.250902] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 1115.250904] xenwatch D ffff8800952d9080 0 14 2 0x00000000 > [ 1115.250907] ffff8800952d9080 0000000000000246 ffff880094510880 0000000000013ec0 > [ 1115.250909] ffff88009530ffd8 0000000000013ec0 ffff88009530ffd8 0000000000013ec0 > [ 1115.250911] ffff8800952d9080 0000000000013ec0 0000000000013ec0 ffff88009530e010 > [ 1115.250913] Call Trace: > [ 1115.250921] [] ? _raw_spin_lock_irqsave+0x11/0x2f > [ 1115.250925] [] ? xenvif_free+0x7a/0xb6 [xen_netback] > [ 1115.250930] [] ? wake_up_bit+0x20/0x20 > [ 1115.250934] [] ? xenbus_rm+0x44/0x4f > [ 1115.250937] [] ? netback_remove+0x5d/0x7e [xen_netback] > [ 1115.250940] [] ? xenbus_dev_remove+0x29/0x4e > [ 1115.250943] [] ? __device_release_driver+0x7f/0xd5 > [ 1115.250946] [] ? device_release_driver+0x1d/0x29 > [ 1115.250948] [] ? bus_remove_device+0xee/0x103 > [ 1115.250950] [] ? device_del+0x112/0x182 > [ 1115.250952] [] ? device_unregister+0x9/0x12 > [ 1115.250955] [] ? xenwatch_thread+0x122/0x15f > [ 1115.250957] [] ? wake_up_bit+0x20/0x20 > [ 1115.250959] [] ? xs_watch+0x57/0x57 > [ 1115.250962] [] ? kthread_freezable_should_stop+0x56/0x56 > [ 1115.250964] [] ? xs_watch+0x57/0x57 > [ 1115.250966] [] ? kthread+0xab/0xb3 > [ 1115.250969] [] ? xen_end_context_switch+0xe/0x1c > [ 1115.250972] [] ? kthread_freezable_should_stop+0x56/0x56 > [ 1115.250975] [] ? ret_from_fork+0x7c/0xb0 > [ 1115.250977] [] ? kthread_freezable_should_stop+0x56/0x56 Any idea? Sincerely Philipp