From mboxrd@z Thu Jan 1 00:00:00 1970 From: Philipp Hahn Subject: Re: RFH: Kernel OOPS in xen_netbk_rx_action / xenvif_gop_skb Date: Sat, 07 Jun 2014 00:12:32 +0200 Message-ID: <53923CD0.7010001@univention.de> References: <5391976F.8020800@univention.de> <20140606105804.GD11959@zion.uk.xensource.com> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Return-path: Received: from mail6.bemta4.messagelabs.com ([85.158.143.247]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1Wt2NK-0002fA-21 for xen-devel@lists.xenproject.org; Fri, 06 Jun 2014 22:12:38 +0000 In-Reply-To: <20140606105804.GD11959@zion.uk.xensource.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Wei Liu Cc: xen-devel , Erik Damrose , Ian Campbell , Zoltan Kiss List-Id: xen-devel@lists.xenproject.org Hello, On 06.06.2014 12:58, Wei Liu wrote: > On Fri, Jun 06, 2014 at 12:26:55PM +0200, Philipp Hahn wrote: >> on one of our hosts (Xen-4.1.3 with Linux-3.10.26 + Debian patches) >> running 16 Linux VMs (linux-3.2.39 and others) netback crashes during >> the night when one of the VMs is rebooted by a cron-job: >>> [38551.549615] Oops: 0000 [#1] SMP > = > Is there any more output above this line? Is it a NULL pointer > dereference or something else? Sorry, those lines got lost somehow during copy&paste: [38551.547728] XXXlan0: port 9(vif26.0) entered disabled state [38551.549365] BUG: unable to handle kernel paging request at ffffc900108641d8 [38551.549461] IP: [] xen_netbk_rx_action+0x18b/0x6f0 [xen_netback] [38551.549551] PGD 57e20067 PUD 57e21067 PMD 571a7067 PTE 0 [38551.549615] Oops: 0000 [#1] SMP >>> [38551.550865] RIP: e030:[] [] >>> xen_netbk_rx_action+0x18b/0x6f0 [xen_netback] > = > Try addr2line? Good to know, but since that host is already rebooted, I no longer know the module load address, which seems to render addr2line useless. >> The host itself is still alive and reachable by network, but all VMs are >> no longer reachable. >> The crash does not happen on every reboot: The VM was running fine for >> 1=BD week after a dom0 kernel update, but now crashed the following past >> two nights. >> > = > What's the Dom0 kernel version before upgrading? That would help us > narrow down the range of changesets. The previous kernel was 3.10.15. The update was performed to get another bug fixed, which went into the Debian update between .11 and .26: commit 0ff773f59ff375c42af2238457bda98ed4ddcd25 Author: David Vrabel Date: Wed Sep 11 14:52:48 2013 +0100 xen-netback: count number required slots for an skb more carefully [ Upstream commit 6e43fc04a6bc357d260583b8440882f28069207f ] > The oops happens in guest receive path. Unfortunately that's a very > complex function, it's hard to identify the problem by looking at the > code. > = > And as you seem to be using a distro kernel, have your reported to > Debian yet? I don't quite understand which Debian release has 3.10 > kernel though. Actually this is "Univention Corporate Server" (UCS), which is Debian-Squeeze based but with a newer Xen-4.1.3 and newer Linux-3.10 kernel. > 3.7.0 is too old. There has been lots of changes since then. Probably so, but thanks for the confirmation. >> Running "objdump -Sl xen-netback.ko" shows the OOPs to happen here: >>> /root/linux-3.10.11/drivers/net/xen-netback/netback.c:606 ... > You mentioned 3.10.26 at the beginning but now it's 3.10.11? I'm > confused. This has something to do with Debian patch policy: The first 3.10 kernel was 3.10.11, so that number stays, even when it up-patched to 3.10.26. > If it's dereferencing NULL pointer, skb_shinfo(skb) =3D=3D NULL? If I got the math right, it looks like it's crashing here: >>> /root/linux-3.10.11/drivers/net/xen-netback/netback.c:611 >>> meta->id =3D req->id; >>> 7d8: 48 83 c2 08 add $0x8,%rdx >>> 7dc: 0f b7 34 d1 movzwl (%rcx,%rdx,8),%esi >> 0x651 + 0x18B =3D 0x7DC 0x651 is the start of xen_netbk_rx_action() from objdump. ... > There's one more patch that you can pick up from 3.10.y tree. I doubt it > will make much difference though. > = > I think the first thing to do is to identify which line of code is > causing the problem. If it is actually the line you're referring to in > your analyse then we need to figure out why skb_shinfo(skb) is NULL... I'll try to add some debug output to yell if skb_shinfo() is NULL, but it might take some time until the bug manifests again. > Wei. Thank you for your feedback. Philipp -- = Philipp Hahn Open Source Software Engineer Univention GmbH be open. Mary-Somerville-Str. 1 D-28359 Bremen Tel.: +49 421 22232-0 Fax : +49 421 22232-99 hahn@univention.de http://www.univention.de/ Gesch=E4ftsf=FChrer: Peter H. Ganten HRB 20755 Amtsgericht Bremen Steuer-Nr.: 71-597-02876