From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:44871) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SF3zr-0000rz-Tp for qemu-devel@nongnu.org; Tue, 03 Apr 2012 09:42:13 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1SF3zl-0004Dx-47 for qemu-devel@nongnu.org; Tue, 03 Apr 2012 09:42:07 -0400 Received: from mail-ee0-f45.google.com ([74.125.83.45]:50139) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SF3zk-0004DX-Jd for qemu-devel@nongnu.org; Tue, 03 Apr 2012 09:42:01 -0400 Received: by eeit10 with SMTP id t10so1353694eei.4 for ; Tue, 03 Apr 2012 06:41:58 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20120403124217.GN1283@arachsys.com> References: <20120402153722.GA30499@arachsys.com> <20120403071328.GB27304@stefanha-thinkpad.localdomain> <20120403081313.GD1283@arachsys.com> <20120403124217.GN1283@arachsys.com> Date: Tue, 3 Apr 2012 14:34:32 +0100 Message-ID: From: Stefan Hajnoczi Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0 List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Chris Webb Cc: qemu-devel@nongnu.org On Tue, Apr 3, 2012 at 1:42 PM, Chris Webb wrote: > Stefan Hajnoczi writes: > >> In a case like this it might be most effective to catch a VM in the >> bad state and then go in with gdb to see what is broken. =A0The basic >> approach would be putting breakpoints on the e1000 device model's >> transmit/receive paths to see if the guest is giving us packets and >> whether the tap device is transmitting/receiving. =A0If guest and host >> appear to be working then QEMU's e1000 model must be in a bad state >> and it's a question of looking at the tx/rx rings and other hardware >> emulation state to figure out what went wrong. > > Hi Stefan. I tried setting a breakpoint on start_xmit, but the qemu blew = up > when I hit it: > > (gdb) break /home/root/packages/qemu-kvm-1.0/src-hrw66F/hw/e1000.c:start_= xmit > Function "start_xmit" not defined. > Make breakpoint pending on future shared library load? (y or [n]) n > (gdb) break /home/root/packages/qemu-kvm-1.0/src-hrw66F/hw/e1000.c:528 > Breakpoint 1 at 0x46dcd6: file /home/root/packages/qemu-kvm-1.0/src-hrw66= F/hw/e1000.c, line 528. > (gdb) cont > Continuing. > > Program terminated with signal SIGTRAP, Trace/breakpoint trap. > The program no longer exists. > > I assume this is some subtlety with breakpointing threaded code? No, that's weird. I would have simply tried "b start_xmit" and as long as the binary has symbols gdb would know what to do. > However, along these lines, I note that the guest appears to have receive= d > packets, though this count is stuck at 1993 bytes. The TX count marches u= pwards > as I ping outbound from the guest. > > If I attach a tcpdump to tap1 on the host, I see the ARP requests going o= ut and > apparently no reply: > > 0024# tcpdump -i tap1 > tcpdump: WARNING: tap1: no IPv4 address assigned > tcpdump: verbose output suppressed, use -v or -vv for full protocol decod= e > listening on tap1, link-type EN10MB (Ethernet), capture size 65535 bytes > 12:08:35.654992 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length= 28 > 12:08:36.654976 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length= 28 > 12:08:37.654975 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length= 28 > 12:08:38.670933 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length= 28 > 12:08:39.670922 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length= 28 > 12:08:40.670908 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length= 28 > > Looking on br0, I do seem to see the replies: > > 12:12:53.509471 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 84.4= 5.8.129 tell 84.45.8.242, length 28 > 12:12:53.509914 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is= -at 00:13:c3:35:a6:42 (oui Unknown), length 46 > 12:12:54.509455 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 84.4= 5.8.129 tell 84.45.8.242, length 28 > 12:12:54.509875 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is= -at 00:13:c3:35:a6:42 (oui Unknown), length 46 > 12:12:55.509447 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 84.4= 5.8.129 tell 84.45.8.242, length 28 > 12:12:55.509878 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is= -at 00:13:c3:35:a6:42 (oui Unknown), length 46 > 12:12:56.525424 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 84.4= 5.8.129 tell 84.45.8.242, length 28 > 12:12:56.525854 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is= -at 00:13:c3:35:a6:42 (oui Unknown), length 46 > 12:12:57.525408 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 84.4= 5.8.129 tell 84.45.8.242, length 28 > 12:12:57.525837 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is= -at 00:13:c3:35:a6:42 (oui Unknown), length 46 > > but they never get to tap1 despite STP being disabled and no bridge port > filtering: > > =A0# ebtables -L > =A0Bridge table: filter > > =A0Bridge chain: INPUT, entries: 0, policy: ACCEPT > > =A0Bridge chain: FORWARD, entries: 0, policy: ACCEPT > > =A0Bridge chain: OUTPUT, entries: 0, policy: ACCEPT > > =A0# brctl show br0 > =A0bridge name =A0 =A0 bridge id =A0 =A0 =A0 =A0 =A0 =A0 =A0 STP enabled = =A0 =A0 interfaces > =A0br0 =A0 =A0 =A0 =A0 =A0 =A0 8000.002590224ffa =A0 =A0 =A0 no =A0 =A0 = =A0 =A0 =A0 =A0 =A0eth0 > > > This looks uncannily like a kernel problem doesn't it? However, remove th= e > -usbdevice tablet, and it goes away, which is truly weird! I've just done= a > hundred successful reboots without it once again to confirm to myself tha= t I'm > definitely not imagining that behaviour. Are you sure no other guest has the same MAC address or IP address? This weird behavior sounds similar to what happens when you have multiple devices on a network using the same address - the results are very confusing :). Stefan