From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:56198) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SF349-00079U-Io for qemu-devel@nongnu.org; Tue, 03 Apr 2012 08:42:37 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1SF341-0002Cm-7Z for qemu-devel@nongnu.org; Tue, 03 Apr 2012 08:42:29 -0400 Received: from alpha.arachsys.com ([91.203.57.7]:48729) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1SF341-0002C3-0s for qemu-devel@nongnu.org; Tue, 03 Apr 2012 08:42:21 -0400 Date: Tue, 3 Apr 2012 13:42:18 +0100 From: Chris Webb Message-ID: <20120403124217.GN1283@arachsys.com> References: <20120402153722.GA30499@arachsys.com> <20120403071328.GB27304@stefanha-thinkpad.localdomain> <20120403081313.GD1283@arachsys.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [Qemu-devel] Intermittent e1000 failure on qemu-kvm 1.0 List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: qemu-devel@nongnu.org Stefan Hajnoczi writes: > In a case like this it might be most effective to catch a VM in the > bad state and then go in with gdb to see what is broken. The basic > approach would be putting breakpoints on the e1000 device model's > transmit/receive paths to see if the guest is giving us packets and > whether the tap device is transmitting/receiving. If guest and host > appear to be working then QEMU's e1000 model must be in a bad state > and it's a question of looking at the tx/rx rings and other hardware > emulation state to figure out what went wrong. Hi Stefan. I tried setting a breakpoint on start_xmit, but the qemu blew up when I hit it: (gdb) break /home/root/packages/qemu-kvm-1.0/src-hrw66F/hw/e1000.c:start_xmit Function "start_xmit" not defined. Make breakpoint pending on future shared library load? (y or [n]) n (gdb) break /home/root/packages/qemu-kvm-1.0/src-hrw66F/hw/e1000.c:528 Breakpoint 1 at 0x46dcd6: file /home/root/packages/qemu-kvm-1.0/src-hrw66F/hw/e1000.c, line 528. (gdb) cont Continuing. Program terminated with signal SIGTRAP, Trace/breakpoint trap. The program no longer exists. I assume this is some subtlety with breakpointing threaded code? However, along these lines, I note that the guest appears to have received packets, though this count is stuck at 1993 bytes. The TX count marches upwards as I ping outbound from the guest. If I attach a tcpdump to tap1 on the host, I see the ARP requests going out and apparently no reply: 0024# tcpdump -i tap1 tcpdump: WARNING: tap1: no IPv4 address assigned tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on tap1, link-type EN10MB (Ethernet), capture size 65535 bytes 12:08:35.654992 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:08:36.654976 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:08:37.654975 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:08:38.670933 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:08:39.670922 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:08:40.670908 ARP, Request who-has 84.45.8.129 tell 84.45.8.242, length 28 Looking on br0, I do seem to see the replies: 12:12:53.509471 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:12:53.509914 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is-at 00:13:c3:35:a6:42 (oui Unknown), length 46 12:12:54.509455 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:12:54.509875 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is-at 00:13:c3:35:a6:42 (oui Unknown), length 46 12:12:55.509447 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:12:55.509878 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is-at 00:13:c3:35:a6:42 (oui Unknown), length 46 12:12:56.525424 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:12:56.525854 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is-at 00:13:c3:35:a6:42 (oui Unknown), length 46 12:12:57.525408 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 84.45.8.129 tell 84.45.8.242, length 28 12:12:57.525837 ARP, Ethernet (len 6), IPv4 (len 4), Reply 84.45.8.129 is-at 00:13:c3:35:a6:42 (oui Unknown), length 46 but they never get to tap1 despite STP being disabled and no bridge port filtering: # ebtables -L Bridge table: filter Bridge chain: INPUT, entries: 0, policy: ACCEPT Bridge chain: FORWARD, entries: 0, policy: ACCEPT Bridge chain: OUTPUT, entries: 0, policy: ACCEPT # brctl show br0 bridge name bridge id STP enabled interfaces br0 8000.002590224ffa no eth0 This looks uncannily like a kernel problem doesn't it? However, remove the -usbdevice tablet, and it goes away, which is truly weird! I've just done a hundred successful reboots without it once again to confirm to myself that I'm definitely not imagining that behaviour. > Have you tried unloading the e1000 kernel module inside the guest and > then modprobing it again? Does this "fix" the issue? Hadn't thought of that, but no, it apparently has no effect. It's still broken after I rmmod it, modprobe it again, and reconfigure the networking. Cheers, Chris.