From mboxrd@z Thu Jan 1 00:00:00 1970 From: Nikola Ciprich Subject: kvm-72: problems with 8139 under heavy load, lost interrupts? Date: Fri, 22 Aug 2008 09:54:35 +0200 Message-ID: <20080822075434.GA3229@develbox.linuxbox.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: nikola.ciprich@linuxbox.cz To: KVM list Return-path: Received: from gwu.lbox.cz ([62.245.111.132]:47021 "EHLO gwu.lbox.cz" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752126AbYHVHyi (ORCPT ); Fri, 22 Aug 2008 03:54:38 -0400 Content-Disposition: inline Sender: kvm-owner@vger.kernel.org List-ID: Hello everybody, we're running cluster of two hosts with tens (~45 running) of kvms, and now I noticed that some nodes are loosing link under heavy load. following appears in dmesg: [ 422.077128] NETDEV WATCHDOG: eth0: transmit timed out [ 422.077215] eth0: Transmit timeout, status d 2b 5 80ff [root@sql1 ~]# cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 0: 144 0 0 0 IO-APIC-edge timer 1: 539 2 1 2 IO-APIC-edge i8042 9: 0 0 0 0 IO-APIC-fasteoi acpi 10: 756783 362345 372753 751385 IO-APIC-fasteoi eth0 11: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb1 12: 150 4 3 4 IO-APIC-edge i8042 14: 518448 528815 172232 348704 IO-APIC-edge ide0 15: 0 0 0 0 IO-APIC-edge ide1 NMI: 0 0 0 0 Non-maskable interrupts LOC: 829179 775992 505151 458761 Local timer interrupts RES: 115772 98143 88928 82099 Rescheduling interrupts CAL: 73 166 138 160 function call interrupts TLB: 214586 255980 66806 278284 TLB shootdowns TRM: 0 0 0 0 Thermal event interrupts SPU: 0 0 0 0 Spurious interrupts ERR: 0 MIS: 1261 I guess the MIS value might be related to this. I have observed this problem on 32bit guests up to now, but it might be coincidence (those affected are heavily used). It also seems that it *might* be related to SMP guests. Hosts are running 2.6.26.2-x86_64 + kvm-72, guests 2.6.24, and are using 8139 virt adapter. I'm not sure if we had this problem with older KVM versions (and thus this is regression), as the usage of machines is growing constantly, so we maybe just didn't noticed the problem before. I CAN try other virt adapters as well, but both machines are production, so I have to be a bit cautious when it comes to experimenting. I'll try to prepare testing environment where I could reproduce the problem. But in the meantime, is there some way I could debug the problem furher, but in safe manner? I don't see anything related in either hosts dmesg, or logfiles. Thanks a lot in advance BR nik -- ------------------------------------- Nikola CIPRICH LinuxBox.cz, s.r.o. 28. rijna 168, 709 01 Ostrava tel.: +420 596 603 142 fax: +420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis@linuxbox.cz -------------------------------------