* kvm-72: problems with 8139 under heavy load, lost interrupts?
@ 2008-08-22 7:54 Nikola Ciprich
2008-08-22 18:16 ` Avi Kivity
0 siblings, 1 reply; 3+ messages in thread
From: Nikola Ciprich @ 2008-08-22 7:54 UTC (permalink / raw)
To: KVM list; +Cc: nikola.ciprich
Hello everybody,
we're running cluster of two hosts with tens (~45 running) of kvms,
and now I noticed that some nodes are loosing link under heavy load.
following appears in dmesg:
[ 422.077128] NETDEV WATCHDOG: eth0: transmit timed out
[ 422.077215] eth0: Transmit timeout, status d 2b 5 80ff
[root@sql1 ~]# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 144 0 0 0 IO-APIC-edge timer
1: 539 2 1 2 IO-APIC-edge i8042
9: 0 0 0 0 IO-APIC-fasteoi acpi
10: 756783 362345 372753 751385 IO-APIC-fasteoi eth0
11: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb1
12: 150 4 3 4 IO-APIC-edge i8042
14: 518448 528815 172232 348704 IO-APIC-edge ide0
15: 0 0 0 0 IO-APIC-edge ide1
NMI: 0 0 0 0 Non-maskable interrupts
LOC: 829179 775992 505151 458761 Local timer interrupts
RES: 115772 98143 88928 82099 Rescheduling interrupts
CAL: 73 166 138 160 function call interrupts
TLB: 214586 255980 66806 278284 TLB shootdowns
TRM: 0 0 0 0 Thermal event interrupts
SPU: 0 0 0 0 Spurious interrupts
ERR: 0
MIS: 1261
I guess the MIS value might be related to this. I have observed this problem
on 32bit guests up to now, but it might be coincidence (those affected are heavily used).
It also seems that it *might* be related to SMP guests.
Hosts are running 2.6.26.2-x86_64 + kvm-72, guests 2.6.24, and are using 8139 virt adapter.
I'm not sure if we had this problem with older KVM versions (and thus this is regression),
as the usage of machines is growing constantly, so we maybe just didn't noticed the problem before.
I CAN try other virt adapters as well, but both machines are production, so I have to be
a bit cautious when it comes to experimenting. I'll try to prepare testing environment where
I could reproduce the problem.
But in the meantime, is there some way I could debug the problem furher, but in safe manner?
I don't see anything related in either hosts dmesg, or logfiles.
Thanks a lot in advance
BR
nik
--
-------------------------------------
Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava
tel.: +420 596 603 142
fax: +420 596 621 273
mobil: +420 777 093 799
www.linuxbox.cz
mobil servis: +420 737 238 656
email servis: servis@linuxbox.cz
-------------------------------------
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: kvm-72: problems with 8139 under heavy load, lost interrupts?
2008-08-22 7:54 kvm-72: problems with 8139 under heavy load, lost interrupts? Nikola Ciprich
@ 2008-08-22 18:16 ` Avi Kivity
2008-08-22 20:48 ` Farkas Levente
0 siblings, 1 reply; 3+ messages in thread
From: Avi Kivity @ 2008-08-22 18:16 UTC (permalink / raw)
To: Nikola Ciprich; +Cc: KVM list, nikola.ciprich
Nikola Ciprich wrote:
> Hello everybody,
> we're running cluster of two hosts with tens (~45 running) of kvms,
> and now I noticed that some nodes are loosing link under heavy load.
>
> following appears in dmesg:
> [ 422.077128] NETDEV WATCHDOG: eth0: transmit timed out
> [ 422.077215] eth0: Transmit timeout, status d 2b 5 80ff
>
> [root@sql1 ~]# cat /proc/interrupts
> CPU0 CPU1 CPU2 CPU3
> 0: 144 0 0 0 IO-APIC-edge timer
> 1: 539 2 1 2 IO-APIC-edge i8042
> 9: 0 0 0 0 IO-APIC-fasteoi acpi
> 10: 756783 362345 372753 751385 IO-APIC-fasteoi eth0
> 11: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb1
> 12: 150 4 3 4 IO-APIC-edge i8042
> 14: 518448 528815 172232 348704 IO-APIC-edge ide0
> 15: 0 0 0 0 IO-APIC-edge ide1
> NMI: 0 0 0 0 Non-maskable interrupts
> LOC: 829179 775992 505151 458761 Local timer interrupts
> RES: 115772 98143 88928 82099 Rescheduling interrupts
> CAL: 73 166 138 160 function call interrupts
> TLB: 214586 255980 66806 278284 TLB shootdowns
> TRM: 0 0 0 0 Thermal event interrupts
> SPU: 0 0 0 0 Spurious interrupts
> ERR: 0
> MIS: 1261
>
> I guess the MIS value might be related to this. I have observed this problem
> on 32bit guests up to now, but it might be coincidence (those affected are heavily used).
> It also seems that it *might* be related to SMP guests.
>
> Hosts are running 2.6.26.2-x86_64 + kvm-72, guests 2.6.24, and are using 8139 virt adapter.
> I'm not sure if we had this problem with older KVM versions (and thus this is regression),
> as the usage of machines is growing constantly, so we maybe just didn't noticed the problem before.
>
> I CAN try other virt adapters as well, but both machines are production, so I have to be
> a bit cautious when it comes to experimenting. I'll try to prepare testing environment where
> I could reproduce the problem.
>
> But in the meantime, is there some way I could debug the problem furher, but in safe manner?
> I don't see anything related in either hosts dmesg, or logfiles.
>
>
What would be most useful is to verify that this reproduces reliably,
and a recipe for us to try out.
Also, how heavy is the load? Maybe it's so heavy that guests don't get
scheduled and really time out. Does the network recover if you ifdown/ifup?
--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: kvm-72: problems with 8139 under heavy load, lost interrupts?
2008-08-22 18:16 ` Avi Kivity
@ 2008-08-22 20:48 ` Farkas Levente
0 siblings, 0 replies; 3+ messages in thread
From: Farkas Levente @ 2008-08-22 20:48 UTC (permalink / raw)
To: Avi Kivity; +Cc: Nikola Ciprich, KVM list, nikola.ciprich
Avi Kivity wrote:
> Nikola Ciprich wrote:
>> Hello everybody,
>> we're running cluster of two hosts with tens (~45 running) of kvms,
>> and now I noticed that some nodes are loosing link under heavy load.
>>
>> following appears in dmesg:
>> [ 422.077128] NETDEV WATCHDOG: eth0: transmit timed out
>> [ 422.077215] eth0: Transmit timeout, status d 2b 5 80ff
>>
>> [root@sql1 ~]# cat /proc/interrupts
>> CPU0 CPU1 CPU2 CPU3
>> 0: 144 0 0 0 IO-APIC-edge timer
>> 1: 539 2 1 2 IO-APIC-edge i8042
>> 9: 0 0 0 0 IO-APIC-fasteoi acpi
>> 10: 756783 362345 372753 751385 IO-APIC-fasteoi eth0
>> 11: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb1
>> 12: 150 4 3 4 IO-APIC-edge i8042
>> 14: 518448 528815 172232 348704 IO-APIC-edge ide0
>> 15: 0 0 0 0 IO-APIC-edge ide1
>> NMI: 0 0 0 0 Non-maskable interrupts
>> LOC: 829179 775992 505151 458761 Local timer interrupts
>> RES: 115772 98143 88928 82099 Rescheduling interrupts
>> CAL: 73 166 138 160 function call interrupts
>> TLB: 214586 255980 66806 278284 TLB shootdowns
>> TRM: 0 0 0 0 Thermal event interrupts
>> SPU: 0 0 0 0 Spurious interrupts
>> ERR: 0
>> MIS: 1261
>>
>> I guess the MIS value might be related to this. I have observed this problem
>> on 32bit guests up to now, but it might be coincidence (those affected are heavily used).
>> It also seems that it *might* be related to SMP guests.
>>
>> Hosts are running 2.6.26.2-x86_64 + kvm-72, guests 2.6.24, and are using 8139 virt adapter.
>> I'm not sure if we had this problem with older KVM versions (and thus this is regression),
>> as the usage of machines is growing constantly, so we maybe just didn't noticed the problem before.
>>
>> I CAN try other virt adapters as well, but both machines are production, so I have to be
>> a bit cautious when it comes to experimenting. I'll try to prepare testing environment where
>> I could reproduce the problem.
>>
>> But in the meantime, is there some way I could debug the problem furher, but in safe manner?
>> I don't see anything related in either hosts dmesg, or logfiles.
>>
>>
>
> What would be most useful is to verify that this reproduces reliably,
> and a recipe for us to try out.
>
> Also, how heavy is the load? Maybe it's so heavy that guests don't get
> scheduled and really time out. Does the network recover if you ifdown/ifup?
the same happened with us. an easy way to reproduce was to create a new
iso image with revisor when it's use kickstart files using the given kvm
guest's nfs server.
--
Levente "Si vis pacem para bellum!"
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2008-08-22 20:48 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-22 7:54 kvm-72: problems with 8139 under heavy load, lost interrupts? Nikola Ciprich
2008-08-22 18:16 ` Avi Kivity
2008-08-22 20:48 ` Farkas Levente
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox