[Qemu-devel] 8139cp problems - steps to reproduce

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] 8139cp problems - steps to reproduce
@ 2008-09-08  7:57 Nikola Ciprich
  2008-09-08  9:59 ` [Qemu-devel] " Nikola Ciprich
  2008-09-08 19:54 ` [Qemu-devel] " Igor Kovalenko
  0 siblings, 2 replies; 5+ messages in thread
From: Nikola Ciprich @ 2008-09-08  7:57 UTC (permalink / raw)
  To: KVM list, qemu-devel; +Cc: nikola.ciprich, lfarkas

Hello Avi and everybody,

(and in advance, sorry for cross-posting).

As it was already reported, some people (including me :)) have problems
with network getting stuck from time to time in KVM guests. 

According to http://qemu-forum.ipi.fi/viewtopic.php?f=4&t=4563&start=0&st=0&sk=t&sd=a&sid=fcf252234991e017919ca7d0eb3799a3
the problem is maybe not KVM speciffic.

I can confirm that the problem seems to be occuring after transmitting few gigabytes of data,
so it can be simply reproduced by starting KVM guest, mounting some NFS in it, and then
starting shell loop dd if=/mnt/nfs/bigimage.iso of=/dev/zero
after some runs (in my case usually tens of GB), the problem occurs:
[ 2159.614496] NETDEV WATCHDOG: eth0: transmit timed out
[ 2159.614537] eth0: Transmit timeout, status  d   2b   15 80ff

The status "  d   2b   15 80ff" is always the same, on all testing machines
which according to 8139cp.c means

Command register=d
C+ command register=2b
Interrupt status=15
Interrupt mask=80ff

Particular bits are explained in 8139cp comments, unfortunately this didn't make me any smarter :(. 
The only thing I tried was disabling rx/tx checksumming for the interface (this was needed
fox XEN domUs as well), but it didn't helped.

What is important to note is, that this is simply reproducible this way for x86_32 guests (I'm
using x86_64 host). For x86_64 guests, the problem is actually much WORSE, as it usually gets
host machine into totally unusable state (it replies to pings, but that's all, no message in logs
after reboot, etc). I'll try to investigate it further.

Another important note is, that the problem is certainly NOT system-load related, it
occurs even when the machine is idle (except from load caused by network dd)

I'm using kvm-74 now, with 2.6.26 host and 2.6.24 guest, and bridged networking.

I'll try using e1000 driver, but I think that 8139cp is ATM considered the most stable choice, right?

So does somebody have an idea on where the problem could be? Of course I'll be glad to (try) to help
debugging...

Thanks a lot in advance!

nik

-- 
-------------------------------------
Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@linuxbox.cz
-------------------------------------

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Qemu-devel] Re: 8139cp problems - steps to reproduce
  2008-09-08  7:57 [Qemu-devel] 8139cp problems - steps to reproduce Nikola Ciprich
@ 2008-09-08  9:59 ` Nikola Ciprich
  2008-09-08 19:54 ` [Qemu-devel] " Igor Kovalenko
  1 sibling, 0 replies; 5+ messages in thread
From: Nikola Ciprich @ 2008-09-08  9:59 UTC (permalink / raw)
  To: KVM list, qemu-devel; +Cc: nikola.ciprich, lfarkas

Well, so with e1000, the problem comes much faster, probably already after few hundreds of MB
n.

On Mon, Sep 08, 2008 at 09:57:59AM +0200, Nikola Ciprich wrote:
> Hello Avi and everybody,
> 
> (and in advance, sorry for cross-posting).
> 
> As it was already reported, some people (including me :)) have problems
> with network getting stuck from time to time in KVM guests. 
> 
> According to http://qemu-forum.ipi.fi/viewtopic.php?f=4&t=4563&start=0&st=0&sk=t&sd=a&sid=fcf252234991e017919ca7d0eb3799a3
> the problem is maybe not KVM speciffic.
> 
> I can confirm that the problem seems to be occuring after transmitting few gigabytes of data,
> so it can be simply reproduced by starting KVM guest, mounting some NFS in it, and then
> starting shell loop dd if=/mnt/nfs/bigimage.iso of=/dev/zero
> after some runs (in my case usually tens of GB), the problem occurs:
> [ 2159.614496] NETDEV WATCHDOG: eth0: transmit timed out
> [ 2159.614537] eth0: Transmit timeout, status  d   2b   15 80ff
> 
> The status "  d   2b   15 80ff" is always the same, on all testing machines
> which according to 8139cp.c means
> 
> Command register=d
> C+ command register=2b
> Interrupt status=15
> Interrupt mask=80ff
> 
> Particular bits are explained in 8139cp comments, unfortunately this didn't make me any smarter :(. 
> The only thing I tried was disabling rx/tx checksumming for the interface (this was needed
> fox XEN domUs as well), but it didn't helped.
> 
> What is important to note is, that this is simply reproducible this way for x86_32 guests (I'm
> using x86_64 host). For x86_64 guests, the problem is actually much WORSE, as it usually gets
> host machine into totally unusable state (it replies to pings, but that's all, no message in logs
> after reboot, etc). I'll try to investigate it further.
> 
> Another important note is, that the problem is certainly NOT system-load related, it
> occurs even when the machine is idle (except from load caused by network dd)
> 
> I'm using kvm-74 now, with 2.6.26 host and 2.6.24 guest, and bridged networking.
> 
> I'll try using e1000 driver, but I think that 8139cp is ATM considered the most stable choice, right?
> 
> So does somebody have an idea on where the problem could be? Of course I'll be glad to (try) to help
> debugging...
> 
> Thanks a lot in advance!
> 
> nik
> 
> 
> -- 
> -------------------------------------
> Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28. rijna 168, 709 01 Ostrava
> 
> tel.:   +420 596 603 142
> fax:    +420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
> 
> mobil servis: +420 737 238 656
> email servis: servis@linuxbox.cz
> -------------------------------------
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
-------------------------------------
Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@linuxbox.cz
-------------------------------------

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] 8139cp problems - steps to reproduce
  2008-09-08  7:57 [Qemu-devel] 8139cp problems - steps to reproduce Nikola Ciprich
  2008-09-08  9:59 ` [Qemu-devel] " Nikola Ciprich
@ 2008-09-08 19:54 ` Igor Kovalenko
  2008-09-10  6:55   ` Nikola Ciprich
  1 sibling, 1 reply; 5+ messages in thread
From: Igor Kovalenko @ 2008-09-08 19:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: nikola.ciprich, KVM list, lfarkas

On Mon, Sep 8, 2008 at 11:57 AM, Nikola Ciprich <extmaillist@linuxbox.cz> wrote:
> Hello Avi and everybody,
>
> (and in advance, sorry for cross-posting).
>
> As it was already reported, some people (including me :)) have problems
> with network getting stuck from time to time in KVM guests.
>
> According to http://qemu-forum.ipi.fi/viewtopic.php?f=4&t=4563&start=0&st=0&sk=t&sd=a&sid=fcf252234991e017919ca7d0eb3799a3
> the problem is maybe not KVM speciffic.
>
> I can confirm that the problem seems to be occuring after transmitting few gigabytes of data,
> so it can be simply reproduced by starting KVM guest, mounting some NFS in it, and then
> starting shell loop dd if=/mnt/nfs/bigimage.iso of=/dev/zero
> after some runs (in my case usually tens of GB), the problem occurs:
> [ 2159.614496] NETDEV WATCHDOG: eth0: transmit timed out
> [ 2159.614537] eth0: Transmit timeout, status  d   2b   15 80ff
>
> The status "  d   2b   15 80ff" is always the same, on all testing machines
> which according to 8139cp.c means
>
> Command register=d
receiver enabled, transmitter enabled, buffer empty
> C+ command register=2b
C+ mode receiver enabled, transmitter enabled, receive checksum
offloading enabled, unsupported pci multiple r/w enabled
> Interrupt status=15
receiver overflow! tx ok,rx ok
> Interrupt mask=80ff

> So does somebody have an idea on where the problem could be? Of course I'll be glad to (try) to help
> debugging...

If it is not guest networking... please check if rx missed is not zero
(no idea how, probably with ethtool or with netstat -i)
You can also try enabling overflow debugging statements near lines where

s->IntrStatus |= RxOverflow

... there are 3 of these. There is quite low probability that rtl8139
should not set RxOK if it has overflow, or that guest driver does not
expect it in that combination (which seems to be rather valid in case
card received full set of descriptors space of data and missed last
packet before driver was able to read and vacate some buffers.)

-- 
Kind regards,
Igor V. Kovalenko

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] 8139cp problems - steps to reproduce
  2008-09-08 19:54 ` [Qemu-devel] " Igor Kovalenko
@ 2008-09-10  6:55   ` Nikola Ciprich
  2008-09-10  9:35     ` Nikola Ciprich
  0 siblings, 1 reply; 5+ messages in thread
From: Nikola Ciprich @ 2008-09-10  6:55 UTC (permalink / raw)
  To: Igor Kovalenko; +Cc: nikola.ciprich, qemu-devel, KVM list, lfarkas

On Mon, Sep 08, 2008 at 11:54:01PM +0400, Igor Kovalenko wrote:
Hi Igor,

I'm using bridged TAP device. Running ethtool -S eth0 (when it's still working)
shows increasing number of rx_err (rx_fifo is also increasing, dunno what that means).
tx_ok, rx_ok are also increasing of course. Then when the network hangs, tx_ok, rx_ok stop
increasing, and only rx_err and rx_fifo keep growing...

rtl8139-diag says among others:

Transmitter enabled with NONSTANDARD! settings, maximum burst 1024 bytes.
dunno if this is related...

when trying 2.6.26 x86-32 guest, i also got this:
[ 3719.662726] eth0: Transmit timeout, status  d   2b   15 80ff
[ 3719.662968] ------------[ cut here ]------------
[ 3719.662970] WARNING: at net/sched/sch_generic.c:223 dev_watchdog+0xde/0xf0()
[ 3719.662972] Modules linked in: nfs lockd sunrpc ipv6 dm_mod sbshc container button battery ac i2c_piix4 i2c_core 8139cp mii bitrev crc32 piix pata_acpi ide_pci_generic ata_piix ata_generic libata sd_mod scsi_mod dock ide_disk ide_core ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: x_tables]
[ 3719.663087] Pid: 0, comm: swapper Not tainted 2.6.26lb.01 #1
[ 3719.663091]  [<c012477f>] warn_on_slowpath+0x5f/0x90
[ 3719.663112]  [<e08fab2c>] cp_interrupt+0x25c/0x440 [8139cp]
[ 3719.663123]  [<c013ca3d>] getnstimeofday+0x3d/0xe0
[ 3719.663136]  [<c0113769>] ack_ioapic_quirk_irq+0x9/0x80
[ 3719.663143]  [<c015604c>] handle_fasteoi_irq+0x8c/0xe0
[ 3719.663151]  [<c0105e30>] do_IRQ+0x40/0x70
[ 3719.663155]  [<c0108aaa>] nommu_map_single+0x2a/0x60
[ 3719.663158]  [<c02b2840>] dev_watchdog+0x0/0xf0
[ 3719.663161]  [<c0103a1f>] common_interrupt+0x23/0x28
[ 3719.663164]  [<c02b2840>] dev_watchdog+0x0/0xf0
[ 3719.663168]  [<c02b291e>] dev_watchdog+0xde/0xf0
[ 3719.663171]  [<c012d2a6>] run_timer_softirq+0x116/0x180
[ 3719.663179]  [<c01296b2>] __do_softirq+0x72/0xf0
[ 3719.663183]  [<c0129767>] do_softirq+0x37/0x40
[ 3719.663186]  [<c0129ab7>] irq_exit+0x57/0x70
[ 3719.663189]  [<c0111e38>] smp_apic_timer_interrupt+0x58/0x90
[ 3719.663193]  [<c0156bc7>] rcu_pending+0x37/0x50
[ 3719.663195]  [<c0101bd0>] default_idle+0x0/0x40
[ 3719.663210]  [<c0103adc>] apic_timer_interrupt+0x28/0x30
[ 3719.663212]  [<c0101bd0>] default_idle+0x0/0x40
[ 3719.663216]  [<c0101bfe>] default_idle+0x2e/0x40
[ 3719.663219]  [<c0101af3>] cpu_idle+0x53/0xc0
[ 3719.663224]  =======================
[ 3719.663225] ---[ end trace a8aacd1e409fdad8 ]---

maybe it could help?

> On Mon, Sep 8, 2008 at 11:57 AM, Nikola Ciprich <extmaillist@linuxbox.cz> wrote:
> 
> If it is not guest networking... please check if rx missed is not zero
> (no idea how, probably with ethtool or with netstat -i)
> You can also try enabling overflow debugging statements near lines where
> 
> s->IntrStatus |= RxOverflow
> 
> ... there are 3 of these. There is quite low probability that rtl8139
> should not set RxOK if it has overflow, or that guest driver does not
> expect it in that combination (which seems to be rather valid in case
> card received full set of descriptors space of data and missed last
> packet before driver was able to read and vacate some buffers.)
> 
> -- 
> Kind regards,
> Igor V. Kovalenko
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
-------------------------------------
Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@linuxbox.cz
-------------------------------------

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] 8139cp problems - steps to reproduce
  2008-09-10  6:55   ` Nikola Ciprich
@ 2008-09-10  9:35     ` Nikola Ciprich
  0 siblings, 0 replies; 5+ messages in thread
From: Nikola Ciprich @ 2008-09-10  9:35 UTC (permalink / raw)
  To: Igor Kovalenko; +Cc: nikola.ciprich, qemu-devel, KVM list, lfarkas

Hello one more time,

I've just found some older thread on the topic, and it concluded that 
disabling APIC fixes the problem. I tried, and ... it works!
So as workaround, it's OK for me now, but I guess this should still be 
considered a bug...
If I could provide any help in fixing this, please let me know...

regards

nik

On Wed, Sep 10, 2008 at 08:55:26AM +0200, Nikola Ciprich wrote:
> On Mon, Sep 08, 2008 at 11:54:01PM +0400, Igor Kovalenko wrote:
> Hi Igor,
> 
> I'm using bridged TAP device. Running ethtool -S eth0 (when it's still working)
> shows increasing number of rx_err (rx_fifo is also increasing, dunno what that means).
> tx_ok, rx_ok are also increasing of course. Then when the network hangs, tx_ok, rx_ok stop
> increasing, and only rx_err and rx_fifo keep growing...
> 
> rtl8139-diag says among others:
> 
> Transmitter enabled with NONSTANDARD! settings, maximum burst 1024 bytes.
> dunno if this is related...
> 
> when trying 2.6.26 x86-32 guest, i also got this:
> [ 3719.662726] eth0: Transmit timeout, status  d   2b   15 80ff
> [ 3719.662968] ------------[ cut here ]------------
> [ 3719.662970] WARNING: at net/sched/sch_generic.c:223 dev_watchdog+0xde/0xf0()
> [ 3719.662972] Modules linked in: nfs lockd sunrpc ipv6 dm_mod sbshc container button battery ac i2c_piix4 i2c_core 8139cp mii bitrev crc32 piix pata_acpi ide_pci_generic ata_piix ata_generic libata sd_mod scsi_mod dock ide_disk ide_core ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: x_tables]
> [ 3719.663087] Pid: 0, comm: swapper Not tainted 2.6.26lb.01 #1
> [ 3719.663091]  [<c012477f>] warn_on_slowpath+0x5f/0x90
> [ 3719.663112]  [<e08fab2c>] cp_interrupt+0x25c/0x440 [8139cp]
> [ 3719.663123]  [<c013ca3d>] getnstimeofday+0x3d/0xe0
> [ 3719.663136]  [<c0113769>] ack_ioapic_quirk_irq+0x9/0x80
> [ 3719.663143]  [<c015604c>] handle_fasteoi_irq+0x8c/0xe0
> [ 3719.663151]  [<c0105e30>] do_IRQ+0x40/0x70
> [ 3719.663155]  [<c0108aaa>] nommu_map_single+0x2a/0x60
> [ 3719.663158]  [<c02b2840>] dev_watchdog+0x0/0xf0
> [ 3719.663161]  [<c0103a1f>] common_interrupt+0x23/0x28
> [ 3719.663164]  [<c02b2840>] dev_watchdog+0x0/0xf0
> [ 3719.663168]  [<c02b291e>] dev_watchdog+0xde/0xf0
> [ 3719.663171]  [<c012d2a6>] run_timer_softirq+0x116/0x180
> [ 3719.663179]  [<c01296b2>] __do_softirq+0x72/0xf0
> [ 3719.663183]  [<c0129767>] do_softirq+0x37/0x40
> [ 3719.663186]  [<c0129ab7>] irq_exit+0x57/0x70
> [ 3719.663189]  [<c0111e38>] smp_apic_timer_interrupt+0x58/0x90
> [ 3719.663193]  [<c0156bc7>] rcu_pending+0x37/0x50
> [ 3719.663195]  [<c0101bd0>] default_idle+0x0/0x40
> [ 3719.663210]  [<c0103adc>] apic_timer_interrupt+0x28/0x30
> [ 3719.663212]  [<c0101bd0>] default_idle+0x0/0x40
> [ 3719.663216]  [<c0101bfe>] default_idle+0x2e/0x40
> [ 3719.663219]  [<c0101af3>] cpu_idle+0x53/0xc0
> [ 3719.663224]  =======================
> [ 3719.663225] ---[ end trace a8aacd1e409fdad8 ]---
> 
> maybe it could help?
> 
> > On Mon, Sep 8, 2008 at 11:57 AM, Nikola Ciprich <extmaillist@linuxbox.cz> wrote:
> > 
> > If it is not guest networking... please check if rx missed is not zero
> > (no idea how, probably with ethtool or with netstat -i)
> > You can also try enabling overflow debugging statements near lines where
> > 
> > s->IntrStatus |= RxOverflow
> > 
> > ... there are 3 of these. There is quite low probability that rtl8139
> > should not set RxOK if it has overflow, or that guest driver does not
> > expect it in that combination (which seems to be rather valid in case
> > card received full set of descriptors space of data and missed last
> > packet before driver was able to read and vacate some buffers.)
> > 
> > -- 
> > Kind regards,
> > Igor V. Kovalenko
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> -- 
> -------------------------------------
> Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28. rijna 168, 709 01 Ostrava
> 
> tel.:   +420 596 603 142
> fax:    +420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
> 
> mobil servis: +420 737 238 656
> email servis: servis@linuxbox.cz
> -------------------------------------
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
-------------------------------------
Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@linuxbox.cz
-------------------------------------

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2008-09-10  9:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-08  7:57 [Qemu-devel] 8139cp problems - steps to reproduce Nikola Ciprich
2008-09-08  9:59 ` [Qemu-devel] " Nikola Ciprich
2008-09-08 19:54 ` [Qemu-devel] " Igor Kovalenko
2008-09-10  6:55   ` Nikola Ciprich
2008-09-10  9:35     ` Nikola Ciprich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).