kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [regression] virtio net locks up
@ 2012-01-11 15:24 Bernd Schubert
  2012-01-11 15:39 ` Bernd Schubert
  2012-01-11 16:04 ` Stefan Hajnoczi
  0 siblings, 2 replies; 13+ messages in thread
From: Bernd Schubert @ 2012-01-11 15:24 UTC (permalink / raw)
  To: kvm

No idea what is going on, but recent kernels lock up here after 
transferring some amount of data. So far I only know that 2.6.32 is the 
last working kernel I have tested and 3.0 is the first non-working 
version I tested.

How to reproduce:

vm1: iperf  -c vm2
vm2: iperf -s vm1

After some time either of both VMs cannot be pinged anymore, neither 
from host nor from the other (still working) VM. Direct access of the 
non-net-working vm via console still works fine.


Also not important if I run with vhost on or off, in both modes it fails.

qemu-kvm version is 1.0.

Here's my qemu-kvm start-up script:

> #! /bin/bash
>
> source  ~/bin/kvm-config.sh
>
> iface=`sudo tunctl -b -u $USER`
> FILE=${IMAGE_DIR}/squeeze1.img
> #NICMODEL=e1000
> NICMODEL=virtio
>
>
> DISKIF=virtio
> #DISKIF=ide
> #DISKIF=scsi
>
> ${kvm}                                                                  \
>         -m 4096                                                         \
>         -net nic,macaddr=52:54:00:12:34:11,model=${NICMODEL}            \
>         -net tap,id=foo,script=${HOME}/bin/kvm-ifup,downscript=${HOME}/bin/kvm-ifdown,ifname=$iface,vhost=on                    \
>         -boot c                                                         \
>         -drive file=${FILE},if=${DISKIF},boot=on,cache=writeback        \
>         ${common_opts}                                                  \
>         "$@"
>
> sudo /usr/sbin/tunctl -d $iface



Any idea what is going on or how to debug it?


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [regression] virtio net locks up
  2012-01-11 15:24 [regression] virtio net locks up Bernd Schubert
@ 2012-01-11 15:39 ` Bernd Schubert
  2012-01-11 16:04 ` Stefan Hajnoczi
  1 sibling, 0 replies; 13+ messages in thread
From: Bernd Schubert @ 2012-01-11 15:39 UTC (permalink / raw)
  To: kvm

On 01/11/2012 04:24 PM, Bernd Schubert wrote:
> No idea what is going on, but recent kernels lock up here after
> transferring some amount of data. So far I only know that 2.6.32 is the
> last working kernel I have tested and 3.0 is the first non-working
> version I tested.

Sorry forgot to tell the host side kernel version:
- this was not updated and is always 2.6.32-131.6.1.el6.x86_64 (so RHEL6)


Cheers,
Bernd


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [regression] virtio net locks up
  2012-01-11 15:24 [regression] virtio net locks up Bernd Schubert
  2012-01-11 15:39 ` Bernd Schubert
@ 2012-01-11 16:04 ` Stefan Hajnoczi
  2012-01-11 16:18   ` Bernd Schubert
  1 sibling, 1 reply; 13+ messages in thread
From: Stefan Hajnoczi @ 2012-01-11 16:04 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: kvm

On Wed, Jan 11, 2012 at 3:24 PM, Bernd Schubert
<bernd.schubert@itwm.fraunhofer.de> wrote:
> Any idea what is going on or how to debug it?

Here are a couple of ideas that would yield more information:

Since the console still works I suggest checking dmesg output inside
the guest.  Are there any error messages at the bottom?

Try pinging the host's IP address from inside the guest.  Run tcpdump
on the guest's tap interface from the host and observe whether or not
you see any packets being sent from the guest.

rmmod virtio_net inside the guest and then modprobe virtio_net again.
See if network connectivity is restored (remember to rerun DHCP or
whatever, if necessary).

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [regression] virtio net locks up
  2012-01-11 16:04 ` Stefan Hajnoczi
@ 2012-01-11 16:18   ` Bernd Schubert
  2012-01-11 17:09     ` Stefan Hajnoczi
  0 siblings, 1 reply; 13+ messages in thread
From: Bernd Schubert @ 2012-01-11 16:18 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kvm

Hello Stefan,

thanks for your help!

On 01/11/2012 05:04 PM, Stefan Hajnoczi wrote:
> On Wed, Jan 11, 2012 at 3:24 PM, Bernd Schubert
> <bernd.schubert@itwm.fraunhofer.de>  wrote:
>> Any idea what is going on or how to debug it?
>
> Here are a couple of ideas that would yield more information:
>
> Since the console still works I suggest checking dmesg output inside
> the guest.  Are there any error messages at the bottom?

No, absolutely nothing in dmesg.

>
> Try pinging the host's IP address from inside the guest.  Run tcpdump
> on the guest's tap interface from the host and observe whether or not
> you see any packets being sent from the guest.

Seems arp requests are still going out, but then don't go in:

17:16:21.202547 ARP, Reply 192.168.123.1 is-at 00:25:90:38:09:cd (oui 
Unknown), length 28
17:16:21.538724 ARP, Request who-has squeeze1 tell squeeze3, length 28
17:16:21.539026 ARP, Reply squeeze1 is-at 52:54:00:12:34:11 (oui 
Unknown), length 28
17:16:22.200912 ARP, Request who-has 192.168.123.1 tell squeeze3, length 28

>
> rmmod virtio_net inside the guest and then modprobe virtio_net again.
> See if network connectivity is restored (remember to rerun DHCP or
> whatever, if necessary).

Yep, that makes it work again. But probably is not the real solution ;)


Thanks,
Bernd


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [regression] virtio net locks up
  2012-01-11 16:18   ` Bernd Schubert
@ 2012-01-11 17:09     ` Stefan Hajnoczi
  2012-07-30 17:33       ` Bernd Schubert
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan Hajnoczi @ 2012-01-11 17:09 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: kvm

On Wed, Jan 11, 2012 at 4:18 PM, Bernd Schubert
<bernd.schubert@itwm.fraunhofer.de> wrote:
> On 01/11/2012 05:04 PM, Stefan Hajnoczi wrote:
>> Try pinging the host's IP address from inside the guest.  Run tcpdump
>> on the guest's tap interface from the host and observe whether or not
>> you see any packets being sent from the guest.
>
>
> Seems arp requests are still going out, but then don't go in:
>
> 17:16:21.202547 ARP, Reply 192.168.123.1 is-at 00:25:90:38:09:cd (oui
> Unknown), length 28
> 17:16:21.538724 ARP, Request who-has squeeze1 tell squeeze3, length 28
> 17:16:21.539026 ARP, Reply squeeze1 is-at 52:54:00:12:34:11 (oui Unknown),
> length 28
> 17:16:22.200912 ARP, Request who-has 192.168.123.1 tell squeeze3, length 28

Okay, so it seems networking from the tap device and beyond is fine.

>> rmmod virtio_net inside the guest and then modprobe virtio_net again.
>> See if network connectivity is restored (remember to rerun DHCP or
>> whatever, if necessary).
>
>
> Yep, that makes it work again. But probably is not the real solution ;)

It's just another piece of information which helps debug this :).  At
least nothing has wedged itself into an unrecoverable state.

When you said the problem happens without vhost, did you explicitly
run vhost=off?  Or did you just omit "vhost=on"?

This sounds like a guest kernel/driver issue.  I recommend testing
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git in
the guest to see if this has already been fixed.

If you have the -dbg RPMs installed it may be possible to insert a
probe into the virtio_net kernel module and observe receive
interrupts.  This does require the right kernel CONFIG_ but you might
already have it enabled:

$ sudo perf probe --add skb_recv_done
$ sudo perf record -e probe:skb_recv_done -a
...send some packets to the guest...
^C
$ sudo perf script

If you see no skb_recv_done events then the guest driver is not
receiving a notification when packets are received.

You can find more about how to use perf-probe(1) at
http://blog.vmsplice.net/2011/03/how-to-use-perf-probe.html.

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [regression] virtio net locks up
  2012-01-11 17:09     ` Stefan Hajnoczi
@ 2012-07-30 17:33       ` Bernd Schubert
  2012-07-30 18:08         ` Bernd Schubert
  0 siblings, 1 reply; 13+ messages in thread
From: Bernd Schubert @ 2012-07-30 17:33 UTC (permalink / raw)
  To: kvm

Hello Stefan,

Stefan Hajnoczi <stefanha <at> gmail.com> writes:
> 
> On Wed, Jan 11, 2012 at 4:18 PM, Bernd Schubert
> <bernd.schubert <at> itwm.fraunhofer.de> wrote:
> > On 01/11/2012 05:04 PM, Stefan Hajnoczi wrote:
> >> Try pinging the host's IP address from inside the guest.  Run tcpdump
> >> on the guest's tap interface from the host and observe whether or not
> >> you see any packets being sent from the guest.
> >


sorry for my terribly late reply. As usual I got distracted by too many other
things and then returned the hardware I was running the VMs on. My new desktop
system is better suitable to run kvm and I can easily reproduce it now with 3.5
on host and guest side. So its not fixed in recent versions yet.


> >
> > Seems arp requests are still going out, but then don't go in:
> >
> > 17:16:21.202547 ARP, Reply 192.168.123.1 is-at 00:25:90:38:09:cd (oui
> > Unknown), length 28
> > 17:16:21.538724 ARP, Request who-has squeeze1 tell squeeze3, length 28
> > 17:16:21.539026 ARP, Reply squeeze1 is-at 52:54:00:12:34:11 (oui Unknown),
> > length 28
> > 17:16:22.200912 ARP, Request who-has 192.168.123.1 tell squeeze3, length 28
> 
> Okay, so it seems networking from the tap device and beyond is fine.
> 
> >> rmmod virtio_net inside the guest and then modprobe virtio_net again.
> >> See if network connectivity is restored (remember to rerun DHCP or
> >> whatever, if necessary).
> >
> >
> > Yep, that makes it work again. But probably is not the real solution ;)
> 
> It's just another piece of information which helps debug this :).  At
> least nothing has wedged itself into an unrecoverable state.
> 
> When you said the problem happens without vhost, did you explicitly
> run vhost=off?  Or did you just omit "vhost=on"?

It was definitely off and I can confirm that it also locks up with vhost=on and
vhost=off with 3.5.

> 
> This sounds like a guest kernel/driver issue.  I recommend testing
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git in
> the guest to see if this has already been fixed.
> 
> If you have the -dbg RPMs installed it may be possible to insert a
> probe into the virtio_net kernel module and observe receive
> interrupts.  This does require the right kernel CONFIG_ but you might
> already have it enabled:
> 
> $ sudo perf probe --add skb_recv_done
> $ sudo perf record -e probe:skb_recv_done -a
> ...send some packets to the guest...
> ^C
> $ sudo perf script
> 
> If you see no skb_recv_done events then the guest driver is not
> receiving a notification when packets are received.
> 
> You can find more about how to use perf-probe(1) at
> http://blog.vmsplice.net/2011/03/how-to-use-perf-probe.html.

Ah nice, I would have used systemtap, but always wanted to check how to do it
with perf :) 

So once the virtio NIC has locked up, I don't get any events from it anymore -
until I remove/re-insert the virtio module (including ifup/ifdown). I will try
to find some time later on this week to look into it again.
Any further ideas how to proceed (I haven't even checked yet how virtio works at
all...).

Thanks,
Bernd


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [regression] virtio net locks up
  2012-07-30 17:33       ` Bernd Schubert
@ 2012-07-30 18:08         ` Bernd Schubert
  2012-07-31 10:23           ` Stefan Hajnoczi
  2012-08-12 11:45           ` Michael S. Tsirkin
  0 siblings, 2 replies; 13+ messages in thread
From: Bernd Schubert @ 2012-07-30 18:08 UTC (permalink / raw)
  To: kvm; +Cc: stefanha

On 07/30/2012 07:33 PM, Bernd Schubert wrote:
> Hello Stefan,
>
> Stefan Hajnoczi <stefanha <at> gmail.com> writes:
>>
>> On Wed, Jan 11, 2012 at 4:18 PM, Bernd Schubert
>> <bernd.schubert <at> itwm.fraunhofer.de> wrote:
>>> On 01/11/2012 05:04 PM, Stefan Hajnoczi wrote:
>>>> Try pinging the host's IP address from inside the guest.  Run tcpdump
>>>> on the guest's tap interface from the host and observe whether or not
>>>> you see any packets being sent from the guest.
>>>
>
>
> sorry for my terribly late reply. As usual I got distracted by too many other
> things and then returned the hardware I was running the VMs on. My new desktop
> system is better suitable to run kvm and I can easily reproduce it now with 3.5
> on host and guest side. So its not fixed in recent versions yet.
>
>
>>>
>>> Seems arp requests are still going out, but then don't go in:
>>>
>>> 17:16:21.202547 ARP, Reply 192.168.123.1 is-at 00:25:90:38:09:cd (oui
>>> Unknown), length 28
>>> 17:16:21.538724 ARP, Request who-has squeeze1 tell squeeze3, length 28
>>> 17:16:21.539026 ARP, Reply squeeze1 is-at 52:54:00:12:34:11 (oui Unknown),
>>> length 28
>>> 17:16:22.200912 ARP, Request who-has 192.168.123.1 tell squeeze3, length 28
>>
>> Okay, so it seems networking from the tap device and beyond is fine.
>>
>>>> rmmod virtio_net inside the guest and then modprobe virtio_net again.
>>>> See if network connectivity is restored (remember to rerun DHCP or
>>>> whatever, if necessary).
>>>
>>>
>>> Yep, that makes it work again. But probably is not the real solution ;)
>>
>> It's just another piece of information which helps debug this :).  At
>> least nothing has wedged itself into an unrecoverable state.
>>
>> When you said the problem happens without vhost, did you explicitly
>> run vhost=off?  Or did you just omit "vhost=on"?
>
> It was definitely off and I can confirm that it also locks up with vhost=on and
> vhost=off with 3.5.
>
>>
>> This sounds like a guest kernel/driver issue.  I recommend testing
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git in
>> the guest to see if this has already been fixed.
>>
>> If you have the -dbg RPMs installed it may be possible to insert a
>> probe into the virtio_net kernel module and observe receive
>> interrupts.  This does require the right kernel CONFIG_ but you might
>> already have it enabled:
>>
>> $ sudo perf probe --add skb_recv_done
>> $ sudo perf record -e probe:skb_recv_done -a
>> ...send some packets to the guest...
>> ^C
>> $ sudo perf script
>>
>> If you see no skb_recv_done events then the guest driver is not
>> receiving a notification when packets are received.
>>
>> You can find more about how to use perf-probe(1) at
>> http://blog.vmsplice.net/2011/03/how-to-use-perf-probe.html.
>
> Ah nice, I would have used systemtap, but always wanted to check how to do it
> with perf :)
>
> So once the virtio NIC has locked up, I don't get any events from it anymore -
> until I remove/re-insert the virtio module (including ifup/ifdown). I will try
> to find some time later on this week to look into it again.
> Any further ideas how to proceed (I haven't even checked yet how virtio works at
> all...).


I took a quick glance where skb_recv_done is registered at all and 
traced it back to vp_find_vqs(). Looking into that function I noticed 
MSI and so tried to boot with pci=nomsi. And indeed I guessed it right, 
with pci=nomsi I don't get any lockups anymore.
Am I the only one booting kvm-qemu usually with enabled MSI?

Cheers,
Bernd

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [regression] virtio net locks up
  2012-07-30 18:08         ` Bernd Schubert
@ 2012-07-31 10:23           ` Stefan Hajnoczi
  2012-08-01 17:05             ` Bernd Schubert
  2012-08-12 11:45           ` Michael S. Tsirkin
  1 sibling, 1 reply; 13+ messages in thread
From: Stefan Hajnoczi @ 2012-07-31 10:23 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: kvm, Michael S. Tsirkin

On Mon, Jul 30, 2012 at 7:08 PM, Bernd Schubert
<bernd.schubert@itwm.fraunhofer.de> wrote:
> On 07/30/2012 07:33 PM, Bernd Schubert wrote:
>>
>> Hello Stefan,
>>
>> Stefan Hajnoczi <stefanha <at> gmail.com> writes:
>>>
>>>
>>> On Wed, Jan 11, 2012 at 4:18 PM, Bernd Schubert
>>> <bernd.schubert <at> itwm.fraunhofer.de> wrote:
>>>>
>>>> On 01/11/2012 05:04 PM, Stefan Hajnoczi wrote:
>>>>>
>>>>> Try pinging the host's IP address from inside the guest.  Run tcpdump
>>>>> on the guest's tap interface from the host and observe whether or not
>>>>> you see any packets being sent from the guest.
>>>>
>>>>
>>
>>
>> sorry for my terribly late reply. As usual I got distracted by too many
>> other
>> things and then returned the hardware I was running the VMs on. My new
>> desktop
>> system is better suitable to run kvm and I can easily reproduce it now
>> with 3.5
>> on host and guest side. So its not fixed in recent versions yet.
>>
>>
>>>>
>>>> Seems arp requests are still going out, but then don't go in:
>>>>
>>>> 17:16:21.202547 ARP, Reply 192.168.123.1 is-at 00:25:90:38:09:cd (oui
>>>> Unknown), length 28
>>>> 17:16:21.538724 ARP, Request who-has squeeze1 tell squeeze3, length 28
>>>> 17:16:21.539026 ARP, Reply squeeze1 is-at 52:54:00:12:34:11 (oui
>>>> Unknown),
>>>> length 28
>>>> 17:16:22.200912 ARP, Request who-has 192.168.123.1 tell squeeze3, length
>>>> 28
>>>
>>>
>>> Okay, so it seems networking from the tap device and beyond is fine.
>>>
>>>>> rmmod virtio_net inside the guest and then modprobe virtio_net again.
>>>>> See if network connectivity is restored (remember to rerun DHCP or
>>>>> whatever, if necessary).
>>>>
>>>>
>>>>
>>>> Yep, that makes it work again. But probably is not the real solution ;)
>>>
>>>
>>> It's just another piece of information which helps debug this :).  At
>>> least nothing has wedged itself into an unrecoverable state.
>>>
>>> When you said the problem happens without vhost, did you explicitly
>>> run vhost=off?  Or did you just omit "vhost=on"?
>>
>>
>> It was definitely off and I can confirm that it also locks up with
>> vhost=on and
>> vhost=off with 3.5.
>>
>>>
>>> This sounds like a guest kernel/driver issue.  I recommend testing
>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git in
>>> the guest to see if this has already been fixed.
>>>
>>> If you have the -dbg RPMs installed it may be possible to insert a
>>> probe into the virtio_net kernel module and observe receive
>>> interrupts.  This does require the right kernel CONFIG_ but you might
>>> already have it enabled:
>>>
>>> $ sudo perf probe --add skb_recv_done
>>> $ sudo perf record -e probe:skb_recv_done -a
>>> ...send some packets to the guest...
>>> ^C
>>> $ sudo perf script
>>>
>>> If you see no skb_recv_done events then the guest driver is not
>>> receiving a notification when packets are received.
>>>
>>> You can find more about how to use perf-probe(1) at
>>> http://blog.vmsplice.net/2011/03/how-to-use-perf-probe.html.
>>
>>
>> Ah nice, I would have used systemtap, but always wanted to check how to do
>> it
>> with perf :)
>>
>> So once the virtio NIC has locked up, I don't get any events from it
>> anymore -
>> until I remove/re-insert the virtio module (including ifup/ifdown). I will
>> try
>> to find some time later on this week to look into it again.
>> Any further ideas how to proceed (I haven't even checked yet how virtio
>> works at
>> all...).
>
>
>
> I took a quick glance where skb_recv_done is registered at all and traced it
> back to vp_find_vqs(). Looking into that function I noticed MSI and so tried
> to boot with pci=nomsi. And indeed I guessed it right, with pci=nomsi I
> don't get any lockups anymore.
> Am I the only one booting kvm-qemu usually with enabled MSI?

MSI enabled is good and is the default with modern qemu-kvm + guest OSes.

Michael: Any ideas?

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [regression] virtio net locks up
  2012-07-31 10:23           ` Stefan Hajnoczi
@ 2012-08-01 17:05             ` Bernd Schubert
  2012-08-02 10:59               ` Stefan Hajnoczi
  0 siblings, 1 reply; 13+ messages in thread
From: Bernd Schubert @ 2012-08-01 17:05 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kvm, Michael S. Tsirkin

On 07/31/2012 12:23 PM, Stefan Hajnoczi wrote:
> On Mon, Jul 30, 2012 at 7:08 PM, Bernd Schubert

>>
>> I took a quick glance where skb_recv_done is registered at all and traced it
>> back to vp_find_vqs(). Looking into that function I noticed MSI and so tried
>> to boot with pci=nomsi. And indeed I guessed it right, with pci=nomsi I
>> don't get any lockups anymore.
>> Am I the only one booting kvm-qemu usually with enabled MSI?
>
> MSI enabled is good and is the default with modern qemu-kvm + guest OSes.
>
> Michael: Any ideas?

I just tried to boot with -net...,vectors=0 and -net...,vectors=32, both 
time it locks up. So MSI-X vs. MSI does not make a difference.

I don't have any time to further track that down right now. Shall I open 
a bugzilla ticket so that it won't be forgotten?
And I also tend to make a simple patch to let virtio-net always use 
normal interrupts as users usually don't like lock ups... But then I 
still don't understand why I seem to be the only one running into it (or 
at least I'm the only one complaining loudly).

Cheers,
Bernd


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [regression] virtio net locks up
  2012-08-01 17:05             ` Bernd Schubert
@ 2012-08-02 10:59               ` Stefan Hajnoczi
  0 siblings, 0 replies; 13+ messages in thread
From: Stefan Hajnoczi @ 2012-08-02 10:59 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: kvm, Michael S. Tsirkin

On Wed, Aug 1, 2012 at 6:05 PM, Bernd Schubert
<bernd.schubert@itwm.fraunhofer.de> wrote:
> On 07/31/2012 12:23 PM, Stefan Hajnoczi wrote:
>>
>> On Mon, Jul 30, 2012 at 7:08 PM, Bernd Schubert
>
>
>>>
>>> I took a quick glance where skb_recv_done is registered at all and traced
>>> it
>>> back to vp_find_vqs(). Looking into that function I noticed MSI and so
>>> tried
>>> to boot with pci=nomsi. And indeed I guessed it right, with pci=nomsi I
>>> don't get any lockups anymore.
>>> Am I the only one booting kvm-qemu usually with enabled MSI?
>>
>>
>> MSI enabled is good and is the default with modern qemu-kvm + guest OSes.
>>
>> Michael: Any ideas?
>
>
> I just tried to boot with -net...,vectors=0 and -net...,vectors=32, both
> time it locks up. So MSI-X vs. MSI does not make a difference.
>
> I don't have any time to further track that down right now. Shall I open a
> bugzilla ticket so that it won't be forgotten?
> And I also tend to make a simple patch to let virtio-net always use normal
> interrupts as users usually don't like lock ups... But then I still don't
> understand why I seem to be the only one running into it (or at least I'm
> the only one complaining loudly).

Filing a bug is a good idea.  I haven't seen this MSI reported elsewhere.

Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [regression] virtio net locks up
  2012-07-30 18:08         ` Bernd Schubert
  2012-07-31 10:23           ` Stefan Hajnoczi
@ 2012-08-12 11:45           ` Michael S. Tsirkin
  2012-08-17 23:25             ` Bernd Schubert
  1 sibling, 1 reply; 13+ messages in thread
From: Michael S. Tsirkin @ 2012-08-12 11:45 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: kvm, stefanha

On Mon, Jul 30, 2012 at 08:08:31PM +0200, Bernd Schubert wrote:
> On 07/30/2012 07:33 PM, Bernd Schubert wrote:
> >Hello Stefan,
> >
> >Stefan Hajnoczi <stefanha <at> gmail.com> writes:
> >>
> >>On Wed, Jan 11, 2012 at 4:18 PM, Bernd Schubert
> >><bernd.schubert <at> itwm.fraunhofer.de> wrote:
> >>>On 01/11/2012 05:04 PM, Stefan Hajnoczi wrote:
> >>>>Try pinging the host's IP address from inside the guest.  Run tcpdump
> >>>>on the guest's tap interface from the host and observe whether or not
> >>>>you see any packets being sent from the guest.
> >>>
> >
> >
> >sorry for my terribly late reply. As usual I got distracted by too many other
> >things and then returned the hardware I was running the VMs on. My new desktop
> >system is better suitable to run kvm and I can easily reproduce it now with 3.5
> >on host and guest side. So its not fixed in recent versions yet.
> >
> >
> >>>
> >>>Seems arp requests are still going out, but then don't go in:
> >>>
> >>>17:16:21.202547 ARP, Reply 192.168.123.1 is-at 00:25:90:38:09:cd (oui
> >>>Unknown), length 28
> >>>17:16:21.538724 ARP, Request who-has squeeze1 tell squeeze3, length 28
> >>>17:16:21.539026 ARP, Reply squeeze1 is-at 52:54:00:12:34:11 (oui Unknown),
> >>>length 28
> >>>17:16:22.200912 ARP, Request who-has 192.168.123.1 tell squeeze3, length 28
> >>
> >>Okay, so it seems networking from the tap device and beyond is fine.
> >>
> >>>>rmmod virtio_net inside the guest and then modprobe virtio_net again.
> >>>>See if network connectivity is restored (remember to rerun DHCP or
> >>>>whatever, if necessary).
> >>>
> >>>
> >>>Yep, that makes it work again. But probably is not the real solution ;)
> >>
> >>It's just another piece of information which helps debug this :).  At
> >>least nothing has wedged itself into an unrecoverable state.
> >>
> >>When you said the problem happens without vhost, did you explicitly
> >>run vhost=off?  Or did you just omit "vhost=on"?
> >
> >It was definitely off and I can confirm that it also locks up with vhost=on and
> >vhost=off with 3.5.
> >
> >>
> >>This sounds like a guest kernel/driver issue.  I recommend testing
> >>git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git in
> >>the guest to see if this has already been fixed.
> >>
> >>If you have the -dbg RPMs installed it may be possible to insert a
> >>probe into the virtio_net kernel module and observe receive
> >>interrupts.  This does require the right kernel CONFIG_ but you might
> >>already have it enabled:
> >>
> >>$ sudo perf probe --add skb_recv_done
> >>$ sudo perf record -e probe:skb_recv_done -a
> >>...send some packets to the guest...
> >>^C
> >>$ sudo perf script
> >>
> >>If you see no skb_recv_done events then the guest driver is not
> >>receiving a notification when packets are received.
> >>
> >>You can find more about how to use perf-probe(1) at
> >>http://blog.vmsplice.net/2011/03/how-to-use-perf-probe.html.
> >
> >Ah nice, I would have used systemtap, but always wanted to check how to do it
> >with perf :)
> >
> >So once the virtio NIC has locked up, I don't get any events from it anymore -
> >until I remove/re-insert the virtio module (including ifup/ifdown). I will try
> >to find some time later on this week to look into it again.
> >Any further ideas how to proceed (I haven't even checked yet how virtio works at
> >all...).
> 
> 
> I took a quick glance where skb_recv_done is registered at all and
> traced it back to vp_find_vqs(). Looking into that function I
> noticed MSI and so tried to boot with pci=nomsi. And indeed I
> guessed it right, with pci=nomsi I don't get any lockups anymore.
> Am I the only one booting kvm-qemu usually with enabled MSI?
> 
> Cheers,
> Bernd

No :)

I am guessing it has to do with OOM handling in the guest -
it is tested very little but maybe your guest is such that atomic
pool gets exhausted for some reason.
Could you pls check whether refill_work runs by tracing it?
This is our OOM handler.


-- 
MST

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [regression] virtio net locks up
  2012-08-12 11:45           ` Michael S. Tsirkin
@ 2012-08-17 23:25             ` Bernd Schubert
  2012-08-19  9:01               ` Michael S. Tsirkin
  0 siblings, 1 reply; 13+ messages in thread
From: Bernd Schubert @ 2012-08-17 23:25 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: kvm, stefanha

On 08/12/2012 01:45 PM, Michael S. Tsirkin wrote:
> On Mon, Jul 30, 2012 at 08:08:31PM +0200, Bernd Schubert wrote:
>> On 07/30/2012 07:33 PM, Bernd Schubert wrote:
>>> Hello Stefan,
>>>
>>> Stefan Hajnoczi <stefanha <at> gmail.com> writes:
>>>>
>>>> On Wed, Jan 11, 2012 at 4:18 PM, Bernd Schubert
>>>> <bernd.schubert <at> itwm.fraunhofer.de> wrote:
>>>>> On 01/11/2012 05:04 PM, Stefan Hajnoczi wrote:
>>>>>> Try pinging the host's IP address from inside the guest.  Run tcpdump
>>>>>> on the guest's tap interface from the host and observe whether or not
>>>>>> you see any packets being sent from the guest.
>>>>>
>>>
>>>
>>> sorry for my terribly late reply. As usual I got distracted by too many other
>>> things and then returned the hardware I was running the VMs on. My new desktop
>>> system is better suitable to run kvm and I can easily reproduce it now with 3.5
>>> on host and guest side. So its not fixed in recent versions yet.
>>>
>>>
>>>>>
>>>>> Seems arp requests are still going out, but then don't go in:
>>>>>
>>>>> 17:16:21.202547 ARP, Reply 192.168.123.1 is-at 00:25:90:38:09:cd (oui
>>>>> Unknown), length 28
>>>>> 17:16:21.538724 ARP, Request who-has squeeze1 tell squeeze3, length 28
>>>>> 17:16:21.539026 ARP, Reply squeeze1 is-at 52:54:00:12:34:11 (oui Unknown),
>>>>> length 28
>>>>> 17:16:22.200912 ARP, Request who-has 192.168.123.1 tell squeeze3, length 28
>>>>
>>>> Okay, so it seems networking from the tap device and beyond is fine.
>>>>
>>>>>> rmmod virtio_net inside the guest and then modprobe virtio_net again.
>>>>>> See if network connectivity is restored (remember to rerun DHCP or
>>>>>> whatever, if necessary).
>>>>>
>>>>>
>>>>> Yep, that makes it work again. But probably is not the real solution ;)
>>>>
>>>> It's just another piece of information which helps debug this :).  At
>>>> least nothing has wedged itself into an unrecoverable state.
>>>>
>>>> When you said the problem happens without vhost, did you explicitly
>>>> run vhost=off?  Or did you just omit "vhost=on"?
>>>
>>> It was definitely off and I can confirm that it also locks up with vhost=on and
>>> vhost=off with 3.5.
>>>
>>>>
>>>> This sounds like a guest kernel/driver issue.  I recommend testing
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git in
>>>> the guest to see if this has already been fixed.
>>>>
>>>> If you have the -dbg RPMs installed it may be possible to insert a
>>>> probe into the virtio_net kernel module and observe receive
>>>> interrupts.  This does require the right kernel CONFIG_ but you might
>>>> already have it enabled:
>>>>
>>>> $ sudo perf probe --add skb_recv_done
>>>> $ sudo perf record -e probe:skb_recv_done -a
>>>> ...send some packets to the guest...
>>>> ^C
>>>> $ sudo perf script
>>>>
>>>> If you see no skb_recv_done events then the guest driver is not
>>>> receiving a notification when packets are received.
>>>>
>>>> You can find more about how to use perf-probe(1) at
>>>> http://blog.vmsplice.net/2011/03/how-to-use-perf-probe.html.
>>>
>>> Ah nice, I would have used systemtap, but always wanted to check how to do it
>>> with perf :)
>>>
>>> So once the virtio NIC has locked up, I don't get any events from it anymore -
>>> until I remove/re-insert the virtio module (including ifup/ifdown). I will try
>>> to find some time later on this week to look into it again.
>>> Any further ideas how to proceed (I haven't even checked yet how virtio works at
>>> all...).
>>
>>
>> I took a quick glance where skb_recv_done is registered at all and
>> traced it back to vp_find_vqs(). Looking into that function I
>> noticed MSI and so tried to boot with pci=nomsi. And indeed I
>> guessed it right, with pci=nomsi I don't get any lockups anymore.
>> Am I the only one booting kvm-qemu usually with enabled MSI?
>>
>> Cheers,
>> Bernd
> 
> No :)
> 
> I am guessing it has to do with OOM handling in the guest -
> it is tested very little but maybe your guest is such that atomic
> pool gets exhausted for some reason.
> Could you pls check whether refill_work runs by tracing it?
> This is our OOM handler.
> 
> 

Just checked it, it does not show up in perf script output.


Cheers,
Bernd

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [regression] virtio net locks up
  2012-08-17 23:25             ` Bernd Schubert
@ 2012-08-19  9:01               ` Michael S. Tsirkin
  0 siblings, 0 replies; 13+ messages in thread
From: Michael S. Tsirkin @ 2012-08-19  9:01 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: kvm, stefanha

On Sat, Aug 18, 2012 at 01:25:05AM +0200, Bernd Schubert wrote:
> On 08/12/2012 01:45 PM, Michael S. Tsirkin wrote:
> > On Mon, Jul 30, 2012 at 08:08:31PM +0200, Bernd Schubert wrote:
> >> On 07/30/2012 07:33 PM, Bernd Schubert wrote:
> >>> Hello Stefan,
> >>>
> >>> Stefan Hajnoczi <stefanha <at> gmail.com> writes:
> >>>>
> >>>> On Wed, Jan 11, 2012 at 4:18 PM, Bernd Schubert
> >>>> <bernd.schubert <at> itwm.fraunhofer.de> wrote:
> >>>>> On 01/11/2012 05:04 PM, Stefan Hajnoczi wrote:
> >>>>>> Try pinging the host's IP address from inside the guest.  Run tcpdump
> >>>>>> on the guest's tap interface from the host and observe whether or not
> >>>>>> you see any packets being sent from the guest.
> >>>>>
> >>>
> >>>
> >>> sorry for my terribly late reply. As usual I got distracted by too many other
> >>> things and then returned the hardware I was running the VMs on. My new desktop
> >>> system is better suitable to run kvm and I can easily reproduce it now with 3.5
> >>> on host and guest side. So its not fixed in recent versions yet.
> >>>
> >>>
> >>>>>
> >>>>> Seems arp requests are still going out, but then don't go in:
> >>>>>
> >>>>> 17:16:21.202547 ARP, Reply 192.168.123.1 is-at 00:25:90:38:09:cd (oui
> >>>>> Unknown), length 28
> >>>>> 17:16:21.538724 ARP, Request who-has squeeze1 tell squeeze3, length 28
> >>>>> 17:16:21.539026 ARP, Reply squeeze1 is-at 52:54:00:12:34:11 (oui Unknown),
> >>>>> length 28
> >>>>> 17:16:22.200912 ARP, Request who-has 192.168.123.1 tell squeeze3, length 28
> >>>>
> >>>> Okay, so it seems networking from the tap device and beyond is fine.
> >>>>
> >>>>>> rmmod virtio_net inside the guest and then modprobe virtio_net again.
> >>>>>> See if network connectivity is restored (remember to rerun DHCP or
> >>>>>> whatever, if necessary).
> >>>>>
> >>>>>
> >>>>> Yep, that makes it work again. But probably is not the real solution ;)
> >>>>
> >>>> It's just another piece of information which helps debug this :).  At
> >>>> least nothing has wedged itself into an unrecoverable state.
> >>>>
> >>>> When you said the problem happens without vhost, did you explicitly
> >>>> run vhost=off?  Or did you just omit "vhost=on"?
> >>>
> >>> It was definitely off and I can confirm that it also locks up with vhost=on and
> >>> vhost=off with 3.5.
> >>>
> >>>>
> >>>> This sounds like a guest kernel/driver issue.  I recommend testing
> >>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git in
> >>>> the guest to see if this has already been fixed.
> >>>>
> >>>> If you have the -dbg RPMs installed it may be possible to insert a
> >>>> probe into the virtio_net kernel module and observe receive
> >>>> interrupts.  This does require the right kernel CONFIG_ but you might
> >>>> already have it enabled:
> >>>>
> >>>> $ sudo perf probe --add skb_recv_done
> >>>> $ sudo perf record -e probe:skb_recv_done -a
> >>>> ...send some packets to the guest...
> >>>> ^C
> >>>> $ sudo perf script
> >>>>
> >>>> If you see no skb_recv_done events then the guest driver is not
> >>>> receiving a notification when packets are received.
> >>>>
> >>>> You can find more about how to use perf-probe(1) at
> >>>> http://blog.vmsplice.net/2011/03/how-to-use-perf-probe.html.
> >>>
> >>> Ah nice, I would have used systemtap, but always wanted to check how to do it
> >>> with perf :)
> >>>
> >>> So once the virtio NIC has locked up, I don't get any events from it anymore -
> >>> until I remove/re-insert the virtio module (including ifup/ifdown). I will try
> >>> to find some time later on this week to look into it again.
> >>> Any further ideas how to proceed (I haven't even checked yet how virtio works at
> >>> all...).
> >>
> >>
> >> I took a quick glance where skb_recv_done is registered at all and
> >> traced it back to vp_find_vqs(). Looking into that function I
> >> noticed MSI and so tried to boot with pci=nomsi. And indeed I
> >> guessed it right, with pci=nomsi I don't get any lockups anymore.
> >> Am I the only one booting kvm-qemu usually with enabled MSI?
> >>
> >> Cheers,
> >> Bernd
> > 
> > No :)
> > 
> > I am guessing it has to do with OOM handling in the guest -
> > it is tested very little but maybe your guest is such that atomic
> > pool gets exhausted for some reason.
> > Could you pls check whether refill_work runs by tracing it?
> > This is our OOM handler.
> > 
> > 
> 
> Just checked it, it does not show up in perf script output.
> 
> 
> Cheers,
> Bernd


When running with vhost-net on, if you enable
DEBUG in *host* kernel build (or set CONFIG_DYNAMIC_DEBUG
and enable messages for the vhost_net module)
pr_debug will output some debug messages if guest
bug is detected.

Can you try this?

-- 
MST

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-08-19  9:00 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-11 15:24 [regression] virtio net locks up Bernd Schubert
2012-01-11 15:39 ` Bernd Schubert
2012-01-11 16:04 ` Stefan Hajnoczi
2012-01-11 16:18   ` Bernd Schubert
2012-01-11 17:09     ` Stefan Hajnoczi
2012-07-30 17:33       ` Bernd Schubert
2012-07-30 18:08         ` Bernd Schubert
2012-07-31 10:23           ` Stefan Hajnoczi
2012-08-01 17:05             ` Bernd Schubert
2012-08-02 10:59               ` Stefan Hajnoczi
2012-08-12 11:45           ` Michael S. Tsirkin
2012-08-17 23:25             ` Bernd Schubert
2012-08-19  9:01               ` Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).