[Qemu-devel] How to lock-up your tap-based VM network

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] How to lock-up your tap-based VM network
@ 2010-04-12 16:43 Jan Kiszka
  2010-04-12 20:07 ` Paul Brook
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kiszka @ 2010-04-12 16:43 UTC (permalink / raw)
  To: qemu-devel

Hi,

we found an ugly issue of the (pseudo) flow-control mechanism in
tap-based networks:

In recent Linux kernels (>= 2.6.30), the tun driver does TX queue length
accounting and stops sending packets if any local receiver does not
return enough of them. This aims at throttling the TX side when the RX
side is temporarily not able to run (e.g. because of CPU
overcommitment). Before that, there was the risk of dropping packets in
this scenario. Unfortunately this approach is fragile and even
counterproductive in some scenarios.

It is fragile as accounting is done based on skb->truesize on sender
side while its purely packet counting on the receiver side.
net/tap-linux.c claimes:

> /* sndbuf should be set to a value lower than the tx queue
>  * capacity of any destination network interface.
>  * Ethernet NICs generally have txqueuelen=1000, so 1Mb is
>  * a good default, given a 1500 byte MTU.
>  */
> #define TAP_DEFAULT_SNDBUF 1024*1024

This works for maximum-sized packets, but fails for minimum-sized ones.

But things get worse: Consider a local bridge with two VMs attached via
taps, and maybe a third interface used to connect to the world. If one
VM decides to shutdown its interface, it will queue packets directed to
it or sent as multicast to the bridge - 500 by default until it overruns
and finally starts dropping. If most of those packets came from the
other VM, that one will ran out of resources before that point! Simple
test: ifdown on the one side, ping -b -s 1472 on the other, and you will
lock out the second VM. This has happened in the field, creating some
unhappy customer. I see the point in avoiding packet drops, but this can
only work as best effort and must not cause such deadlocks.

A major reason for this deadlock could likely be removed by shutting
down the tap (if peered) or dropping packets in user space (in case of
vlan) when a NIC is stopped or otherwise shut down. Currently most (if
not all) NIC models seem to signal both "queue full" and "RX disabled"
via !can_receive(). This should be changed, probably by returning a
reason for "can't receive" so that the network layer can decide what to do.

Opinions? Better suggestions?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] How to lock-up your tap-based VM network
  2010-04-12 16:43 [Qemu-devel] How to lock-up your tap-based VM network Jan Kiszka
@ 2010-04-12 20:07 ` Paul Brook
  2010-04-12 21:49   ` Jamie Lokier
                     ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Paul Brook @ 2010-04-12 20:07 UTC (permalink / raw)
  To: qemu-devel; +Cc: Jan Kiszka

> A major reason for this deadlock could likely be removed by shutting
> down the tap (if peered) or dropping packets in user space (in case of
> vlan) when a NIC is stopped or otherwise shut down. Currently most (if
> not all) NIC models seem to signal both "queue full" and "RX disabled"
> via !can_receive().

No. A disabled device should return true from can_recieve, then discard the 
packets in its receive callback. Failure to do so is a bug in the device. It 
looks like the virtio-net device may be buggy.

Paul.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] How to lock-up your tap-based VM network
  2010-04-12 20:07 ` Paul Brook
@ 2010-04-12 21:49   ` Jamie Lokier
  2010-04-12 23:20     ` Paul Brook
  2010-04-13 12:22     ` Jan Kiszka
  2010-04-13 12:19   ` Jan Kiszka
  2010-04-13 18:48   ` Blue Swirl
  2 siblings, 2 replies; 12+ messages in thread
From: Jamie Lokier @ 2010-04-12 21:49 UTC (permalink / raw)
  To: Paul Brook; +Cc: Jan Kiszka, qemu-devel

Paul Brook wrote:
> > A major reason for this deadlock could likely be removed by shutting
> > down the tap (if peered) or dropping packets in user space (in case of
> > vlan) when a NIC is stopped or otherwise shut down. Currently most (if
> > not all) NIC models seem to signal both "queue full" and "RX disabled"
> > via !can_receive().
> 
> No. A disabled device should return true from can_recieve, then discard the 
> packets in its receive callback. Failure to do so is a bug in the device. It 
> looks like the virtio-net device may be buggy.

I agree - or alternatively signal that there's no point sending it
packets and they should be dropped without bothering to construct them.

But anyway, this flow control mechanism is buggy - what if instead of
an interface down, you just have a *slow* guest?  That should not push
back so much that it makes other guests networking with each other
slow down.

-- Jamie

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] How to lock-up your tap-based VM network
  2010-04-12 21:49   ` Jamie Lokier
@ 2010-04-12 23:20     ` Paul Brook
  2010-04-13 12:30       ` Jan Kiszka
  2010-04-13 12:22     ` Jan Kiszka
  1 sibling, 1 reply; 12+ messages in thread
From: Paul Brook @ 2010-04-12 23:20 UTC (permalink / raw)
  To: qemu-devel; +Cc: Jan Kiszka

> But anyway, this flow control mechanism is buggy - what if instead of
> an interface down, you just have a *slow* guest?  That should not push
> back so much that it makes other guests networking with each other
> slow down.

The OP described multiple guests connected via a host bridge. In this case it 
is entirely the host's responsibility to arbitrate between multiple guests. If 
one interface can block the bridge simply by failing to respond in a timely 
manner then this is a serious bug or misconfiguration of your host bridge.

The reason tap_send exists is because slirp does not implement TCP flow 
control properly.  Instead it relies on the can_send hook to only avoid 
dropped packets.

Using this in the tap code is debatable. My guess is that this is a hack to 
workaround guest network stacks that expect throughput to be limited by line 
speed (which is a virtual environment is effectively infinite), have poor 
higher level flow control, and react badly to packet loss.

Paul

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] How to lock-up your tap-based VM network
  2010-04-12 23:20     ` Paul Brook
@ 2010-04-13 12:30       ` Jan Kiszka
  2010-04-13 13:02         ` Paul Brook
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Kiszka @ 2010-04-13 12:30 UTC (permalink / raw)
  To: Paul Brook; +Cc: Mark McLoughlin, qemu-devel@nongnu.org, Herbert Xu

Paul Brook wrote:
>> But anyway, this flow control mechanism is buggy - what if instead of
>> an interface down, you just have a *slow* guest?  That should not push
>> back so much that it makes other guests networking with each other
>> slow down.
> 
> The OP described multiple guests connected via a host bridge. In this case it 
> is entirely the host's responsibility to arbitrate between multiple guests. If 
> one interface can block the bridge simply by failing to respond in a timely 
> manner then this is a serious bug or misconfiguration of your host bridge.

It's not the bridge, I think it's limited to the tun driver as bridge
participant and how it is configured/used by QEMU.

> 
> The reason tap_send exists is because slirp does not implement TCP flow 
> control properly.  Instead it relies on the can_send hook to only avoid 
> dropped packets.
> 
> Using this in the tap code is debatable. My guess is that this is a hack to 
> workaround guest network stacks that expect throughput to be limited by line 
> speed (which is a virtual environment is effectively infinite), have poor 
> higher level flow control, and react badly to packet loss.

[ CC'ing some people who should have been involved in establishing the
TX mitigation for tap devices. Full thread under
http://thread.gmane.org/gmane.comp.emulators.qemu/67809 ]

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] How to lock-up your tap-based VM network
  2010-04-13 12:30       ` Jan Kiszka
@ 2010-04-13 13:02         ` Paul Brook
  0 siblings, 0 replies; 12+ messages in thread
From: Paul Brook @ 2010-04-13 13:02 UTC (permalink / raw)
  To: qemu-devel; +Cc: Jan Kiszka, Herbert Xu, Mark McLoughlin

> Paul Brook wrote:
> >> But anyway, this flow control mechanism is buggy - what if instead of
> >> an interface down, you just have a *slow* guest?  That should not push
> >> back so much that it makes other guests networking with each other
> >> slow down.
> >
> > The OP described multiple guests connected via a host bridge. In this
> > case it is entirely the host's responsibility to arbitrate between
> > multiple guests. If one interface can block the bridge simply by failing
> > to respond in a timely manner then this is a serious bug or
> > misconfiguration of your host bridge.
> 
> It's not the bridge, I think it's limited to the tun driver as bridge
> participant and how it is configured/used by QEMU.

If one tap interface can prevent correct operation of another by failing to 
read data, then this is clearly a host kernel bug. 
I've no idea whether it's a bug in the TAP implementation (e.g. shared queue 
between independent devices), a bug in the bridging implementation (drops 
unrelated packets when one port is slow), or some interaction between the two, 
but it's definitely not a qemu bug.    

I'm sure there are plenty of ways to DoS a qemu instance, and prevent it 
reading data from its tap interface.  How/whether this effects other unrelated 
processes on the host machine is not something qemu can or should know about.

Paul

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] How to lock-up your tap-based VM network
  2010-04-12 21:49   ` Jamie Lokier
  2010-04-12 23:20     ` Paul Brook
@ 2010-04-13 12:22     ` Jan Kiszka
  1 sibling, 0 replies; 12+ messages in thread
From: Jan Kiszka @ 2010-04-13 12:22 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Paul Brook, qemu-devel@nongnu.org

Jamie Lokier wrote:
> Paul Brook wrote:
>>> A major reason for this deadlock could likely be removed by shutting
>>> down the tap (if peered) or dropping packets in user space (in case of
>>> vlan) when a NIC is stopped or otherwise shut down. Currently most (if
>>> not all) NIC models seem to signal both "queue full" and "RX disabled"
>>> via !can_receive().
>> No. A disabled device should return true from can_recieve, then discard the 
>> packets in its receive callback. Failure to do so is a bug in the device. It 
>> looks like the virtio-net device may be buggy.
> 
> I agree - or alternatively signal that there's no point sending it
> packets and they should be dropped without bothering to construct them.
> 
> But anyway, this flow control mechanism is buggy - what if instead of
> an interface down, you just have a *slow* guest?  That should not push
> back so much that it makes other guests networking with each other
> slow down.

Indeed. So, instead of the current scheme that tries to stop the sender
when some receiver overflows, we must turn it up side down so that this
stop can _never_ happen. We may keep a sufficiently long queue to reduce
the risk of packet drops, but we can't prevent this for all cases anyway.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] How to lock-up your tap-based VM network
  2010-04-12 20:07 ` Paul Brook
  2010-04-12 21:49   ` Jamie Lokier
@ 2010-04-13 12:19   ` Jan Kiszka
  2010-04-13 13:03     ` Paul Brook
  2010-04-13 18:48   ` Blue Swirl
  2 siblings, 1 reply; 12+ messages in thread
From: Jan Kiszka @ 2010-04-13 12:19 UTC (permalink / raw)
  To: Paul Brook; +Cc: qemu-devel@nongnu.org

Paul Brook wrote:
>> A major reason for this deadlock could likely be removed by shutting
>> down the tap (if peered) or dropping packets in user space (in case of
>> vlan) when a NIC is stopped or otherwise shut down. Currently most (if
>> not all) NIC models seem to signal both "queue full" and "RX disabled"
>> via !can_receive().
> 
> No. A disabled device should return true from can_recieve, then discard the 
> packets in its receive callback. Failure to do so is a bug in the device. It 
> looks like the virtio-net device may be buggy.

That's not a virtio-only issue. In fact, we ran into this over pcnet,
and a quick check of other popular PCI NIC models (except for rtl8139)
revealed the same picture: They only report can_receive if their
receiver unit is up and ready (some also include the queue state, but
that's an "add-on").

I think it's clear why: "can_receive" strongly suggests that a suspended
receiver should make the model return false. If we want to keep this
handler, it should be refactored to something like "queue_full".

But before starting any refactoring endeavor: Do we have a consensus on
the direction? Refactor can_receive to queue_full? Or even drop it?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] How to lock-up your tap-based VM network
  2010-04-13 12:19   ` Jan Kiszka
@ 2010-04-13 13:03     ` Paul Brook
  2010-04-13 13:15       ` Jan Kiszka
  0 siblings, 1 reply; 12+ messages in thread
From: Paul Brook @ 2010-04-13 13:03 UTC (permalink / raw)
  To: qemu-devel; +Cc: Jan Kiszka

> Paul Brook wrote:
> >> A major reason for this deadlock could likely be removed by shutting
> >> down the tap (if peered) or dropping packets in user space (in case of
> >> vlan) when a NIC is stopped or otherwise shut down. Currently most (if
> >> not all) NIC models seem to signal both "queue full" and "RX disabled"
> >> via !can_receive().
> >
> > No. A disabled device should return true from can_recieve, then discard
> > the packets in its receive callback. Failure to do so is a bug in the
> > device. It looks like the virtio-net device may be buggy.
> 
> That's not a virtio-only issue. In fact, we ran into this over pcnet,
> and a quick check of other popular PCI NIC models (except for rtl8139)
> revealed the same picture: They only report can_receive if their
> receiver unit is up and ready (some also include the queue state, but
> that's an "add-on").

If so these are also buggy.

> I think it's clear why: "can_receive" strongly suggests that a suspended
> receiver should make the model return false. If we want to keep this
> handler, it should be refactored to something like "queue_full".

I don't see a need to refactor anything.  You just need to fix the devices 
that incorrectly return false when their RX engine is disabled.

Paul

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] How to lock-up your tap-based VM network
  2010-04-13 13:03     ` Paul Brook
@ 2010-04-13 13:15       ` Jan Kiszka
  0 siblings, 0 replies; 12+ messages in thread
From: Jan Kiszka @ 2010-04-13 13:15 UTC (permalink / raw)
  To: Paul Brook; +Cc: qemu-devel@nongnu.org

Paul Brook wrote:
>> Paul Brook wrote:
>>>> A major reason for this deadlock could likely be removed by shutting
>>>> down the tap (if peered) or dropping packets in user space (in case of
>>>> vlan) when a NIC is stopped or otherwise shut down. Currently most (if
>>>> not all) NIC models seem to signal both "queue full" and "RX disabled"
>>>> via !can_receive().
>>> No. A disabled device should return true from can_recieve, then discard
>>> the packets in its receive callback. Failure to do so is a bug in the
>>> device. It looks like the virtio-net device may be buggy.
>> That's not a virtio-only issue. In fact, we ran into this over pcnet,
>> and a quick check of other popular PCI NIC models (except for rtl8139)
>> revealed the same picture: They only report can_receive if their
>> receiver unit is up and ready (some also include the queue state, but
>> that's an "add-on").
> 
> If so these are also buggy.
> 
>> I think it's clear why: "can_receive" strongly suggests that a suspended
>> receiver should make the model return false. If we want to keep this
>> handler, it should be refactored to something like "queue_full".
> 
> I don't see a need to refactor anything.  You just need to fix the devices 
> that incorrectly return false when their RX engine is disabled.

"can_receive" is a misleading name. NIC model developers will continue
to make mistakes when implementing this function. And I don't think we
want such implementations all around:

static int rtl8139_can_receive(VLANClientState *nc)
{
    RTL8139State *s = DO_UPCAST(NICState, nc, nc)->opaque;
    int avail;

    /* Receive (drop) packets if card is disabled.  */
    if (!s->clock_enabled)
      return 1;
    if (!rtl8139_receiver_enabled(s))
      return 1;

    if (rtl8139_cp_receiver_enabled(s)) {
        /* ??? Flow control not implemented in c+ mode.
           This is a hack to work around slirp deficiencies anyway.  */
        return 1;
    } else {
        avail = MOD2(s->RxBufferSize + s->RxBufPtr - s->RxBufAddr,
                     s->RxBufferSize);
        return (avail == 0 || avail >= 1514);
    }
}

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] How to lock-up your tap-based VM network
  2010-04-12 20:07 ` Paul Brook
  2010-04-12 21:49   ` Jamie Lokier
  2010-04-13 12:19   ` Jan Kiszka
@ 2010-04-13 18:48   ` Blue Swirl
  2010-04-13 19:13     ` Blue Swirl
  2 siblings, 1 reply; 12+ messages in thread
From: Blue Swirl @ 2010-04-13 18:48 UTC (permalink / raw)
  To: Paul Brook; +Cc: Jan Kiszka, qemu-devel

On 4/12/10, Paul Brook <paul@codesourcery.com> wrote:
> > A major reason for this deadlock could likely be removed by shutting
>  > down the tap (if peered) or dropping packets in user space (in case of
>  > vlan) when a NIC is stopped or otherwise shut down. Currently most (if
>  > not all) NIC models seem to signal both "queue full" and "RX disabled"
>  > via !can_receive().
>
>
> No. A disabled device should return true from can_recieve, then discard the
>  packets in its receive callback. Failure to do so is a bug in the device. It
>  looks like the virtio-net device may be buggy.

Awesome, it looks like a longstanding bug with pcnet/lance has is
fixed by this change! OpenBSD installer would hang when receiving
packages, now it works!

diff --git a/hw/pcnet.c b/hw/pcnet.c
index 5e63eb5..a04ad09 100644
--- a/hw/pcnet.c
+++ b/hw/pcnet.c
@@ -1031,8 +1031,6 @@ static int pcnet_tdte_poll(PCNetState *s)
 int pcnet_can_receive(VLANClientState *nc)
 {
     PCNetState *s = DO_UPCAST(NICState, nc, nc)->opaque;
-    if (CSR_STOP(s) || CSR_SPND(s))
-        return 0;

     return sizeof(s->buffer)-16;
 }

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] How to lock-up your tap-based VM network
  2010-04-13 18:48   ` Blue Swirl
@ 2010-04-13 19:13     ` Blue Swirl
  0 siblings, 0 replies; 12+ messages in thread
From: Blue Swirl @ 2010-04-13 19:13 UTC (permalink / raw)
  To: Paul Brook; +Cc: Jan Kiszka, qemu-devel

On 4/13/10, Blue Swirl <blauwirbel@gmail.com> wrote:
> On 4/12/10, Paul Brook <paul@codesourcery.com> wrote:
>  > > A major reason for this deadlock could likely be removed by shutting
>  >  > down the tap (if peered) or dropping packets in user space (in case of
>  >  > vlan) when a NIC is stopped or otherwise shut down. Currently most (if
>  >  > not all) NIC models seem to signal both "queue full" and "RX disabled"
>  >  > via !can_receive().
>  >
>  >
>  > No. A disabled device should return true from can_recieve, then discard the
>  >  packets in its receive callback. Failure to do so is a bug in the device. It
>  >  looks like the virtio-net device may be buggy.
>
>
> Awesome, it looks like a longstanding bug with pcnet/lance has is
>  fixed by this change! OpenBSD installer would hang when receiving
>  packages, now it works!

I spoke too soon, networking works also without the patch now.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2010-04-13 19:13 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-12 16:43 [Qemu-devel] How to lock-up your tap-based VM network Jan Kiszka
2010-04-12 20:07 ` Paul Brook
2010-04-12 21:49   ` Jamie Lokier
2010-04-12 23:20     ` Paul Brook
2010-04-13 12:30       ` Jan Kiszka
2010-04-13 13:02         ` Paul Brook
2010-04-13 12:22     ` Jan Kiszka
2010-04-13 12:19   ` Jan Kiszka
2010-04-13 13:03     ` Paul Brook
2010-04-13 13:15       ` Jan Kiszka
2010-04-13 18:48   ` Blue Swirl
2010-04-13 19:13     ` Blue Swirl

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).