[Qemu-devel] How to lock-up your tap-based VM network

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jan Kiszka <jan.kiszka@siemens.com>
To: qemu-devel <qemu-devel@nongnu.org>
Subject: [Qemu-devel] How to lock-up your tap-based VM network
Date: Mon, 12 Apr 2010 18:43:01 +0200	[thread overview]
Message-ID: <4BC34D95.7050804@siemens.com> (raw)

Hi,

we found an ugly issue of the (pseudo) flow-control mechanism in
tap-based networks:

In recent Linux kernels (>= 2.6.30), the tun driver does TX queue length
accounting and stops sending packets if any local receiver does not
return enough of them. This aims at throttling the TX side when the RX
side is temporarily not able to run (e.g. because of CPU
overcommitment). Before that, there was the risk of dropping packets in
this scenario. Unfortunately this approach is fragile and even
counterproductive in some scenarios.

It is fragile as accounting is done based on skb->truesize on sender
side while its purely packet counting on the receiver side.
net/tap-linux.c claimes:

> /* sndbuf should be set to a value lower than the tx queue
>  * capacity of any destination network interface.
>  * Ethernet NICs generally have txqueuelen=1000, so 1Mb is
>  * a good default, given a 1500 byte MTU.
>  */
> #define TAP_DEFAULT_SNDBUF 1024*1024

This works for maximum-sized packets, but fails for minimum-sized ones.

But things get worse: Consider a local bridge with two VMs attached via
taps, and maybe a third interface used to connect to the world. If one
VM decides to shutdown its interface, it will queue packets directed to
it or sent as multicast to the bridge - 500 by default until it overruns
and finally starts dropping. If most of those packets came from the
other VM, that one will ran out of resources before that point! Simple
test: ifdown on the one side, ping -b -s 1472 on the other, and you will
lock out the second VM. This has happened in the field, creating some
unhappy customer. I see the point in avoiding packet drops, but this can
only work as best effort and must not cause such deadlocks.

A major reason for this deadlock could likely be removed by shutting
down the tap (if peered) or dropping packets in user space (in case of
vlan) when a NIC is stopped or otherwise shut down. Currently most (if
not all) NIC models seem to signal both "queue full" and "RX disabled"
via !can_receive(). This should be changed, probably by returning a
reason for "can't receive" so that the network layer can decide what to do.

Opinions? Better suggestions?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

next             reply	other threads:[~2010-04-12 16:45 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-04-12 16:43 Jan Kiszka [this message]
2010-04-12 20:07 ` [Qemu-devel] How to lock-up your tap-based VM network Paul Brook
2010-04-12 21:49   ` Jamie Lokier
2010-04-12 23:20     ` Paul Brook
2010-04-13 12:30       ` Jan Kiszka
2010-04-13 13:02         ` Paul Brook
2010-04-13 12:22     ` Jan Kiszka
2010-04-13 12:19   ` Jan Kiszka
2010-04-13 13:03     ` Paul Brook
2010-04-13 13:15       ` Jan Kiszka
2010-04-13 18:48   ` Blue Swirl
2010-04-13 19:13     ` Blue Swirl

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4BC34D95.7050804@siemens.com \
    --to=jan.kiszka@siemens.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.