xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [BUG] VIF rate limiting locks up network in the whole system
@ 2014-05-09  7:03 Jacek Konieczny
  2014-05-09  9:18 ` Ian Campbell
  0 siblings, 1 reply; 9+ messages in thread
From: Jacek Konieczny @ 2014-05-09  7:03 UTC (permalink / raw)
  To: xen-devel; +Cc: Mariusz Mazur

Hi,

[third attempt to send this to the xen-devel list]

To prevent a single domU from saturating our bandwidth we use the
'rate=...' options in VIF configuration of our domains xl.cfg files.
This used to work without serious issues, but recently, after upgrading
to xen 4.4.0 something went wrong.

When one domU on our semi-production system generated excessive network
traffic the network stopped for all other domUs on that host. dom0 was
still reachable and responsive, even over network, but no other domU
that the one generating the traffic could be even pinged. Strange,
reproducible on this system with this exact workload, but I could not
continue testing there to prepare a proper bug report.

When I started reproducing the problem on our test host the problems
were even worse. Network would completely stop in dom0 too pulling other
problems including fencing by a remote host (this was part of a
cluster). I managed to get all complicating factors out of the way (the
cluster and all other domUs) and that is what I got:


Xen version: 4.4.0 plus the following patches from the GIT:

babcef x86: enforce preemption in HVM_set_mem_access / p2m_set_mem_access()
3a148e x86: call pit_init for pvh also
1e83fa x86/pvh: disallow PHYSDEVOP_pirq_eoi_gmfn_v2/v1

dom0 Linux kernel: 3.13.6
domU Linux kernel: 3.14.0


domU xl config:

memory = 256
vcpus = 1
name = "ratelimittest1"
vif = [ 'mac=02:00:0f:ff:00:1f, bridge=xenbr0, rate=128Kb/s']
#vif = [ 'mac=02:00:0f:ff:00:1f, bridge=xenbr0']
disk = [ 'phy:/dev/vg/ratetest1,hda,w' ]
bootloader = 'pygrub'


When the 'rate=128Kb/s' option is there, the VM boots properly and its
network works properly. I can stress it with:

ping -q -f 10.28.45.27

(10.28.45.27 is the IP address of dom0 on the bridge when the VIF is
connected to)

It brings the network bandwith usage close to the 128Kb/s, but doesn't
affect any other domains in any significant way.

When I increase the packet size with:

ping -q -f 10.28.45.27 -s 1000

things start to go wrong:

Network in dom0 stops.

'top' in dom0 shows 100% CPU#0 usage in 'software interrupts':

%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,100.0 si,
 0.0 st
%Cpu1  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,
 0.0 st

'xl top' shows that dom0 uses ~ 102 % CPU, and the domU ~1.7% CPU

'perf top' shows this activity:
 48.23%  [kernel]                        [k] xen_hypercall_sched_op
 19.35%  [kernel]                        [k] xenvif_tx_build_gops
  6.70%  [kernel]                        [k] xen_restore_fl_direct
  5.99%  [kernel]                        [k] xen_hypercall_xen_version

dom0 is still perfectly usable through the serial console (the other CPU
core
pinned to the dom0 is mostly idle), but network interfaces are unusable.

The ping traffic is constantly flowing, in a rate a bit over the
configured 128kbit/s.

Problems stop when I stop the 'ping' in the domU.

The higher is the limit, the harder to trigger the problem (sometimes I
needed to use two parallel 'ping -f' to trigger that with rate=1024Kb/s)
and the lock-up seems a bit milder (huge network lag instead of network
not working at all).

There is no such problem when there is no 'rate=128Kb/s' in the VIF
configuration. Without the rate limit the same ping flood reaches over
10Mbit/s in each direction without affecting dom0 in a significant way.

dom0 'top' shows then:

%Cpu0  :  0.0 us,  0.7 sy,  0.0 ni, 59.3 id,  0.0 wa,  0.0 hi, 36.7 si,
%3.3 st
%Cpu1  :  0.0 us,  3.4 sy,  0.0 ni, 95.9 id,  0.0 wa,  0.0 hi,  0.0 si,
%0.7 st

'perf top':

 76.05%  [kernel]                        [k] xen_hypercall_sched_op
  5.14%  [kernel]                        [k] xen_hypercall_xen_version
  2.35%  [kernel]                        [k] xen_hypercall_grant_table_op
  1.66%  [kernel]                        [k] xen_hypercall_event_channel_op

and 'xl top': dom0 50%, domU 73%

If any more information is needed, please let me know.

Greets
Jacek Konieczny

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-05-09 13:43 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-09  7:03 [BUG] VIF rate limiting locks up network in the whole system Jacek Konieczny
2014-05-09  9:18 ` Ian Campbell
2014-05-09 10:25   ` Jacek Konieczny
2014-05-09 10:32     ` Ian Campbell
2014-05-09 11:44       ` Jacek Konieczny
2014-05-09 11:55         ` Ian Campbell
2014-05-09 12:44           ` Jacek Konieczny
2014-05-09 13:01             ` Ian Campbell
2014-05-09 13:43               ` Jacek Konieczny

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).