* [BUG] VIF rate limiting locks up network in the whole system @ 2014-05-09 7:03 Jacek Konieczny 2014-05-09 9:18 ` Ian Campbell 0 siblings, 1 reply; 9+ messages in thread From: Jacek Konieczny @ 2014-05-09 7:03 UTC (permalink / raw) To: xen-devel; +Cc: Mariusz Mazur Hi, [third attempt to send this to the xen-devel list] To prevent a single domU from saturating our bandwidth we use the 'rate=...' options in VIF configuration of our domains xl.cfg files. This used to work without serious issues, but recently, after upgrading to xen 4.4.0 something went wrong. When one domU on our semi-production system generated excessive network traffic the network stopped for all other domUs on that host. dom0 was still reachable and responsive, even over network, but no other domU that the one generating the traffic could be even pinged. Strange, reproducible on this system with this exact workload, but I could not continue testing there to prepare a proper bug report. When I started reproducing the problem on our test host the problems were even worse. Network would completely stop in dom0 too pulling other problems including fencing by a remote host (this was part of a cluster). I managed to get all complicating factors out of the way (the cluster and all other domUs) and that is what I got: Xen version: 4.4.0 plus the following patches from the GIT: babcef x86: enforce preemption in HVM_set_mem_access / p2m_set_mem_access() 3a148e x86: call pit_init for pvh also 1e83fa x86/pvh: disallow PHYSDEVOP_pirq_eoi_gmfn_v2/v1 dom0 Linux kernel: 3.13.6 domU Linux kernel: 3.14.0 domU xl config: memory = 256 vcpus = 1 name = "ratelimittest1" vif = [ 'mac=02:00:0f:ff:00:1f, bridge=xenbr0, rate=128Kb/s'] #vif = [ 'mac=02:00:0f:ff:00:1f, bridge=xenbr0'] disk = [ 'phy:/dev/vg/ratetest1,hda,w' ] bootloader = 'pygrub' When the 'rate=128Kb/s' option is there, the VM boots properly and its network works properly. I can stress it with: ping -q -f 10.28.45.27 (10.28.45.27 is the IP address of dom0 on the bridge when the VIF is connected to) It brings the network bandwith usage close to the 128Kb/s, but doesn't affect any other domains in any significant way. When I increase the packet size with: ping -q -f 10.28.45.27 -s 1000 things start to go wrong: Network in dom0 stops. 'top' in dom0 shows 100% CPU#0 usage in 'software interrupts': %Cpu0 : 0.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi,100.0 si, 0.0 st %Cpu1 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st 'xl top' shows that dom0 uses ~ 102 % CPU, and the domU ~1.7% CPU 'perf top' shows this activity: 48.23% [kernel] [k] xen_hypercall_sched_op 19.35% [kernel] [k] xenvif_tx_build_gops 6.70% [kernel] [k] xen_restore_fl_direct 5.99% [kernel] [k] xen_hypercall_xen_version dom0 is still perfectly usable through the serial console (the other CPU core pinned to the dom0 is mostly idle), but network interfaces are unusable. The ping traffic is constantly flowing, in a rate a bit over the configured 128kbit/s. Problems stop when I stop the 'ping' in the domU. The higher is the limit, the harder to trigger the problem (sometimes I needed to use two parallel 'ping -f' to trigger that with rate=1024Kb/s) and the lock-up seems a bit milder (huge network lag instead of network not working at all). There is no such problem when there is no 'rate=128Kb/s' in the VIF configuration. Without the rate limit the same ping flood reaches over 10Mbit/s in each direction without affecting dom0 in a significant way. dom0 'top' shows then: %Cpu0 : 0.0 us, 0.7 sy, 0.0 ni, 59.3 id, 0.0 wa, 0.0 hi, 36.7 si, %3.3 st %Cpu1 : 0.0 us, 3.4 sy, 0.0 ni, 95.9 id, 0.0 wa, 0.0 hi, 0.0 si, %0.7 st 'perf top': 76.05% [kernel] [k] xen_hypercall_sched_op 5.14% [kernel] [k] xen_hypercall_xen_version 2.35% [kernel] [k] xen_hypercall_grant_table_op 1.66% [kernel] [k] xen_hypercall_event_channel_op and 'xl top': dom0 50%, domU 73% If any more information is needed, please let me know. Greets Jacek Konieczny ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [BUG] VIF rate limiting locks up network in the whole system 2014-05-09 7:03 [BUG] VIF rate limiting locks up network in the whole system Jacek Konieczny @ 2014-05-09 9:18 ` Ian Campbell 2014-05-09 10:25 ` Jacek Konieczny 0 siblings, 1 reply; 9+ messages in thread From: Ian Campbell @ 2014-05-09 9:18 UTC (permalink / raw) To: Jacek Konieczny; +Cc: Mariusz Mazur, xen-devel On Fri, 2014-05-09 at 09:03 +0200, Jacek Konieczny wrote: > Hi, > > [third attempt to send this to the xen-devel list] > > To prevent a single domU from saturating our bandwidth we use the > 'rate=...' options in VIF configuration of our domains xl.cfg files. > This used to work without serious issues, but recently, after upgrading > to xen 4.4.0 something went wrong. Just to be clear, you changed Xen but not the dom0 or domU kernel, is that right? Which version of Xen were you using before? Did you also change toolstack (e.g. from xm to xl) over the upgrade? The main (only, I think) affect of the rate= from the toolstack PoV is to write some additional keys to the vif backend directory, you should be able to see this in "xenstore-ls -fp" output. Do they perhaps differ between the working and non-working case (despite the input configuration being the same)? Those keys then affect netback's behaviour which is why I am interested in whether the kernel version has changed. Ian. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [BUG] VIF rate limiting locks up network in the whole system 2014-05-09 9:18 ` Ian Campbell @ 2014-05-09 10:25 ` Jacek Konieczny 2014-05-09 10:32 ` Ian Campbell 0 siblings, 1 reply; 9+ messages in thread From: Jacek Konieczny @ 2014-05-09 10:25 UTC (permalink / raw) To: Ian Campbell; +Cc: Mariusz Mazur, xen-devel On 05/09/14 11:18, Ian Campbell wrote: > On Fri, 2014-05-09 at 09:03 +0200, Jacek Konieczny wrote: >> Hi, >> >> [third attempt to send this to the xen-devel list] >> >> To prevent a single domU from saturating our bandwidth we use the >> 'rate=...' options in VIF configuration of our domains xl.cfg files. >> This used to work without serious issues, but recently, after upgrading >> to xen 4.4.0 something went wrong. > > Just to be clear, you changed Xen but not the dom0 or domU kernel, is > that right? Unfortunately not, I have upgraded all. > Which version of Xen were you using before? 4.3.1 > Did you also change toolstack (e.g. from xm to xl) over the upgrade? No, I have been using 'xl' for long time now. > The main (only, I think) affect of the rate= from the toolstack PoV is > to write some additional keys to the vif backend directory, you should > be able to see this in "xenstore-ls -fp" output. I could have guessed that. > Do they perhaps differ between the working and non-working case > (despite the input configuration being the same)? I will check that. I think this can be safely done even on a production server, still running the old Xen and kernel. > Those keys then affect netback's behaviour which is why I am interested > in whether the kernel version has changed. The kernel version has changed and I should focus on that instead of the Xen version change. I will try other kernels and get more information. The other thing is I am not 100% sure my earlier setups were not buggy – it might be I have just never reached the limit for long enough to notice. Greets, Jacek ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [BUG] VIF rate limiting locks up network in the whole system 2014-05-09 10:25 ` Jacek Konieczny @ 2014-05-09 10:32 ` Ian Campbell 2014-05-09 11:44 ` Jacek Konieczny 0 siblings, 1 reply; 9+ messages in thread From: Ian Campbell @ 2014-05-09 10:32 UTC (permalink / raw) To: Jacek Konieczny; +Cc: Mariusz Mazur, xen-devel On Fri, 2014-05-09 at 12:25 +0200, Jacek Konieczny wrote: > > Do they perhaps differ between the working and non-working case > > (despite the input configuration being the same)? > > I will check that. I think this can be safely done even on a production > server, still running the old Xen and kernel. Just to be clear I meant working with rate= (on the old setup) and not working without rate= (on the new setup). Working without rate= won't tell us much, since those keys simply won't be present.. > > Those keys then affect netback's behaviour which is why I am interested > > in whether the kernel version has changed. > > The kernel version has changed and I should focus on that instead of the > Xen version change. I think having confirmed that the xenstore keys are unchanged then the kernel side should be the focus. > I will try other kernels and get more information. Thanks. > The other thing is I am not 100% sure my earlier setups were not buggy – > it might be I have just never reached the limit for long enough to notice. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [BUG] VIF rate limiting locks up network in the whole system 2014-05-09 10:32 ` Ian Campbell @ 2014-05-09 11:44 ` Jacek Konieczny 2014-05-09 11:55 ` Ian Campbell 0 siblings, 1 reply; 9+ messages in thread From: Jacek Konieczny @ 2014-05-09 11:44 UTC (permalink / raw) To: Ian Campbell; +Cc: Mariusz Mazur, xen-devel On 05/09/14 12:32, Ian Campbell wrote: > On Fri, 2014-05-09 at 12:25 +0200, Jacek Konieczny wrote: > >>> Do they perhaps differ between the working and non-working case >>> (despite the input configuration being the same)? >> >> I will check that. I think this can be safely done even on a production >> server, still running the old Xen and kernel. > > Just to be clear I meant working with rate= (on the old setup) and not > working without rate= (on the new setup). Working without rate= won't > tell us much, since those keys simply won't be present.. Yes, I understood that. I used the same xl configuration file on two hosts. Xen 4.4.0, Linux 3.13.6 (not-working setup): /local/domain/0/backend/vif/24/0/frontend = "/local/domain/24/device/vif/0" (n0,r24) /local/domain/0/backend/vif/24/0/frontend-id = "24" (n0,r24) /local/domain/0/backend/vif/24/0/online = "1" (n0,r24) /local/domain/0/backend/vif/24/0/state = "4" (n0,r24) /local/domain/0/backend/vif/24/0/script = "/etc/xen/scripts/vif-bridge" (n0,r24) /local/domain/0/backend/vif/24/0/mac = "02:00:0f:ff:00:1f" (n0,r24) /local/domain/0/backend/vif/24/0/rate = "800,50000" (n0,r24) /local/domain/0/backend/vif/24/0/bridge = "xenbr0" (n0,r24) /local/domain/0/backend/vif/24/0/handle = "0" (n0,r24) /local/domain/0/backend/vif/24/0/type = "vif" (n0,r24) /local/domain/0/backend/vif/24/0/feature-sg = "1" (n0,r24) /local/domain/0/backend/vif/24/0/feature-gso-tcpv4 = "1" (n0,r24) /local/domain/0/backend/vif/24/0/feature-gso-tcpv6 = "1" (n0,r24) /local/domain/0/backend/vif/24/0/feature-ipv6-csum-offload = "1" (n0,r24) /local/domain/0/backend/vif/24/0/feature-rx-copy = "1" (n0,r24) /local/domain/0/backend/vif/24/0/feature-rx-flip = "0" (n0,r24) /local/domain/0/backend/vif/24/0/feature-split-event-channels = "1" (n0,r24) /local/domain/0/backend/vif/24/0/hotplug-status = "connected" (n0,r24) Xen 4.2.1, kernel 3.7.1 (old working setup, I don't have 4.3 and newer kernel handy): /local/domain/0/backend/vif/20/0/frontend = "/local/domain/20/device/vif/0" (n0,r20) /local/domain/0/backend/vif/20/0/frontend-id = "20" (n0,r20) /local/domain/0/backend/vif/20/0/online = "1" (n0,r20) /local/domain/0/backend/vif/20/0/state = "4" (n0,r20) /local/domain/0/backend/vif/20/0/script = "/etc/xen/scripts/vif-bridge" (n0,r20) /local/domain/0/backend/vif/20/0/mac = "02:00:0d:ff:00:1f" (n0,r20) /local/domain/0/backend/vif/20/0/rate = "800,50000" (n0,r20) /local/domain/0/backend/vif/20/0/bridge = "br1" (n0,r20) /local/domain/0/backend/vif/20/0/handle = "0" (n0,r20) /local/domain/0/backend/vif/20/0/type = "vif" (n0,r20) /local/domain/0/backend/vif/20/0/feature-sg = "1" (n0,r20) /local/domain/0/backend/vif/20/0/feature-gso-tcpv4 = "1" (n0,r20) /local/domain/0/backend/vif/20/0/feature-rx-copy = "1" (n0,r20) /local/domain/0/backend/vif/20/0/feature-rx-flip = "0" (n0,r20) /local/domain/0/backend/vif/20/0/hotplug-status = "connected" (n0,r20) No change in the 'rate' value here, but the 'features' are different. >>> Those keys then affect netback's behaviour which is why I am interested >>> in whether the kernel version has changed. >> > I think having confirmed that the xenstore keys are unchanged then the > kernel side should be the focus. I have booted the Xen 4.4.0 host with an older kernel: 3.7.10 The system does not lock up any more. Xenstore variables for the backend: /local/domain/1/device/vif/0/backend = "/local/domain/0/backend/vif/1/0" (n1,r0) /local/domain/1/device/vif/0/backend-id = "0" (n1,r0) /local/domain/1/device/vif/0/state = "4" (n1,r0) /local/domain/1/device/vif/0/handle = "0" (n1,r0) /local/domain/1/device/vif/0/mac = "02:00:0f:ff:00:1f" (n1,r0) /local/domain/1/device/vif/0/tx-ring-ref = "9" (n1,r0) /local/domain/1/device/vif/0/rx-ring-ref = "768" (n1,r0) /local/domain/1/device/vif/0/event-channel = "11" (n1,r0) /local/domain/1/device/vif/0/request-rx-copy = "1" (n1,r0) /local/domain/1/device/vif/0/feature-rx-notify = "1" (n1,r0) /local/domain/1/device/vif/0/feature-sg = "1" (n1,r0) /local/domain/1/device/vif/0/feature-gso-tcpv4 = "1" (n1,r0) /local/domain/1/device/vif/0/feature-gso-tcpv6 = "1" (n1,r0) /local/domain/1/device/vif/0/feature-ipv6-csum-offload = "1" (n1,r0) Is it possible, that one of the features introduced by the 3.13 kernel is faulty (e.g. the 'feature-split-event-channels')? Is there a way to selectively enable/disable those features without changing the kernel? I will also try the 3.14.3 kernel, but I need to prepare it first. Greets, Jacek ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [BUG] VIF rate limiting locks up network in the whole system 2014-05-09 11:44 ` Jacek Konieczny @ 2014-05-09 11:55 ` Ian Campbell 2014-05-09 12:44 ` Jacek Konieczny 0 siblings, 1 reply; 9+ messages in thread From: Ian Campbell @ 2014-05-09 11:55 UTC (permalink / raw) To: Jacek Konieczny; +Cc: Wei Liu, Mariusz Mazur, xen-devel On Fri, 2014-05-09 at 13:44 +0200, Jacek Konieczny wrote: > No change in the 'rate' value here, OK, that rules that out. Good. > but the 'features' are different. Indeed. Looks like gso-tcpv6, ipv6-csum-offload and split event channels are new. ipv6 seems unlikely to be related to your issues, since you look to be using v4. split-event-channels, could be I suppose be involved somehow. [..] > I have booted the Xen 4.4.0 host with an older kernel: 3.7.10 > The system does not lock up any more. OK, so I think that makes it pretty certainly a kernel issue. > > Xenstore variables for the backend: > > /local/domain/1/device/vif/0/backend = "/local/domain/0/backend/vif/1/0" > (n1,r0) > /local/domain/1/device/vif/0/backend-id = "0" (n1,r0) > /local/domain/1/device/vif/0/state = "4" (n1,r0) > /local/domain/1/device/vif/0/handle = "0" (n1,r0) > /local/domain/1/device/vif/0/mac = "02:00:0f:ff:00:1f" (n1,r0) > /local/domain/1/device/vif/0/tx-ring-ref = "9" (n1,r0) > /local/domain/1/device/vif/0/rx-ring-ref = "768" (n1,r0) > /local/domain/1/device/vif/0/event-channel = "11" (n1,r0) > /local/domain/1/device/vif/0/request-rx-copy = "1" (n1,r0) > /local/domain/1/device/vif/0/feature-rx-notify = "1" (n1,r0) > /local/domain/1/device/vif/0/feature-sg = "1" (n1,r0) > /local/domain/1/device/vif/0/feature-gso-tcpv4 = "1" (n1,r0) > /local/domain/1/device/vif/0/feature-gso-tcpv6 = "1" (n1,r0) > /local/domain/1/device/vif/0/feature-ipv6-csum-offload = "1" (n1,r0) Interestingly a different set of features to the 3.7.2 case. I am guessing that the guest kernel differs on this other system, likely not worth pursuing that. > Is it possible, that one of the features introduced by the 3.13 kernel is > faulty (e.g. the 'feature-split-event-channels')? The guilty change may or may not be related to the new features, but it could be. > Is there a way to selectively enable/disable those features without changing > the kernel? Unfortunately I don't think so. It *might* be possible to start the guest paused and then mess with the feature advertisements in the backend's xenstore directory. Or that might cause things to explode ;-). It's worth trying -- I think it will be obvious if it hasn't worked, rather than being a subtle issue which invalidate the testing... > I will also try the 3.14.3 kernel, but I need to prepare it first. Sounds good. Is there any chance you could bisect the releases between 3.7 and 3.14 to narrow down the range? I'd probably test the actual v3.X tags/releases rather than using git bisect at this stage. Ian. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [BUG] VIF rate limiting locks up network in the whole system 2014-05-09 11:55 ` Ian Campbell @ 2014-05-09 12:44 ` Jacek Konieczny 2014-05-09 13:01 ` Ian Campbell 0 siblings, 1 reply; 9+ messages in thread From: Jacek Konieczny @ 2014-05-09 12:44 UTC (permalink / raw) To: Ian Campbell; +Cc: Wei Liu, Mariusz Mazur, xen-devel On 05/09/14 13:55, Ian Campbell wrote: >> Is it possible, that one of the features introduced by the 3.13 kernel is >> faulty (e.g. the 'feature-split-event-channels')? > > The guilty change may or may not be related to the new features, but it > could be. > >> Is there a way to selectively enable/disable those features without changing >> the kernel? > > Unfortunately I don't think so. > > It *might* be possible to start the guest paused and then mess with the > feature advertisements in the backend's xenstore directory. Or that > might cause things to explode ;-). It's worth trying -- I think it will > be obvious if it hasn't worked, rather than being a subtle issue which > invalidate the testing... I have tried that: xl create -p ratelimittest1.cfg xenstore-rm /local/domain/0/backend/vif/3/0/feature-gso-tcpv6 xenstore-rm /local/domain/0/backend/vif/3/0/feature-ipv6-csum-offload xenstore-rm /local/domain/0/backend/vif/3/0/feature-split-event-channels It didn't help. Nothing had exploded, though. >> I will also try the 3.14.3 kernel, but I need to prepare it first. > > Sounds good. That didn't help either. The problem still occurs after upgrading dom0 to 3.14.3. > Is there any chance you could bisect the releases between 3.7 and 3.14 > to narrow down the range? I'd probably test the actual v3.X > tags/releases rather than using git bisect at this stage. I am afraid, I cannot spend that much more time on investigating this issue right now. Greets, Jacek ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [BUG] VIF rate limiting locks up network in the whole system 2014-05-09 12:44 ` Jacek Konieczny @ 2014-05-09 13:01 ` Ian Campbell 2014-05-09 13:43 ` Jacek Konieczny 0 siblings, 1 reply; 9+ messages in thread From: Ian Campbell @ 2014-05-09 13:01 UTC (permalink / raw) To: Jacek Konieczny; +Cc: Wei Liu, Mariusz Mazur, xen-devel On Fri, 2014-05-09 at 14:44 +0200, Jacek Konieczny wrote: > On 05/09/14 13:55, Ian Campbell wrote: > >> Is it possible, that one of the features introduced by the 3.13 kernel is > >> faulty (e.g. the 'feature-split-event-channels')? > > > > The guilty change may or may not be related to the new features, but it > > could be. > > > >> Is there a way to selectively enable/disable those features without changing > >> the kernel? > > > > Unfortunately I don't think so. > > > > It *might* be possible to start the guest paused and then mess with the > > feature advertisements in the backend's xenstore directory. Or that > > might cause things to explode ;-). It's worth trying -- I think it will > > be obvious if it hasn't worked, rather than being a subtle issue which > > invalidate the testing... > > I have tried that: > > xl create -p ratelimittest1.cfg > xenstore-rm /local/domain/0/backend/vif/3/0/feature-gso-tcpv6 > xenstore-rm /local/domain/0/backend/vif/3/0/feature-ipv6-csum-offload > xenstore-rm /local/domain/0/backend/vif/3/0/feature-split-event-channels > > It didn't help. Nothing had exploded, though. Can you tell if those features were actually enabled or not afterwards? ethtool in the guest and on the dom0 vifX.Y device should have changed, or possibly it would be reflected in the dmesg. For split event channel I think there will be only one evtchn key in xenstore instead of two. Ian. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [BUG] VIF rate limiting locks up network in the whole system 2014-05-09 13:01 ` Ian Campbell @ 2014-05-09 13:43 ` Jacek Konieczny 0 siblings, 0 replies; 9+ messages in thread From: Jacek Konieczny @ 2014-05-09 13:43 UTC (permalink / raw) To: Ian Campbell; +Cc: Wei Liu, Mariusz Mazur, xen-devel On 05/09/14 15:01, Ian Campbell wrote: > On Fri, 2014-05-09 at 14:44 +0200, Jacek Konieczny wrote: >> >> I have tried that: >> >> xl create -p ratelimittest1.cfg >> xenstore-rm /local/domain/0/backend/vif/3/0/feature-gso-tcpv6 >> xenstore-rm /local/domain/0/backend/vif/3/0/feature-ipv6-csum-offload >> xenstore-rm /local/domain/0/backend/vif/3/0/feature-split-event-channels >> >> It didn't help. Nothing had exploded, though. > > Can you tell if those features were actually enabled or not afterwards? > ethtool in the guest and on the dom0 vifX.Y device should have changed, > or possibly it would be reflected in the dmesg. ethtool in the guest changes when I change the offload settings, though my ethtool doesn't seem to show IPv6 offloading settings. In dom0 'ethtool -k' output doesn't change, as this seems to be triggered by the guest features, which are written to xenstore after the domain is unpaused and the kernel boots. > For split event channel I think there will be only one evtchn key in > xenstore instead of two. Yes, and there is only one 'event-channel' instead of separate RX and TX channels. So it seems, the 'xenstore-rm' does the trick, at list partially. Greets, Jacek ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-05-09 13:43 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-05-09 7:03 [BUG] VIF rate limiting locks up network in the whole system Jacek Konieczny 2014-05-09 9:18 ` Ian Campbell 2014-05-09 10:25 ` Jacek Konieczny 2014-05-09 10:32 ` Ian Campbell 2014-05-09 11:44 ` Jacek Konieczny 2014-05-09 11:55 ` Ian Campbell 2014-05-09 12:44 ` Jacek Konieczny 2014-05-09 13:01 ` Ian Campbell 2014-05-09 13:43 ` Jacek Konieczny
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).