From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52376) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ayXTo-0007G1-Sj for qemu-devel@nongnu.org; Fri, 06 May 2016 00:35:15 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ayXTc-00068B-Hm for qemu-devel@nongnu.org; Fri, 06 May 2016 00:35:03 -0400 Received: from mailout.ish.de ([80.69.98.248]:51961) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ayXTc-00062x-6P for qemu-devel@nongnu.org; Fri, 06 May 2016 00:34:56 -0400 Message-ID: Date: Fri, 6 May 2016 06:34:33 +0200 From: Ingo Krabbe In-Reply-To: <20160505174203.GC14181@stefanha-x1.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] TCP Segementation Offloading List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Cc: mst@redhat.com, jasowang@redhat.com > On Sun, May 01, 2016 at 02:31:57PM +0200, Ingo Krabbe wrote: >> Good Mayday Qemu Developers, >> >> today I tried to find a reference to a networking problem, that seems to be of quite general nature: TCP Segmentation Offloading (TSO) in virtual environments. >> >> When I setup TAP network adapter for a virtual machine and put it into a host bridge, the known best practice is to manually set "tso off gso off" with ethtool, for the guest driver if I use a hardware emulation, such as e1000 and/or "tso off gso off" for the host driver and/or for the bridge adapter, if I use the virtio driver, as otherwise you experience (sometimes?) performance problems or even lost packages. > > I can't parse this sentence. In what cases do you think it's a "known > best practice" to disable tso and gso? Maybe a table would be a clearer > way to communicate this. > > Can you provide a link to the source claiming tso and gso should be > disabled? Sorry for that long sentence. The consequence seems to be, that it is most stable to turn off tso and gso for host bridges and for adapters in virtual machines. One of the most comprehensive collections of arguments is this article https://kris.io/2015/10/01/kvm-network-performance-tso-and-gso-turn-it-off/ while I also found a documentation for Centos 6 https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Virtualization_Host_Configuration_and_Guest_Installation_Guide/ch10s04.html In google groups this one is discussed https://code.google.com/p/ganeti/wiki/PerformanceTuning Of course the same is found for Xen Machines http://cloudnull.io/2012/07/xenserver-network-tuning/ You see there are several Links in the internet and my first question is: Why can't I find this discussion in the qemu-wiki space. I think the bug https://bugs.launchpad.net/bugs/1202289 is related. >> I haven't found a complete analysis of the background of these problems, but there seem to be some effects on MTU based fragmentation and UDP checksums. >> >> There is a tso related bug on launchpad, but the context of this bug is too narrow, for the generality of the problem. >> >> Also it seems that there is a problem in LXC contexts too (I found such a reference, without detailed description in a Post about Xen setup). >> >> My question now is: Is there a bug in the driver code and shouldn't this be documented somewhere in wiki.qemu.org? Where there developments about this topic in the past or is there any planned/ongoing work todo on the qemu drivers? >> >> Most problem reports found relate to deprecated Centos6 qemu-kvm packages. >> >> In our company we have similar or even worse problems with Centos7 hosts and guest machines. > > Have haven't explained what problem you are experiencing. If you want > help with your setup please include your QEMU command-line (ps aux | > grep qemu), the traffic pattern (ideally how to reproduce it with a > benchmarking tool), and what observation you are making (e.g. netstat > counters showing dropped packets). I was quite astonished about the many hints about virtio drivers as we had this problem with the e1000 driver in a Centos7 Guest on a Centos6 Host. e1000 0000:00:03.0 ens3: Detected Tx Unit Hang#012 Tx Queue <0>#012 TDH <42>#012 TDT <42>#012 next_to_use <2e>#012 next_to_clean <42>#012buffer_info[next_to_clean]#012 time_stamp <104aff1b8>#012 next_to_watch <44>#012 jiffies <104b00ee9>#012 next_to_watch.status <0> Apr 25 21:08:48 db03 kernel: ------------[ cut here ]------------ Apr 25 21:08:48 db03 kernel: WARNING: at net/sched/sch_generic.c:297 dev_watchdog+0x270/0x280() Apr 25 21:08:48 db03 kernel: NETDEV WATCHDOG: ens3 (e1000): transmit queue 0 timed out Apr 25 21:08:48 db03 kernel: Modules linked in: binfmt_misc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables btrfs zlib_deflate raid6_pq xor ext4 mbcache jbd2 crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper i2c_piix4 ppdev cryptd pcspkr virtio_balloon parport_pc parport sg nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic ata_generic pata_acpi virtio_scsi cirrus syscopyarea sysfillrect sysimgblt drm_kms_helper ttm drm crct10dif_pclmul crct10dif_common ata_piix crc32c_intel virtio_pci e1000 i2c_core virtio_ring libata serio_raw virtio floppy dm_mirror dm_region_hash dm_log dm_mod Apr 25 21:08:48 db03 kernel: CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.10.0-327.13.1.el7.x86_64 #1 Apr 25 21:08:48 db03 kernel: Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 Apr 25 21:08:48 db03 kernel: ffff88126f483d88 685d892e8a452abb ffff88126f483d40 ffffffff8163571c Apr 25 21:08:48 db03 kernel: ffff88126f483d78 ffffffff8107b200 0000000000000000 ffff881203b9a000 Apr 25 21:08:48 db03 kernel: ffff881201c3e080 0000000000000001 0000000000000002 ffff88126f483de0 Apr 25 21:08:48 db03 kernel: Call Trace: Apr 25 21:08:48 db03 kernel: [] dump_stack+0x19/0x1b Apr 25 21:08:48 db03 kernel: [] warn_slowpath_common+0x70/0xb0 Apr 25 21:08:48 db03 kernel: [] warn_slowpath_fmt+0x5c/0x80 Apr 25 21:08:48 db03 kernel: [] dev_watchdog+0x270/0x280 Apr 25 21:08:48 db03 kernel: [] ? dev_graft_qdisc+0x80/0x80 Apr 25 21:08:48 db03 kernel: [] call_timer_fn+0x36/0x110 Apr 25 21:08:48 db03 kernel: [] ? dev_graft_qdisc+0x80/0x80 Apr 25 21:08:48 db03 kernel: [] run_timer_softirq+0x237/0x340 Apr 25 21:08:48 db03 kernel: [] __do_softirq+0xef/0x280 Apr 25 21:08:48 db03 kernel: [] call_softirq+0x1c/0x30 Apr 25 21:08:48 db03 kernel: [] do_softirq+0x65/0xa0 Apr 25 21:08:48 db03 kernel: [] irq_exit+0x115/0x120 Apr 25 21:08:48 db03 kernel: [] smp_apic_timer_interrupt+0x45/0x60 Apr 25 21:08:48 db03 kernel: [] apic_timer_interrupt+0x6d/0x80 Apr 25 21:08:48 db03 kernel: [] ? native_safe_halt+0x6/0x10 Apr 25 21:08:48 db03 kernel: [] default_idle+0x1f/0xc0 Apr 25 21:08:48 db03 kernel: [] arch_cpu_idle+0x26/0x30 Apr 25 21:08:48 db03 kernel: [] cpu_startup_entry+0x245/0x290 Apr 25 21:08:48 db03 kernel: [] start_secondary+0x1ba/0x230 Apr 25 21:08:48 db03 kernel: ---[ end trace 71ac4360272e207e ]--- Apr 25 21:08:48 db03 kernel: e1000 0000:00:03.0 ens3: Reset adapter I'm still not sure why this happens on this host "db03", while db02 and db01 are not affected. All guests are running on different hosts and the network is controlled by an openvswitch.