From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50052) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a9EHR-0006n4-Nw for qemu-devel@nongnu.org; Wed, 16 Dec 2015 10:46:18 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a9EHN-0005hF-Vt for qemu-devel@nongnu.org; Wed, 16 Dec 2015 10:46:17 -0500 Received: from mx1.redhat.com ([209.132.183.28]:55716) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a9EHN-0005h8-Nj for qemu-devel@nongnu.org; Wed, 16 Dec 2015 10:46:13 -0500 References: <5671230C.70102@redhat.com> <5671300E.5060109@redhat.com> <56714D0E.1030209@redhat.com> From: Paolo Bonzini Message-ID: <5671873D.6010302@redhat.com> Date: Wed, 16 Dec 2015 16:46:05 +0100 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH v2 0/3] virtio: proposal to optimize accesses to VQs List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Vincenzo Maffione Cc: "Michael S. Tsirkin" , Jason Wang , Markus Armbruster , qemu-devel , Giuseppe Lettieri , Luigi Rizzo On 16/12/2015 15:25, Vincenzo Maffione wrote: >> vhost-net actually had better performance, so virtio-net dataplane >> was never committed. As Michael mentioned, in practice on Linux you >> use vhost, and non-Linux hypervisors you do not use QEMU. :) >=20 > Yes, I understand. However, another possible use-case would using QEMU > + virtio-net + netmap backend + Linux (e.g. for QEMU-sandboxed packet > generators or packe processors, where very high packet rates are > common), where is not possible to use vhost. Yes, of course. That was tongue in cheek. Another possibility for your use case is to interface with netmap through vhost-user, but I'm happy if you choose to improve virtio.c instead! >> The main optimization that vring.c has is to cache the translation of >> the rings. Using address_space_map/unmap for rings in virtio.c would = be >> a noticeable improvement, as your numbers for patch 3 show. However, = by >> caching translations you also conveniently "forget" to promptly mark t= he >> pages as dirty. As you pointed out this is obviously an issue for >> migration. You can then add a notifier for runstate changes. When >> entering RUN_STATE_FINISH_MIGRATE or RUN_STATE_SAVE_VM the rings would >> be unmapped, and then remapped the next time the VM starts running aga= in. >=20 > Ok so it seems feasible with a bit of care. The numbers we've been > seing in various experiments have always shown that this optimization > could easily double the 2 Mpps packet rate bottleneck. Cool. Bonus points for nicely abstracting it so that virtio.c is just a user. >> You also guessed right that there are consistency issues; for these yo= u >> can add a MemoryListener that invalidates all mappings. >=20 > Yeah, but I don't know exactly what kind of inconsinstencies there can > be. Maybe the memory we are mapping may be hot-unplugged? Yes. Just blow away all mappings in the MemoryListener commit callback. >> That said, I'm wondering where the cost of address translation lies---= is >> it cache-unfriendly data structures, locked operations, or simply too >> much code to execute? It was quite surprising to me that on virtio-bl= k >> benchmarks we were spending 5% of the time doing memcpy! (I have just >> extracted from my branch the patches to remove that, and sent them to >> qemu-devel). >=20 > I feel it's just too much code (but I may be wrong). That is likely to be a good guess, but notice that the fast path doesn't actually have _that much_ code, because a lot of "if"s that are almost always false. Looking at a profile would be useful. Is it flat, or does something (e.g. address_space_translate) actually stand out? > I'm not sure whether you are thinking that 5% is too much or too little= . > To me it's too little, showing that most of the overhead it's > somewhere else (e.g. memory translation, or backend processing). In a > ideal transmission system, most of the overhead should be spent on > copying, because it means that you successfully managed to suppress > notifications and translation overhead. On copying data, though---not on copying virtio descriptors. 5% for those is entirely wasted time. Also, note that I'm looking at disk I/O rather than networking, where there should be no copies at all. Paolo >> Examples of missing optimizations in exec.c include: >> >> * caching enough information in RAM MemoryRegions to avoid the calls t= o >> qemu_get_ram_block (e.g. replace mr->ram_addr with a RAMBlock pointer)= ; >> >> * adding a MRU cache to address_space_lookup_region. >> >> In particular, the former should be easy if you want to give it a >> try---easier than caching ring translations in virtio.c. >=20 > Thank you so much for the insights :)