From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:37774) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a1txX-0002xl-Py for qemu-devel@nongnu.org; Thu, 26 Nov 2015 05:39:28 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a1txW-0002il-P4 for qemu-devel@nongnu.org; Thu, 26 Nov 2015 05:39:27 -0500 Sender: Paolo Bonzini References: <1448388091-117282-1-git-send-email-pbonzini@redhat.com> <5656D2B9.3010802@de.ibm.com> From: Paolo Bonzini Message-ID: <5656E158.3090505@redhat.com> Date: Thu, 26 Nov 2015 11:39:20 +0100 MIME-Version: 1.0 In-Reply-To: <5656D2B9.3010802@de.ibm.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC PATCH 00/40] Sneak peek of virtio and dataplane changes for 2.6 List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Christian Borntraeger , qemu-devel@nongnu.org, qemu-block@nongnu.org Cc: mlin@kernel.org, famz@redhat.com, ming.lei@canonical.com, stefanha@redhat.com, mst@redhat.com On 26/11/2015 10:36, Christian Borntraeger wrote: > For some unknown reason, this seems to be slightly slower than 2.5-rc1 on my > old z196. (have not net tested the z13) > > your branch is certainly better regarding malloc, but worse regarding others. Thanks for taking the time to test this! This is correct, see the cover letter: "[Patches 14 to 16 remove] the duplicate dataplane-specific implementation of virtio in favor of the regular one that is already used for non-dataplane. While the dataplane implementation is slightly more optimized, I chose to keep the other one to avoid another "touch all virtio devices" series. Patch 10 alone mostly brings performance in par between the two. The remaining 7-8% can be recovered by mostly getting rid of tiny address_space_* operations, keeping the rings always mapped. Note that the rest of this big series does bring a little performance improvement, and already makes up for the lost performance." The profile shows that the culprit is the repeated access to the virtio ring: 3.99% qemu-system-s39 libc-2.18.so [.] __memcpy_z196 2.66% qemu-system-s39 qemu-system-s390x [.] address_space_lduw_le 2.51% qemu-system-s39 qemu-system-s390x [.] address_space_map 2.51% qemu-system-s39 qemu-system-s390x [.] phys_page_find 2.24% qemu-system-s39 qemu-system-s390x [.] qemu_get_ram_ptr 2.18% qemu-system-s39 qemu-system-s390x [.] address_space_translate_internal 1.91% qemu-system-s39 qemu-system-s390x [.] qemu_coroutine_switch 1.66% qemu-system-s39 qemu-system-s390x [.] address_space_rw 1.63% qemu-system-s39 qemu-system-s390x [.] address_space_stw_le 1.57% qemu-system-s39 qemu-system-s390x [.] address_space_stl_le 1.57% qemu-system-s39 qemu-system-s390x [.] address_space_translate 1.45% qemu-system-s39 qemu-system-s390x [.] virtqueue_pop 0.91% qemu-system-s39 qemu-system-s390x [.] qemu_ram_block_from_host 0.79% qemu-system-s39 qemu-system-s390x [.] vring_desc_read 0.76% qemu-system-s39 qemu-system-s390x [.] qemu_get_ram_block ----------- 28.33% 3.30% qemu-system-s39 libc-2.18.so [.] __memcpy_z196 2.83% qemu-system-s39 qemu-system-s390x [.] memory_region_find_rcu 2.72% qemu-system-s39 qemu-system-s390x [.] vring_pop 1.37% qemu-system-s39 qemu-system-s390x [.] address_space_rw 1.37% qemu-system-s39 qemu-system-s390x [.] qemu_get_ram_ptr 1.18% qemu-system-s39 qemu-system-s390x [.] memory_region_find 0.92% qemu-system-s39 qemu-system-s390x [.] get_desc.isra.11 0.92% qemu-system-s39 qemu-system-s390x [.] qemu_ram_block_from_host 0.84% qemu-system-s39 qemu-system-s390x [.] vring_push ----------- 15.45% I would really prefer to get rid of vring.c as soon as the infrastructure makes it possible---even if it's faster. We know what makes virtio.c slower, and it's simpler to fix virtio.c than to convert all the other models to vring.c _plus_ make vring.c safe for migration. Paolo