From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36213) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1beRaE-0001By-I0 for qemu-devel@nongnu.org; Mon, 29 Aug 2016 14:46:59 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1beRaA-0001lQ-Bi for qemu-devel@nongnu.org; Mon, 29 Aug 2016 14:46:57 -0400 Received: from mail-qk0-x236.google.com ([2607:f8b0:400d:c09::236]:34906) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1beRaA-0001lG-4x for qemu-devel@nongnu.org; Mon, 29 Aug 2016 14:46:54 -0400 Received: by mail-qk0-x236.google.com with SMTP id v123so147381331qkh.2 for ; Mon, 29 Aug 2016 11:46:53 -0700 (PDT) Sender: Richard Henderson From: Richard Henderson Date: Mon, 29 Aug 2016 11:46:11 -0700 Message-Id: <1472496380-19706-1-git-send-email-rth@twiddle.net> Subject: [Qemu-devel] [PATCH v3 0/9] Improve buffer_is_zero List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Cc: pbonzini@redhat.com, vijay.kilari@gmail.com Changes from v2 to v3: * Unit testing. This includes having x86 attempt all versions of the accelerator that will run on the hardware. Thus an avx2 host will run the basic test 5 times (1.5sec on my laptop). * Drop the ppc and aarch64 specializations. I have improved the basic integer version to the point that those vectorized versions are not a win. In the case of my aarch64 mustang, the integer version is 4 times faster than the neon version that I delete. With effort I was able to rewrite the neon version to come to within a factor of 1.1, but it remained slower than the integer. To be fair, gcc6 makes very good use of ldp, so the integer path is *also* loading 16 bytes per insn. I can forward my standalone aarch64 benchmark if anyone is interested. Note however that at least the avx2 acceleration is still very much a win, being about 3 times faster on my laptop. Of course, it's handling 4 times as much data per loop as the integer version, so one can still see the overhead caused by using vector insns. For grins I wrote an avx512 version, if someone has a skylake upon which to test and benchmark. That requires additional configure checks, so I didn't bother to include it here. r~ Richard Henderson (9): cutils: Move buffer_is_zero and subroutines to a new file cutils: Remove SPLAT macro cutils: Export only buffer_is_zero cutils: Rearrange buffer_is_zero acceleration cutils: Add test for buffer_is_zero cutils: Add generic prefetch cutils: Rewrite x86 buffer zero checking cutils: Remove aarch64 buffer zero checking cutils: Remove ppc buffer zero checking configure | 21 +-- include/qemu/cutils.h | 3 +- migration/ram.c | 2 +- migration/rdma.c | 5 +- tests/Makefile.include | 3 + tests/test-bufferiszero.c | 78 +++++++++++ util/Makefile.objs | 1 + util/bufferiszero.c | 332 ++++++++++++++++++++++++++++++++++++++++++++++ util/cutils.c | 244 ---------------------------------- 9 files changed, 423 insertions(+), 266 deletions(-) create mode 100644 tests/test-bufferiszero.c create mode 100644 util/bufferiszero.c -- 2.7.4