From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50190) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cX7wI-0004vN-Qf for qemu-devel@nongnu.org; Fri, 27 Jan 2017 09:55:51 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cX7wE-0001oN-8B for qemu-devel@nongnu.org; Fri, 27 Jan 2017 09:55:46 -0500 Received: from mail-wm0-x232.google.com ([2a00:1450:400c:c09::232]:38319) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1cX7wD-0001oD-Uz for qemu-devel@nongnu.org; Fri, 27 Jan 2017 09:55:42 -0500 Received: by mail-wm0-x232.google.com with SMTP id r144so141801152wme.1 for ; Fri, 27 Jan 2017 06:55:41 -0800 (PST) References: <1484644078-21312-1-git-send-email-batuzovk@ispras.ru> From: Alex =?utf-8?Q?Benn=C3=A9e?= In-reply-to: <1484644078-21312-1-git-send-email-batuzovk@ispras.ru> Date: Fri, 27 Jan 2017 14:55:39 +0000 Message-ID: <87r33o8sd0.fsf@linaro.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Subject: Re: [Qemu-devel] [PATCH 00/18] Emulate guest vector operations with host vector operations List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Kirill Batuzov Cc: qemu-devel@nongnu.org, Peter Maydell , Peter Crosthwaite , Paolo Bonzini , Richard Henderson Kirill Batuzov writes: > The goal of these patch series is to set up an infrastructure to emulate > guest vector operations using host vector operations. Preliminary > experiments show that simply translating loads and stores increases > performance of x264 video codec by 10%. The performance of a gcc vectorized > for loop increased 2x. > > To be able to emulate guest vector operations using host vector operations, > several things need to be done. I see rth has already done a bunch of review so I'll pass on this cycle but please feel free to add me to the CC list next iteration. > > 1. Corresponding vector types should be added to TCG. These series add > TCG_v128 and TCG_v64. I've made TCG_v64 a different type than TCG_i64 > because it usually needs to be allocated to different registers and > supports different operations. > > 2. Load/store operations for these new types need to be implemented. > > 3. For seamless transition from current model to a new one we need to > handle cases where memory occupied by global variable can be accessed via > pointer to the CPUArchState structure. A very simple conservative alias > analysis has been added to do it. This analysis tracks memory loads and > stores that overlap with fields of CPUArchState and provides this > information to the register allocator. The allocator then spills and > reloads affected globals when needed. > > 4. Allow overlapping globals. For scalar registers this is a rare case, and > overlapping registers can ba handled as a single one (ah, al, ax, eax, > rax). In ARM every Q-register consists of two D-register each consisting of > two S-registers. Handling 4 S-registers as one because they are parts of > the same Q-register is way too inefficient. > > 5. Add new memory addressing mode to MMU code for large accesses and create > needed helpers. Only 128-bit vectors have been handled for now. > > 6. Create TCG opcodes for vector operations. Only addition has beed handled > in these series. Each operation has a wrapper that checks if the backend > supports the corresponding operation or not. In one case the vector opcode > is generated, in the other the operation is emulated with scalar > operations. The emulation code is generated inline for performance reasons > (there is a huge performance difference between inline generation > and calling a helper). As a positive side effect this will eventually allow > to merge similar emulation code for vector instructions from different > frontends to target-independent implementation. > > 7. Use new operations in the frontend (ARM was used in these series). > > 8. Support new operations in the backend (x86_64 was used in these series). > > For experiments I have used ARM guest on x86_64 host. I wanted some pair of > different architectures with vector extensions both. ARM and x86_64 pair > fits well. > > Kirill Batuzov (18): > tcg: add support for 128bit vector type > tcg: add support for 64bit vector type > tcg: add ld_v128, ld_v64, st_v128 and st_v64 opcodes > tcg: add simple alias analysis > tcg: use results of alias analysis in liveness analysis > tcg: allow globals to overlap > tcg: add vector addition operations > target/arm: support access to vector guest registers as globals > target/arm: use vector opcode to handle vadd. instruction > tcg/i386: add support for vector opcodes > tcg/i386: support 64-bit vector operations > tcg/i386: support remaining vector addition operations > tcg: do not relay on exact values of MO_BSWAP or MO_SIGN in backend > tcg: introduce new TCGMemOp - MO_128 > tcg: introduce qemu_ld_v128 and qemu_st_v128 opcodes > softmmu: create helpers for vector loads > tcg/i386: add support for qemu_ld_v128/qemu_st_v128 ops > target/arm: load two consecutive 64-bits vector regs as a 128-bit > vector reg > > cputlb.c | 4 + > softmmu_template_vector.h | 266 +++++++++++++++++++++++++++++++++++++++++++ > target/arm/translate.c | 89 ++++++++++++++- > tcg/aarch64/tcg-target.inc.c | 4 +- > tcg/arm/tcg-target.inc.c | 4 +- > tcg/i386/tcg-target.h | 35 +++++- > tcg/i386/tcg-target.inc.c | 245 ++++++++++++++++++++++++++++++++++++--- > tcg/mips/tcg-target.inc.c | 4 +- > tcg/optimize.c | 146 ++++++++++++++++++++++++ > tcg/ppc/tcg-target.inc.c | 4 +- > tcg/s390/tcg-target.inc.c | 4 +- > tcg/sparc/tcg-target.inc.c | 12 +- > tcg/tcg-op.c | 20 +++- > tcg/tcg-op.h | 262 ++++++++++++++++++++++++++++++++++++++++++ > tcg/tcg-opc.h | 34 ++++++ > tcg/tcg.c | 146 ++++++++++++++++++++++++ > tcg/tcg.h | 147 +++++++++++++++++++++++- > 17 files changed, 1385 insertions(+), 41 deletions(-) > create mode 100644 softmmu_template_vector.h -- Alex Bennée