From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:60608) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Sv95h-00067o-Sl for qemu-devel@nongnu.org; Sat, 28 Jul 2012 11:38:06 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Sv95g-0003aV-Qd for qemu-devel@nongnu.org; Sat, 28 Jul 2012 11:38:05 -0400 Received: from mailout1.samsung.com ([203.254.224.24]:15418) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Sv95g-0003ZS-GB for qemu-devel@nongnu.org; Sat, 28 Jul 2012 11:38:04 -0400 Received: from epcpsbgm2.samsung.com (mailout1.samsung.com [203.254.224.24]) by mailout1.samsung.com (Oracle Communications Messaging Server 7u4-24.01(7.0.4.24.0) 64bit (built Nov 17 2011)) with ESMTP id <0M7V008P1NFBOS40@mailout1.samsung.com> for qemu-devel@nongnu.org; Sun, 29 Jul 2012 00:37:59 +0900 (KST) Received: from [172.21.111.108] ([182.198.1.3]) by mmp2.samsung.com (Oracle Communications Messaging Server 7u4-24.01 (7.0.4.24.0) 64bit (built Nov 17 2011)) with ESMTPA id <0M7V00LTNNFASEC0@mmp2.samsung.com> for qemu-devel@nongnu.org; Sun, 29 Jul 2012 00:37:59 +0900 (KST) Date: Sun, 29 Jul 2012 00:39:01 +0900 From: Yeongkyoon Lee In-reply-to: <500FFBE0.70700@twiddle.net> Message-id: <50140795.5030209@samsung.com> MIME-version: 1.0 Content-type: text/plain; charset=UTF-8; format=flowed Content-transfer-encoding: QUOTED-PRINTABLE References: <1343201734-12062-1-git-send-email-yeongkyoon.lee@samsung.com> <1343201734-12062-4-git-send-email-yeongkyoon.lee@samsung.com> <500FFBE0.70700@twiddle.net> Subject: Re: [Qemu-devel] [RFC][PATCH v4 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Richard Henderson Cc: blauwirbel@gmail.com, sw@weilnetz.de, laurent.desnogues@gmail.com, qemu-devel@nongnu.org, peter.maydell@linaro.org On 2012=EB=85=84 07=EC=9B=94 25=EC=9D=BC 23:00, Richard Henderson wro= te: > On 07/25/2012 12:35 AM, Yeongkyoon Lee wrote: >> +#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOFT= MMU) >> +/* Macros/structures for qemu_ld/st IR code optimization: >> + TCG_MAX_HELPER_LABELS is defined as same as OPC_BUF_SIZE in ex= ec-all.h. */ >> +#define TCG_MAX_QEMU_LDST 640 > Why statically size this ... This just followed the other TCG's code style, the allocation of the= =20 "labels" of "TCGContext" in tcg.c. > >> + /* labels info for qemu_ld/st IRs >> + The labels help to generate TLB miss case codes at the end= of TB */ >> + TCGLabelQemuLdst *qemu_ldst_labels; > ... and then allocate the array dynamically? ditto. > >> + /* jne slow_path */ >> + /* XXX: How to avoid using OPC_JCC_long for peephole optimiza= tion? */ >> + tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0); > You can't, not and maintain the code-generate-until-address-reached > exception invariant. > >> +#ifndef CONFIG_QEMU_LDST_OPTIMIZATION >> uint8_t __ldb_mmu(target_ulong addr, int mmu_idx); >> void __stb_mmu(target_ulong addr, uint8_t val, int mmu_idx); >> uint16_t __ldw_mmu(target_ulong addr, int mmu_idx); >> @@ -28,6 +30,30 @@ void __stl_cmmu(target_ulong addr, uint32_t val= , int mmu_idx); >> uint64_t __ldq_cmmu(target_ulong addr, int mmu_idx); >> void __stq_cmmu(target_ulong addr, uint64_t val, int mmu_idx); >> #else >> +/* Extended versions of MMU helpers for qemu_ld/st optimization. >> + The additional argument is a host code address accessing guest= memory */ >> +uint8_t ext_ldb_mmu(target_ulong addr, int mmu_idx, uintptr_t ra)= ; > Don't tie LDST_OPTIMIZATION directly to the extended function calls= . > > For a host supporting predication, like ARM, the best code sequence > may look like > > =09(1) TLB check > =09(2) If hit, load value from memory > =09(3) If miss, call miss case (5) > =09(4) ... next code > =09... > =09(5) Load call parameters > =09(6) Tail call (aka jump) to MMU helper > > so that (a) we need not explicitly load the address of (3) by hand > for your RA parameter and (b) the mmu helper returns directly to (4= ). > > > r~ The difference between current HEAD and the code sequence you said is= , I=20 think, code locality. My LDST_OPTIMIZATION patches enhances the code locality and also remo= ves=20 one jump. It shows about 4% rising of CoreMark performance on x86 host which= =20 supports predication like ARM. Probably, the performance enhancement for AREG0 cases might get more = larger. I'm not sure where the performance enhancement came from now, and I'l= l=20 check it by some tests later. In my humble opinion, there are no things to lose in LDST_OPTIMIZATIO= N=20 except for just adding one argument to MMU helper implicitly which doesn't l= ook=20 so critical. How about your opinion? Thanks.