From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:52857) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T5tfF-0006Hu-RT for qemu-devel@nongnu.org; Mon, 27 Aug 2012 03:23:14 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1T5tf8-0006uw-3S for qemu-devel@nongnu.org; Mon, 27 Aug 2012 03:23:13 -0400 Received: from mailout2.samsung.com ([203.254.224.25]:18777) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1T5tf7-0006uc-IC for qemu-devel@nongnu.org; Mon, 27 Aug 2012 03:23:06 -0400 Received: from epcpsbgm2.samsung.com (epcpsbgm2 [203.254.230.27]) by mailout2.samsung.com (Oracle Communications Messaging Server 7u4-24.01(7.0.4.24.0) 64bit (built Nov 17 2011)) with ESMTP id <0M9E00BRLKI5ECD0@mailout2.samsung.com> for qemu-devel@nongnu.org; Mon, 27 Aug 2012 16:23:01 +0900 (KST) Received: from [172.21.111.108] ([182.198.1.3]) by mmp1.samsung.com (Oracle Communications Messaging Server 7u4-24.01 (7.0.4.24.0) 64bit (built Nov 17 2011)) with ESMTPA id <0M9E00G5HKICB211@mmp1.samsung.com> for qemu-devel@nongnu.org; Mon, 27 Aug 2012 16:23:01 +0900 (KST) Date: Mon, 27 Aug 2012 16:23:57 +0900 From: Yeongkyoon Lee In-reply-to: <50140795.5030209@samsung.com> Message-id: <503B208D.2020407@samsung.com> MIME-version: 1.0 Content-type: text/plain; charset=UTF-8; format=flowed Content-transfer-encoding: QUOTED-PRINTABLE References: <1343201734-12062-1-git-send-email-yeongkyoon.lee@samsung.com> <1343201734-12062-4-git-send-email-yeongkyoon.lee@samsung.com> <500FFBE0.70700@twiddle.net> <50140795.5030209@samsung.com> Subject: Re: [Qemu-devel] [RFC][PATCH v4 3/3] tcg: Optimize qemu_ld/st by generating slow paths at the end of a block List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Richard Henderson Cc: blauwirbel@gmail.com, sw@weilnetz.de, laurent.desnogues@gmail.com, qemu-devel@nongnu.org, peter.maydell@linaro.org On 2012=EB=85=84 07=EC=9B=94 29=EC=9D=BC 00:39, Yeongkyoon Lee wrote: > On 2012=EB=85=84 07=EC=9B=94 25=EC=9D=BC 23:00, Richard Henderson w= rote: >> On 07/25/2012 12:35 AM, Yeongkyoon Lee wrote: >>> +#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOF= TMMU) >>> +/* Macros/structures for qemu_ld/st IR code optimization: >>> + TCG_MAX_HELPER_LABELS is defined as same as OPC_BUF_SIZE in= =20 >>> exec-all.h. */ >>> +#define TCG_MAX_QEMU_LDST 640 >> Why statically size this ... > > This just followed the other TCG's code style, the allocation of th= e=20 > "labels" of "TCGContext" in tcg.c. > > >> >>> + /* labels info for qemu_ld/st IRs >>> + The labels help to generate TLB miss case codes at the en= d=20 >>> of TB */ >>> + TCGLabelQemuLdst *qemu_ldst_labels; >> ... and then allocate the array dynamically? > > ditto. > >> >>> + /* jne slow_path */ >>> + /* XXX: How to avoid using OPC_JCC_long for peephole=20 >>> optimization? */ >>> + tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0); >> You can't, not and maintain the code-generate-until-address-reache= d >> exception invariant. >> >>> +#ifndef CONFIG_QEMU_LDST_OPTIMIZATION >>> uint8_t __ldb_mmu(target_ulong addr, int mmu_idx); >>> void __stb_mmu(target_ulong addr, uint8_t val, int mmu_idx); >>> uint16_t __ldw_mmu(target_ulong addr, int mmu_idx); >>> @@ -28,6 +30,30 @@ void __stl_cmmu(target_ulong addr, uint32_t va= l,=20 >>> int mmu_idx); >>> uint64_t __ldq_cmmu(target_ulong addr, int mmu_idx); >>> void __stq_cmmu(target_ulong addr, uint64_t val, int mmu_idx); >>> #else >>> +/* Extended versions of MMU helpers for qemu_ld/st optimization. >>> + The additional argument is a host code address accessing gues= t=20 >>> memory */ >>> +uint8_t ext_ldb_mmu(target_ulong addr, int mmu_idx, uintptr_t ra= ); >> Don't tie LDST_OPTIMIZATION directly to the extended function call= s. >> >> For a host supporting predication, like ARM, the best code sequenc= e >> may look like >> >> (1) TLB check >> (2) If hit, load value from memory >> (3) If miss, call miss case (5) >> (4) ... next code >> ... >> (5) Load call parameters >> (6) Tail call (aka jump) to MMU helper >> >> so that (a) we need not explicitly load the address of (3) by hand >> for your RA parameter and (b) the mmu helper returns directly to (= 4). >> >> >> r~ > > The difference between current HEAD and the code sequence you said = is,=20 > I think, code locality. > My LDST_OPTIMIZATION patches enhances the code locality and also= =20 > removes one jump. > It shows about 4% rising of CoreMark performance on x86 host which= =20 > supports predication like ARM. > Probably, the performance enhancement for AREG0 cases might get mor= e=20 > larger. > I'm not sure where the performance enhancement came from now, and I= 'll=20 > check it by some tests later. > > In my humble opinion, there are no things to lose in LDST_OPTIMIZAT= ION=20 > except > for just adding one argument to MMU helper implicitly which doesn't= =20 > look so critical. > How about your opinion? > > Thanks. > It's been a long time. I've tested the performances of one jump difference when fast qemu_ld= /st=20 (TLB hit). The result shows 3.6% CoreMark enhancement when reducing one jump whe= re=20 slow paths are generated at the end of block as same for the both cas= es. That means reducing one jump dominates the majority of performance= =20 enhancement from LDST_OPTIMIZATION. As a result, it needs extended MMU helper functions for attaining tha= t=20 performance rising, and those extended functions are used only implic= itly. BTW, who will finally confirm my patches? I have sent four version of my patches in which I have applied all th= e=20 reasonable feedbacks from this community. Currently, v4 is the final candidate though it might need merge with= =20 latest HEAD because it was sent 1 month before. Thanks.