From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:52857)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <yeongkyoon.lee@samsung.com>) id 1T5tfF-0006Hu-RT
	for qemu-devel@nongnu.org; Mon, 27 Aug 2012 03:23:14 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <yeongkyoon.lee@samsung.com>) id 1T5tf8-0006uw-3S
	for qemu-devel@nongnu.org; Mon, 27 Aug 2012 03:23:13 -0400
Received: from mailout2.samsung.com ([203.254.224.25]:18777)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <yeongkyoon.lee@samsung.com>) id 1T5tf7-0006uc-IC
	for qemu-devel@nongnu.org; Mon, 27 Aug 2012 03:23:06 -0400
Received: from epcpsbgm2.samsung.com (epcpsbgm2 [203.254.230.27])
	by mailout2.samsung.com
	(Oracle Communications Messaging Server 7u4-24.01(7.0.4.24.0) 64bit
	(built Nov
	17 2011)) with ESMTP id <0M9E00BRLKI5ECD0@mailout2.samsung.com> for
	qemu-devel@nongnu.org; Mon, 27 Aug 2012 16:23:01 +0900 (KST)
Received: from [172.21.111.108] ([182.198.1.3])
	by mmp1.samsung.com (Oracle Communications Messaging Server 7u4-24.01
	(7.0.4.24.0) 64bit (built Nov 17 2011))
	with ESMTPA id <0M9E00G5HKICB211@mmp1.samsung.com> for
	qemu-devel@nongnu.org; Mon, 27 Aug 2012 16:23:01 +0900 (KST)
Date: Mon, 27 Aug 2012 16:23:57 +0900
From: Yeongkyoon Lee <yeongkyoon.lee@samsung.com>
In-reply-to: <50140795.5030209@samsung.com>
Message-id: <503B208D.2020407@samsung.com>
MIME-version: 1.0
Content-type: text/plain; charset=UTF-8; format=flowed
Content-transfer-encoding: QUOTED-PRINTABLE
References: <1343201734-12062-1-git-send-email-yeongkyoon.lee@samsung.com>
	<1343201734-12062-4-git-send-email-yeongkyoon.lee@samsung.com>
	<500FFBE0.70700@twiddle.net> <50140795.5030209@samsung.com>
Subject: Re: [Qemu-devel] [RFC][PATCH v4 3/3] tcg: Optimize qemu_ld/st by
 generating slow paths at the end of a block
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Richard Henderson <rth@twiddle.net>
Cc: blauwirbel@gmail.com, sw@weilnetz.de, laurent.desnogues@gmail.com, qemu-devel@nongnu.org, peter.maydell@linaro.org

On 2012=EB=85=84 07=EC=9B=94 29=EC=9D=BC 00:39, Yeongkyoon Lee wrote:
> On 2012=EB=85=84 07=EC=9B=94 25=EC=9D=BC 23:00, Richard Henderson w=
rote:
>> On 07/25/2012 12:35 AM, Yeongkyoon Lee wrote:
>>> +#if defined(CONFIG_QEMU_LDST_OPTIMIZATION) && defined(CONFIG_SOF=
TMMU)
>>> +/* Macros/structures for qemu_ld/st IR code optimization:
>>> +   TCG_MAX_HELPER_LABELS is defined as same as OPC_BUF_SIZE in=
=20
>>> exec-all.h. */
>>> +#define TCG_MAX_QEMU_LDST       640
>> Why statically size this ...
>
> This just followed the other TCG's code style, the allocation of th=
e=20
> "labels" of "TCGContext" in tcg.c.
>
>
>>
>>> +    /* labels info for qemu_ld/st IRs
>>> +       The labels help to generate TLB miss case codes at the en=
d=20
>>> of TB */
>>> +    TCGLabelQemuLdst *qemu_ldst_labels;
>> ... and then allocate the array dynamically?
>
> ditto.
>
>>
>>> +    /* jne slow_path */
>>> +    /* XXX: How to avoid using OPC_JCC_long for peephole=20
>>> optimization? */
>>> +    tcg_out_opc(s, OPC_JCC_long + JCC_JNE, 0, 0, 0);
>> You can't, not and maintain the code-generate-until-address-reache=
d
>> exception invariant.
>>
>>> +#ifndef CONFIG_QEMU_LDST_OPTIMIZATION
>>>   uint8_t __ldb_mmu(target_ulong addr, int mmu_idx);
>>>   void __stb_mmu(target_ulong addr, uint8_t val, int mmu_idx);
>>>   uint16_t __ldw_mmu(target_ulong addr, int mmu_idx);
>>> @@ -28,6 +30,30 @@ void __stl_cmmu(target_ulong addr, uint32_t va=
l,=20
>>> int mmu_idx);
>>>   uint64_t __ldq_cmmu(target_ulong addr, int mmu_idx);
>>>   void __stq_cmmu(target_ulong addr, uint64_t val, int mmu_idx);
>>>   #else
>>> +/* Extended versions of MMU helpers for qemu_ld/st optimization.
>>> +   The additional argument is a host code address accessing gues=
t=20
>>> memory */
>>> +uint8_t ext_ldb_mmu(target_ulong addr, int mmu_idx, uintptr_t ra=
);
>> Don't tie LDST_OPTIMIZATION directly to the extended function call=
s.
>>
>> For a host supporting predication, like ARM, the best code sequenc=
e
>> may look like
>>
>>     (1) TLB check
>>     (2) If hit, load value from memory
>>     (3) If miss, call miss case (5)
>>     (4) ... next code
>>     ...
>>     (5) Load call parameters
>>     (6) Tail call (aka jump) to MMU helper
>>
>> so that (a) we need not explicitly load the address of (3) by hand
>> for your RA parameter and (b) the mmu helper returns directly to (=
4).
>>
>>
>> r~
>
> The difference between current HEAD and the code sequence you said =
is,=20
> I think, code locality.
> My LDST_OPTIMIZATION patches enhances the code locality and also=
=20
> removes one jump.
> It shows about 4% rising of CoreMark performance on x86 host which=
=20
> supports predication like ARM.
> Probably, the performance enhancement for AREG0 cases might get mor=
e=20
> larger.
> I'm not sure where the performance enhancement came from now, and I=
'll=20
> check it by some tests later.
>
> In my humble opinion, there are no things to lose in LDST_OPTIMIZAT=
ION=20
> except
> for just adding one argument to MMU helper implicitly which doesn't=
=20
> look so critical.
> How about your opinion?
>
> Thanks.
>

It's been a long time.

I've tested the performances of one jump difference when fast qemu_ld=
/st=20
(TLB hit).
The result shows 3.6% CoreMark enhancement when reducing one jump whe=
re=20
slow paths are generated at the end of block as same for the both cas=
es.
That means reducing one jump dominates the majority of performance=
=20
enhancement from LDST_OPTIMIZATION.
As a result, it needs extended MMU helper functions for attaining tha=
t=20
performance rising, and those extended functions are used only implic=
itly.

BTW, who will finally confirm my patches?
I have sent four version of my patches in which I have applied all th=
e=20
reasonable feedbacks from this community.
Currently, v4 is the final candidate though it might need merge with=
=20
latest HEAD because it was sent 1 month before.

Thanks.