Re: [PATCH] arm64: atomics: Fix the issue on xchg when switch to atomic instruction

From: Shaokun Zhang <zhangshaokun@hisilicon.com>
To: Will Deacon <will@kernel.org>
Cc: Yuqi Jin <jinyuqi@huawei.com>,
	Andrew Murray <amurray@thegoodpenguin.co.uk>,
	linux-arm-kernel@lists.infradead.org,
	Catalin Marinas <catalin.marinas@arm.com>
Subject: Re: [PATCH] arm64: atomics: Fix the issue on xchg when switch to atomic instruction
Date: Mon, 25 May 2020 17:27:30 +0800	[thread overview]
Message-ID: <d1e62e64-9cda-eb70-42f8-f65e43632add@hisilicon.com> (raw)
In-Reply-To: <a44c1318-bace-36ec-457e-c2856ad3e191@hisilicon.com>

Hi Will,

On 2020/5/7 15:54, Shaokun Zhang wrote:
> Hi Will,
> 
> On 2020/5/6 19:30, Shaokun Zhang wrote:
>> Hi Will,
>>
>> On 2020/5/6 18:44, Will Deacon wrote:
>>> On Wed, May 06, 2020 at 06:39:16PM +0800, Shaokun Zhang wrote:
>>>> Apologies for my noise, you are right and it's my mistake.
>>>
>>> No need to apologise, but thanks for letting me know.
>>>
>>>> On 2020/5/6 15:53, Will Deacon wrote:
>>>>> On Wed, May 06, 2020 at 03:00:39PM +0800, Shaokun Zhang wrote:
>>>>>> On 2020/5/5 17:15, Will Deacon wrote:
>>>>>>> On Tue, May 05, 2020 at 05:02:35PM +0800, Shaokun Zhang wrote:
>>>>>>>> From: Yuqi Jin <jinyuqi@huawei.com>
>>>>>>>>
>>>>>>>> Since commit addfc38672c7 ("arm64: atomics: avoid out-of-line ll/sc atomics"),
>>>>>>>> it has provided inline implementations of both LSE and ll/sc and used a static
>>>>>>>> key to select between them, which allows the compiler to generate better
>>>>>>>> atomics code.
>>>>>>>> However, xchg still uses the original method which would fail to switch to
>>>>>>>> the atomic instruction correctly, Let's fix this issue.
>>>>>>>
>>>>>>> Please can you elaborate on the failure mode? The current code looks alright
>>>>>>
>>>>>> When enable CONFIG_ARM64_LSE_ATOMICS, xchg is failed to switch to swp instruction
>>>>>> or dynamic replacement instructions are not seen.
>>>>>>
>>>>>> We do some tests on the copy of xchg_tail,:
>>>>>> u32 xchg_tail_my(struct qspinlock *lock, u32 tail)
>>>>>> {
>>>>>>         return (u32)xchg_relaxed(&lock->tail,
>>>>>>                                  tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
>>>>>> }
>>>>>> and the asm code is as follows:
>>>>>>
>>>>>> ffff80001015b050 <xchg_tail_my>:
>>>>>> ffff80001015b050:       a9be7bfd        stp     x29, x30, [sp, #-32]!
>>>>>> ffff80001015b054:       910003fd        mov     x29, sp
>>>>>> ffff80001015b058:       a90153f3        stp     x19, x20, [sp, #16]
>>>>>> ffff80001015b05c:       2a0103f3        mov     w19, w1
>>>>>> ffff80001015b060:       aa0003f4        mov     x20, x0
>>>>>> ffff80001015b064:       aa1e03e0        mov     x0, x30
>>>>>> ffff80001015b068:       97fd07ee        bl      ffff80001009d020 <_mcount>
>>>>>> ffff80001015b06c:       53107e61        lsr     w1, w19, #16
>>>>>> ffff80001015b070:       91000a83        add     x3, x20, #0x2
>>>>>> ffff80001015b074:       f9800071        prfm    pstl1strm, [x3]
>>>>>> ffff80001015b078:       485f7c60        ldxrh   w0, [x3]
>>>>>> ffff80001015b07c:       48027c61        stxrh   w2, w1, [x3]
>>>>>> ffff80001015b080:       35ffffc2        cbnz    w2, ffff80001015b078 <xchg_tail_my+0x28>
>>>>>> ffff80001015b084:       53103c00        lsl     w0, w0, #16
>>>>>> ffff80001015b088:       a94153f3        ldp     x19, x20, [sp, #16]
>>>>>> ffff80001015b08c:       a8c27bfd        ldp     x29, x30, [sp], #32
>>>>>> ffff80001015b090:       d65f03c0        ret
>>>>>
>>>>> This should get patched at runtime, but you're saying that's not happening?
>>>>>
>>>>
>>>> My mistake, I didn't check the runtime carefully.
>>>
>>> Good to hear there's not a bug, but if you see a performance benefit from
>>> using the static-key for xchg() then I'd obviously be open to changing it
>>
>> Thanks your reply, if I follow the two methods correctly, static-key will
>> not consume '__nops(3)', others are the same.
>>
>> I will run some tests to check the performance  ;-)
>>
> 
> We compare the two methods on Huawei Kunpeng920 and the throughput per second
> as follows:
> 
> one core  |without delay| 200ns delay|
> --------------------------------------
> static-key| 55294942    | 3937156    |
> --------------------------------------
> runtime   | 54706282    | 3918188    |
> --------------------------------------
> 

Are you happy to pick up this patch since it has some benefits for single core?  ;-)

Thanks,
Shaokun

> If we run this test using 32-cores, the result is almost the same.
> Test code is followed:
> if(delay_o) {
> 	while (get_cycles() <= (time_temp + t_cnt)) {
> 		(void)atomic64_xchg(&wk_in->num, 1);
> 		myndelay(delay_o);
> 		(void)atomic64_xchg(&wk_in->num, 2);
> 		myndelay(delay_o);
> 		w_cnt+=2;
> 	}
> } else {
> 	while (get_cycles() <= (time_temp + t_cnt)){
> 		(void)atomic64_xchg(&wk_in->num, 1);
> 		(void)atomic64_xchg(&wk_in->num, 2);
> 		w_cnt+=2;
> 	}
> }
> 
> Thanks,
> Shaokun
> 
>> Thanks,
>> Shaokun
>>
>>
>>> over as well.
>>>
>>> Thanks,
>>>
>>> Will
>>>
>>> .
>>>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel