From: Punit Agrawal <punitagrawal@gmail.com>
To: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
<linux-kernel@vger.kernel.org>,
Mark Rutland <mark.rutland@arm.com>
Subject: Re: [PATCH] arm64: entry: Improve the performance of system calls
Date: Tue, 14 Sep 2021 08:01:22 +0900 [thread overview]
Message-ID: <87o88w84hp.fsf@stealth> (raw)
In-Reply-To: <20210903121950.2284-1-thunder.leizhen@huawei.com> (Zhen Lei's message of "Fri, 3 Sep 2021 20:19:50 +0800")
Hi Zhen Lei,
Zhen Lei <thunder.leizhen@huawei.com> writes:
> Commit 582f95835a8f ("arm64: entry: convert el0_sync to C") converted lots
> of functions from assembly to C, this greatly improves readability. But
> el0_svc()/el0_svc_compat() is in response to system call requests from
> user mode and may be in the hot path.
>
> Although the SVC is in the first case of the switch statement in C, the
> compiler optimizes the switch statement as a whole, and does not give SVC
> a small boost.
>
> Use "likely()" to help SVC directly invoke its handler after a simple
> judgment to avoid entering the switch table lookup process.
>
> After:
> 0000000000000ff0 <el0t_64_sync_handler>:
> ff0: d503245f bti c
> ff4: d503233f paciasp
> ff8: a9bf7bfd stp x29, x30, [sp, #-16]!
> ffc: 910003fd mov x29, sp
> 1000: d5385201 mrs x1, esr_el1
> 1004: 531a7c22 lsr w2, w1, #26
> 1008: f100545f cmp x2, #0x15
> 100c: 540000a1 b.ne 1020 <el0t_64_sync_handler+0x30>
> 1010: 97fffe14 bl 860 <el0_svc>
> 1014: a8c17bfd ldp x29, x30, [sp], #16
> 1018: d50323bf autiasp
> 101c: d65f03c0 ret
> 1020: f100705f cmp x2, #0x1c
>
> Execute "./lat_syscall null" on my board (BogoMIPS : 200.00), it can save
> about 10ns.
>
> Before:
> Simple syscall: 0.2365 microseconds
> Simple syscall: 0.2354 microseconds
> Simple syscall: 0.2339 microseconds
>
> After:
> Simple syscall: 0.2255 microseconds
> Simple syscall: 0.2254 microseconds
> Simple syscall: 0.2256 microseconds
I was curious about the impact of the patch on other
micro-architectures. Following are the results from using the patch
applied to v5.14 on A72 and A53 on a RK399.
For the A72 -
Before:
Simple syscall: 0.4311 microseconds
Simple syscall: 0.4311 microseconds
Simple syscall: 0.4313 microseconds
After:
Simple syscall: 0.4249 microseconds
Simple syscall: 0.4248 microseconds
Simple syscall: 0.4248 microseconds
For the A53 -
Before:
Simple syscall: 0.4130 microseconds
Simple syscall: 0.4128 microseconds
Simple syscall: 0.4124 microseconds
After:
Simple syscall: 0.4031 microseconds
Simple syscall: 0.4078 microseconds
Simple syscall: 0.4030 microseconds
Although there is a small benefit, they are not as big as on your board
/ micro-architecture.
Were you able to see any impact on real workloads?
I imagine that other code paths in the sync handler would also benefit
from similar special casing - did you try any others. Page fault
handling came to mind.
Overall, I feel a little uneasy about the special casing introduced here
but at the same time see that it does benefit certain workloads.
One more comment below.
>
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
> arch/arm64/kernel/entry-common.c | 18 ++++++++++++------
> 1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
> index 32f9796c4ffe77b..062eb5a895ec6f3 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -607,11 +607,14 @@ static void noinstr el0_fpac(struct pt_regs *regs, unsigned long esr)
> asmlinkage void noinstr el0t_64_sync_handler(struct pt_regs *regs)
> {
> unsigned long esr = read_sysreg(esr_el1);
> + unsigned long ec = ESR_ELx_EC(esr);
>
> - switch (ESR_ELx_EC(esr)) {
> - case ESR_ELx_EC_SVC64:
> + if (likely(ec == ESR_ELx_EC_SVC64)) {
> el0_svc(regs);
> - break;
> + return;
> + }
> +
> + switch (ec) {
> case ESR_ELx_EC_DABT_LOW:
> el0_da(regs, esr);
> break;
Please include a big fat comment on why SVC (or any other patch) is
being separated out of the switch case - both here and below.
Thanks,
Punit
> @@ -730,11 +733,14 @@ static void noinstr el0_svc_compat(struct pt_regs *regs)
> asmlinkage void noinstr el0t_32_sync_handler(struct pt_regs *regs)
> {
> unsigned long esr = read_sysreg(esr_el1);
> + unsigned long ec = ESR_ELx_EC(esr);
>
> - switch (ESR_ELx_EC(esr)) {
> - case ESR_ELx_EC_SVC32:
> + if (likely(ec == ESR_ELx_EC_SVC32)) {
> el0_svc_compat(regs);
> - break;
> + return;
> + }
> +
> + switch (ec) {
> case ESR_ELx_EC_DABT_LOW:
> el0_da(regs, esr);
> break;
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
WARNING: multiple messages have this Message-ID (diff)
From: Punit Agrawal <punitagrawal@gmail.com>
To: Zhen Lei <thunder.leizhen@huawei.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
linux-arm-kernel <linux-arm-kernel@lists.infradead.org>,
<linux-kernel@vger.kernel.org>,
Mark Rutland <mark.rutland@arm.com>
Subject: Re: [PATCH] arm64: entry: Improve the performance of system calls
Date: Tue, 14 Sep 2021 08:01:22 +0900 [thread overview]
Message-ID: <87o88w84hp.fsf@stealth> (raw)
In-Reply-To: <20210903121950.2284-1-thunder.leizhen@huawei.com> (Zhen Lei's message of "Fri, 3 Sep 2021 20:19:50 +0800")
Hi Zhen Lei,
Zhen Lei <thunder.leizhen@huawei.com> writes:
> Commit 582f95835a8f ("arm64: entry: convert el0_sync to C") converted lots
> of functions from assembly to C, this greatly improves readability. But
> el0_svc()/el0_svc_compat() is in response to system call requests from
> user mode and may be in the hot path.
>
> Although the SVC is in the first case of the switch statement in C, the
> compiler optimizes the switch statement as a whole, and does not give SVC
> a small boost.
>
> Use "likely()" to help SVC directly invoke its handler after a simple
> judgment to avoid entering the switch table lookup process.
>
> After:
> 0000000000000ff0 <el0t_64_sync_handler>:
> ff0: d503245f bti c
> ff4: d503233f paciasp
> ff8: a9bf7bfd stp x29, x30, [sp, #-16]!
> ffc: 910003fd mov x29, sp
> 1000: d5385201 mrs x1, esr_el1
> 1004: 531a7c22 lsr w2, w1, #26
> 1008: f100545f cmp x2, #0x15
> 100c: 540000a1 b.ne 1020 <el0t_64_sync_handler+0x30>
> 1010: 97fffe14 bl 860 <el0_svc>
> 1014: a8c17bfd ldp x29, x30, [sp], #16
> 1018: d50323bf autiasp
> 101c: d65f03c0 ret
> 1020: f100705f cmp x2, #0x1c
>
> Execute "./lat_syscall null" on my board (BogoMIPS : 200.00), it can save
> about 10ns.
>
> Before:
> Simple syscall: 0.2365 microseconds
> Simple syscall: 0.2354 microseconds
> Simple syscall: 0.2339 microseconds
>
> After:
> Simple syscall: 0.2255 microseconds
> Simple syscall: 0.2254 microseconds
> Simple syscall: 0.2256 microseconds
I was curious about the impact of the patch on other
micro-architectures. Following are the results from using the patch
applied to v5.14 on A72 and A53 on a RK399.
For the A72 -
Before:
Simple syscall: 0.4311 microseconds
Simple syscall: 0.4311 microseconds
Simple syscall: 0.4313 microseconds
After:
Simple syscall: 0.4249 microseconds
Simple syscall: 0.4248 microseconds
Simple syscall: 0.4248 microseconds
For the A53 -
Before:
Simple syscall: 0.4130 microseconds
Simple syscall: 0.4128 microseconds
Simple syscall: 0.4124 microseconds
After:
Simple syscall: 0.4031 microseconds
Simple syscall: 0.4078 microseconds
Simple syscall: 0.4030 microseconds
Although there is a small benefit, they are not as big as on your board
/ micro-architecture.
Were you able to see any impact on real workloads?
I imagine that other code paths in the sync handler would also benefit
from similar special casing - did you try any others. Page fault
handling came to mind.
Overall, I feel a little uneasy about the special casing introduced here
but at the same time see that it does benefit certain workloads.
One more comment below.
>
> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
> ---
> arch/arm64/kernel/entry-common.c | 18 ++++++++++++------
> 1 file changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
> index 32f9796c4ffe77b..062eb5a895ec6f3 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -607,11 +607,14 @@ static void noinstr el0_fpac(struct pt_regs *regs, unsigned long esr)
> asmlinkage void noinstr el0t_64_sync_handler(struct pt_regs *regs)
> {
> unsigned long esr = read_sysreg(esr_el1);
> + unsigned long ec = ESR_ELx_EC(esr);
>
> - switch (ESR_ELx_EC(esr)) {
> - case ESR_ELx_EC_SVC64:
> + if (likely(ec == ESR_ELx_EC_SVC64)) {
> el0_svc(regs);
> - break;
> + return;
> + }
> +
> + switch (ec) {
> case ESR_ELx_EC_DABT_LOW:
> el0_da(regs, esr);
> break;
Please include a big fat comment on why SVC (or any other patch) is
being separated out of the switch case - both here and below.
Thanks,
Punit
> @@ -730,11 +733,14 @@ static void noinstr el0_svc_compat(struct pt_regs *regs)
> asmlinkage void noinstr el0t_32_sync_handler(struct pt_regs *regs)
> {
> unsigned long esr = read_sysreg(esr_el1);
> + unsigned long ec = ESR_ELx_EC(esr);
>
> - switch (ESR_ELx_EC(esr)) {
> - case ESR_ELx_EC_SVC32:
> + if (likely(ec == ESR_ELx_EC_SVC32)) {
> el0_svc_compat(regs);
> - break;
> + return;
> + }
> +
> + switch (ec) {
> case ESR_ELx_EC_DABT_LOW:
> el0_da(regs, esr);
> break;
next prev parent reply other threads:[~2021-09-13 23:03 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-09-03 12:19 [PATCH] arm64: entry: Improve the performance of system calls Zhen Lei
2021-09-03 12:19 ` Zhen Lei
2021-09-13 23:01 ` Punit Agrawal [this message]
2021-09-13 23:01 ` Punit Agrawal
2021-09-14 2:01 ` Leizhen (ThunderTown)
2021-09-14 2:01 ` Leizhen (ThunderTown)
2021-09-14 9:55 ` Mark Rutland
2021-09-14 9:55 ` Mark Rutland
2021-09-14 11:23 ` Leizhen (ThunderTown)
2021-09-14 11:23 ` Leizhen (ThunderTown)
2021-09-14 11:48 ` Leizhen (ThunderTown)
2021-09-14 11:48 ` Leizhen (ThunderTown)
2021-09-14 15:17 ` Joey Gouly
2021-09-14 15:17 ` Joey Gouly
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87o88w84hp.fsf@stealth \
--to=punitagrawal@gmail.com \
--cc=catalin.marinas@arm.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=thunder.leizhen@huawei.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.