public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed
From: Mark Rutland <mark.rutland@arm.com>
To: Liao Chang <liaochang1@huawei.com>
Cc: catalin.marinas@arm.com, will@kernel.org, ast@kernel.org,
	puranjay@kernel.org, andrii@kernel.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org,
	bpf@vger.kernel.org
Subject: Re: [PATCH v2] arm64: insn: Simulate nop instruction for better uprobe performance
Date: Thu, 10 Oct 2024 11:52:51 +0100	[thread overview]
Message-ID: <ZweyA3tZc1BiBcb6@J2N7QTR9R3> (raw)
In-Reply-To: <20240909071114.1150053-1-liaochang1@huawei.com>

On Mon, Sep 09, 2024 at 07:11:14AM +0000, Liao Chang wrote:
> v2->v1:
> 1. Remove the simuation of STP and the related bits.
> 2. Use arm64_skip_faulting_instruction for single-stepping or FEAT_BTI
>    scenario.
> 
> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
> counterintuitive result that nop and push variants are much slower than
> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
> which excludes 'nop' and 'stp' from the emulatable instructions list.
> This force the kernel returns to userspace and execute them out-of-line,
> then trapping back to kernel for running uprobe callback functions. This
> leads to a significant performance overhead compared to 'ret' variant,
> which is already emulated.
> 
> Typicall uprobe is installed on 'nop' for USDT and on function entry
> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
> and fp into stack regardless kernel or userspace binary. In order to
> improve the performance of handling uprobe for common usecases. This
> patch supports the emulation of Arm64 equvialents instructions of 'nop'
> and 'push'. The benchmark results below indicates the performance gain
> of emulation is obvious.
> 
> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
> 
> xol (1 cpus)
> ------------
> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> 
> emulation (1 cpus)
> -------------------
> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
> uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)
> 
> As shown above, the performance gap between 'nop/push' and 'ret'
> variants has been significantly reduced. Due to the emulation of 'push'
> instruction needs to access userspace memory, it spent more cycles than
> the other.
> 
> As Mark suggested [1], it is painful to emulate the correct atomicity
> and ordering properties of STP, especially when it interacts with MTE,
> POE, etc. So this patch just focus on the simuation of 'nop'. The
> simluation of STP and related changes will be addressed in a separate
> patch.
> 
> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
> 
> CC: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> CC: Mark Rutland <mark.rutland@arm.com>
> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> ---
>  arch/arm64/include/asm/insn.h            |  6 ++++++
>  arch/arm64/kernel/probes/decode-insn.c   |  9 +++++++++
>  arch/arm64/kernel/probes/simulate-insn.c | 11 +++++++++++
>  arch/arm64/kernel/probes/simulate-insn.h |  1 +
>  4 files changed, 27 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> index 8c0a36f72d6f..dd530d5c3d67 100644
> --- a/arch/arm64/include/asm/insn.h
> +++ b/arch/arm64/include/asm/insn.h
> @@ -549,6 +549,12 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
>  	       aarch64_insn_is_prfm_lit(insn);
>  }
>  
> +static __always_inline bool aarch64_insn_is_nop(u32 insn)
> +{
> +	return aarch64_insn_is_hint(insn) &&
> +	       ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
> +}

Can we please make this:

static __always_inline bool aarch64_insn_is_nop(u32 insn)
{
	return insn == aarch64_insn_gen_nop();
}

That way we don't need to duplicate the encoding details, and it's
"obviously correct".

> +
>  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
>  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
>  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
> index 968d5fffe233..be54539e309e 100644
> --- a/arch/arm64/kernel/probes/decode-insn.c
> +++ b/arch/arm64/kernel/probes/decode-insn.c
> @@ -75,6 +75,15 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
>  enum probe_insn __kprobes
>  arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
>  {
> +	/*
> +	 * While 'nop' instruction can execute in the out-of-line slot,
> +	 * simulating them in breakpoint handling offers better performance.
> +	 */
> +	if (aarch64_insn_is_nop(insn)) {
> +		api->handler = simulate_nop;
> +		return INSN_GOOD_NO_SLOT;
> +	}
> +
>  	/*
>  	 * Instructions reading or modifying the PC won't work from the XOL
>  	 * slot.
> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
> index 22d0b3252476..5e4f887a074c 100644
> --- a/arch/arm64/kernel/probes/simulate-insn.c
> +++ b/arch/arm64/kernel/probes/simulate-insn.c
> @@ -200,3 +200,14 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
>  
>  	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>  }
> +
> +void __kprobes
> +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
> +{
> +	/*
> +	 * Compared to instruction_pointer_set(), it offers better
> +	 * compatibility with single-stepping and execution in target
> +	 * guarded memory.
> +	 */
> +	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
> +}

Can we please delete the comment? i.e. make this:

	void __kprobes
	simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
	{
		arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
	}

With those two changes:

Acked-by: Mark Rutland <mark.rutland@arm.com>

... and I can go chase up fixing the other issues in this file.

Mark.


> diff --git a/arch/arm64/kernel/probes/simulate-insn.h b/arch/arm64/kernel/probes/simulate-insn.h
> index e065dc92218e..efb2803ec943 100644
> --- a/arch/arm64/kernel/probes/simulate-insn.h
> +++ b/arch/arm64/kernel/probes/simulate-insn.h
> @@ -16,5 +16,6 @@ void simulate_cbz_cbnz(u32 opcode, long addr, struct pt_regs *regs);
>  void simulate_tbz_tbnz(u32 opcode, long addr, struct pt_regs *regs);
>  void simulate_ldr_literal(u32 opcode, long addr, struct pt_regs *regs);
>  void simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs);
> +void simulate_nop(u32 opcode, long addr, struct pt_regs *regs);
>  
>  #endif /* _ARM_KERNEL_KPROBES_SIMULATE_INSN_H */
> -- 
> 2.34.1
> 
> 


  parent reply	other threads:[~2024-10-10 10:54 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-09  7:11 [PATCH v2] arm64: insn: Simulate nop instruction for better uprobe performance Liao Chang
2024-10-09 23:54 ` Andrii Nakryiko
2024-10-10 10:58   ` Mark Rutland
2024-10-10 22:22     ` Andrii Nakryiko
2024-10-10 10:52 ` Mark Rutland [this message]
2024-10-21 10:44   ` Liao, Chang
2024-10-15 18:58 ` Catalin Marinas
2024-10-21 10:45   ` Liao, Chang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZweyA3tZc1BiBcb6@J2N7QTR9R3 \
    --to=mark.rutland@arm.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=liaochang1@huawei.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=puranjay@kernel.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox