[PATCH v2] arm64: insn: Simulate nop instruction for better uprobe performance

public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH v2] arm64: insn: Simulate nop instruction for better uprobe performance
@ 2024-09-09  7:11 Liao Chang
  2024-10-09 23:54 ` Andrii Nakryiko
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Liao Chang @ 2024-09-09  7:11 UTC (permalink / raw)
  To: catalin.marinas, will, ast, puranjay, liaochang1, andrii,
	mark.rutland
  Cc: linux-arm-kernel, linux-kernel, linux-trace-kernel, bpf

v2->v1:
1. Remove the simuation of STP and the related bits.
2. Use arm64_skip_faulting_instruction for single-stepping or FEAT_BTI
   scenario.

As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
counterintuitive result that nop and push variants are much slower than
ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
which excludes 'nop' and 'stp' from the emulatable instructions list.
This force the kernel returns to userspace and execute them out-of-line,
then trapping back to kernel for running uprobe callback functions. This
leads to a significant performance overhead compared to 'ret' variant,
which is already emulated.

Typicall uprobe is installed on 'nop' for USDT and on function entry
which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
and fp into stack regardless kernel or userspace binary. In order to
improve the performance of handling uprobe for common usecases. This
patch supports the emulation of Arm64 equvialents instructions of 'nop'
and 'push'. The benchmark results below indicates the performance gain
of emulation is obvious.

On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.

xol (1 cpus)
------------
uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)

emulation (1 cpus)
-------------------
uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)

As shown above, the performance gap between 'nop/push' and 'ret'
variants has been significantly reduced. Due to the emulation of 'push'
instruction needs to access userspace memory, it spent more cycles than
the other.

As Mark suggested [1], it is painful to emulate the correct atomicity
and ordering properties of STP, especially when it interacts with MTE,
POE, etc. So this patch just focus on the simuation of 'nop'. The
simluation of STP and related changes will be addressed in a separate
patch.

[0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
[1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/

CC: Andrii Nakryiko <andrii.nakryiko@gmail.com>
CC: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Liao Chang <liaochang1@huawei.com>
---
 arch/arm64/include/asm/insn.h            |  6 ++++++
 arch/arm64/kernel/probes/decode-insn.c   |  9 +++++++++
 arch/arm64/kernel/probes/simulate-insn.c | 11 +++++++++++
 arch/arm64/kernel/probes/simulate-insn.h |  1 +
 4 files changed, 27 insertions(+)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 8c0a36f72d6f..dd530d5c3d67 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -549,6 +549,12 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
 	       aarch64_insn_is_prfm_lit(insn);
 }
 
+static __always_inline bool aarch64_insn_is_nop(u32 insn)
+{
+	return aarch64_insn_is_hint(insn) &&
+	       ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
+}
+
 enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
 u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
 u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
index 968d5fffe233..be54539e309e 100644
--- a/arch/arm64/kernel/probes/decode-insn.c
+++ b/arch/arm64/kernel/probes/decode-insn.c
@@ -75,6 +75,15 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
 enum probe_insn __kprobes
 arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
 {
+	/*
+	 * While 'nop' instruction can execute in the out-of-line slot,
+	 * simulating them in breakpoint handling offers better performance.
+	 */
+	if (aarch64_insn_is_nop(insn)) {
+		api->handler = simulate_nop;
+		return INSN_GOOD_NO_SLOT;
+	}
+
 	/*
 	 * Instructions reading or modifying the PC won't work from the XOL
 	 * slot.
diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
index 22d0b3252476..5e4f887a074c 100644
--- a/arch/arm64/kernel/probes/simulate-insn.c
+++ b/arch/arm64/kernel/probes/simulate-insn.c
@@ -200,3 +200,14 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
 
 	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
 }
+
+void __kprobes
+simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
+{
+	/*
+	 * Compared to instruction_pointer_set(), it offers better
+	 * compatibility with single-stepping and execution in target
+	 * guarded memory.
+	 */
+	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
+}
diff --git a/arch/arm64/kernel/probes/simulate-insn.h b/arch/arm64/kernel/probes/simulate-insn.h
index e065dc92218e..efb2803ec943 100644
--- a/arch/arm64/kernel/probes/simulate-insn.h
+++ b/arch/arm64/kernel/probes/simulate-insn.h
@@ -16,5 +16,6 @@ void simulate_cbz_cbnz(u32 opcode, long addr, struct pt_regs *regs);
 void simulate_tbz_tbnz(u32 opcode, long addr, struct pt_regs *regs);
 void simulate_ldr_literal(u32 opcode, long addr, struct pt_regs *regs);
 void simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs);
+void simulate_nop(u32 opcode, long addr, struct pt_regs *regs);
 
 #endif /* _ARM_KERNEL_KPROBES_SIMULATE_INSN_H */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] arm64: insn: Simulate nop instruction for better uprobe performance
  2024-09-09  7:11 [PATCH v2] arm64: insn: Simulate nop instruction for better uprobe performance Liao Chang
@ 2024-10-09 23:54 ` Andrii Nakryiko
  2024-10-10 10:58   ` Mark Rutland
  2024-10-10 10:52 ` Mark Rutland
  2024-10-15 18:58 ` Catalin Marinas
  2 siblings, 1 reply; 8+ messages in thread
From: Andrii Nakryiko @ 2024-10-09 23:54 UTC (permalink / raw)
  To: Liao Chang, will, mark.rutland
  Cc: catalin.marinas, ast, puranjay, andrii, linux-arm-kernel,
	linux-kernel, linux-trace-kernel, bpf

On Mon, Sep 9, 2024 at 12:21 AM Liao Chang <liaochang1@huawei.com> wrote:
>
> v2->v1:
> 1. Remove the simuation of STP and the related bits.
> 2. Use arm64_skip_faulting_instruction for single-stepping or FEAT_BTI
>    scenario.
>
> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
> counterintuitive result that nop and push variants are much slower than
> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
> which excludes 'nop' and 'stp' from the emulatable instructions list.
> This force the kernel returns to userspace and execute them out-of-line,
> then trapping back to kernel for running uprobe callback functions. This
> leads to a significant performance overhead compared to 'ret' variant,
> which is already emulated.
>
> Typicall uprobe is installed on 'nop' for USDT and on function entry
> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
> and fp into stack regardless kernel or userspace binary. In order to
> improve the performance of handling uprobe for common usecases. This
> patch supports the emulation of Arm64 equvialents instructions of 'nop'
> and 'push'. The benchmark results below indicates the performance gain
> of emulation is obvious.
>
> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>
> xol (1 cpus)
> ------------
> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>
> emulation (1 cpus)
> -------------------
> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
> uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)
>
> As shown above, the performance gap between 'nop/push' and 'ret'
> variants has been significantly reduced. Due to the emulation of 'push'
> instruction needs to access userspace memory, it spent more cycles than
> the other.
>
> As Mark suggested [1], it is painful to emulate the correct atomicity
> and ordering properties of STP, especially when it interacts with MTE,
> POE, etc. So this patch just focus on the simuation of 'nop'. The
> simluation of STP and related changes will be addressed in a separate
> patch.
>
> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
>
> CC: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> CC: Mark Rutland <mark.rutland@arm.com>
> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> ---
>  arch/arm64/include/asm/insn.h            |  6 ++++++
>  arch/arm64/kernel/probes/decode-insn.c   |  9 +++++++++
>  arch/arm64/kernel/probes/simulate-insn.c | 11 +++++++++++
>  arch/arm64/kernel/probes/simulate-insn.h |  1 +
>  4 files changed, 27 insertions(+)
>

I'm curious what's the status of this patch? It received no comments
so far in the last month. Can someone on the ARM64 side of things
please take a look? (or maybe it was applied to some tree and there
was just no notification?)

This is a very useful performance optimization for uprobe tracing on
ARM64, so would be nice to get it in during current release cycle.
Thank you!

> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> index 8c0a36f72d6f..dd530d5c3d67 100644
> --- a/arch/arm64/include/asm/insn.h
> +++ b/arch/arm64/include/asm/insn.h
> @@ -549,6 +549,12 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
>                aarch64_insn_is_prfm_lit(insn);
>  }
>
> +static __always_inline bool aarch64_insn_is_nop(u32 insn)
> +{
> +       return aarch64_insn_is_hint(insn) &&
> +              ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
> +}
> +
>  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
>  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
>  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
> index 968d5fffe233..be54539e309e 100644
> --- a/arch/arm64/kernel/probes/decode-insn.c
> +++ b/arch/arm64/kernel/probes/decode-insn.c
> @@ -75,6 +75,15 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
>  enum probe_insn __kprobes
>  arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
>  {
> +       /*
> +        * While 'nop' instruction can execute in the out-of-line slot,
> +        * simulating them in breakpoint handling offers better performance.
> +        */
> +       if (aarch64_insn_is_nop(insn)) {
> +               api->handler = simulate_nop;
> +               return INSN_GOOD_NO_SLOT;
> +       }
> +
>         /*
>          * Instructions reading or modifying the PC won't work from the XOL
>          * slot.
> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
> index 22d0b3252476..5e4f887a074c 100644
> --- a/arch/arm64/kernel/probes/simulate-insn.c
> +++ b/arch/arm64/kernel/probes/simulate-insn.c
> @@ -200,3 +200,14 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
>
>         instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>  }
> +
> +void __kprobes
> +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
> +{
> +       /*
> +        * Compared to instruction_pointer_set(), it offers better
> +        * compatibility with single-stepping and execution in target
> +        * guarded memory.
> +        */
> +       arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
> +}
> diff --git a/arch/arm64/kernel/probes/simulate-insn.h b/arch/arm64/kernel/probes/simulate-insn.h
> index e065dc92218e..efb2803ec943 100644
> --- a/arch/arm64/kernel/probes/simulate-insn.h
> +++ b/arch/arm64/kernel/probes/simulate-insn.h
> @@ -16,5 +16,6 @@ void simulate_cbz_cbnz(u32 opcode, long addr, struct pt_regs *regs);
>  void simulate_tbz_tbnz(u32 opcode, long addr, struct pt_regs *regs);
>  void simulate_ldr_literal(u32 opcode, long addr, struct pt_regs *regs);
>  void simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs);
> +void simulate_nop(u32 opcode, long addr, struct pt_regs *regs);
>
>  #endif /* _ARM_KERNEL_KPROBES_SIMULATE_INSN_H */
> --
> 2.34.1
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] arm64: insn: Simulate nop instruction for better uprobe performance
  2024-10-09 23:54 ` Andrii Nakryiko
@ 2024-10-10 10:58   ` Mark Rutland
  2024-10-10 22:22     ` Andrii Nakryiko
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Rutland @ 2024-10-10 10:58 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Liao Chang, will, catalin.marinas, ast, puranjay, andrii,
	linux-arm-kernel, linux-kernel, linux-trace-kernel, bpf

Hi Andrii,

On Wed, Oct 09, 2024 at 04:54:25PM -0700, Andrii Nakryiko wrote:
> On Mon, Sep 9, 2024 at 12:21 AM Liao Chang <liaochang1@huawei.com> wrote:

> I'm curious what's the status of this patch? It received no comments
> so far in the last month. Can someone on the ARM64 side of things
> please take a look? (or maybe it was applied to some tree and there
> was just no notification?)
> 
> This is a very useful performance optimization for uprobe tracing on
> ARM64, so would be nice to get it in during current release cycle.
> Thank you!

Sorry, I got busy chasing up a bunch of bugs and hadn't gotten round to
this yet.

I've replied with a couple of minor comments and an ack, and I reckon we
can queue this up this cycle. Usually this sort of thing starts to get
queued around -rc3.

Mark.

> 
> > diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> > index 8c0a36f72d6f..dd530d5c3d67 100644
> > --- a/arch/arm64/include/asm/insn.h
> > +++ b/arch/arm64/include/asm/insn.h
> > @@ -549,6 +549,12 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
> >                aarch64_insn_is_prfm_lit(insn);
> >  }
> >
> > +static __always_inline bool aarch64_insn_is_nop(u32 insn)
> > +{
> > +       return aarch64_insn_is_hint(insn) &&
> > +              ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
> > +}
> > +
> >  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
> >  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
> >  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
> > diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
> > index 968d5fffe233..be54539e309e 100644
> > --- a/arch/arm64/kernel/probes/decode-insn.c
> > +++ b/arch/arm64/kernel/probes/decode-insn.c
> > @@ -75,6 +75,15 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
> >  enum probe_insn __kprobes
> >  arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
> >  {
> > +       /*
> > +        * While 'nop' instruction can execute in the out-of-line slot,
> > +        * simulating them in breakpoint handling offers better performance.
> > +        */
> > +       if (aarch64_insn_is_nop(insn)) {
> > +               api->handler = simulate_nop;
> > +               return INSN_GOOD_NO_SLOT;
> > +       }
> > +
> >         /*
> >          * Instructions reading or modifying the PC won't work from the XOL
> >          * slot.
> > diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
> > index 22d0b3252476..5e4f887a074c 100644
> > --- a/arch/arm64/kernel/probes/simulate-insn.c
> > +++ b/arch/arm64/kernel/probes/simulate-insn.c
> > @@ -200,3 +200,14 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
> >
> >         instruction_pointer_set(regs, instruction_pointer(regs) + 4);
> >  }
> > +
> > +void __kprobes
> > +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
> > +{
> > +       /*
> > +        * Compared to instruction_pointer_set(), it offers better
> > +        * compatibility with single-stepping and execution in target
> > +        * guarded memory.
> > +        */
> > +       arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
> > +}
> > diff --git a/arch/arm64/kernel/probes/simulate-insn.h b/arch/arm64/kernel/probes/simulate-insn.h
> > index e065dc92218e..efb2803ec943 100644
> > --- a/arch/arm64/kernel/probes/simulate-insn.h
> > +++ b/arch/arm64/kernel/probes/simulate-insn.h
> > @@ -16,5 +16,6 @@ void simulate_cbz_cbnz(u32 opcode, long addr, struct pt_regs *regs);
> >  void simulate_tbz_tbnz(u32 opcode, long addr, struct pt_regs *regs);
> >  void simulate_ldr_literal(u32 opcode, long addr, struct pt_regs *regs);
> >  void simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs);
> > +void simulate_nop(u32 opcode, long addr, struct pt_regs *regs);
> >
> >  #endif /* _ARM_KERNEL_KPROBES_SIMULATE_INSN_H */
> > --
> > 2.34.1
> >


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] arm64: insn: Simulate nop instruction for better uprobe performance
  2024-10-10 10:58   ` Mark Rutland
@ 2024-10-10 22:22     ` Andrii Nakryiko
  0 siblings, 0 replies; 8+ messages in thread
From: Andrii Nakryiko @ 2024-10-10 22:22 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Liao Chang, will, catalin.marinas, ast, puranjay, andrii,
	linux-arm-kernel, linux-kernel, linux-trace-kernel, bpf

On Thu, Oct 10, 2024 at 3:58 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> Hi Andrii,
>
> On Wed, Oct 09, 2024 at 04:54:25PM -0700, Andrii Nakryiko wrote:
> > On Mon, Sep 9, 2024 at 12:21 AM Liao Chang <liaochang1@huawei.com> wrote:
>
> > I'm curious what's the status of this patch? It received no comments
> > so far in the last month. Can someone on the ARM64 side of things
> > please take a look? (or maybe it was applied to some tree and there
> > was just no notification?)
> >
> > This is a very useful performance optimization for uprobe tracing on
> > ARM64, so would be nice to get it in during current release cycle.
> > Thank you!
>
> Sorry, I got busy chasing up a bunch of bugs and hadn't gotten round to
> this yet.
>
> I've replied with a couple of minor comments and an ack, and I reckon we
> can queue this up this cycle. Usually this sort of thing starts to get
> queued around -rc3.

Thanks Mark! I'm happy to backport it internally before it goes into
official kernel release, as long as it's clear that the patch is in
the final state. So once Liao posts a new version with your ack, I'll
just go ahead and use it internally.

When you get a chance, please also take another look at Liao's second
optimization targeting STP instruction. I know it was more
controversial, but hopefully we can arrive at some maintainable
solution that would still benefit a very common uprobe tracing use
case. Thanks in advance!

  [0] https://lore.kernel.org/linux-trace-kernel/20240910060407.1427716-1-liaochang1@huawei.com/

>
> Mark.
>
> >
> > > diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> > > index 8c0a36f72d6f..dd530d5c3d67 100644
> > > --- a/arch/arm64/include/asm/insn.h
> > > +++ b/arch/arm64/include/asm/insn.h
> > > @@ -549,6 +549,12 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
> > >                aarch64_insn_is_prfm_lit(insn);
> > >  }
> > >
> > > +static __always_inline bool aarch64_insn_is_nop(u32 insn)
> > > +{
> > > +       return aarch64_insn_is_hint(insn) &&
> > > +              ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
> > > +}
> > > +
> > >  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
> > >  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
> > >  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
> > > diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
> > > index 968d5fffe233..be54539e309e 100644
> > > --- a/arch/arm64/kernel/probes/decode-insn.c
> > > +++ b/arch/arm64/kernel/probes/decode-insn.c
> > > @@ -75,6 +75,15 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
> > >  enum probe_insn __kprobes
> > >  arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
> > >  {
> > > +       /*
> > > +        * While 'nop' instruction can execute in the out-of-line slot,
> > > +        * simulating them in breakpoint handling offers better performance.
> > > +        */
> > > +       if (aarch64_insn_is_nop(insn)) {
> > > +               api->handler = simulate_nop;
> > > +               return INSN_GOOD_NO_SLOT;
> > > +       }
> > > +
> > >         /*
> > >          * Instructions reading or modifying the PC won't work from the XOL
> > >          * slot.
> > > diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
> > > index 22d0b3252476..5e4f887a074c 100644
> > > --- a/arch/arm64/kernel/probes/simulate-insn.c
> > > +++ b/arch/arm64/kernel/probes/simulate-insn.c
> > > @@ -200,3 +200,14 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
> > >
> > >         instruction_pointer_set(regs, instruction_pointer(regs) + 4);
> > >  }
> > > +
> > > +void __kprobes
> > > +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
> > > +{
> > > +       /*
> > > +        * Compared to instruction_pointer_set(), it offers better
> > > +        * compatibility with single-stepping and execution in target
> > > +        * guarded memory.
> > > +        */
> > > +       arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
> > > +}
> > > diff --git a/arch/arm64/kernel/probes/simulate-insn.h b/arch/arm64/kernel/probes/simulate-insn.h
> > > index e065dc92218e..efb2803ec943 100644
> > > --- a/arch/arm64/kernel/probes/simulate-insn.h
> > > +++ b/arch/arm64/kernel/probes/simulate-insn.h
> > > @@ -16,5 +16,6 @@ void simulate_cbz_cbnz(u32 opcode, long addr, struct pt_regs *regs);
> > >  void simulate_tbz_tbnz(u32 opcode, long addr, struct pt_regs *regs);
> > >  void simulate_ldr_literal(u32 opcode, long addr, struct pt_regs *regs);
> > >  void simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs);
> > > +void simulate_nop(u32 opcode, long addr, struct pt_regs *regs);
> > >
> > >  #endif /* _ARM_KERNEL_KPROBES_SIMULATE_INSN_H */
> > > --
> > > 2.34.1
> > >


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] arm64: insn: Simulate nop instruction for better uprobe performance
  2024-09-09  7:11 [PATCH v2] arm64: insn: Simulate nop instruction for better uprobe performance Liao Chang
  2024-10-09 23:54 ` Andrii Nakryiko
@ 2024-10-10 10:52 ` Mark Rutland
  2024-10-21 10:44   ` Liao, Chang
  2024-10-15 18:58 ` Catalin Marinas
  2 siblings, 1 reply; 8+ messages in thread
From: Mark Rutland @ 2024-10-10 10:52 UTC (permalink / raw)
  To: Liao Chang
  Cc: catalin.marinas, will, ast, puranjay, andrii, linux-arm-kernel,
	linux-kernel, linux-trace-kernel, bpf

On Mon, Sep 09, 2024 at 07:11:14AM +0000, Liao Chang wrote:
> v2->v1:
> 1. Remove the simuation of STP and the related bits.
> 2. Use arm64_skip_faulting_instruction for single-stepping or FEAT_BTI
>    scenario.
> 
> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
> counterintuitive result that nop and push variants are much slower than
> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
> which excludes 'nop' and 'stp' from the emulatable instructions list.
> This force the kernel returns to userspace and execute them out-of-line,
> then trapping back to kernel for running uprobe callback functions. This
> leads to a significant performance overhead compared to 'ret' variant,
> which is already emulated.
> 
> Typicall uprobe is installed on 'nop' for USDT and on function entry
> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
> and fp into stack regardless kernel or userspace binary. In order to
> improve the performance of handling uprobe for common usecases. This
> patch supports the emulation of Arm64 equvialents instructions of 'nop'
> and 'push'. The benchmark results below indicates the performance gain
> of emulation is obvious.
> 
> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
> 
> xol (1 cpus)
> ------------
> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> 
> emulation (1 cpus)
> -------------------
> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
> uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)
> 
> As shown above, the performance gap between 'nop/push' and 'ret'
> variants has been significantly reduced. Due to the emulation of 'push'
> instruction needs to access userspace memory, it spent more cycles than
> the other.
> 
> As Mark suggested [1], it is painful to emulate the correct atomicity
> and ordering properties of STP, especially when it interacts with MTE,
> POE, etc. So this patch just focus on the simuation of 'nop'. The
> simluation of STP and related changes will be addressed in a separate
> patch.
> 
> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
> 
> CC: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> CC: Mark Rutland <mark.rutland@arm.com>
> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> ---
>  arch/arm64/include/asm/insn.h            |  6 ++++++
>  arch/arm64/kernel/probes/decode-insn.c   |  9 +++++++++
>  arch/arm64/kernel/probes/simulate-insn.c | 11 +++++++++++
>  arch/arm64/kernel/probes/simulate-insn.h |  1 +
>  4 files changed, 27 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> index 8c0a36f72d6f..dd530d5c3d67 100644
> --- a/arch/arm64/include/asm/insn.h
> +++ b/arch/arm64/include/asm/insn.h
> @@ -549,6 +549,12 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
>  	       aarch64_insn_is_prfm_lit(insn);
>  }
>  
> +static __always_inline bool aarch64_insn_is_nop(u32 insn)
> +{
> +	return aarch64_insn_is_hint(insn) &&
> +	       ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
> +}

Can we please make this:

static __always_inline bool aarch64_insn_is_nop(u32 insn)
{
	return insn == aarch64_insn_gen_nop();
}

That way we don't need to duplicate the encoding details, and it's
"obviously correct".

> +
>  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
>  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
>  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
> index 968d5fffe233..be54539e309e 100644
> --- a/arch/arm64/kernel/probes/decode-insn.c
> +++ b/arch/arm64/kernel/probes/decode-insn.c
> @@ -75,6 +75,15 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
>  enum probe_insn __kprobes
>  arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
>  {
> +	/*
> +	 * While 'nop' instruction can execute in the out-of-line slot,
> +	 * simulating them in breakpoint handling offers better performance.
> +	 */
> +	if (aarch64_insn_is_nop(insn)) {
> +		api->handler = simulate_nop;
> +		return INSN_GOOD_NO_SLOT;
> +	}
> +
>  	/*
>  	 * Instructions reading or modifying the PC won't work from the XOL
>  	 * slot.
> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
> index 22d0b3252476..5e4f887a074c 100644
> --- a/arch/arm64/kernel/probes/simulate-insn.c
> +++ b/arch/arm64/kernel/probes/simulate-insn.c
> @@ -200,3 +200,14 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
>  
>  	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>  }
> +
> +void __kprobes
> +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
> +{
> +	/*
> +	 * Compared to instruction_pointer_set(), it offers better
> +	 * compatibility with single-stepping and execution in target
> +	 * guarded memory.
> +	 */
> +	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
> +}

Can we please delete the comment? i.e. make this:

	void __kprobes
	simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
	{
		arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
	}

With those two changes:

Acked-by: Mark Rutland <mark.rutland@arm.com>

... and I can go chase up fixing the other issues in this file.

Mark.


> diff --git a/arch/arm64/kernel/probes/simulate-insn.h b/arch/arm64/kernel/probes/simulate-insn.h
> index e065dc92218e..efb2803ec943 100644
> --- a/arch/arm64/kernel/probes/simulate-insn.h
> +++ b/arch/arm64/kernel/probes/simulate-insn.h
> @@ -16,5 +16,6 @@ void simulate_cbz_cbnz(u32 opcode, long addr, struct pt_regs *regs);
>  void simulate_tbz_tbnz(u32 opcode, long addr, struct pt_regs *regs);
>  void simulate_ldr_literal(u32 opcode, long addr, struct pt_regs *regs);
>  void simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs);
> +void simulate_nop(u32 opcode, long addr, struct pt_regs *regs);
>  
>  #endif /* _ARM_KERNEL_KPROBES_SIMULATE_INSN_H */
> -- 
> 2.34.1
> 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] arm64: insn: Simulate nop instruction for better uprobe performance
  2024-10-10 10:52 ` Mark Rutland
@ 2024-10-21 10:44   ` Liao, Chang
  0 siblings, 0 replies; 8+ messages in thread
From: Liao, Chang @ 2024-10-21 10:44 UTC (permalink / raw)
  To: Mark Rutland
  Cc: catalin.marinas, will, ast, puranjay, andrii, linux-arm-kernel,
	linux-kernel, linux-trace-kernel, bpf



在 2024/10/10 18:52, Mark Rutland 写道:
> On Mon, Sep 09, 2024 at 07:11:14AM +0000, Liao Chang wrote:
>> v2->v1:
>> 1. Remove the simuation of STP and the related bits.
>> 2. Use arm64_skip_faulting_instruction for single-stepping or FEAT_BTI
>>    scenario.
>>
>> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
>> counterintuitive result that nop and push variants are much slower than
>> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
>> which excludes 'nop' and 'stp' from the emulatable instructions list.
>> This force the kernel returns to userspace and execute them out-of-line,
>> then trapping back to kernel for running uprobe callback functions. This
>> leads to a significant performance overhead compared to 'ret' variant,
>> which is already emulated.
>>
>> Typicall uprobe is installed on 'nop' for USDT and on function entry
>> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
>> and fp into stack regardless kernel or userspace binary. In order to
>> improve the performance of handling uprobe for common usecases. This
>> patch supports the emulation of Arm64 equvialents instructions of 'nop'
>> and 'push'. The benchmark results below indicates the performance gain
>> of emulation is obvious.
>>
>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>>
>> xol (1 cpus)
>> ------------
>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>
>> emulation (1 cpus)
>> -------------------
>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)
>>
>> As shown above, the performance gap between 'nop/push' and 'ret'
>> variants has been significantly reduced. Due to the emulation of 'push'
>> instruction needs to access userspace memory, it spent more cycles than
>> the other.
>>
>> As Mark suggested [1], it is painful to emulate the correct atomicity
>> and ordering properties of STP, especially when it interacts with MTE,
>> POE, etc. So this patch just focus on the simuation of 'nop'. The
>> simluation of STP and related changes will be addressed in a separate
>> patch.
>>
>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
>> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
>>
>> CC: Andrii Nakryiko <andrii.nakryiko@gmail.com>
>> CC: Mark Rutland <mark.rutland@arm.com>
>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>> ---
>>  arch/arm64/include/asm/insn.h            |  6 ++++++
>>  arch/arm64/kernel/probes/decode-insn.c   |  9 +++++++++
>>  arch/arm64/kernel/probes/simulate-insn.c | 11 +++++++++++
>>  arch/arm64/kernel/probes/simulate-insn.h |  1 +
>>  4 files changed, 27 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>> index 8c0a36f72d6f..dd530d5c3d67 100644
>> --- a/arch/arm64/include/asm/insn.h
>> +++ b/arch/arm64/include/asm/insn.h
>> @@ -549,6 +549,12 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
>>  	       aarch64_insn_is_prfm_lit(insn);
>>  }
>>  
>> +static __always_inline bool aarch64_insn_is_nop(u32 insn)
>> +{
>> +	return aarch64_insn_is_hint(insn) &&
>> +	       ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
>> +}
> 
> Can we please make this:
> 
> static __always_inline bool aarch64_insn_is_nop(u32 insn)
> {
> 	return insn == aarch64_insn_gen_nop();
> }
> 
> That way we don't need to duplicate the encoding details, and it's
> "obviously correct".

Absolutely agree.

> 
>> +
>>  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
>>  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
>>  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
>> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
>> index 968d5fffe233..be54539e309e 100644
>> --- a/arch/arm64/kernel/probes/decode-insn.c
>> +++ b/arch/arm64/kernel/probes/decode-insn.c
>> @@ -75,6 +75,15 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
>>  enum probe_insn __kprobes
>>  arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
>>  {
>> +	/*
>> +	 * While 'nop' instruction can execute in the out-of-line slot,
>> +	 * simulating them in breakpoint handling offers better performance.
>> +	 */
>> +	if (aarch64_insn_is_nop(insn)) {
>> +		api->handler = simulate_nop;
>> +		return INSN_GOOD_NO_SLOT;
>> +	}
>> +
>>  	/*
>>  	 * Instructions reading or modifying the PC won't work from the XOL
>>  	 * slot.
>> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
>> index 22d0b3252476..5e4f887a074c 100644
>> --- a/arch/arm64/kernel/probes/simulate-insn.c
>> +++ b/arch/arm64/kernel/probes/simulate-insn.c
>> @@ -200,3 +200,14 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
>>  
>>  	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>  }
>> +
>> +void __kprobes
>> +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>> +{
>> +	/*
>> +	 * Compared to instruction_pointer_set(), it offers better
>> +	 * compatibility with single-stepping and execution in target
>> +	 * guarded memory.
>> +	 */
>> +	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
>> +}
> 
> Can we please delete the comment? i.e. make this:
> 
> 	void __kprobes
> 	simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
> 	{
> 		arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
> 	}
> 
> With those two changes:

I wrote this comment to ensure I don't misunderstand the usage of
arm64_skip_faulting_instruction() here. This comment should be removed from
the final patch.

> 
> Acked-by: Mark Rutland <mark.rutland@arm.com>
> 
> ... and I can go chase up fixing the other issues in this file.
> 
> Mark.
> 
> 
>> diff --git a/arch/arm64/kernel/probes/simulate-insn.h b/arch/arm64/kernel/probes/simulate-insn.h
>> index e065dc92218e..efb2803ec943 100644
>> --- a/arch/arm64/kernel/probes/simulate-insn.h
>> +++ b/arch/arm64/kernel/probes/simulate-insn.h
>> @@ -16,5 +16,6 @@ void simulate_cbz_cbnz(u32 opcode, long addr, struct pt_regs *regs);
>>  void simulate_tbz_tbnz(u32 opcode, long addr, struct pt_regs *regs);
>>  void simulate_ldr_literal(u32 opcode, long addr, struct pt_regs *regs);
>>  void simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs);
>> +void simulate_nop(u32 opcode, long addr, struct pt_regs *regs);
>>  
>>  #endif /* _ARM_KERNEL_KPROBES_SIMULATE_INSN_H */
>> -- 
>> 2.34.1
>>
>>
> 

-- 
BR
Liao, Chang



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] arm64: insn: Simulate nop instruction for better uprobe performance
  2024-09-09  7:11 [PATCH v2] arm64: insn: Simulate nop instruction for better uprobe performance Liao Chang
  2024-10-09 23:54 ` Andrii Nakryiko
  2024-10-10 10:52 ` Mark Rutland
@ 2024-10-15 18:58 ` Catalin Marinas
  2024-10-21 10:45   ` Liao, Chang
  2 siblings, 1 reply; 8+ messages in thread
From: Catalin Marinas @ 2024-10-15 18:58 UTC (permalink / raw)
  To: will, ast, puranjay, andrii, mark.rutland, Liao Chang
  Cc: linux-arm-kernel, linux-kernel, linux-trace-kernel, bpf

On Mon, 09 Sep 2024 07:11:14 +0000, Liao Chang wrote:
> v2->v1:
> 1. Remove the simuation of STP and the related bits.
> 2. Use arm64_skip_faulting_instruction for single-stepping or FEAT_BTI
>    scenario.
> 
> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
> counterintuitive result that nop and push variants are much slower than
> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
> which excludes 'nop' and 'stp' from the emulatable instructions list.
> This force the kernel returns to userspace and execute them out-of-line,
> then trapping back to kernel for running uprobe callback functions. This
> leads to a significant performance overhead compared to 'ret' variant,
> which is already emulated.
> 
> [...]

Applied to arm64 (for-next/probes), thanks! I fixed it up according to
Mark's comments.

[1/1] arm64: insn: Simulate nop instruction for better uprobe performance
      https://git.kernel.org/arm64/c/ac4ad5c09b34

-- 
Catalin



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2] arm64: insn: Simulate nop instruction for better uprobe performance
  2024-10-15 18:58 ` Catalin Marinas
@ 2024-10-21 10:45   ` Liao, Chang
  0 siblings, 0 replies; 8+ messages in thread
From: Liao, Chang @ 2024-10-21 10:45 UTC (permalink / raw)
  To: Catalin Marinas, will, ast, puranjay, andrii, mark.rutland
  Cc: linux-arm-kernel, linux-kernel, linux-trace-kernel, bpf

在 2024/10/16 2:58, Catalin Marinas 写道:
> On Mon, 09 Sep 2024 07:11:14 +0000, Liao Chang wrote:
>> v2->v1:
>> 1. Remove the simuation of STP and the related bits.
>> 2. Use arm64_skip_faulting_instruction for single-stepping or FEAT_BTI
>>    scenario.
>>
>> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
>> counterintuitive result that nop and push variants are much slower than
>> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
>> which excludes 'nop' and 'stp' from the emulatable instructions list.
>> This force the kernel returns to userspace and execute them out-of-line,
>> then trapping back to kernel for running uprobe callback functions. This
>> leads to a significant performance overhead compared to 'ret' variant,
>> which is already emulated.
>>
>> [...]
> 
> Applied to arm64 (for-next/probes), thanks! I fixed it up according to
> Mark's comments.
> 
> [1/1] arm64: insn: Simulate nop instruction for better uprobe performance
>       https://git.kernel.org/arm64/c/ac4ad5c09b34
> 

Mark, Catalin and Andrii,

I am just back from a long vacation, thanks for reviewing and involvement for
this patch.

I've sent a patch [1] that simulates STP at function entry, It maps user
stack pages to kernel address space, allowing kernel to use STP directly
to push fp/lr onto stack. Unfortunately, the profiling results below show
reveals this approach increases the uprobe-push throughput by 29.3% (from
0.868M/s/cpu to 1.1238M/s/cpu) and uretprobe-push by 15.9% (from 0.616M/s/cpu
to 0.714M/s/cpu). As Andrii pointed out, this approach is a bit complex and
overkill for STP simluation. So I look forward to more input about this patch,
is it possible to reach a better result? Or should I pause this work for now
and wait for Arm64 to add some instruction for storing pairs of registers to
unprivileged memory in privileged exception level? Thanks.

xol-stp
-------
uprobe-push     ( 1 cpus):    0.868 ± 0.001M/s  (  0.868M/s/cpu)
uretprobe-push  ( 1 cpus):    0.616 ± 0.001M/s  (  0.616M/s/cpu)

simulated-stp
-------------
uprobe-push     ( 1 cpus):    1.128 ± 0.002M/s  (  1.128M/s/cpu)
uretprobe-push  ( 1 cpus):    0.714 ± 0.001M/s  (  0.714M/s/cpu)

[1] https://lore.kernel.org/all/20240910060407.1427716-1-liaochang1@huawei.com/

-- 
BR
Liao, Chang

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-10-21 10:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-09  7:11 [PATCH v2] arm64: insn: Simulate nop instruction for better uprobe performance Liao Chang
2024-10-09 23:54 ` Andrii Nakryiko
2024-10-10 10:58   ` Mark Rutland
2024-10-10 22:22     ` Andrii Nakryiko
2024-10-10 10:52 ` Mark Rutland
2024-10-21 10:44   ` Liao, Chang
2024-10-15 18:58 ` Catalin Marinas
2024-10-21 10:45   ` Liao, Chang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox