From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 90D01CF11CD for ; Thu, 10 Oct 2024 10:54:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:In-Reply-To: Content-Transfer-Encoding:Content-Type:MIME-Version:References:Message-ID: Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=oPbxbvmISrDidUvjpuALcEMm3uiAEHQQSwt8YlkWLzw=; b=lOn1DBQFelL3QKAkpdU61/O8Zv SelIuNgGjZTCLx5MZGn96XtTA3WnemGJy8jof+MESLuYekNTA61AkY87UnDeYgUp1EbfhWsgMCd6v 2nbxdZ5dnMZH8QTSFFUn3JURk5OxwAmYTfMjN/nekXpFiYnQ5K/+GxPUdeKt9lVLNCmskxCdfXO4k Lt67KX/edvS0kZHafxgtTJX+MpobY15Xop2UO1vtwn6YD+fQI+HOpzqUeB7Tjqp5vaqEe8gMFNJtB 8pgxCJ2fFejKDxIsNhHzVj46ckf13SRkCjjh6FXZgqVTb9fSzqtkPPpJpvdwN/hGqS5qo4g/jbzQe /eqfVijw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux)) id 1syqoU-0000000CTT2-1Ssi; Thu, 10 Oct 2024 10:54:34 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux)) id 1syqmu-0000000CT57-3m83 for linux-arm-kernel@lists.infradead.org; Thu, 10 Oct 2024 10:53:00 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id EF394497; Thu, 10 Oct 2024 03:53:25 -0700 (PDT) Received: from J2N7QTR9R3 (usa-sjc-imap-foss1.foss.arm.com [10.121.207.14]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 8F6DE3F58B; Thu, 10 Oct 2024 03:52:54 -0700 (PDT) Date: Thu, 10 Oct 2024 11:52:51 +0100 From: Mark Rutland To: Liao Chang Cc: catalin.marinas@arm.com, will@kernel.org, ast@kernel.org, puranjay@kernel.org, andrii@kernel.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, bpf@vger.kernel.org Subject: Re: [PATCH v2] arm64: insn: Simulate nop instruction for better uprobe performance Message-ID: References: <20240909071114.1150053-1-liaochang1@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20240909071114.1150053-1-liaochang1@huawei.com> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20241010_035257_088820_08A31EDC X-CRM114-Status: GOOD ( 30.49 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Mon, Sep 09, 2024 at 07:11:14AM +0000, Liao Chang wrote: > v2->v1: > 1. Remove the simuation of STP and the related bits. > 2. Use arm64_skip_faulting_instruction for single-stepping or FEAT_BTI > scenario. > > As Andrii pointed out, the uprobe/uretprobe selftest bench run into a > counterintuitive result that nop and push variants are much slower than > ret variant [0]. The root cause lies in the arch_probe_analyse_insn(), > which excludes 'nop' and 'stp' from the emulatable instructions list. > This force the kernel returns to userspace and execute them out-of-line, > then trapping back to kernel for running uprobe callback functions. This > leads to a significant performance overhead compared to 'ret' variant, > which is already emulated. > > Typicall uprobe is installed on 'nop' for USDT and on function entry > which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr > and fp into stack regardless kernel or userspace binary. In order to > improve the performance of handling uprobe for common usecases. This > patch supports the emulation of Arm64 equvialents instructions of 'nop' > and 'push'. The benchmark results below indicates the performance gain > of emulation is obvious. > > On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz. > > xol (1 cpus) > ------------ > uprobe-nop: 0.916 ± 0.001M/s (0.916M/prod) > uprobe-push: 0.908 ± 0.001M/s (0.908M/prod) > uprobe-ret: 1.855 ± 0.000M/s (1.855M/prod) > uretprobe-nop: 0.640 ± 0.000M/s (0.640M/prod) > uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod) > uretprobe-ret: 0.978 ± 0.003M/s (0.978M/prod) > > emulation (1 cpus) > ------------------- > uprobe-nop: 1.862 ± 0.002M/s (1.862M/prod) > uprobe-push: 1.743 ± 0.006M/s (1.743M/prod) > uprobe-ret: 1.840 ± 0.001M/s (1.840M/prod) > uretprobe-nop: 0.964 ± 0.004M/s (0.964M/prod) > uretprobe-push: 0.936 ± 0.004M/s (0.936M/prod) > uretprobe-ret: 0.940 ± 0.001M/s (0.940M/prod) > > As shown above, the performance gap between 'nop/push' and 'ret' > variants has been significantly reduced. Due to the emulation of 'push' > instruction needs to access userspace memory, it spent more cycles than > the other. > > As Mark suggested [1], it is painful to emulate the correct atomicity > and ordering properties of STP, especially when it interacts with MTE, > POE, etc. So this patch just focus on the simuation of 'nop'. The > simluation of STP and related changes will be addressed in a separate > patch. > > [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/ > [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/ > > CC: Andrii Nakryiko > CC: Mark Rutland > Signed-off-by: Liao Chang > --- > arch/arm64/include/asm/insn.h | 6 ++++++ > arch/arm64/kernel/probes/decode-insn.c | 9 +++++++++ > arch/arm64/kernel/probes/simulate-insn.c | 11 +++++++++++ > arch/arm64/kernel/probes/simulate-insn.h | 1 + > 4 files changed, 27 insertions(+) > > diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h > index 8c0a36f72d6f..dd530d5c3d67 100644 > --- a/arch/arm64/include/asm/insn.h > +++ b/arch/arm64/include/asm/insn.h > @@ -549,6 +549,12 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn) > aarch64_insn_is_prfm_lit(insn); > } > > +static __always_inline bool aarch64_insn_is_nop(u32 insn) > +{ > + return aarch64_insn_is_hint(insn) && > + ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP); > +} Can we please make this: static __always_inline bool aarch64_insn_is_nop(u32 insn) { return insn == aarch64_insn_gen_nop(); } That way we don't need to duplicate the encoding details, and it's "obviously correct". > + > enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn); > u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn); > u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type, > diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c > index 968d5fffe233..be54539e309e 100644 > --- a/arch/arm64/kernel/probes/decode-insn.c > +++ b/arch/arm64/kernel/probes/decode-insn.c > @@ -75,6 +75,15 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn) > enum probe_insn __kprobes > arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api) > { > + /* > + * While 'nop' instruction can execute in the out-of-line slot, > + * simulating them in breakpoint handling offers better performance. > + */ > + if (aarch64_insn_is_nop(insn)) { > + api->handler = simulate_nop; > + return INSN_GOOD_NO_SLOT; > + } > + > /* > * Instructions reading or modifying the PC won't work from the XOL > * slot. > diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c > index 22d0b3252476..5e4f887a074c 100644 > --- a/arch/arm64/kernel/probes/simulate-insn.c > +++ b/arch/arm64/kernel/probes/simulate-insn.c > @@ -200,3 +200,14 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs) > > instruction_pointer_set(regs, instruction_pointer(regs) + 4); > } > + > +void __kprobes > +simulate_nop(u32 opcode, long addr, struct pt_regs *regs) > +{ > + /* > + * Compared to instruction_pointer_set(), it offers better > + * compatibility with single-stepping and execution in target > + * guarded memory. > + */ > + arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE); > +} Can we please delete the comment? i.e. make this: void __kprobes simulate_nop(u32 opcode, long addr, struct pt_regs *regs) { arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE); } With those two changes: Acked-by: Mark Rutland ... and I can go chase up fixing the other issues in this file. Mark. > diff --git a/arch/arm64/kernel/probes/simulate-insn.h b/arch/arm64/kernel/probes/simulate-insn.h > index e065dc92218e..efb2803ec943 100644 > --- a/arch/arm64/kernel/probes/simulate-insn.h > +++ b/arch/arm64/kernel/probes/simulate-insn.h > @@ -16,5 +16,6 @@ void simulate_cbz_cbnz(u32 opcode, long addr, struct pt_regs *regs); > void simulate_tbz_tbnz(u32 opcode, long addr, struct pt_regs *regs); > void simulate_ldr_literal(u32 opcode, long addr, struct pt_regs *regs); > void simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs); > +void simulate_nop(u32 opcode, long addr, struct pt_regs *regs); > > #endif /* _ARM_KERNEL_KPROBES_SIMULATE_INSN_H */ > -- > 2.34.1 > >