From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 87185C54E5D for ; Tue, 12 Mar 2024 13:42:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=U3QAkgX1A19te2bVITTmNuWxCdf/t+212aJAfPRHwNw=; b=laXL1DQLSsQ2a8 eXejtfErtyRilZk+cB4c9dZxdD1UhL0ssPDpeW8OGDdTdi3Hp6Q8D37ChU5Km8yE1AB2npuEA8UY6 ETpVSA40y8DRedt2LNCnV73a4C4dzd9TcluiHsnywuQpCyt98zBiKU9w2732kqEfyBeen/bb12Ulb CxVy/R0GbUufsfzywpm+tcAjBqtTnyjfPJKoH2tSjr1U1cDNtLEw7dhKpwikTwjE7lQ1kANofBSfc 9yoXrek9EhV0VLOs0mfccNc1OU0QXbbkQ0ixuqXYGpVEeL5kViV8LFQ8pmttlluq1HoHuRZu5MyD9 Tsed31W0CZJmrcuV6AqQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux)) id 1rk2Ox-00000005xVq-0qQw; Tue, 12 Mar 2024 13:42:43 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux)) id 1rk2Ou-00000005xTp-49oN for linux-riscv@lists.infradead.org; Tue, 12 Mar 2024 13:42:42 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 31EAE1007; Tue, 12 Mar 2024 06:43:15 -0700 (PDT) Received: from FVFF77S0Q05N (unknown [10.57.69.97]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 8E9313F762; Tue, 12 Mar 2024 06:42:34 -0700 (PDT) Date: Tue, 12 Mar 2024 13:42:28 +0000 From: Mark Rutland To: "Bj\"orn T\"opel" Cc: Puranjay Mohan , Paul Walmsley , Palmer Dabbelt , Albert Ou , Steven Rostedt , Masami Hiramatsu , Sami Tolvanen , Guo Ren , Ley Foon Tan , Deepak Gupta , Sia Jee Heng , "Bj\"orn T\"opel" , Song Shuai , Cl'ement L'eger , Al Viro , Jisheng Zhang , linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Subject: Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS Message-ID: References: <20240306165904.108141-1-puranjay12@gmail.com> <87ttlhdeqb.fsf@all.your.base.are.belong.to.us> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <87ttlhdeqb.fsf@all.your.base.are.belong.to.us> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20240312_064241_168605_7ECD4278 X-CRM114-Status: GOOD ( 49.24 ) X-BeenThere: linux-riscv@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-riscv" Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org Hi Bjorn (apologies, my corporate mail server has butchered your name here). There's a big info dump below; I realise this sounds like a sales pitch for CALL_OPS, but my intent is more to say "here are some dragons you may not have spotted". On Thu, Mar 07, 2024 at 08:27:40PM +0100, Bj"orn T"opel wrote: > Puranjay! > > Puranjay Mohan writes: > > > This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on RISC-V. > > This allows each ftrace callsite to provide an ftrace_ops to the common > > ftrace trampoline, allowing each callsite to invoke distinct tracer > > functions without the need to fall back to list processing or to > > allocate custom trampolines for each callsite. This significantly speeds > > up cases where multiple distinct trace functions are used and callsites > > are mostly traced by a single tracer. > > > > The idea and most of the implementation is taken from the ARM64's > > implementation of the same feature. The idea is to place a pointer to > > the ftrace_ops as a literal at a fixed offset from the function entry > > point, which can be recovered by the common ftrace trampoline. > > Not really a review, but some more background; Another rationale (on-top > of the improved per-call performance!) for CALL_OPS was to use it to > build ftrace direct call support (which BPF uses a lot!). Mark, please > correct me if I'm lying here! Yep; it gives us the ability to do a number of per-callsite things, including direct calls. > On Arm64, CALL_OPS makes it possible to implement direct calls, while > only patching one BL instruction -- nice! The key thing here isn't that we patch a single instruction (since we have ot patch the ops pointer too!); it's that we can safely patch either of the ops pointer or BL/NOP at any time while threads are concurrently executing. If you have a multi-instruction sequence, then threads can be preempted mid-sequence, and it's very painful/complex to handle all of the races that entails. For example, if your callsites use a sequence: AUIPC , JALR , () Using stop_machine() won't allow you to patch that safely as some threads could be stuck mid-sequence, e.g. AUIPC , [ preempted here ] JALR , () ... and you can't update the JALR to use a new funcptr immediate until those have completed the sequence. There are ways around that, but they're complicated and/or expensive, e.g. * Use a sequence of multiple patches, starting with replacing the JALR with an exception-generating instruction with a fixup handler, which is sort-of what x86 does with UD2. This may require multiple passes with synchronize_rcu_tasks() to make sure all threads have seen the latest instructions, and that cannot be done under stop_machine(), so if you need stop_machine() for CMODx reasons, you may need to use that several times with intervening calls to synchronize_rcu_tasks(). * Have the patching logic manually go over each thread and fix up the pt_regs for the interrupted thread. This is pretty horrid since you could have nested exceptions and a task could have several pt_regs which might require updating. The CALL_OPS approach is a bit easier to deal with as we can patch the per-callsite pointer atomically, then we can (possibly) enable/disable the callsite's branch, then wait for threads to drain once. As a heads-up, there are some latent/generic issues with DYNAMIC_FTRACE generally in this area (CALL_OPs happens to side-step those, but trampoline usage is currently affected): https://lore.kernel.org/lkml/Zenx_Q0UiwMbSAdP@FVFF77S0Q05N/ ... I'm looking into fixing that at the moment, and it looks like that's likely to require some per-architecture changes. > On RISC-V we cannot use use the same ideas as Arm64 straight off, > because the range of jal (compare to BL) is simply too short (+/-1M). > So, on RISC-V we need to use a full auipc/jal pair (the text patching > story is another chapter, but let's leave that aside for now). Since we > have to patch multiple instructions, the cmodx situation doesn't really > improve with CALL_OPS. The branch range thing is annoying, but I think this boils down to the same problem as arm64 has with needing a "MOV , LR" instruction that we have to patch in once at boot time. You could do the same and patch in the AUIPC once, e.g. have | NOP | NOP | func: | AUIPC , | JALR , () // patched with NOP ... which'd look very similar to arm64's sequence: | NOP | NOP | func: | MOV X9, LR | BL ftrace_caller // patched with NOP ... which I think means it *might* be better from a cmodx perspective? > Let's say that we continue building on your patch and implement direct > calls on CALL_OPS for RISC-V as well. > > From Florent's commit message for direct calls: > > | There are a few cases to distinguish: > | - If a direct call ops is the only one tracing a function: > | - If the direct called trampoline is within the reach of a BL > | instruction > | -> the ftrace patchsite jumps to the trampoline > | - Else > | -> the ftrace patchsite jumps to the ftrace_caller trampoline which > | reads the ops pointer in the patchsite and jumps to the direct > | call address stored in the ops > | - Else > | -> the ftrace patchsite jumps to the ftrace_caller trampoline and its > | ops literal points to ftrace_list_ops so it iterates over all > | registered ftrace ops, including the direct call ops and calls its > | call_direct_funcs handler which stores the direct called > | trampoline's address in the ftrace_regs and the ftrace_caller > | trampoline will return to that address instead of returning to the > | traced function > > On RISC-V, where auipc/jalr is used, the direct called trampoline would > always be reachable, and then first Else-clause would never be entered. > This means the the performance for direct calls would be the same as the > one we have today (i.e. no regression!). > > RISC-V does like x86 does (-ish) -- patch multiple instructions, long > reach. > > Arm64 uses CALL_OPS and patch one instruction BL. > > Now, with this background in mind, compared to what we have today, > CALL_OPS would give us (again assuming we're using it for direct calls): > > * Better performance for tracer per-call (faster ops lookup) GOOD > * Larger text size (function alignment + extra nops) BAD > * Same direct call performance NEUTRAL > * Same complicated text patching required NEUTRAL Is your current sequence safe for preemptible kernels (i.e. with PREEMPT_FULL=y or PREEMPT_DYNAMIC=y + "preempt=full" on the kernel cmdline) ? Looking at v6.8, IIUC you have: // unpatched //patched NOP AUIPC NOP JALR What prevents a thread being preempted mid-sequence such that it executes: NOP [ preempted ] [ stop_machine() used to patch text ] [ restored ] JALR ... ? ... or when changing the call: AUIPC // old funcptr [ preempted ] [ stop_machine() used to patch text ] [ restored ] JALR // new funcptr ... ? I suspect those might both be broken, but it's difficult to hit the race and so testing hasn't revealed that so far. > It would be interesting to see how the per-call performance would > improve on x86 with CALL_OPS! ;-) Heh. ;) > I'm trying to wrap my head if it makes sense to have it on RISC-V, given > that we're a bit different from Arm64. Does the scale tip to the GOOD > side? > > Oh, and we really need to see performance numbers on real HW! I have a > VF2 that I could try this series on. I'll have to leave those two to you. I suspect the preempt/stop_machine() might just tip that from NEUTRAL to GOOD. Mark. _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv