From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 76C05C00140 for ; Wed, 10 Aug 2022 15:27:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=hpIU1uAqoeBY+ZqjGq/jEDBuK0DAIkXQBw2WrHEAWEM=; b=LYQi2MtZ7HRKBB n7uAwUJm91QuhA7df1eXv5P9yFvLAeZhEwhFvu2POFKjDCugMyVq1nODkZdOm9FnU7sZafpOPS9Hf ss3rOP0igUx+lnvDOxAHJhrW70rNTnkgbiYUcZ8XLYHFjqSKaFs8NYbVLnMhSqVu54/Wz4i9YnPNW ucPHxvsEF8bFkW+X+ShXoeve1xA3VAGSZSxvHJ+rVEXo3+TT0qq3RkJPBq7Or8nrAoN6EHinpJ2Sr WuMQ/ux8lz7Le62cGbFzoNbmsGbF0vIFWmGd2D4Xrofse56ZXL/Tu4n0YI9rxp5C4lO1wtsoj6fAf 3dcDU7Dqklm2TOy+cbYg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1oLnbD-00CYtw-8s; Wed, 10 Aug 2022 15:26:23 +0000 Received: from out2.migadu.com ([2001:41d0:2:aacc::]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1oLnay-00CYpa-BV for linux-arm-kernel@lists.infradead.org; Wed, 10 Aug 2022 15:26:21 +0000 Date: Wed, 10 Aug 2022 10:25:56 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1660145161; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=crgVp9NsCb0L7kNNYBvx2bNXUBSIsZMhWwhI7LZsPBI=; b=ku/7SXkeVEh09MrmbC2RQi4olqOTqH+SUijKnoNarWUxYgP4yf72cyXRmLh0rLrGTP/A4/ oTBRIDa1oxNAv59702KMnci4dB6B2d6QjcRc2w6wD6mKO7+ilsRspGsoXS2hN3L9/kNKUX A8H92ZtCKRe3jL9kYuvC8KPZ2lcEqlI= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Oliver Upton To: Alexandru Elisei Cc: maz@kernel.org, Will Deacon , kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org Subject: Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory Message-ID: References: <20220419141012.GB6143@willie-the-truck> <20220801170055.GB26471@willie-the-truck> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Migadu-Auth-User: linux.dev X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220810_082619_207153_6E3CE21A X-CRM114-Status: GOOD ( 69.79 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Wed, Aug 10, 2022 at 10:37:26AM +0100, Alexandru Elisei wrote: > Hi, > > On Tue, Aug 09, 2022 at 01:43:32PM -0500, Oliver Upton wrote: > > Hi Alex, > > > > On Tue, Aug 09, 2022 at 03:01:36PM +0100, Alexandru Elisei wrote: > > > > [...] > > > > > > > To summarize the approaches we've discussed so far: > > > > > > > > > > 1. Pinning the entire guest memory > > > > > - Heavy handed and not ideal. > > > > > - Tried this approach in v5 of the SPE series [1], patches #2-#12. > > > > > > > > > > 2. Mapping the guest SPE buffer on demand, page by page, as a result of stage 2 > > > > > faults reported by SPE. > > > > > - Not feasible, because the entire contents of the buffer must be discarded is > > > > > PMBSR_EL1.DL is set to 1 when taking the fault. > > > > > - Requires KVM to walk the guest's stage 1 tables, because SPE reports the VA, > > > > > not the IPA. > > > > > > > > > > 3. Pinning the guest SPE buffer when profiling becomes enabled*: > > > > > - There is the corner case described above, when profiling becomes enabled as a > > > > > result of an ERET to EL0. This can happen when the buffer is enabled and > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; > > > > > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > > > > > stage 2 faults when draining the buffer, which is performed with profiling > > > > > disabled. > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > 4. Pin the entire guest SPE buffer after the first stage 2 fault reported by > > > > > SPE. > > > > > - Gets rid of the corner case at 3. > > > > > - Same approach to buffer unpinning as 3. > > > > > - Introduces a blackout window before the first record is written. > > > > > - Also requires KVM to walk the guest's stage 1 tables. > > > > > > > > > > As for the corner case at 3, I proposed either: > > > > > > > > > > a) Mandate that guest operating systems must never modify the buffer > > > > > translation entries if the buffer is enabled and > > > > > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}. > > > > > > > > > > b) Pin the entire buffer as a result of the first stage 2 fault reported by SPE, > > > > > but **only** for this corner case. For all other cases, the buffer is pinned > > > > > when profiling becomes enabled, to eliminate the blackout window. Guest > > > > > operating systems can be modified to not change the translation entries for the > > > > > buffer if this blackout window is not desirable. > > > > > > > > > > Pinning as a result of the **first** stage 2 fault should work, because there > > > > > are no prior records that would have to be discarded if PMSBR_EL1.DL = 1. > > > > > > > > > > I hope I haven't missed anything. Thoughts and suggestions more than welcome. > > > > > > > > Thanks Alex for pulling together all of the context here. > > > > > > > > Unless there's any other strong opinions on the topic, it seems to me > > > > that option #4 (pin on S2 fault) is probably the best approach for > > > > the initial implementation. No amount of tricks in KVM can work around > > > > the fact that SPE has some serious issues w.r.t. virtualization. With > > > > that, we should probably document the behavior of SPE as a known erratum > > > > of KVM. > > > > > > > > If folks complain about EL1 profile blackout, eagerly pinning when > > > > profiling is enabled could layer on top quite easily by treating it as > > > > a synthetic S2 fault and triggering the implementation of #4. Having > > > > > > I'm not sure I follow, I understand what you mean by "treating it as a > > > synthetic S2 fault", would you mind elaborating? > > > > Assuming approach #4 is implemented, we will already have an SPE fault > > handler that walks stage-1 and pins the buffer. At that point, > > implementing approach #3 would be relatively easy. When EL1 sets > > PMSCR_EL1.E1SPE, call the SPE fault handler on the GVA of the buffer. > > I see, that makes sense, thanks, > > > > > > > said that I don't believe it is a hard requirement for enabling some > > > > flavor of SPE for guests. > > > > > > > > Walking guest S1 in KVM doesn't sound too exciting although it'll need to > > > > be done eventually. > > > > > > > > Do you feel like this is an OK route forward, or have I missed > > > > something? > > > > > > I've been giving this some thought, and I prefer approach #3 because with > > > #4, pinning the buffer as a result of a stage 2 fault reported by SPE, it > > > will be impossible to distinguish between a valid stage 2 fault (a fault > > > caused by the guest reprogramming the buffer and enabling profiling) and > > > KVM messing something up when pinning the buffer. I believe this to be > > > important, as experience has shown me that pinning the buffer at stage 2 is > > > not trivial and there isn't a mechanism today in Linux to do that > > > (explanation and examples here [1]). > > > > How does eagerly pinning avoid stage-2 aborts, though? As you note in > > [1], page pinning does not avoid the possibility of the MMU notifiers > > being called on a given range. Want to make sure I'm following, what > > is your suggestion for approach #3 to handle the profile buffer when > > only enabled at EL0? > > > > > With approach #4, it would be impossible to figure out if the results of a > > > profiling operations inside a guest are representative of the workload or > > > not, because those SPE stage 2 faults triggered by a bug in KVM can happen > > > multiple times per profiling session, introducing multiple blackout windows > > > that can skew the results. > > > > > > If you're proposing that the blackout window when the first record is > > > written be documented as an erratum for KVM, then why no got a step further > > > and document as an erratum that changing the buffer translation tables > > > after the buffer has been enabled will lead to an SPE Serror? That will > > > allow us to always pin the buffer when profiling is enabled. > > > > Ah, there are certainly more errata in virtualizing SPE beyond what I > > had said :) Preserving the stage-1 translations while profiling is > > active is a good recommendation, although I'm not sure that we've > > completely eliminated the risk of stage-2 faults. > > > > It seems impossible to blame the guest for all stage-2 faults that happen > > in the middle of a profiling session. In addition to host mm driven changes > > to stage-2, live migration is a busted as well. You'd need to build out > > stage-2 on the target before resuming the guest and guarantee that the > > appropriate pages have been demanded from the source (in case of post-copy). > > > > So, are we going to inject an SError for stage-2 faults outside of guest > > control as well? An external abort reported as an SPE buffer management > > event seems to be gracefully handled by the Linux driver, but that behavior > > is disallowed by SPEv1p3. > > > > To sum up the point I'm getting at: I agree that there are ways to > > reduce the risk of stage-2 faults in the middle of profiling, but I > > don't believe the current architecture allows KVM to virtualize the > > feature to the letter of the specification. > > I believe there's some confusion here: emulating SPE **does not work** if > stage 2 faults are triggered in the middle of a profiling session. Being > able to have a memory range never unmapped from stage 2 is a > **prerequisite** and is **required** for SPE emulation, it's not a nice to > have. > > A stage 2 fault before the first record is written is acceptable because > there are no other records already written which need to be thrown away. > Stage 2 faults after at least one record has been written are unacceptable > because it means that the contents of the buffer needs to thrown away. > > Does that make sense to you? > > I believe it is doable to have addresses always mapped at stage 2 with some > changes to KVM, but that's not what this thread is about. This thread is > about how and when to pin the buffer. Sorry if I've been forcing a tangent, but I believe there is a lot of value in discussing what is to be done for keeping the stage-2 mapping alive. I've been whining about it out of the very concern you highlight: a stage-2 fault in the middle of the profile is game over. Otherwise, optimizations in *when* we pin the buffer seem meaningless as stage-2 faults appear unavoidable. Nonetheless, back to your proposal. Injecting some context from earlier: > 3. Pinning the guest SPE buffer when profiling becomes enabled*: So we are only doing this when enabled for EL1, right? (PMSCR_EL1.{E0SPE,E1SPE} = {x, 1}) > - There is the corner case described above, when profiling becomes enabled as a > result of an ERET to EL0. This can happen when the buffer is enabled and > PMSCR_EL1.{E0SPE,E1SPE} = {1,0}; Is your proposal for the EL0 case to pin on fault or pin when E0SPE is set (outside of the architectures definition of when profiling is enabled)? > - The previous buffer is unpinned when a new buffer is pinned, to avoid SPE > stage 2 faults when draining the buffer, which is performed with profiling > disabled. Sounds reasonable. > As long as we're all agreed that buffer memory needs "pinning" (as in the > IPA are never unmapped from stage 2 until KVM decides otherwise as part of > SPE emulation), I believe that live migration is tangential to figuring out > how and when the buffer should be "pinned". I'm more than happy to start a > separate thread about live migration after we figure out how we should go > about "pinning" the buffer, I think your insight would be most helpful :) Fair enough, let's see how this all shakes out and then figure out LM thereafter :) -- Thanks, Oliver _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel