From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 938A6C04A68 for ; Wed, 27 Jul 2022 11:20:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:Content-Type: Content-Transfer-Encoding:List-Subscribe:List-Help:List-Post:List-Archive: List-Unsubscribe:List-Id:Message-ID:References:In-Reply-To:Subject:Cc:To:From :Date:MIME-Version:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=9bfk37pr0fAqvIc7mPqCh8+7YYFytCEyfrzQuE2su8c=; b=BffF98ULYJ+BY6prAXD97oz6uK 3eWeNQaz2r2kahWX7+hRzdCSxyg/UKXbfvxHVdV+O+VhlLSLYo6ybwd4TDDcuSSZ8aLAJQb4jWp00 iwpzRUTRPvie+vnDNljs9zrS7Idm0jl0UlpBTqoIOVAMUlIsV9eYpy0v83hz1ccnWhjcyewhqHrQ8 w7wMZQZlfLal3AnaOk1I6OsaFkV4DiCLGuPdEElRjRfcWZw+XCBGTT3Zplx8TnE6k18H79+ens3bc MS8R7QJC5vEmqZwECwOiv5yJtwl/dr00JnG6ZuJScPiJhYQzxb/IZCaOZp0JadyBmRW8Qd7+c6NQ9 6L55w3ug==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1oGf42-00Crjp-A6; Wed, 27 Jul 2022 11:18:54 +0000 Received: from sin.source.kernel.org ([2604:1380:40e1:4800::1]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1oGf3y-00Crfp-4a for linux-arm-kernel@lists.infradead.org; Wed, 27 Jul 2022 11:18:52 +0000 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sin.source.kernel.org (Postfix) with ESMTPS id 4BE93CE2130; Wed, 27 Jul 2022 11:18:45 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8E17EC433D6; Wed, 27 Jul 2022 11:18:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1658920723; bh=N9zxE9cd1ixeSvIrMLQeNbHK3cs5R5DayO604mExzBo=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=C4Ft1VTIX4rwPAnoZWY6wqE3HpWBk0RtansnQQxPP6nHtN5dUsDh1EWZLBnYvXggO QtKDp84PMV64gy1gVWLUDFLuHUi4J5DBDcyKDVfRx8w9N8NHepYwzkjB0LaweJ6iyZ NdOt8cEBv3k0cNBbxCVOhELKWpnf0KnjG9ZBozoI0BMdyBbZmoQt5QMAK17D1tDIL0 xfOW5XsyetSKcnrNFZlmmB80OWp/xRETmWO4lmzn2dKSfkZ4ITk5zSwiY/FrO1pAEG c6iu9Uznv3Ko4v+g6C52YOS1sbL6n5Lf6JNCHMUakQZu38iW4stdQ5WqtUzU0lA/QT 7tiqLiTcIg7Fg== Received: from disco-boy.misterjones.org ([51.254.78.96] helo=www.loen.fr) by disco-boy.misterjones.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.95) (envelope-from ) id 1oGf3p-00ANeG-Dt; Wed, 27 Jul 2022 12:18:41 +0100 MIME-Version: 1.0 Date: Wed, 27 Jul 2022 12:18:41 +0100 From: Marc Zyngier To: Alexandru Elisei Cc: Oliver Upton , Will Deacon , kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org Subject: Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead of pinning VM memory In-Reply-To: References: <20220419141012.GB6143@willie-the-truck> <875yjiyka4.wl-maz@kernel.org> User-Agent: Roundcube Webmail/1.4.13 Message-ID: X-Sender: maz@kernel.org X-SA-Exim-Connect-IP: 51.254.78.96 X-SA-Exim-Rcpt-To: alexandru.elisei@arm.com, oliver.upton@linux.dev, will@kernel.org, kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org X-SA-Exim-Mail-From: maz@kernel.org X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220727_041850_729567_FD6D39F6 X-CRM114-Status: GOOD ( 52.03 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On 2022-07-27 11:56, Alexandru Elisei wrote: > Hi Marc, > > On Wed, Jul 27, 2022 at 10:30:59AM +0100, Marc Zyngier wrote: >> On Tue, 26 Jul 2022 18:51:21 +0100, >> Oliver Upton wrote: >> > >> > Hi Alex, >> > >> > On Mon, Jul 25, 2022 at 11:06:24AM +0100, Alexandru Elisei wrote: >> > >> > [...] >> > >> > > > A funkier approach might be to defer pinning of the buffer until the SPE is >> > > > enabled and avoid pinning all of VM memory that way, although I can't >> > > > immediately tell how flexible the architecture is in allowing you to cache >> > > > the base/limit values. >> > > >> > > I was investigating this approach, and Mark raised a concern that I think >> > > might be a showstopper. >> > > >> > > Let's consider this scenario: >> > > >> > > Initial conditions: guest at EL1, profiling disabled (PMBLIMITR_EL1.E = 0, >> > > PMBSR_EL1.S = 0, PMSCR_EL1.{E0SPE,E1SPE} = {0,0}). >> > > >> > > 1. Guest programs the buffer and enables it (PMBLIMITR_EL1.E = 1). >> > > 2. Guest programs SPE to enable profiling at **EL0** >> > > (PMSCR_EL1.{E0SPE,E1SPE} = {1,0}). >> > > 3. Guest changes the translation table entries for the buffer. The >> > > architecture allows this. >> > > 4. Guest does an ERET to EL0, thus enabling profiling. >> > > >> > > Since KVM cannot trap the ERET to EL0, it will be impossible for KVM to pin >> > > the buffer at stage 2 when profiling gets enabled at EL0. >> > >> > Not saying we necessarily should, but this is possible with FGT no? >> >> Given how often ERET is used at EL1, I'd really refrain from doing >> so. NV uses the same mechanism to multiplex vEL2 and vEL1 on the real >> EL1, and this comes at a serious cost (even an exception return that >> stays at the same EL gets trapped). Once EL1 runs, we disengage this >> trap because it is otherwise way too costly. >> >> > >> > > I can see two solutions here: >> > > >> > > a. Accept the limitation (and advertise it in the documentation) that if >> > > someone wants to use SPE when running as a Linux guest, the kernel used by >> > > the guest must not change the buffer translation table entries after the >> > > buffer has been enabled (PMBLIMITR_EL1.E = 1). Linux already does that, so >> > > running a Linux guest should not be a problem. I don't know how other OSes >> > > do it (but I can find out). We could also phrase it that the buffer >> > > translation table entries can be changed after enabling the buffer, but >> > > only if profiling happens at EL1. But that sounds very arbitrary. >> > > >> > > b. Pin the buffer after the stage 2 DABT that SPE will report in the >> > > situation above. This means that there is a blackout window, but will >> > > happen only once after each time the guest reprograms the buffer. I don't >> > > know if this is acceptable. We could say that this if this blackout window >> > > is not acceptable, then the guest kernel shouldn't change the translation >> > > table entries after enabling the buffer. >> > > >> > > Or drop the approach of pinning the buffer and go back to pinning the >> > > entire memory of the VM. >> > > >> > > Any thoughts on this? I would very much prefer to try to pin only the >> > > buffer. >> > >> > Doesn't pinning the buffer also imply pinning the stage 1 tables >> > responsible for its translation as well? I agree that pinning the buffer >> > is likely the best way forward as pinning the whole of guest memory is >> > entirely impractical. >> >> How different is this from device assignment, which also relies on >> full page pinning? The way I look at it, SPE is a device directly >> assigned to the guest, and isn't capable of generating synchronous >> exception. Not that I'm madly in love with the approach, but this is >> at least consistent. There was also some concerns around buggy HW that >> would blow itself up on S2 faults, but I think these implementations >> are confidential enough that we don't need to worry about them. >> >> > I'm also a bit confused on how we would manage to un-pin memory on the >> > way out with this. The guest is free to muck with the stage 1 and could >> > cause the SPU to spew a bunch of stage 2 aborts if it wanted to be >> > annoying. One way to tackle it would be to only allow a single >> > root-to-target walk to be pinned by a vCPU at a time. Any time a new >> > stage 2 abort comes from the SPU, we un-pin the old walk and pin the new >> > one instead. >> >> This sounds like a reasonable option. Only one IPA range covering the >> SPE buffer (as described by the translation of PMBPTR_EL1) is pinned >> at any given time. Generate a SPE S2 fault outside of this range, and >> we unpin the region before mapping in the next one. Yes, the guest can >> play tricks on us and exploit the latency of the interrupt. But at the >> end of the day, this is its own problem. >> >> Of course, this results in larger blind windows. Ideally, we should be >> able to report these to the guest, either as sideband data or in the >> actual profiling buffer (but I have no idea whether this is possible). > > I believe solution b, pin the buffer when guest enables profiling > (where by > profiling enabled I mean StatisticalProfilingEnabled() returns true), > and > only in the situation that I described pin the buffer as a result of a > stage 2 fault, would reduce the blackouts to a minimum. In all honesty, I'd rather see everything be done as the result of a S2 fault for now, and only introduce heuristics to reduce the blackout window at a later time. And this includes buffer pinning if that can be avoided. My hunch is that people wanting zero blackout will always pin all their memory, one way or another, and that the rest of us will be happy just to get *something* out of SPE in a VM... M. -- Jazz is not dead. It just smells funny... _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel