From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B91E9C25B10 for ; Mon, 13 May 2024 20:37:23 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1s6cPY-00036w-Is; Mon, 13 May 2024 16:36:40 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from <3znlCZgYKCgUxjfsohlttlqj.htrvjrz-ij0jqstslsz.twl@flex--seanjc.bounces.google.com>) id 1s6cPW-00035e-I0 for qemu-devel@nongnu.org; Mon, 13 May 2024 16:36:38 -0400 Received: from mail-yw1-x114a.google.com ([2607:f8b0:4864:20::114a]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from <3znlCZgYKCgUxjfsohlttlqj.htrvjrz-ij0jqstslsz.twl@flex--seanjc.bounces.google.com>) id 1s6cPS-0005q8-77 for qemu-devel@nongnu.org; Mon, 13 May 2024 16:36:37 -0400 Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-620a2321b0fso75047867b3.3 for ; Mon, 13 May 2024 13:36:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1715632591; x=1716237391; darn=nongnu.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=RTqdF/Fyy9v2rp3elmPXvaESz+kmdh+eg2ABVOfbt3w=; b=ZIoOfl6vTYx+7opW36sUoAZsI9Hb3vVP6Dqnqdn+eCqAwlJVScSs5sP+tLdpHYUVPn 686mEsOQ597XXFvacqZkl8nvj+x9JgT2wjZHMm7hxH+Mnb3KPu4Z04CKQAgoCPbf3iYE FVtEt8KPRW/mb6XE+4NtIsc6yALAvPcLhFioQsIHKzwhkpXbuzWrr1CgWggs4+utdcqp QzoCZat6/z0w2DkULjfChssOBSwgzbnH5kROYvgwzdQlDvszPoDOOd5B952BE/5iLqvr GbfCHSYeufkm0JXJeVEXO2hy/BOVfH8PWNea6qL7Y8AtaFSvIgUAMl6SWSiruxY5EbBm rO0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715632591; x=1716237391; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=RTqdF/Fyy9v2rp3elmPXvaESz+kmdh+eg2ABVOfbt3w=; b=e64q6xrZQkCI2U/zlFy4RHrFU/xMVe2AwJpK1s7d7Zxjm6g3HeaIpRQdeQmGIXGLMz 0O118Cdzd9C1Yi264Q7DymBWT0+JFfUoutUqSF8r1MxQeOZqYqtBByJQbx5go2TwS3QV kOQaQIrwth4pOg5oPfb3jQxU1W6VBsgT+A/ABjvTHcCaNIVoc0jParFs+RqvklohLppW FBjP7o277ieAmbqu/fuZBY+0qZXcvVDNqxsxQTDAT8o6P99aY6GxercFDhrgKnVB1NEh yyDeyIfwYRwm/6cra1BCSA9SHXblaRgbBH3aapF8wuhM15tnYKPZa39KolqsI9YFkdPB oHcQ== X-Forwarded-Encrypted: i=1; AJvYcCVvlEJdG81LHbDUUXyWWU7zOvoJlkVTGCVRybeOUgpsEeozp665ocGtPdLXBnV/lcNPr0Rg9LssPzYZIpkZRUJLbfNvNQE= X-Gm-Message-State: AOJu0YwVPlyT4/85KN1iKlezPIKPrKU5szjxhKnXPWSsxmXeovN+sSKA C3djNy97hG3R6YDHVONnGCx6Miqznjtua7vABvrp/MytT1nE3LJUdy5GXeaIPU24ucQfQfB8ORu o0g== X-Google-Smtp-Source: AGHT+IFxSaf5ALbxAjGJ4irStP0T4VQiFrWgBxguuMjd3vTUy8dpIf11/Kw31vWLIOru5J9yUqNrB2dHqIA= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a05:6902:120d:b0:dee:690f:af35 with SMTP id 3f1490d57ef6-dee690fbad4mr609613276.8.1715632590827; Mon, 13 May 2024 13:36:30 -0700 (PDT) Date: Mon, 13 May 2024 13:36:29 -0700 In-Reply-To: Mime-Version: 1.0 References: <58f39f23-0314-4e34-a8c7-30c3a1ae4777@amazon.co.uk> Message-ID: Subject: Re: Unmapping KVM Guest Memory from Host Kernel From: Sean Christopherson To: James Gowans Cc: "kvm@vger.kernel.org" , "linux-coco@lists.linux.dev" , Nikita Kalyazin , "rppt@kernel.org" , "qemu-devel@nongnu.org" , Patrick Roy , "somlo@cmu.edu" , "vbabka@suse.cz" , "akpm@linux-foundation.org" , "kirill.shutemov@linux.intel.com" , "Liam.Howlett@oracle.com" , David Woodhouse , "pbonzini@redhat.com" , "linux-mm@kvack.org" , Alexander Graf , Derek Manwaring , "chao.p.peng@linux.intel.com" , "lstoakes@gmail.com" , "mst@redhat.com" Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=2607:f8b0:4864:20::114a; envelope-from=3znlCZgYKCgUxjfsohlttlqj.htrvjrz-ij0jqstslsz.twl@flex--seanjc.bounces.google.com; helo=mail-yw1-x114a.google.com X-Spam_score_int: -95 X-Spam_score: -9.6 X-Spam_bar: --------- X-Spam_report: (-9.6 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, USER_IN_DEF_DKIM_WL=-7.5 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org On Mon, May 13, 2024, James Gowans wrote: > On Mon, 2024-05-13 at 10:09 -0700, Sean Christopherson wrote: > > On Mon, May 13, 2024, James Gowans wrote: > > > On Mon, 2024-05-13 at 08:39 -0700, Sean Christopherson wrote: > > > > > Sean, you mentioned that you envision guest_memfd also supporting= non-CoCo VMs. > > > > > Do you have some thoughts about how to make the above cases work = in the > > > > > guest_memfd context? > > > >=20 > > > > Yes.=C2=A0 The hand-wavy plan is to allow selectively mmap()ing gue= st_memfd().=C2=A0 There > > > > is a long thread[*] discussing how exactly we want to do that.=C2= =A0 The TL;DR is that > > > > the basic functionality is also straightforward; the bulk of the di= scussion is > > > > around gup(), reclaim, page migration, etc. > > >=20 > > > I still need to read this long thread, but just a thought on the word > > > "restricted" here: for MMIO the instruction can be anywhere and > > > similarly the load/store MMIO data can be anywhere. Does this mean th= at > > > for running unmodified non-CoCo VMs with guest_memfd backend that we'= ll > > > always need to have the whole of guest memory mmapped? > >=20 > > Not necessarily, e.g. KVM could re-establish the direct map or mremap()= on-demand. > > There are variation on that, e.g. if ASI[*] were to ever make it's way = upstream, > > which is a huge if, then we could have guest_memfd mapped into a KVM-on= ly CR3. >=20 > Yes, on-demand mapping in of guest RAM pages is definitely an option. It > sounds quite challenging to need to always go via interfaces which > demand map/fault memory, and also potentially quite slow needing to > unmap and flush afterwards.=20 >=20 > Not too sure what you have in mind with "guest_memfd mapped into KVM- > only CR3" - could you expand? Remove guest_memfd from the kernel's direct map, e.g. so that the kernel at= -large can't touch guest memory, but have a separate set of page tables that have = the direct map, userspace page tables, _and_ kernel mappings for guest_memfd. = On KVM_RUN (or vcpu_load()?), switch to KVM's CR3 so that KVM always map/unmap= are free (literal nops). That's an imperfect solution as IRQs and NMIs will run kernel code with KVM= 's page tables, i.e. guest memory would still be exposed to the host kernel. = And of course we'd need to get buy in from multiple architecturs and maintainer= s, etc. > > > I guess the idea is that this use case will still be subject to the > > > normal restriction rules, but for a non-CoCo non-pKVM VM there will b= e > > > no restriction in practice, and userspace will need to mmap everythin= g > > > always? > > >=20 > > > It really seems yucky to need to have all of guest RAM mmapped all th= e > > > time just for MMIO to work... But I suppose there is no way around th= at > > > for Intel x86. > >=20 > > It's not just MMIO.=C2=A0 Nested virtualization, and more specifically = shadowing nested > > TDP, is also problematic (probably more so than MMIO).=C2=A0 And there = are more cases, > > i.e. we'll need a generic solution for this.=C2=A0 As above, there are = a variety of > > options, it's largely just a matter of doing the work.=C2=A0 I'm not sa= ying it's a > > trivial amount of work/effort, but it's far from an unsolvable problem. >=20 > I didn't even think of nested virt, but that will absolutely be an even > bigger problem too. MMIO was just the first roadblock which illustrated > the problem. > Overall what I'm trying to figure out is whether there is any sane path > here other than needing to mmap all guest RAM all the time. Trying to > get nested virt and MMIO and whatever else needs access to guest RAM > working by doing just-in-time (aka: on-demand) mappings and unmappings > of guest RAM sounds like a painful game of whack-a-mole, potentially > really bad for performance too. It's a whack-a-mole game that KVM already plays, e.g. for dirty tracking, p= ost-copy demand paging, etc.. There is still plenty of room for improvement, e.g. t= o reduce the number of touchpoints and thus the potential for missed cases. But KVM= more or less needs to solve this basic problem no matter what, so I don't think = that guest_memfd adds much, if any, burden. > Do you think we should look at doing this on-demand mapping, or, for > now, simply require that all guest RAM is mmapped all the time and KVM > be given a valid virtual addr for the memslots? I don't think "map everything into userspace" is a viable approach, precise= ly because it requires reflecting that back into KVM's memslots, which in turn means guest_memfd needs to allow gup(). And I don't think we want to allow= gup(), because that opens a rather large can of worms (see the long thread I linke= d). Hmm, a slightly crazy idea (ok, maybe wildly crazy) would be to support map= ping all of guest_memfd into kernel address space, but as USER=3D1 mappings. I.= e. don't require a carve-out from userspace, but do require CLAC/STAC when access gu= est memory from the kernel. I think/hope that would provide the speculative ex= ecution mitigation properties you're looking for? Userspace would still have access to guest memory, but it would take a trul= y malicious userspace for that to matter. And when CPUs that support LASS co= me along, userspace would be completely unable to access guest memory through = KVM's magic mapping. This too would require a decent amount of buy-in from outside of KVM, e.g. = to carve out the virtual address range in the kernel. But the performance ove= rhead would be identical to the status quo. And there could be advantages to bei= ng able to identify accesses to guest memory based purely on kernel virtual ad= dress.