From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2EA4DC4167B for ; Thu, 2 Nov 2023 17:37:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232479AbjKBRhg (ORCPT ); Thu, 2 Nov 2023 13:37:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39018 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232330AbjKBRhf (ORCPT ); Thu, 2 Nov 2023 13:37:35 -0400 Received: from mail-pl1-x64a.google.com (mail-pl1-x64a.google.com [IPv6:2607:f8b0:4864:20::64a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5FF6D198 for ; Thu, 2 Nov 2023 10:37:31 -0700 (PDT) Received: by mail-pl1-x64a.google.com with SMTP id d9443c01a7336-1cc323b2aa3so8913455ad.3 for ; Thu, 02 Nov 2023 10:37:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1698946651; x=1699551451; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=uN1ssLkZHjDTrGZRWby6miGmTEsoKqfVN3YbqYw1LLA=; b=coSNqGSa47J/UZf0Pz9YGwIQ7TBXNp4w0vC+NhbSJdmfM63JiOkcuplyubh+1Bc1Xq wDZIUN+2gb4+E0zeGXXCd85Yf8kSu9+YVBg7mENZa8X2Td3Okm1qsYKwdlv1ZIziaahp phBKhJnDxgx037qgYQBLTXyDIWs04h12RX0vyIVuFVvywhiLTAp2dUtk4xYwj8sT53Kd wak23GwFfRo+wHgAjW8Ifr3xrZf2CDKVvRJ96S0KbOFd3kiUcpPysu6bM4aaMiQbe+Qx hptsOZZLoBiRBJmyrnh5Kp6KtkOEMsgVNRtO/71TejvAhuZF/s+jD0uHNwhko4LLUx/t wMTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698946651; x=1699551451; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=uN1ssLkZHjDTrGZRWby6miGmTEsoKqfVN3YbqYw1LLA=; b=SM8z70fhLuQsIalo1t6258tsecZo5sKMu2QIWh2WOS4qroy9s14sxAS0yHYleMTsKC hPWCpsrZVdMOHElZnOuJVQaUpbQ2odfQCa90bOw0MFBsLMJ8Zr2iva0LhNU0Qnqb/03c eq8HQoYRKQvRh20Ynq7FghzQto/O4sWVmkqFHskimYzl0s3BJ6NBwZrX9zD2mkXa8/U7 xgqKZeGyT2Jz9mbVCCFhVj/cwXgXLTLlh+1V+gLh48J6npC7kWzcoTNAkJPlF/iY4XhV uVOlS6oqRdmq6TIFVeJjImHGGLMbPRRJV98ImaLQP/vtDbrQZDI82KcjNfMinFMq+5Lz dXZA== X-Gm-Message-State: AOJu0YysYOv277BY6KTt5JPa6C8vk5yZ4mnfxVDMPgk2hD9j2KPiLWlA Sb++oISeVtlMhOLXj0ZA4k8hqylo6qc= X-Google-Smtp-Source: AGHT+IHO2BmoMwEG0Z4p7dUErvps7DbkHqeEBbdPqNa8vmvQCAafmC1+rihoFOkAVlqXxxzUlxPm8WkwUZU= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:903:25d2:b0:1cc:2ffe:5a27 with SMTP id jc18-20020a17090325d200b001cc2ffe5a27mr287356plb.9.1698946650780; Thu, 02 Nov 2023 10:37:30 -0700 (PDT) Date: Thu, 2 Nov 2023 10:37:29 -0700 In-Reply-To: Mime-Version: 1.0 References: <20231027182217.3615211-1-seanjc@google.com> <20231027182217.3615211-17-seanjc@google.com> <6642c379-1023-4716-904f-4bbf076744c2@redhat.com> Message-ID: Subject: Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory From: Sean Christopherson To: David Matlack Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , Alexander Viro , Christian Brauner , "Matthew Wilcox (Oracle)" , Andrew Morton , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Xiaoyao Li , Xu Yilun , Chao Peng , Fuad Tabba , Jarkko Sakkinen , Anish Moorthy , Yu Zhang , Isaku Yamahata , "=?utf-8?Q?Micka=C3=ABl_Sala=C3=BCn?=" , Vlastimil Babka , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-mips@vger.kernel.org On Thu, Nov 02, 2023, David Matlack wrote: > On Thu, Nov 2, 2023 at 9:03=E2=80=AFAM Sean Christopherson wrote: > > > > On Thu, Nov 02, 2023, Paolo Bonzini wrote: > > > On 10/31/23 23:39, David Matlack wrote: > > > > > > Maybe can you sketch out how you see this proposal being extens= ible to > > > > > > using guest_memfd for shared mappings? > > > > > For in-place conversions, e.g. pKVM, no additional guest_memfd is= needed. What's > > > > > missing there is the ability to (safely) mmap() guest_memfd, e.g.= KVM needs to > > > > > ensure there are no outstanding references when converting back t= o private. > > > > > > > > > > For TDX/SNP, assuming we don't find a performant and robust way t= o do in-place > > > > > conversions, a second fd+offset pair would be needed. > > > > Is there a way to support non-in-place conversions within a single = guest_memfd? > > > > > > For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to = guest > > > memory. The hook would invalidate now-private parts if they have a V= MA, > > > causing a SIGSEGV/EFAULT if the host touches them. > > > > > > It would forbid mappings from multiple gfns to a single offset of the > > > guest_memfd, because then the shared vs. private attribute would be t= ied to > > > the offset. This should not be a problem; for example, in the case o= f SNP, > > > the RMP already requires a single mapping from host physical address = to > > > guest physical address. > > > > I don't see how this can work. It's not a M:1 scenario (where M is mul= tiple gfns), > > it's a 1:N scenario (wheren N is multiple offsets). The *gfn* doesn't = change on > > a conversion, what needs to change to do non-in-place conversion is the= pfn, which > > is effectively the guest_memfd+offset pair. > > > > So yes, we *could* support non-in-place conversions within a single gue= st_memfd, > > but it would require a second offset, >=20 > Why can't KVM free the existing page at guest_memfd+offset and > allocate a new one when doing non-in-place conversions? Oh, I see what you're suggesting. Eww. It's certainly possible, but it would largely defeat the purpose of why we = are adding guest_memfd in the first place. For TDX and SNP, the goal is to provide a simple, robust mechanism for isol= ating guest private memory so that it's all but impossible for the host to access= private memory. As things stand, memory for a given guest_memfd is either private = or shared (assuming we support a second guest_memfd per memslot). I.e. there's no ne= ed to track whether a given page/folio in the guest_memfd is private vs. shared. We could use memory attributes, but that further complicates things when in= trahost migration (and potentially other multi-user scenarios) comes along, i.e. wh= en KVM supports linking multiple guest_memfd files to a single inode. We'd have t= o ensure that all "struct kvm" instances have identical PRIVATE attributes for a giv= en *offset* in the inode. I'm not even sure how feasible that is for intrahos= t migration, and that's the *easy* case, because IIRC it's already a hard req= uirement that the source and destination have identical gnf=3D>guest_memfd bindings,= i.e. KVM can somewhat easily reason about gfn attributes. But even then, that only helps with the actual migration of the VM, e.g. we= 'd still have to figure out how to deal with .mmap() and other shared vs. private ac= tions when linking a new guest_memfd file against an existing inode. I haven't seen the pKVM patches for supporting .mmap(), so maybe this is al= ready a solved problem, but I'd honestly be quite surprised if it all works corre= ctly if/when KVM supports multiple files per inode. And I don't see what value non-in-place conversions would add. The value a= dded by in-place conversions, aside from the obvious preservation of data, which= isn't relevant to TDX/SNP, is that it doesn't require freeing and reallocating me= mory to avoid double-allocating for private vs. shared. That's especialy quite = nice when hugepages are being used because reconstituing a hugepage "only" requi= res zapping SPTEs. But if KVM is freeing the private page, it's the same as punching a hole, p= robably quite literally, when mapping the gfn as shared. In every way I can think = of, it's worse. E.g. it's more complex for KVM, and the PUNCH_HOLE =3D> allocation = operations must be serialized. Regarding double-allocating, I really, really think we should solve that in= the guest. I.e. teach Linux-as-a-guest to aggressively convert at 2MiB granula= rity and avoid 4KiB conversions. 4KiB conversions aren't just a memory utilizat= ion problem, they're also a performance problem, e.g. shatters hugepages (which= KVM doesn't yet support recovering) and increases TLB pressure for both stage-1= and stage-2 mappings.