From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DF42BC4167D for ; Thu, 2 Nov 2023 17:37:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3A74F8D009F; Thu, 2 Nov 2023 13:37:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3303E8D000F; Thu, 2 Nov 2023 13:37:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1831B8D009F; Thu, 2 Nov 2023 13:37:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 074308D000F for ; Thu, 2 Nov 2023 13:37:34 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id D166CA0101 for ; Thu, 2 Nov 2023 17:37:33 +0000 (UTC) X-FDA: 81413721186.01.1DB1698 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) by imf30.hostedemail.com (Postfix) with ESMTP id E629280031 for ; Thu, 2 Nov 2023 17:37:31 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="KLZS844/"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf30.hostedemail.com: domain of 3Wt5DZQYKCFcH3zC815DD5A3.1DBA7CJM-BB9Kz19.DG5@flex--seanjc.bounces.google.com designates 209.85.214.202 as permitted sender) smtp.mailfrom=3Wt5DZQYKCFcH3zC815DD5A3.1DBA7CJM-BB9Kz19.DG5@flex--seanjc.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698946652; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=uN1ssLkZHjDTrGZRWby6miGmTEsoKqfVN3YbqYw1LLA=; b=ylGjaoLCABIvvNuSBXkUQNcLOlwl623EQOECYWCwuCrP+mO+GoZc47JpIBYuhVQHXG2VUi rh5IeYAJOsXazz3WWSNy9u79MH76UCd5IKyY47H6kfeRTJ8MBLnVcenG7OziAhViww9AkE 2zjMDqaPH2sJ/YUbVmMoar4HgzjeDMs= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="KLZS844/"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf30.hostedemail.com: domain of 3Wt5DZQYKCFcH3zC815DD5A3.1DBA7CJM-BB9Kz19.DG5@flex--seanjc.bounces.google.com designates 209.85.214.202 as permitted sender) smtp.mailfrom=3Wt5DZQYKCFcH3zC815DD5A3.1DBA7CJM-BB9Kz19.DG5@flex--seanjc.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698946652; a=rsa-sha256; cv=none; b=64fX0jWyprTl7JuVHDpo63uDK/ozmm0WVZFkPOP4iWIntMvH1MoGzrcKTpRKCFA4mP94bH qcjxmC3T7pkpPTauhuBc1kaWNG1CRveerQo/7zsmPC43O19AI37uSa8aVOvufQR4j5ZtEB pXdoY6Td9oWGRKPkeKJwzF4fSEO6tnM= Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-1cc252cbde2so9056695ad.0 for ; Thu, 02 Nov 2023 10:37:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1698946651; x=1699551451; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=uN1ssLkZHjDTrGZRWby6miGmTEsoKqfVN3YbqYw1LLA=; b=KLZS844/r6+3DlLYF8bWCjQaExYD2/3W9ydQ2VKS0O4HL43q9hHkhVRa6s+hROxb6+ BCi14NX0M1zOo00eo0FCZeIOnVt0l8h4Memx8SlX0ohFA0njc6s4iB1MEQMdNvgZmKvf 0CY0LcXbiWy+Lw29gs5AImu/bpkM7gV5MieMn5Angwno1CgawoIZqZDLSB60/MMqGjQe FtCyJyw5gUBKXN9w/Ns4FupUYXPwRo1elfxUxrk+uEuak9JwgYjGgSeNTZ0feOzrEd1F H90te0aMFEERcedqaMQvR6/xZUeRP1qxz+9qDltNY6SH934nkkKch5tV0cftkFUKzYp1 qQcA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698946651; x=1699551451; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=uN1ssLkZHjDTrGZRWby6miGmTEsoKqfVN3YbqYw1LLA=; b=h+MzpBB1QsdXg2IcR778LeHqDArcZIgEqjUaO+z2B2MNWicOo65DHagaggMDf8sjjo yaYD5LPEKYxAMEDHukx94nrHNRvhDZM/lRDqaEkRouvJRn8lH1P1Q/oiIMDwVR+T7ypJ R6z+H/kVmMurb6lSNq7pZP0FpvCAsmamnWgy3Y9QdGhhFUPWL6KWPE0q//W08sAM4Hqt JZ10yZvvxYUZ/GW3uhOa1ZzRO0lINFrvtGr3vSphMriBaBPRjOhBvt3fzYfE15mtGjju WjsS5r06lOQqf0jiH9+KBcpWpa3J9W6neQHLYrEUnumxSCw+jXmhAPI3lVHVB528hV/C P9Gw== X-Gm-Message-State: AOJu0YwsYP13xb3NTLAb1QfJ0C1AG06Jr15ZSvQ87n62rhVGvGKoi0K7 okQm+QS2qYWWHeX/zsoL/DfaWR4T/EU= X-Google-Smtp-Source: AGHT+IHO2BmoMwEG0Z4p7dUErvps7DbkHqeEBbdPqNa8vmvQCAafmC1+rihoFOkAVlqXxxzUlxPm8WkwUZU= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:903:25d2:b0:1cc:2ffe:5a27 with SMTP id jc18-20020a17090325d200b001cc2ffe5a27mr287356plb.9.1698946650780; Thu, 02 Nov 2023 10:37:30 -0700 (PDT) Date: Thu, 2 Nov 2023 10:37:29 -0700 In-Reply-To: Mime-Version: 1.0 References: <20231027182217.3615211-1-seanjc@google.com> <20231027182217.3615211-17-seanjc@google.com> <6642c379-1023-4716-904f-4bbf076744c2@redhat.com> Message-ID: Subject: Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory From: Sean Christopherson To: David Matlack Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , Alexander Viro , Christian Brauner , "Matthew Wilcox (Oracle)" , Andrew Morton , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Xiaoyao Li , Xu Yilun , Chao Peng , Fuad Tabba , Jarkko Sakkinen , Anish Moorthy , Yu Zhang , Isaku Yamahata , "=?utf-8?Q?Micka=C3=ABl_Sala=C3=BCn?=" , Vlastimil Babka , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: E629280031 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: a66xsiph64n977b3r345fkynwuqmgzi8 X-HE-Tag: 1698946651-947467 X-HE-Meta: U2FsdGVkX1+Ogyoyt4owFnOoFgC5t66XtrLvEqlmZnHdem56fwK61RhsOtTbDo/gVby/Je39SnVoyn/yey9aP4vY8M3akARfVv72JOgdrxZc04fFPHSXakBtTUDBErNkVBGcGXMvOd3Ok4DJKd9o9FE7+YwKD8XH6QZAKmvS1SBplHMHAMIC2uNy8OWwQk1O5hH8PyzQDb9bRCRT2aMcoI6v7CO8H1v5SKw8KacTUwq3ur8xInzyHjoht40eF7zjD72M+E3pSrc0+HuCfKcJY02s4UbsBgUAFtZ3CfwEQgvBL+xCGwsZEY7SxaDz7T5UE1psymvfcUjSfrG/LxCaibVbY6p+djF5l341ToHUyIEHOSIZ6it3NuaBk5IR7etfg0J6yQeWfuLGFhLfMz+I45aHyRIxiVxQZPcaKPCqe1vPv11J2rpzdPDsem5Xr+xcViN6n/ce3NrfkCs30R3y66OQqKEGXSH59uhe5UP4BaIyWZjrGGYHSJ/VQQgA9syhYVqJT+gytVJVLTdkYlYoNEZPdrvVV6dstbIMLmDmZh2fT9KMXRIG+CwQXVm+kvrnPxskk8Jb5/jFze3ybSiWt6jijzJi2zyIrxPbLbpBoKpLOTUAv9B76QGzXIwqGBY2tq9M9+XpRs/mjO8GkLLQQiQ36LZaEpeD9SwQakkw9sWCAWrhdKwGkutG1UVi330RsXHfOxSpZ/kJJ0Z2DZb8t8TMoL5aFje+VcZIiIA7c44fSbk9W2I346AJSiYdYecQjLD0ugaA25iwQXQEIXjmXLG2u3rXb3MCc6T5640MQeFgokYEQQP2v4hxRg2ZLEairOvaAi9UlgdcrSfn4xgfV7eUcDskmnz9mbcKiYwmS5Pk4rjuCr7NEWVmqxp5U5xH4vZA/Bxm+jCyiGDKNS9Xxmdz9YCPcYMXyojJgmcLe7WgVrge9b16K9ZsDNcvUbwwip+sxA8oJOOKkyl7/3/ 0lbuXcUv q/xayecydITsrVU7bx2WgA91Z5Rwi28NmGCeasQdeX9UvYDtkhqv980893ajBEe4wqBeTSerHuWFwH9sGhxmj9oDjp0Su6NNH56WdIpRdSiUntZgwb4yZkdbwVfrVGoLnysKsh4L7SqTjaFAd7I5mhakasvR03zNogbwurw85JrX3jNaWjrlDwT1iUOsnWANziH/jHm4BNJwOp8b8T8sVjCo6nqmukJ2uVxTQ/Y+GP5X7TCdwL2/CvBzM67kSBrar68eT0PzXVUXcYWy8kMrLLhzRqbLjO3tb2/XrXj4bjCfYvB1+dZHJdXIki0TDgaEj/Uc/oOcutQLvpfYJrf7WropdOkX4T3KPSJYExJyvvETCizsQkeVUkjXCK8foiPP/OoQuwYQY7U8+/nXeTqtpyDM+DUN3TbFQUcmU8FyB7uMLfFhRXnHaRSaycDmR6CZXHPj0BQcTRHv8WvXkf8T9sERanMN8O+N2eAXyIlSVX6hO7mS385jBR9yPlmoCX4NUkyegfjP5FbFtRxyveKGz3L2JaDPhtEwL8fOcOB0O6aLFao4v/l/8pNDT6BFxuAsHTUV9oU9P8UeKEc701vEj8r7vhzRo6Qdhesi+xjSNjk4ggupGiqs5HhrDXPTmoYLzIOSMKaxTXeqRLLLyCmKSvnLQ6EQi1bDolegLY//GbeP55aBwQlpgkYZtpP1EHlvqTMhztAgBfsFjtsSEA3eldqwNC4rOaOyHOvD3 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Nov 02, 2023, David Matlack wrote: > On Thu, Nov 2, 2023 at 9:03=E2=80=AFAM Sean Christopherson wrote: > > > > On Thu, Nov 02, 2023, Paolo Bonzini wrote: > > > On 10/31/23 23:39, David Matlack wrote: > > > > > > Maybe can you sketch out how you see this proposal being extens= ible to > > > > > > using guest_memfd for shared mappings? > > > > > For in-place conversions, e.g. pKVM, no additional guest_memfd is= needed. What's > > > > > missing there is the ability to (safely) mmap() guest_memfd, e.g.= KVM needs to > > > > > ensure there are no outstanding references when converting back t= o private. > > > > > > > > > > For TDX/SNP, assuming we don't find a performant and robust way t= o do in-place > > > > > conversions, a second fd+offset pair would be needed. > > > > Is there a way to support non-in-place conversions within a single = guest_memfd? > > > > > > For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to = guest > > > memory. The hook would invalidate now-private parts if they have a V= MA, > > > causing a SIGSEGV/EFAULT if the host touches them. > > > > > > It would forbid mappings from multiple gfns to a single offset of the > > > guest_memfd, because then the shared vs. private attribute would be t= ied to > > > the offset. This should not be a problem; for example, in the case o= f SNP, > > > the RMP already requires a single mapping from host physical address = to > > > guest physical address. > > > > I don't see how this can work. It's not a M:1 scenario (where M is mul= tiple gfns), > > it's a 1:N scenario (wheren N is multiple offsets). The *gfn* doesn't = change on > > a conversion, what needs to change to do non-in-place conversion is the= pfn, which > > is effectively the guest_memfd+offset pair. > > > > So yes, we *could* support non-in-place conversions within a single gue= st_memfd, > > but it would require a second offset, >=20 > Why can't KVM free the existing page at guest_memfd+offset and > allocate a new one when doing non-in-place conversions? Oh, I see what you're suggesting. Eww. It's certainly possible, but it would largely defeat the purpose of why we = are adding guest_memfd in the first place. For TDX and SNP, the goal is to provide a simple, robust mechanism for isol= ating guest private memory so that it's all but impossible for the host to access= private memory. As things stand, memory for a given guest_memfd is either private = or shared (assuming we support a second guest_memfd per memslot). I.e. there's no ne= ed to track whether a given page/folio in the guest_memfd is private vs. shared. We could use memory attributes, but that further complicates things when in= trahost migration (and potentially other multi-user scenarios) comes along, i.e. wh= en KVM supports linking multiple guest_memfd files to a single inode. We'd have t= o ensure that all "struct kvm" instances have identical PRIVATE attributes for a giv= en *offset* in the inode. I'm not even sure how feasible that is for intrahos= t migration, and that's the *easy* case, because IIRC it's already a hard req= uirement that the source and destination have identical gnf=3D>guest_memfd bindings,= i.e. KVM can somewhat easily reason about gfn attributes. But even then, that only helps with the actual migration of the VM, e.g. we= 'd still have to figure out how to deal with .mmap() and other shared vs. private ac= tions when linking a new guest_memfd file against an existing inode. I haven't seen the pKVM patches for supporting .mmap(), so maybe this is al= ready a solved problem, but I'd honestly be quite surprised if it all works corre= ctly if/when KVM supports multiple files per inode. And I don't see what value non-in-place conversions would add. The value a= dded by in-place conversions, aside from the obvious preservation of data, which= isn't relevant to TDX/SNP, is that it doesn't require freeing and reallocating me= mory to avoid double-allocating for private vs. shared. That's especialy quite = nice when hugepages are being used because reconstituing a hugepage "only" requi= res zapping SPTEs. But if KVM is freeing the private page, it's the same as punching a hole, p= robably quite literally, when mapping the gfn as shared. In every way I can think = of, it's worse. E.g. it's more complex for KVM, and the PUNCH_HOLE =3D> allocation = operations must be serialized. Regarding double-allocating, I really, really think we should solve that in= the guest. I.e. teach Linux-as-a-guest to aggressively convert at 2MiB granula= rity and avoid 4KiB conversions. 4KiB conversions aren't just a memory utilizat= ion problem, they're also a performance problem, e.g. shatters hugepages (which= KVM doesn't yet support recovering) and increases TLB pressure for both stage-1= and stage-2 mappings.