From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sean Christopherson Date: Thu, 27 Jul 2023 10:13:07 -0700 Subject: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory In-Reply-To: References: <20230718234512.1690985-1-seanjc@google.com> <20230718234512.1690985-13-seanjc@google.com> Message-ID: List-Id: To: kvm-riscv@lists.infradead.org MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Thu, Jul 27, 2023, Fuad Tabba wrote: > Hi Sean, > > > ... > > > @@ -5134,6 +5167,16 @@ static long kvm_vm_ioctl(struct file *filp, > > case KVM_GET_STATS_FD: > > r = kvm_vm_ioctl_get_stats_fd(kvm); > > break; > > + case KVM_CREATE_GUEST_MEMFD: { > > + struct kvm_create_guest_memfd guest_memfd; > > + > > + r = -EFAULT; > > + if (copy_from_user(&guest_memfd, argp, sizeof(guest_memfd))) > > + goto out; > > + > > + r = kvm_gmem_create(kvm, &guest_memfd); > > + break; > > + } > > I'm thinking line of sight here, by having this as a vm ioctl (rather > than a system iocl), would it complicate making it possible in the > future to share/donate memory between VMs? Maybe, but I hope not? There would still be a primary owner of the memory, i.e. the memory would still need to be allocated in the context of a specific VM. And the primary owner should be able to restrict privileges, e.g. allow a different VM to read but not write memory. My current thinking is to (a) tie the lifetime of the backing pages to the inode, i.e. allow allocations to outlive the original VM, and (b) create a new file each time memory is shared/donated with a different VM (or other entity in the kernel). That should make it fairly straightforward to provide different permissions, e.g. track them per-file, and I think should also avoid the need to change the memslot binding logic since each VM would have it's own view/bindings. Copy+pasting a relevant snippet from a lengthier response in a different thread[*]: Conceptually, I think KVM should to bind to the file. The inode is effectively the raw underlying physical storage, while the file is the VM's view of that storage. Practically, I think that gives us a clean, intuitive way to handle intra-host migration. Rather than transfer ownership of the file, instantiate a new file for the target VM, using the gmem inode from the source VM, i.e. create a hard link. That'd probably require new uAPI, but I don't think that will be hugely problematic. KVM would need to ensure the new VM's guest_memfd can't be mapped until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the memslots/bindings are identical), but that should be easy enough to enforce. That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing the memory and the *contents* of memory to outlive the VM, i.e. be effectively transfered to the new target VM. And we'll maintain the invariant that each guest_memfd is bound 1:1 with a single VM. As above, that should also help us draw the line between mapping memory into a VM (file), and freeing/reclaiming the memory (inode). There will be extra complexity/overhead as we'll have to play nice with the possibility of multiple files per inode, e.g. to zap mappings across all files when punching a hole, but the extra complexity is quite small, e.g. we can use address_space.private_list to keep track of the guest_memfd instances associated with the inode. Setting aside TDX and SNP for the moment, as it's not clear how they'll support memory that is "private" but shared between multiple VMs, I think per-VM files would work well for sharing gmem between two VMs. E.g. would allow a give page to be bound to a different gfn for each VM, would allow having different permissions for each file (e.g. to allow fallocate() only from the original owner). [*] https://lore.kernel.org/all/ZLGiEfJZTyl7M8mS at google.com From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4A6CA27124 for ; Thu, 27 Jul 2023 17:13:10 +0000 (UTC) Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-563ab574cb5so784349a12.1 for ; Thu, 27 Jul 2023 10:13:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1690477989; x=1691082789; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=F7oDMpfc0bW6NSyfuHNNNYlN2HFPFEuCSHvjU51+NBs=; b=uR617wLpg6qNB677nCTpj/mRcHCn7jeMQ6T975dwFWafb05LBOA5GXBHggNBovfIzX O/7M392ccdTOGZqesE1qC7szLmXDUVbpTkdUSAnDKesUZGTHf6YrHw7iRdcLemcslWSD KrRYixDmKkhke5BhBNDcGGoexHbctICnVSugeNNjkFTNcd3g+7vTx3m4xPX98nUwGt3/ mWhf+uFvINzhg7lFN3O3GR8jmx04AReHHh1Ek1sP09ZwG+oxGwlWe7n39MiREf5a36wi QECqs1dJt2n4+euTMFmAQH0Oiz0RKLy4HgS6MVL8RVa78gxf6JRosPxL/mvaakJouVeF ho3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690477989; x=1691082789; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=F7oDMpfc0bW6NSyfuHNNNYlN2HFPFEuCSHvjU51+NBs=; b=BBYAgOh5wP3ZP/eAa/ITW3uhk20dfTZvr7wVi8afX2BWV+UAcb+GR2mOt0AVFlpFYe oKeZf4VbBFUu2ZmwFEpK9HAL0OfaEvSUxNFHMIpoqDtH5NjPcdL9lWBonX0HphUaQj/T S1J53jNzQOCyHbL5mK8uwvQ0jqq2PBsn89LUNfP+bSoC6uEz1X4jMbq61INT3mCXUq2U cJJVT6hPOZY4f9g8sfoDTXIo3dei7XNAtDVUgQbsov1dKy92JjRej0Kmp6Zlu5qzrFHA cCD12MKr2JBaz5t2n/tn4MMMq4j1DX7Qcy37JoE7DMeAD995heZcJlk5FP2T6xYIinDt ma9w== X-Gm-Message-State: ABy/qLbtroOVnw5kHbVXz0ZprtPvY1x0YrBoUXV30KLk+PvoC1Qoo8Hz 86c9QbJ7P4DT01lDQkuzq/fYLIPVa4I= X-Google-Smtp-Source: APBJJlEiCX9DyxliShqr2xPxTiyej7fUi+PkS+4LO77eNeghg1Fiv4b0fvD5YH2w9mP2/cy4q9w9vNXHRRI= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:903:22c6:b0:1b5:2b14:5f2c with SMTP id y6-20020a17090322c600b001b52b145f2cmr24803plg.4.1690477989357; Thu, 27 Jul 2023 10:13:09 -0700 (PDT) Date: Thu, 27 Jul 2023 10:13:07 -0700 In-Reply-To: Precedence: bulk X-Mailing-List: kvmarm@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20230718234512.1690985-1-seanjc@google.com> <20230718234512.1690985-13-seanjc@google.com> Message-ID: Subject: Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory From: Sean Christopherson To: Fuad Tabba Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , "Matthew Wilcox (Oracle)" , Andrew Morton , Paul Moore , James Morris , "Serge E. Hallyn" , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, Chao Peng , Jarkko Sakkinen , Yu Zhang , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , Vlastimil Babka , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" Content-Type: text/plain; charset="us-ascii" On Thu, Jul 27, 2023, Fuad Tabba wrote: > Hi Sean, > > > ... > > > @@ -5134,6 +5167,16 @@ static long kvm_vm_ioctl(struct file *filp, > > case KVM_GET_STATS_FD: > > r = kvm_vm_ioctl_get_stats_fd(kvm); > > break; > > + case KVM_CREATE_GUEST_MEMFD: { > > + struct kvm_create_guest_memfd guest_memfd; > > + > > + r = -EFAULT; > > + if (copy_from_user(&guest_memfd, argp, sizeof(guest_memfd))) > > + goto out; > > + > > + r = kvm_gmem_create(kvm, &guest_memfd); > > + break; > > + } > > I'm thinking line of sight here, by having this as a vm ioctl (rather > than a system iocl), would it complicate making it possible in the > future to share/donate memory between VMs? Maybe, but I hope not? There would still be a primary owner of the memory, i.e. the memory would still need to be allocated in the context of a specific VM. And the primary owner should be able to restrict privileges, e.g. allow a different VM to read but not write memory. My current thinking is to (a) tie the lifetime of the backing pages to the inode, i.e. allow allocations to outlive the original VM, and (b) create a new file each time memory is shared/donated with a different VM (or other entity in the kernel). That should make it fairly straightforward to provide different permissions, e.g. track them per-file, and I think should also avoid the need to change the memslot binding logic since each VM would have it's own view/bindings. Copy+pasting a relevant snippet from a lengthier response in a different thread[*]: Conceptually, I think KVM should to bind to the file. The inode is effectively the raw underlying physical storage, while the file is the VM's view of that storage. Practically, I think that gives us a clean, intuitive way to handle intra-host migration. Rather than transfer ownership of the file, instantiate a new file for the target VM, using the gmem inode from the source VM, i.e. create a hard link. That'd probably require new uAPI, but I don't think that will be hugely problematic. KVM would need to ensure the new VM's guest_memfd can't be mapped until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the memslots/bindings are identical), but that should be easy enough to enforce. That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing the memory and the *contents* of memory to outlive the VM, i.e. be effectively transfered to the new target VM. And we'll maintain the invariant that each guest_memfd is bound 1:1 with a single VM. As above, that should also help us draw the line between mapping memory into a VM (file), and freeing/reclaiming the memory (inode). There will be extra complexity/overhead as we'll have to play nice with the possibility of multiple files per inode, e.g. to zap mappings across all files when punching a hole, but the extra complexity is quite small, e.g. we can use address_space.private_list to keep track of the guest_memfd instances associated with the inode. Setting aside TDX and SNP for the moment, as it's not clear how they'll support memory that is "private" but shared between multiple VMs, I think per-VM files would work well for sharing gmem between two VMs. E.g. would allow a give page to be bound to a different gfn for each VM, would allow having different permissions for each file (e.g. to allow fallocate() only from the original owner). [*] https://lore.kernel.org/all/ZLGiEfJZTyl7M8mS@google.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id DB639C0015E for ; Thu, 27 Jul 2023 17:13:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:Message-ID: References:Mime-Version:In-Reply-To:Date:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Owner; bh=HRVct3AuRXnXkqQTmExkXMvMCpmF1Yu2FDhgwXnm9y4=; b=zltbtfkhf6bxuugbPhhV3uYcQ5 uvIyNjJcUddvplq74GsME5720G9by2bz1S2U6o0JZt39e6u4s4KZ/SMf0Emjo8cItXwQGmdscn05x 92REf/QIHxB+FOLXFk9d9oNY3lK9S1hBXjHgykV+gbytmAUaGCNBHUfGBIxfz48RlTyAorELeQ8Ba Bgz2d4uDM37lrsdWpBWlqeh6oaFvW81PaIzv4SjO+6E4i2m0WS+afI2ORDNw9zYuX8BolifxY0eDz Gy1RCsqWMdgP0om7gXQO9iSqtbIy1nzEG5GyPl2syArAcYeDzf+zUHeB7v8+WsyMuuW3BOq6Ng7bI irmlpJHA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qP4Y9-00HAKj-1j; Thu, 27 Jul 2023 17:13:17 +0000 Received: from mail-pl1-x649.google.com ([2607:f8b0:4864:20::649]) by bombadil.infradead.org with esmtps (Exim 4.96 #2 (Red Hat Linux)) id 1qP4Y3-00HAE7-1R for linux-riscv@lists.infradead.org; Thu, 27 Jul 2023 17:13:15 +0000 Received: by mail-pl1-x649.google.com with SMTP id d9443c01a7336-1bbb34b091dso8228795ad.0 for ; Thu, 27 Jul 2023 10:13:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1690477989; x=1691082789; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=F7oDMpfc0bW6NSyfuHNNNYlN2HFPFEuCSHvjU51+NBs=; b=uR617wLpg6qNB677nCTpj/mRcHCn7jeMQ6T975dwFWafb05LBOA5GXBHggNBovfIzX O/7M392ccdTOGZqesE1qC7szLmXDUVbpTkdUSAnDKesUZGTHf6YrHw7iRdcLemcslWSD KrRYixDmKkhke5BhBNDcGGoexHbctICnVSugeNNjkFTNcd3g+7vTx3m4xPX98nUwGt3/ mWhf+uFvINzhg7lFN3O3GR8jmx04AReHHh1Ek1sP09ZwG+oxGwlWe7n39MiREf5a36wi QECqs1dJt2n4+euTMFmAQH0Oiz0RKLy4HgS6MVL8RVa78gxf6JRosPxL/mvaakJouVeF ho3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690477989; x=1691082789; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=F7oDMpfc0bW6NSyfuHNNNYlN2HFPFEuCSHvjU51+NBs=; b=bA7px1pckdt3LaaFaIEpKiH8/NMDEtEyv++b8jk+GaaLiDtqajMaWZut+2UQ3aJplD eovehCls3P+wino8U2kCz8dxiiw6D+AyrtSdBT6SqSTmGnuRp7HRtlVM3ktb5OGfMbI8 B0KsHR/+oL1hLFGTDtVZQKHFnYEhLQfMp4PWO6C+0sr88LAd7OAklAktvCstOVKyFDtU M945UiNvWQXwXGcqWf4by4tFSVATOo7G2zdT5uDNk2mc7exnm3HvnWX2GkzeYTTkytH9 w//oG/Prp1T585TZQr/fnQsWZl2VJJFv0lp1aIUollR9LRlWDVBChtZPrIy3Sp8zKHOl fV4g== X-Gm-Message-State: ABy/qLbu2FoGlQKQ7Kkgj7DuDVJWdMrIT1V5Dtcx9TMrYVoVQYy4hXnU Lyek/ldqYE64qTccLUzv0KdsgpLLlkg= X-Google-Smtp-Source: APBJJlEiCX9DyxliShqr2xPxTiyej7fUi+PkS+4LO77eNeghg1Fiv4b0fvD5YH2w9mP2/cy4q9w9vNXHRRI= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:903:22c6:b0:1b5:2b14:5f2c with SMTP id y6-20020a17090322c600b001b52b145f2cmr24803plg.4.1690477989357; Thu, 27 Jul 2023 10:13:09 -0700 (PDT) Date: Thu, 27 Jul 2023 10:13:07 -0700 In-Reply-To: Mime-Version: 1.0 References: <20230718234512.1690985-1-seanjc@google.com> <20230718234512.1690985-13-seanjc@google.com> Message-ID: Subject: Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory From: Sean Christopherson To: Fuad Tabba Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , "Matthew Wilcox (Oracle)" , Andrew Morton , Paul Moore , James Morris , "Serge E. Hallyn" , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, Chao Peng , Jarkko Sakkinen , Yu Zhang , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , Vlastimil Babka , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230727_101311_519010_9DCE968B X-CRM114-Status: GOOD ( 25.93 ) X-BeenThere: linux-riscv@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-riscv" Errors-To: linux-riscv-bounces+linux-riscv=archiver.kernel.org@lists.infradead.org On Thu, Jul 27, 2023, Fuad Tabba wrote: > Hi Sean, > > > ... > > > @@ -5134,6 +5167,16 @@ static long kvm_vm_ioctl(struct file *filp, > > case KVM_GET_STATS_FD: > > r = kvm_vm_ioctl_get_stats_fd(kvm); > > break; > > + case KVM_CREATE_GUEST_MEMFD: { > > + struct kvm_create_guest_memfd guest_memfd; > > + > > + r = -EFAULT; > > + if (copy_from_user(&guest_memfd, argp, sizeof(guest_memfd))) > > + goto out; > > + > > + r = kvm_gmem_create(kvm, &guest_memfd); > > + break; > > + } > > I'm thinking line of sight here, by having this as a vm ioctl (rather > than a system iocl), would it complicate making it possible in the > future to share/donate memory between VMs? Maybe, but I hope not? There would still be a primary owner of the memory, i.e. the memory would still need to be allocated in the context of a specific VM. And the primary owner should be able to restrict privileges, e.g. allow a different VM to read but not write memory. My current thinking is to (a) tie the lifetime of the backing pages to the inode, i.e. allow allocations to outlive the original VM, and (b) create a new file each time memory is shared/donated with a different VM (or other entity in the kernel). That should make it fairly straightforward to provide different permissions, e.g. track them per-file, and I think should also avoid the need to change the memslot binding logic since each VM would have it's own view/bindings. Copy+pasting a relevant snippet from a lengthier response in a different thread[*]: Conceptually, I think KVM should to bind to the file. The inode is effectively the raw underlying physical storage, while the file is the VM's view of that storage. Practically, I think that gives us a clean, intuitive way to handle intra-host migration. Rather than transfer ownership of the file, instantiate a new file for the target VM, using the gmem inode from the source VM, i.e. create a hard link. That'd probably require new uAPI, but I don't think that will be hugely problematic. KVM would need to ensure the new VM's guest_memfd can't be mapped until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the memslots/bindings are identical), but that should be easy enough to enforce. That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing the memory and the *contents* of memory to outlive the VM, i.e. be effectively transfered to the new target VM. And we'll maintain the invariant that each guest_memfd is bound 1:1 with a single VM. As above, that should also help us draw the line between mapping memory into a VM (file), and freeing/reclaiming the memory (inode). There will be extra complexity/overhead as we'll have to play nice with the possibility of multiple files per inode, e.g. to zap mappings across all files when punching a hole, but the extra complexity is quite small, e.g. we can use address_space.private_list to keep track of the guest_memfd instances associated with the inode. Setting aside TDX and SNP for the moment, as it's not clear how they'll support memory that is "private" but shared between multiple VMs, I think per-VM files would work well for sharing gmem between two VMs. E.g. would allow a give page to be bound to a different gfn for each VM, would allow having different permissions for each file (e.g. to allow fallocate() only from the original owner). [*] https://lore.kernel.org/all/ZLGiEfJZTyl7M8mS@google.com _______________________________________________ linux-riscv mailing list linux-riscv@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-riscv From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.ozlabs.org (lists.ozlabs.org [112.213.38.117]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 905F7C001E0 for ; Thu, 27 Jul 2023 17:14:55 +0000 (UTC) Authentication-Results: lists.ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=google.com header.i=@google.com header.a=rsa-sha256 header.s=20221208 header.b=uR617wLp; dkim-atps=neutral Received: from boromir.ozlabs.org (localhost [IPv6:::1]) by lists.ozlabs.org (Postfix) with ESMTP id 4RBcp60RY5z3cWr for ; Fri, 28 Jul 2023 03:14:54 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=google.com header.i=@google.com header.a=rsa-sha256 header.s=20221208 header.b=uR617wLp; dkim-atps=neutral Authentication-Results: lists.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=flex--seanjc.bounces.google.com (client-ip=2607:f8b0:4864:20::64a; helo=mail-pl1-x64a.google.com; envelope-from=3paxczaykdcgwierngksskpi.gsqpmrybttg-hizpmwxw.sdpefw.svk@flex--seanjc.bounces.google.com; receiver=lists.ozlabs.org) Received: from mail-pl1-x64a.google.com (mail-pl1-x64a.google.com [IPv6:2607:f8b0:4864:20::64a]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 4RBcmC5b2sz3c3s for ; Fri, 28 Jul 2023 03:13:13 +1000 (AEST) Received: by mail-pl1-x64a.google.com with SMTP id d9443c01a7336-1bbb34b091dso8228825ad.0 for ; Thu, 27 Jul 2023 10:13:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1690477989; x=1691082789; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=F7oDMpfc0bW6NSyfuHNNNYlN2HFPFEuCSHvjU51+NBs=; b=uR617wLpg6qNB677nCTpj/mRcHCn7jeMQ6T975dwFWafb05LBOA5GXBHggNBovfIzX O/7M392ccdTOGZqesE1qC7szLmXDUVbpTkdUSAnDKesUZGTHf6YrHw7iRdcLemcslWSD KrRYixDmKkhke5BhBNDcGGoexHbctICnVSugeNNjkFTNcd3g+7vTx3m4xPX98nUwGt3/ mWhf+uFvINzhg7lFN3O3GR8jmx04AReHHh1Ek1sP09ZwG+oxGwlWe7n39MiREf5a36wi QECqs1dJt2n4+euTMFmAQH0Oiz0RKLy4HgS6MVL8RVa78gxf6JRosPxL/mvaakJouVeF ho3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690477989; x=1691082789; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=F7oDMpfc0bW6NSyfuHNNNYlN2HFPFEuCSHvjU51+NBs=; b=iAq8GxUB1jOHdgFWJvOih3E7CMF7TlK1whiMQqVnb+XZN9omttPmHKhVMZeJKB5L1R FwVaOjSvRbki7XG91G7ry/9OrGRTLrpdx7tUNWEZqZFSnCd6ivFA/M/vhxwyTuL6g8/N vbonYzQqi4+TwFkAYTbOAjDlZ9+16qzjuSm3uKDfMuld5azt/E15wGAWtZV2bpMHN+gL ECjNI747csLuU7Y0vugrxKG1S0G4rKwSBnZfEPsjxuNHVCJocpRHdYqzACtE/YkYqas5 uWVQgicXG2NEiAbQX3+S+EydmJy/jy7H9S7UYF8u6J+UlOEXoxI2oKpUi9yBR8A22y8E ntFw== X-Gm-Message-State: ABy/qLahSQf97g0zhpZ/+fdha2EPBntvY8iJPO1vbtYaRRbwMDY3L9g4 bKSdxLGykdjzpo7V4nqNepERgLuZj9k= X-Google-Smtp-Source: APBJJlEiCX9DyxliShqr2xPxTiyej7fUi+PkS+4LO77eNeghg1Fiv4b0fvD5YH2w9mP2/cy4q9w9vNXHRRI= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:903:22c6:b0:1b5:2b14:5f2c with SMTP id y6-20020a17090322c600b001b52b145f2cmr24803plg.4.1690477989357; Thu, 27 Jul 2023 10:13:09 -0700 (PDT) Date: Thu, 27 Jul 2023 10:13:07 -0700 In-Reply-To: Mime-Version: 1.0 References: <20230718234512.1690985-1-seanjc@google.com> <20230718234512.1690985-13-seanjc@google.com> Message-ID: Subject: Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory From: Sean Christopherson To: Fuad Tabba Content-Type: text/plain; charset="us-ascii" X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: kvm@vger.kernel.org, David Hildenbrand , Yu Zhang , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chao Peng , linux-riscv@lists.infradead.org, Isaku Yamahata , Paul Moore , Marc Zyngier , Huacai Chen , James Morris , "Matthew Wilcox \(Oracle\)" , Wang , Vlastimil Babka , Jarkko Sakkinen , "Serge E. Hallyn" , Maciej Szmigiero , Albert Ou , Michael Roth , Ackerley Tng , Paul Walmsley , kvmarm@lists.linux.dev, linux-arm-kernel@lists.infradead.org, Quentin Perret , Liam Merwick , linux-mips@vger.kernel.org, Oliver Upton , linux-security-module@vger.kernel.org, Palmer Dabbelt , kvm-riscv@lists.infradead.org, Anup Patel , linux-fsdevel@vger.kernel.org, Paolo Bonzini , Andrew Morton , Vishal Annapurve , linuxppc-dev@lists.ozlabs.org, "Kirill A . Shutemov" Errors-To: linuxppc-dev-bounces+linuxppc-dev=archiver.kernel.org@lists.ozlabs.org Sender: "Linuxppc-dev" On Thu, Jul 27, 2023, Fuad Tabba wrote: > Hi Sean, > > > ... > > > @@ -5134,6 +5167,16 @@ static long kvm_vm_ioctl(struct file *filp, > > case KVM_GET_STATS_FD: > > r = kvm_vm_ioctl_get_stats_fd(kvm); > > break; > > + case KVM_CREATE_GUEST_MEMFD: { > > + struct kvm_create_guest_memfd guest_memfd; > > + > > + r = -EFAULT; > > + if (copy_from_user(&guest_memfd, argp, sizeof(guest_memfd))) > > + goto out; > > + > > + r = kvm_gmem_create(kvm, &guest_memfd); > > + break; > > + } > > I'm thinking line of sight here, by having this as a vm ioctl (rather > than a system iocl), would it complicate making it possible in the > future to share/donate memory between VMs? Maybe, but I hope not? There would still be a primary owner of the memory, i.e. the memory would still need to be allocated in the context of a specific VM. And the primary owner should be able to restrict privileges, e.g. allow a different VM to read but not write memory. My current thinking is to (a) tie the lifetime of the backing pages to the inode, i.e. allow allocations to outlive the original VM, and (b) create a new file each time memory is shared/donated with a different VM (or other entity in the kernel). That should make it fairly straightforward to provide different permissions, e.g. track them per-file, and I think should also avoid the need to change the memslot binding logic since each VM would have it's own view/bindings. Copy+pasting a relevant snippet from a lengthier response in a different thread[*]: Conceptually, I think KVM should to bind to the file. The inode is effectively the raw underlying physical storage, while the file is the VM's view of that storage. Practically, I think that gives us a clean, intuitive way to handle intra-host migration. Rather than transfer ownership of the file, instantiate a new file for the target VM, using the gmem inode from the source VM, i.e. create a hard link. That'd probably require new uAPI, but I don't think that will be hugely problematic. KVM would need to ensure the new VM's guest_memfd can't be mapped until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the memslots/bindings are identical), but that should be easy enough to enforce. That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing the memory and the *contents* of memory to outlive the VM, i.e. be effectively transfered to the new target VM. And we'll maintain the invariant that each guest_memfd is bound 1:1 with a single VM. As above, that should also help us draw the line between mapping memory into a VM (file), and freeing/reclaiming the memory (inode). There will be extra complexity/overhead as we'll have to play nice with the possibility of multiple files per inode, e.g. to zap mappings across all files when punching a hole, but the extra complexity is quite small, e.g. we can use address_space.private_list to keep track of the guest_memfd instances associated with the inode. Setting aside TDX and SNP for the moment, as it's not clear how they'll support memory that is "private" but shared between multiple VMs, I think per-VM files would work well for sharing gmem between two VMs. E.g. would allow a give page to be bound to a different gfn for each VM, would allow having different permissions for each file (e.g. to allow fallocate() only from the original owner). [*] https://lore.kernel.org/all/ZLGiEfJZTyl7M8mS@google.com From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 71225C0015E for ; Thu, 27 Jul 2023 17:13:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:Message-ID: References:Mime-Version:In-Reply-To:Date:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Owner; bh=N1jXdWfKemBNt+nnNJcsTgZfNLlY+HQ55h+CoxHVXbw=; b=Mbj+OP0re/mwTsbZ14ZO8sgoTj P3xkQUnyV/2uyIFlrLKe3Mx9TQnRUDVD0PJKbVTrtLyVNY5liSz3UewBOcnWoQRMONMCici0grqPD /DEYjQd9iEKX/RGv6KtARMP8uZNC+XWq0KyRjf18iDBWPch0vMYnzTeySPog5IYihCz6+GyCbiXkr EPfIq/GMoZ8a7HKvlzDWkadfC63tazi4WfkIMDhcEx9XrrVNBu9QrE9Z1+0EZcNcAEJotFr9t/868 5U0wfLLDA2uYQ7l1huDctqS1sqmWSrMFbR056+0qWefe9UxXQb19ZPOan1SwHwwpsrN3cPLWxAccQ RFkvNNRA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qP4YA-00HALX-1N; Thu, 27 Jul 2023 17:13:18 +0000 Received: from mail-pl1-x64a.google.com ([2607:f8b0:4864:20::64a]) by bombadil.infradead.org with esmtps (Exim 4.96 #2 (Red Hat Linux)) id 1qP4Y3-00HAE6-1Z for linux-arm-kernel@lists.infradead.org; Thu, 27 Jul 2023 17:13:15 +0000 Received: by mail-pl1-x64a.google.com with SMTP id d9443c01a7336-1bb98659f3cso8122265ad.3 for ; Thu, 27 Jul 2023 10:13:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1690477989; x=1691082789; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=F7oDMpfc0bW6NSyfuHNNNYlN2HFPFEuCSHvjU51+NBs=; b=uR617wLpg6qNB677nCTpj/mRcHCn7jeMQ6T975dwFWafb05LBOA5GXBHggNBovfIzX O/7M392ccdTOGZqesE1qC7szLmXDUVbpTkdUSAnDKesUZGTHf6YrHw7iRdcLemcslWSD KrRYixDmKkhke5BhBNDcGGoexHbctICnVSugeNNjkFTNcd3g+7vTx3m4xPX98nUwGt3/ mWhf+uFvINzhg7lFN3O3GR8jmx04AReHHh1Ek1sP09ZwG+oxGwlWe7n39MiREf5a36wi QECqs1dJt2n4+euTMFmAQH0Oiz0RKLy4HgS6MVL8RVa78gxf6JRosPxL/mvaakJouVeF ho3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690477989; x=1691082789; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=F7oDMpfc0bW6NSyfuHNNNYlN2HFPFEuCSHvjU51+NBs=; b=Nd5FULnbCNXJTlas+uzoGu6QUvcNwsB5EAHNkRUFb9JSJPjjTMh6vi1j8q4xAcOn+y ElR9KuSM5om+VifPwAcrZQTn1yHrf2aM5wthfWyoZFWoSQeySXXZIBqEMYdSaCMXaGSS 94am3IqB2XXnXrv8O3Wng4WBnJSvrkBfozAQ/bmH3O1DoCk7IBk8D3e6hRqJcn+w5nWs A7kovxvNg5UODpXhfs2S7tPyzyTL/IRHv2q7I4jgW2gs5GIIvLkAmhNPaDT5Q6xHE+ja lrRC1/RwuKUOc9OoU+7h62YJPW3oIMJPrZQ5AwPp5YG8CNje51G2YK4uMBAO6a8PrGLc SUOA== X-Gm-Message-State: ABy/qLYBBmVJo7fcE9xaAzyMpApZ+S+GgKuh8phRbVHOxgMCiAWPSYUQ lJC3DeMn5uk+WCogkJvc+ql8Pq4F9jE= X-Google-Smtp-Source: APBJJlEiCX9DyxliShqr2xPxTiyej7fUi+PkS+4LO77eNeghg1Fiv4b0fvD5YH2w9mP2/cy4q9w9vNXHRRI= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:903:22c6:b0:1b5:2b14:5f2c with SMTP id y6-20020a17090322c600b001b52b145f2cmr24803plg.4.1690477989357; Thu, 27 Jul 2023 10:13:09 -0700 (PDT) Date: Thu, 27 Jul 2023 10:13:07 -0700 In-Reply-To: Mime-Version: 1.0 References: <20230718234512.1690985-1-seanjc@google.com> <20230718234512.1690985-13-seanjc@google.com> Message-ID: Subject: Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory From: Sean Christopherson To: Fuad Tabba Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , "Matthew Wilcox (Oracle)" , Andrew Morton , Paul Moore , James Morris , "Serge E. Hallyn" , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, Chao Peng , Jarkko Sakkinen , Yu Zhang , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , Vlastimil Babka , David Hildenbrand , Quentin Perret , Michael Roth , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230727_101311_536314_430B0068 X-CRM114-Status: GOOD ( 27.20 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Thu, Jul 27, 2023, Fuad Tabba wrote: > Hi Sean, > > > ... > > > @@ -5134,6 +5167,16 @@ static long kvm_vm_ioctl(struct file *filp, > > case KVM_GET_STATS_FD: > > r = kvm_vm_ioctl_get_stats_fd(kvm); > > break; > > + case KVM_CREATE_GUEST_MEMFD: { > > + struct kvm_create_guest_memfd guest_memfd; > > + > > + r = -EFAULT; > > + if (copy_from_user(&guest_memfd, argp, sizeof(guest_memfd))) > > + goto out; > > + > > + r = kvm_gmem_create(kvm, &guest_memfd); > > + break; > > + } > > I'm thinking line of sight here, by having this as a vm ioctl (rather > than a system iocl), would it complicate making it possible in the > future to share/donate memory between VMs? Maybe, but I hope not? There would still be a primary owner of the memory, i.e. the memory would still need to be allocated in the context of a specific VM. And the primary owner should be able to restrict privileges, e.g. allow a different VM to read but not write memory. My current thinking is to (a) tie the lifetime of the backing pages to the inode, i.e. allow allocations to outlive the original VM, and (b) create a new file each time memory is shared/donated with a different VM (or other entity in the kernel). That should make it fairly straightforward to provide different permissions, e.g. track them per-file, and I think should also avoid the need to change the memslot binding logic since each VM would have it's own view/bindings. Copy+pasting a relevant snippet from a lengthier response in a different thread[*]: Conceptually, I think KVM should to bind to the file. The inode is effectively the raw underlying physical storage, while the file is the VM's view of that storage. Practically, I think that gives us a clean, intuitive way to handle intra-host migration. Rather than transfer ownership of the file, instantiate a new file for the target VM, using the gmem inode from the source VM, i.e. create a hard link. That'd probably require new uAPI, but I don't think that will be hugely problematic. KVM would need to ensure the new VM's guest_memfd can't be mapped until KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM (which would also need to verify the memslots/bindings are identical), but that should be easy enough to enforce. That way, a VM, its memslots, and its SPTEs are tied to the file, while allowing the memory and the *contents* of memory to outlive the VM, i.e. be effectively transfered to the new target VM. And we'll maintain the invariant that each guest_memfd is bound 1:1 with a single VM. As above, that should also help us draw the line between mapping memory into a VM (file), and freeing/reclaiming the memory (inode). There will be extra complexity/overhead as we'll have to play nice with the possibility of multiple files per inode, e.g. to zap mappings across all files when punching a hole, but the extra complexity is quite small, e.g. we can use address_space.private_list to keep track of the guest_memfd instances associated with the inode. Setting aside TDX and SNP for the moment, as it's not clear how they'll support memory that is "private" but shared between multiple VMs, I think per-VM files would work well for sharing gmem between two VMs. E.g. would allow a give page to be bound to a different gfn for each VM, would allow having different permissions for each file (e.g. to allow fallocate() only from the original owner). [*] https://lore.kernel.org/all/ZLGiEfJZTyl7M8mS@google.com _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel