From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 07F8CCD4936 for ; Wed, 20 Sep 2023 23:44:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7427E6B014E; Wed, 20 Sep 2023 19:44:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6F2096B0180; Wed, 20 Sep 2023 19:44:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 592886B01F5; Wed, 20 Sep 2023 19:44:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 492C46B014E for ; Wed, 20 Sep 2023 19:44:52 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 11CD61207D5 for ; Wed, 20 Sep 2023 23:44:52 +0000 (UTC) X-FDA: 81258608424.15.0E6E74F Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) by imf05.hostedemail.com (Postfix) with ESMTP id 3A514100006 for ; Wed, 20 Sep 2023 23:44:50 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=VvbjbqeI; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of 38IMLZQYKCFUF1xA6z3BB381.zB985AHK-997Ixz7.BE3@flex--seanjc.bounces.google.com designates 209.85.214.201 as permitted sender) smtp.mailfrom=38IMLZQYKCFUF1xA6z3BB381.zB985AHK-997Ixz7.BE3@flex--seanjc.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695253490; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YMbatuEIGrWsTlMlK7Tk6yapZGH8kwO0Z4+wG56Y3PQ=; b=p22FNQIlIFO0dhUJPArJsGC5xhqnA8a/UUCfIYWI0vGp6FRNrjsFfxJ9/F7FlIwCVQKMC0 MFdfsO0VW6ND6uG7qfLPeVY/XzeEmYX6ln7qVv/cm6g2JxeSNRgFWyjTruArypI6sNpwFq +qX1RSFDLKjCR5l7Sy3zdV5Xg0U6Hlk= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=VvbjbqeI; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of 38IMLZQYKCFUF1xA6z3BB381.zB985AHK-997Ixz7.BE3@flex--seanjc.bounces.google.com designates 209.85.214.201 as permitted sender) smtp.mailfrom=38IMLZQYKCFUF1xA6z3BB381.zB985AHK-997Ixz7.BE3@flex--seanjc.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695253490; a=rsa-sha256; cv=none; b=3LKwDngL6FSAes5rh8Ele02b/04QOdC9qBsHwhrGi0ScpC9hmV2lVaBuUTjNfOejjCNvAa GPOwsjmKZ8tyEKpDqpxKyP2V53OkiVAmbBD9xivM4hYuPEdxU/h+BwqCEEDQ0UeuDbBPbJ eaTYLcWU9afZTe/SkHaowb3JqNjW7As= Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-1c43cd8b6cbso2975365ad.0 for ; Wed, 20 Sep 2023 16:44:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1695253489; x=1695858289; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=YMbatuEIGrWsTlMlK7Tk6yapZGH8kwO0Z4+wG56Y3PQ=; b=VvbjbqeICAizNY4YVemgY89SQtaZ6OVOOqCtf6oBa+Uutr8RFYcWlZnpe2bmcGI+O0 5kVUQR2Xa7Hx9VHRuwefEcCcHpV7xeftOzZKaVdqy6n0wLA2SWRDoh1T1ZLDIio++wVy bwTd5FeNmWkmUXb/YxEfmt8Oj6ov8Dgmuff467WP07nphFa/hlqN2dC2w5jUZBUfZogP y51vOp6ksPeExUvQh+yH/Y7AhFAWRmpOSMIIFgoCnchMU3e0hcqK8nBIPjhpWxXSFS0v 0gS2FHX9LYGKHgUpNEG+BqXhV8QjCpsAcANYaGRhErnJeYhGbIIELNMiUvXhGHYZxKGT qVhw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695253489; x=1695858289; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=YMbatuEIGrWsTlMlK7Tk6yapZGH8kwO0Z4+wG56Y3PQ=; b=JqB9gOHyIOdPyPZx/tiEIrNcFu5tT4NSfZaOII5psLe88sJjRiPW0RYOfPwPhB0OqQ dZFzlj7LpoLmq5LTkzEBVHex61nEb+JIdLYDFvcndPpgdpznz7csopz9RfEpcoGKlsPM gIPVxKKIpQE8qbsB+jCDWuLMWRld8tGxCggn1dh6jKxYGl5oIB4hu+0e7jLyEfcleusM Y8adBR+Ahpu1MjLKXOnieWMLr5U1YS4GeUxQgxzRLdK5z21qcWeEislMHbPDG6UP+n9N 4U36ESXJHKVL4gvftObem3EZkGrEyDmVRSl982bmE4B6nNuwqs+Knc9T98PARgka0Vrw 5xog== X-Gm-Message-State: AOJu0YzR7oHB0a3OUyqM8hjGFH14R4KRjjLBI9Su19MNqdrPsTYV3WBz I/+JMu40PKoSxciTpBzU1dsdRY7nxYg= X-Google-Smtp-Source: AGHT+IHQbG7yFE9jJPEUI1XKEjcd5Hn7eouqEyZetIlWhAmmYx1Ip3XqohOdLN8YNlfaQrxQ5DSfhNksAYI= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:903:22d0:b0:1ae:6895:cb96 with SMTP id y16-20020a17090322d000b001ae6895cb96mr56298plg.5.1695253488829; Wed, 20 Sep 2023 16:44:48 -0700 (PDT) Date: Wed, 20 Sep 2023 16:44:47 -0700 In-Reply-To: <20230918163647.m6bjgwusc7ww5tyu@amd.com> Mime-Version: 1.0 References: <20230914015531.1419405-1-seanjc@google.com> <20230914015531.1419405-15-seanjc@google.com> <20230918163647.m6bjgwusc7ww5tyu@amd.com> Message-ID: Subject: Re: [RFC PATCH v12 14/33] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory From: Sean Christopherson To: Michael Roth Cc: Paolo Bonzini , Marc Zyngier , Oliver Upton , Huacai Chen , Michael Ellerman , Anup Patel , Paul Walmsley , Palmer Dabbelt , Albert Ou , "Matthew Wilcox (Oracle)" , Andrew Morton , Paul Moore , James Morris , "Serge E. Hallyn" , kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, Chao Peng , Fuad Tabba , Jarkko Sakkinen , Anish Moorthy , Yu Zhang , Isaku Yamahata , Xu Yilun , Vlastimil Babka , Vishal Annapurve , Ackerley Tng , Maciej Szmigiero , David Hildenbrand , Quentin Perret , Wang , Liam Merwick , Isaku Yamahata , "Kirill A . Shutemov" Content-Type: text/plain; charset="us-ascii" X-Rspamd-Queue-Id: 3A514100006 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: cgt648w895skgnb6jbj4h9xmac3obcw8 X-HE-Tag: 1695253490-449504 X-HE-Meta: U2FsdGVkX18Wy6bqsJSptcGPyUe7HaUqYYFoy+SSLSt7NYWuuK27e1L3PuPUyrNKLMRdBjSCc+pkDpDH9tDaFwsjKJcd4t9g77eyXTWcijEGkhpPMt4SOcgXv/jkN8CQej7LDcxmLb+V+i1Ri9/SSO5ujItH0nR0WthiT+FLXXtPUDcZVdb9hs46noHjQFbbu62D/NBQ7owu+2siu4FKSbxsJR8z6Bo1l5YA8kYdBXQOVKLDD3KUbhDqFqelQ419qUn1JfRBhJ03nqR98Cr7XxTOh79H9QMDBvLD8OnMm6segVtWSilQw47a246eGcdtsl2WbC/UVkaX+oyjS0syvaQY3pH1KsSQ8Whjh0PUc49uSXs5XIWghlZ2Em/MwCaS/z8Sl+3OngAncifWvbfS/3siJA6OHg8Nc67quqtC8d0bNwApoSDphiQfNxKg/7dC5R1+vavE5nn+Qu1AgIVxpwbeliP09/xOmG9uzUPiEMoqXDmObNnrsz+o/AM9j3VC/6Ydm4UUWiGNyXNRqzeLlnklD/k2bVYNg71fVzhZ1hVUx/7WHWqKHtCwL/rDbK5OmQyY8VAo9mdCbniCN7M1tXTRTaJOZ5RnVsMmNmHQ7YFkDgO2f7xw9fqEi/DRctXB2Bg2tAqoHzYUbqNyPgkcZaPs0CpGGGhRnz+FWV7xR+0byMV7gc8h5EUJsJPCmDW0RE6jGX2AuOXt3va2URH+l8R8fZZq8PBxGdgMHJPYy5uPLhpCFY0bTsqwZ6ii+RL6Yj9giwMfiLwUvfDCRoc8728dVeTCTDVqxjzIdG0g4FcO8MiBfRzrsZ5tEWazlIynKum0vM4MAfPb/qYydEJr+7fI55LyyzJEmgXCObOplbiP9t+CqojPlKgY0qbq/DKDKlh+7tAbgHWwjU4hMNDbil+JFe858x4qKImSpBi2+whLtN9TF9g3DVutb2FqfFeI94l8wgxlw+Ka63t6XE8 6BoX3baR PhbmMqnqCM9R549co74JJhgBtXOzKmfg5QXrGnTFnG2ZrZPMXhwUffkNrKHQzsCg88kUTzau0+Nhjy8gaxdZYFTce6yxR0d6zwoMZsurRIGURY24maafxSlwO5hR/ydOGIsDinr3ONuUxty00hxtpxvSU+K9jZMo8+J6+LaKm2P4h61u1hJUQxALPBFsKcLhQTEpY7hphIfYQ456iYW3JVITmx+R1CUNe9HKCDc8tgCLkbqykhLjHzsk6hjy+cQHYq4FN+STEHbwBMu/iXXh+fYsdxo+GrDTF8BHYxKNEGB5//Gf5YrczAI9y5epr4RdR16dY6UVSadpzHWsqTBPnlG1LmaLkxtimWtVnsCIz/YIu0HrFUVMNCXLQePpUOxMLq8UQJhVOpkfrORuO5X/f18ABTrH3b2dlDCqziPYVNbQhJ2kngfsFcztvH8nK6sBUInsC4Y4sEU8QI8Uw1sytAef96fZEK+f4W6UJZiE6Npd0hoYk+AUd16zSCGRRQELxSsKHqSHnp1q70IIhIXTUIfyreYYkOJe34EPA5NezEXWFruDu2n6J9bsJQdiVLW4wjiBXJnJo2pAc2E7MeoDRf2j21jFzfxHlnVeVE6xXGwK2nznK8biQ4piOtjDQLGPXob9BRilIKItjgF64UX/GB0LGXJ5qFknEM/2iiY8Rh/7M1IliBh0CrG8Vw75QuOeVEuZeTVn4UKtn/Lo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Sep 18, 2023, Michael Roth wrote: > > +static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len) > > +{ > > + struct list_head *gmem_list = &inode->i_mapping->private_list; > > + pgoff_t start = offset >> PAGE_SHIFT; > > + pgoff_t end = (offset + len) >> PAGE_SHIFT; > > + struct kvm_gmem *gmem; > > + > > + /* > > + * Bindings must stable across invalidation to ensure the start+end > > + * are balanced. > > + */ > > + filemap_invalidate_lock(inode->i_mapping); > > + > > + list_for_each_entry(gmem, gmem_list, entry) { > > + kvm_gmem_invalidate_begin(gmem, start, end); > > In v11 we used to call truncate_inode_pages_range() here to drop filemap's > reference on the folio. AFAICT the folios are only getting free'd upon > guest shutdown without this. Was this on purpose? Nope, I just spotted this too. And then after scratching my head for a few minutes, wondering if I was having an -ENOCOFFEE moment, I finally read your mail. *sigh* Looking at my reflog history, I'm pretty sure I deleted the wrong line when removing the truncation from kvm_gmem_error_page(). > > + kvm_gmem_invalidate_end(gmem, start, end); > > + } > > + > > + filemap_invalidate_unlock(inode->i_mapping); > > + > > + return 0; > > +} > > + > > +static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len) > > +{ > > + struct address_space *mapping = inode->i_mapping; > > + pgoff_t start, index, end; > > + int r; > > + > > + /* Dedicated guest is immutable by default. */ > > + if (offset + len > i_size_read(inode)) > > + return -EINVAL; > > + > > + filemap_invalidate_lock_shared(mapping); > > We take the filemap lock here, but not for > kvm_gmem_get_pfn()->kvm_gmem_get_folio(). Is it needed there as well? No, we specifically do not want to take a rwsem when faulting in guest memory. Callers of kvm_gmem_get_pfn() *must* guard against concurrent invalidations via mmu_invalidate_seq and friends. > > + /* > > + * For simplicity, require the offset into the file and the size of the > > + * memslot to be aligned to the largest possible page size used to back > > + * the file (same as the size of the file itself). > > + */ > > + if (!kvm_gmem_is_valid_size(offset, flags) || > > + !kvm_gmem_is_valid_size(size, flags)) > > + goto err; > > I needed to relax this check for SNP. KVM_GUEST_MEMFD_ALLOW_HUGEPAGE > applies to entire gmem inode, so it makes sense for userspace to enable > hugepages if start/end are hugepage-aligned, but QEMU will do things > like map overlapping regions for ROMs and other things on top of the > GPA range that the gmem inode was originally allocated for. For > instance: > > 692500@1689108688.696338:kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fbf5be00000 ret=0 restricted_fd=19 restricted_offset=0x0 > 692500@1689108688.699802:kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0x100000000 size=0x380000000 ua=0x7fbfdbe00000 ret=0 restricted_fd=19 restricted_offset=0x80000000 > 692500@1689108688.795412:kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x0 gpa=0x0 size=0x0 ua=0x7fbf5be00000 ret=0 restricted_fd=19 restricted_offset=0x0 > 692500@1689108688.795550:kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0xc0000 ua=0x7fbf5be00000 ret=0 restricted_fd=19 restricted_offset=0x0 > 692500@1689108688.796227:kvm_set_user_memory AddrSpace#0 Slot#6 flags=0x4 gpa=0x100000 size=0x7ff00000 ua=0x7fbf5bf00000 ret=0 restricted_fd=19 restricted_offset=0x100000 > > Because of that the KVM_SET_USER_MEMORY_REGIONs for non-THP-aligned GPAs > will fail. Maybe instead it should be allowed, and kvm_gmem_get_folio() > should handle the alignment checks on a case-by-case and simply force 4k > for offsets corresponding to unaligned bindings? Yeah, I wanted to keep the code simple, but disallowing small bindings/memslots is probably going to be a deal-breaker. Even though I'm skeptical that QEMU _needs_ to play these games for SNP guests, not playing nice will make it all but impossible to use guest_memfd for regular VMs. And the code isn't really any more complex, so long as we punt on allowing hugepages on interior sub-ranges. Compile-tested only, but this? --- virt/kvm/guest_mem.c | 54 ++++++++++++++++++++++---------------------- 1 file changed, 27 insertions(+), 27 deletions(-) diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c index a819367434e9..dc12e38211df 100644 --- a/virt/kvm/guest_mem.c +++ b/virt/kvm/guest_mem.c @@ -426,20 +426,6 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags, return err; } -static bool kvm_gmem_is_valid_size(loff_t size, u64 flags) -{ - if (size < 0 || !PAGE_ALIGNED(size)) - return false; - -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) && - !IS_ALIGNED(size, HPAGE_PMD_SIZE)) - return false; -#endif - - return true; -} - int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) { loff_t size = args->size; @@ -452,9 +438,15 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args) if (flags & ~valid_flags) return -EINVAL; - if (!kvm_gmem_is_valid_size(size, flags)) + if (size < 0 || !PAGE_ALIGNED(size)) return -EINVAL; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) && + !IS_ALIGNED(size, HPAGE_PMD_SIZE)) + return false; +#endif + return __kvm_gmem_create(kvm, size, flags, kvm_gmem_mnt); } @@ -462,7 +454,7 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, unsigned int fd, loff_t offset) { loff_t size = slot->npages << PAGE_SHIFT; - unsigned long start, end, flags; + unsigned long start, end; struct kvm_gmem *gmem; struct inode *inode; struct file *file; @@ -481,16 +473,9 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot, goto err; inode = file_inode(file); - flags = (unsigned long)inode->i_private; - /* - * For simplicity, require the offset into the file and the size of the - * memslot to be aligned to the largest possible page size used to back - * the file (same as the size of the file itself). - */ - if (!kvm_gmem_is_valid_size(offset, flags) || - !kvm_gmem_is_valid_size(size, flags)) - goto err; + if (offset < 0 || !PAGE_ALIGNED(offset)) + return -EINVAL; if (offset + size > i_size_read(inode)) goto err; @@ -591,8 +576,23 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, page = folio_file_page(folio, index); *pfn = page_to_pfn(page); - if (max_order) - *max_order = compound_order(compound_head(page)); + if (!max_order) + goto success; + + *max_order = compound_order(compound_head(page)); + if (!*max_order) + goto success; + + /* + * For simplicity, allow mapping a hugepage if and only if the entire + * binding is compatible, i.e. don't bother supporting mapping interior + * sub-ranges with hugepages (unless userspace comes up with a *really* + * strong use case for needing hugepages within unaligned bindings). + */ + if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) || + !IS_ALIGNED(slot->npages, 1ull << *max_order)) + *max_order = 0; +success: r = 0; out_unlock: base-commit: bc1a54ee393e0574ea422525cf0b2f1e768e38c5 --