From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id EB4BACD37AA for ; Thu, 7 May 2026 20:22:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C3E906B0092; Thu, 7 May 2026 16:22:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C2F5E6B0096; Thu, 7 May 2026 16:22:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AD0026B0093; Thu, 7 May 2026 16:22:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 9D8856B008C for ; Thu, 7 May 2026 16:22:52 -0400 (EDT) Received: from smtpin18.hostedemail.com (lb01a-stub [10.200.18.249]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 5802B1405EF for ; Thu, 7 May 2026 20:22:52 +0000 (UTC) X-FDA: 84741747384.18.2FB7EF1 Received: from sea.source.kernel.org (sea.source.kernel.org [172.234.252.31]) by imf14.hostedemail.com (Postfix) with ESMTP id 4F3AC10000A for ; Thu, 7 May 2026 20:22:50 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=PJrvV5Kt; spf=pass (imf14.hostedemail.com: domain of devnull+ackerleytng.google.com@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=devnull+ackerleytng.google.com@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1778185370; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=upRmbhJjXcvVe8az6V3OkuzxN0c2eUJA1dVKbWD2PMk=; b=hU3jcXQ7rKMCb9oCkuCwrr0I/F+JPVImiN2dsohm42D6s4WnXMkqcPHZI6Kw+1TjDPEhkL iU6QwmuelSeOvftxTsjIOxNQZXeZDshqycKdRPJpNr+ZVhoKy5Q3rHcKnRR+Iq+iym28cY Z1hh7zeepaXlXlYMHW9YxE/ggpdf0hg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1778185370; a=rsa-sha256; cv=none; b=LE5RFctl1QtdlOMjRFCCbEYvZEgwBr5iwKS3xkC2UudB2+ktUuICBFVmS5LtRJSQU/Jpfp fpnHqFobpqyebhxXNUG16Iy4Y/tl7T72IFrDhmVETkZgW79RnmT9S4Azmsd8SPIbpnVGq2 xQk84crzrwTf3ri5N7baVG8nM6FRFnM= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=PJrvV5Kt; spf=pass (imf14.hostedemail.com: domain of devnull+ackerleytng.google.com@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=devnull+ackerleytng.google.com@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sea.source.kernel.org (Postfix) with ESMTP id B7D8A444E1; Thu, 7 May 2026 20:22:48 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPS id 80FFEC2BCC9; Thu, 7 May 2026 20:22:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778185368; bh=dWI4ZTIRATMySPfvi795hfxSWKVS1v11uAgGUojLwMc=; h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From; b=PJrvV5KttXPt/bZgav3tIXujv7mdnH3OR127SPPk3BmGUR1ZVQg2fpKd+cG1yDtge VgvGPwuP8ZFq3OLXEqVm9QLZdkjltmOjsCYh5qhw2zSZVtTH8uRsqVcJtCsjyB+6fJ xZFoWm/CIrs20EpLYlDU/ru1lJjZWOqMB1kGCPo2WLKBs8NxFspohkRT8EgAw97c6L CUpkJrtEZPPjhsplP05oIKhPsRh2e1rRzISWTarZ8MHR5RXPW/vmtOsUDydk330p3o B/sM4ODjyAW9u0/v9HaYHxZktKPFhBHnp0Pkfa/c5gUIPoe2TB5omCE0MYWB2PV+w7 4P7cbTrQdtG1Q== Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org (localhost.localdomain [127.0.0.1]) by smtp.lore.kernel.org (Postfix) with ESMTP id 69B63CD343F; Thu, 7 May 2026 20:22:48 +0000 (UTC) From: Ackerley Tng via B4 Relay Date: Thu, 07 May 2026 13:22:20 -0700 Subject: [PATCH v6 01/43] KVM: guest_memfd: Introduce per-gmem attributes, use to guard user mappings MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20260507-gmem-inplace-conversion-v6-1-91ab5a8b19a4@google.com> References: <20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4@google.com> In-Reply-To: <20260507-gmem-inplace-conversion-v6-0-91ab5a8b19a4@google.com> To: aik@amd.com, andrew.jones@linux.dev, binbin.wu@linux.intel.com, brauner@kernel.org, chao.p.peng@linux.intel.com, david@kernel.org, ira.weiny@intel.com, jmattson@google.com, jthoughton@google.com, michael.roth@amd.com, oupton@kernel.org, pankaj.gupta@amd.com, qperret@google.com, rick.p.edgecombe@intel.com, rientjes@google.com, shivankg@amd.com, steven.price@arm.com, tabba@google.com, willy@infradead.org, wyihan@google.com, yan.y.zhao@intel.com, forkloop@google.com, pratyush@kernel.org, suzuki.poulose@arm.com, aneesh.kumar@kernel.org, liam@infradead.org, Paolo Bonzini , Sean Christopherson , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Jonathan Corbet , Shuah Khan , Shuah Khan , Vishal Annapurve , Andrew Morton , Chris Li , Kairui Song , Kemeng Shi , Nhat Pham , Baoquan He , Barry Song , Axel Rasmussen , Yuanchu Xie , Wei Xu , Youngjun Park , Qi Zheng , Shakeel Butt , Kiryl Shutsemau , Jason Gunthorpe , Vlastimil Babka Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, linux-coco@lists.linux.dev, Ackerley Tng X-Mailer: b4 0.14.3 X-Developer-Signature: v=1; a=ed25519-sha256; t=1778185365; l=8329; i=ackerleytng@google.com; s=20260225; h=from:subject:message-id; bh=WV1q5ADDwVIkTfcOqWUgOmrSPleaE+f9AWUhcZoRwK8=; b=aZ2+v7O+QRdQ2UZklgKUVpjgdsXKvME4d4yZftIKgAFWrS6fqPtEMiAhZ91ba9Au0HQyL5SJq Aw+D+hGNcgfAq8wvIwZDSjmR73InmZIEHlwEhwkeWEUHgp9lTjOFYi+ X-Developer-Key: i=ackerleytng@google.com; a=ed25519; pk=sAZDYXdm6Iz8FHitpHeFlCMXwabodTm7p8/3/8xUxuU= X-Endpoint-Received: by B4 Relay for ackerleytng@google.com/20260225 with auth_id=649 X-Original-From: Ackerley Tng Reply-To: ackerleytng@google.com X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 4F3AC10000A X-Rspam-User: X-Stat-Signature: jgh5eyoxzza6did36uto4477cohspb4m X-HE-Tag: 1778185370-252474 X-HE-Meta: U2FsdGVkX19tcEsUNDYzKimCJ6N7hgo14ndbIcXHBap7XN7AcvhJRkPBuN9HugXXWLOunLUvHJxlQD4/oVbvjBfM80qfWA/O+MtZUK8TBMepK9XMhaM3f5Ri4926ooaCctI7ungOcmVu0tbOR5aIN23kEGM409kzVF5EcAY/zCl/eYKRv9/7LCy5tixdGeoeUjLA7V7S0KNTrXjjfiTFd0BCSP9fzrPR+rAFUsz/nCkt9/w7T8tGP6DOZXzimpgEg3ZJoTm/4qdjZS9RRDVWub3wdPTSnajQlV/vxC0dBPrb4VocR6btwh/041GEST7A/avLk3wQ1DqGx+ZEIvC6O2+xLjM3DHu4TUI38KpVgzLF7SK4NeiGOjWS2XZjPCMKl3IMNvKLuOlKvovs4BoGF9zHmuZ9DeEzOPZ9Iyva+WmKBGpJNQ+oOVoAMzyFT0f7Q2/RorlY0ooeE+BZBT1yujXxtHYGHR2T/NCN3wIje6xp6oq2mQrtYRJ8B0an0BfbHBmMCldaqWIm8jlT/3MU52vIlh4oQgQjiW/hcRq/lyqJ8cwF74QD800G1wDUtHdxfG4SfVaTG820BD6MMaMgotn/+wj93RUGxKmEVpSzoVrvTQU1mpG4mQ/Tg2XAvK7NEw/MgaaGCJOKZvE8kC3rkx4qbFZS+Blz8L4zo1K35vvv1T+L+kepAXtDM88laed9XtJt9ROzMAdQtrTkloddthMxDCSjYiZIum7aqtYfsB/QAPMI2eg2Wp1BgrMY4CgiY+Htsl7Hdo0LTZwNkMAaVWSnocOZrrTVGhHxEQQgwygpPvjzslOH4lXI+NS0+XDidC2QLx6LvvFzm3FCeMTqmWQq0VDj/zeBSE0yDB64y9AbcPBU0vGBU0EeRsweyQ2GJ0lRpNOfF13pbixvH+i0DHuV/EyGIKjAjeGVgLnDqnc5CiN5BDjaUtz8VDqqEhivP8GS/yND6Ph86YGDYOG ACPgX5kt uwCMxrDnhWEmEPXrGOvePuFSsvUFFU21Q9ms+LV0CzYD4SY0PpRutXs6zlEuN6yND2elVXcz9xxBWzz3C4q9xFGC/7m2Cqcu02B+yHazslsoxkO6/ZIZkeAilrJRBqtyD7w01vp4ZCeP2Ly21lfw3MyO49/VlbPiChyS5S3FgEKJUsVOF6Z6+HvrGA/ZKwVKz6GVpio8F12Vce/6eOuh/joYbRjxrxRhl30dvoIbRQIlwf6HPeAREOAUMx+YB+ek19l+lIHjDPPWEJNtpLkegDBSXYZ/WNJsYuRwxUJBdfXno2E35Hq+0hjDtLsTK+cj6b6ynhFN/P2PhapE+1lQCQ8rLahSXUsaju1ycuR3Pd7VP9VCDCcTcje2JVGS3tIXDG2uOWqu9dORtoapg1HgJJpxVjrHn4vJbgfE4/R550kwmTuk= Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Sean Christopherson Start plumbing in guest_memfd support for in-place private<=>shared conversions by tracking attributes via a maple tree. KVM currently tracks private vs. shared attributes on a per-VM basis, which made sense when a guest_memfd _only_ supported private memory, but tracking per-VM simply can't work for in-place conversions as the shareability of a given page needs to be per-gmem_inode, not per-VM. Use the filemap invalidation lock to protect the maple tree, as taking the lock for read when faulting in memory (for userspace or the guest) isn't expected to result in meaningful contention, and using a separate lock would add significant complexity (avoid deadlock is quite difficult). Signed-off-by: Sean Christopherson Co-developed-by: Ackerley Tng Signed-off-by: Ackerley Tng Co-developed-by: Vishal Annapurve Signed-off-by: Vishal Annapurve Co-developed-by: Fuad Tabba Signed-off-by: Fuad Tabba --- virt/kvm/guest_memfd.c | 133 +++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 117 insertions(+), 16 deletions(-) diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 69c9d6d546b28..5011d38820d0d 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -4,6 +4,7 @@ #include #include #include +#include #include #include #include @@ -33,6 +34,13 @@ struct gmem_inode { struct list_head gmem_file_list; u64 flags; + /* + * Every index in this inode, whether memory is populated or + * not, is tracked in attributes. The entire range of indices, + * corresponding to the size of this inode, is represented in + * this maple tree. + */ + struct maple_tree attributes; }; static __always_inline struct gmem_inode *GMEM_I(struct inode *inode) @@ -60,6 +68,24 @@ static pgoff_t kvm_gmem_get_index(struct kvm_memory_slot *slot, gfn_t gfn) return gfn - slot->base_gfn + slot->gmem.pgoff; } +static u64 kvm_gmem_get_attributes(struct inode *inode, pgoff_t index) +{ + struct maple_tree *mt = &GMEM_I(inode)->attributes; + void *entry = mtree_load(mt, index); + + return WARN_ON_ONCE(!entry) ? 0 : xa_to_value(entry); +} + +static bool kvm_gmem_is_private_mem(struct inode *inode, pgoff_t index) +{ + return kvm_gmem_get_attributes(inode, index) & KVM_MEMORY_ATTRIBUTE_PRIVATE; +} + +static bool kvm_gmem_is_shared_mem(struct inode *inode, pgoff_t index) +{ + return !kvm_gmem_is_private_mem(inode, index); +} + static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot, pgoff_t index, struct folio *folio) { @@ -397,10 +423,13 @@ static vm_fault_t kvm_gmem_fault_user_mapping(struct vm_fault *vmf) if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode)) return VM_FAULT_SIGBUS; - if (!(GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED)) - return VM_FAULT_SIGBUS; + filemap_invalidate_lock_shared(inode->i_mapping); + if (kvm_gmem_is_shared_mem(inode, vmf->pgoff)) + folio = kvm_gmem_get_folio(inode, vmf->pgoff); + else + folio = ERR_PTR(-EACCES); + filemap_invalidate_unlock_shared(inode->i_mapping); - folio = kvm_gmem_get_folio(inode, vmf->pgoff); if (IS_ERR(folio)) { if (PTR_ERR(folio) == -EAGAIN) return VM_FAULT_RETRY; @@ -556,6 +585,51 @@ bool __weak kvm_arch_supports_gmem_init_shared(struct kvm *kvm) return true; } +static int kvm_gmem_init_inode(struct inode *inode, loff_t size, u64 flags) +{ + struct gmem_inode *gi = GMEM_I(inode); + MA_STATE(mas, &gi->attributes, 0, (size >> PAGE_SHIFT) - 1); + u64 attrs; + int r; + + inode->i_op = &kvm_gmem_iops; + inode->i_mapping->a_ops = &kvm_gmem_aops; + inode->i_mode |= S_IFREG; + inode->i_size = size; + mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); + + /* + * guest_memfd memory is neither migratable nor swappable: set + * inaccessible to gate off both. + */ + mapping_set_inaccessible(inode->i_mapping); + WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); + + gi->flags = flags; + + mt_set_external_lock(&gi->attributes, + &inode->i_mapping->invalidate_lock); + + /* + * Store default attributes for the entire gmem instance. Ensuring every + * index is represented in the maple tree at all times simplifies the + * conversion and merging logic. + */ + attrs = gi->flags & GUEST_MEMFD_FLAG_INIT_SHARED ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE; + + /* + * Acquire the invalidation lock purely to make lockdep happy. The + * maple tree library expects all stores to be protected via the lock, + * and the library can't know when the tree is reachable only by the + * caller, as is the case here. + */ + filemap_invalidate_lock(inode->i_mapping); + r = mas_store_gfp(&mas, xa_mk_value(attrs), GFP_KERNEL); + filemap_invalidate_unlock(inode->i_mapping); + + return r; +} + static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) { static const char *name = "[kvm-gmem]"; @@ -586,16 +660,9 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags) goto err_fops; } - inode->i_op = &kvm_gmem_iops; - inode->i_mapping->a_ops = &kvm_gmem_aops; - inode->i_mode |= S_IFREG; - inode->i_size = size; - mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); - mapping_set_inaccessible(inode->i_mapping); - /* Unmovable mappings are supposed to be marked unevictable as well. */ - WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); - - GMEM_I(inode)->flags = flags; + err = kvm_gmem_init_inode(inode, size, flags); + if (err) + goto err_inode; file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, &kvm_gmem_fops); if (IS_ERR(file)) { @@ -797,9 +864,13 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, if (!file) return -EFAULT; + filemap_invalidate_lock_shared(file_inode(file)->i_mapping); + folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order); - if (IS_ERR(folio)) - return PTR_ERR(folio); + if (IS_ERR(folio)) { + r = PTR_ERR(folio); + goto out; + } if (!folio_test_uptodate(folio)) { clear_highpage(folio_page(folio, 0)); @@ -815,6 +886,8 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot, else folio_put(folio); +out: + filemap_invalidate_unlock_shared(file_inode(file)->i_mapping); return r; } EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_gmem_get_pfn); @@ -944,6 +1017,15 @@ static struct inode *kvm_gmem_alloc_inode(struct super_block *sb) mpol_shared_policy_init(&gi->policy, NULL); + /* + * Memory attributes are protected by the filemap invalidation lock, but + * the lock structure isn't available at this time. Immediately mark + * maple tree as using external locking so that accessing the tree + * before it's fully initialized results in NULL pointer dereferences + * and not more subtle bugs. + */ + mt_init_flags(&gi->attributes, MT_FLAGS_LOCK_EXTERN | MT_FLAGS_USE_RCU); + gi->flags = 0; INIT_LIST_HEAD(&gi->gmem_file_list); return &gi->vfs_inode; @@ -951,7 +1033,26 @@ static struct inode *kvm_gmem_alloc_inode(struct super_block *sb) static void kvm_gmem_destroy_inode(struct inode *inode) { - mpol_free_shared_policy(&GMEM_I(inode)->policy); + struct gmem_inode *gi = GMEM_I(inode); + + mpol_free_shared_policy(&gi->policy); + + /* + * Note! Checking for an empty tree is functionally necessary + * to avoid explosions if the tree hasn't been fully + * initialized, i.e. if the inode is being destroyed before + * guest_memfd can set the external lock, lockdep would find + * that the tree's internal ma_lock was not held. + */ + if (!mtree_empty(&gi->attributes)) { + /* + * Acquire the invalidation lock purely to make lockdep happy, + * the inode is unreachable at this point. + */ + filemap_invalidate_lock(inode->i_mapping); + __mt_destroy(&gi->attributes); + filemap_invalidate_unlock(inode->i_mapping); + } } static void kvm_gmem_free_inode(struct inode *inode) -- 2.54.0.563.g4f69b47b94-goog