From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A2D4E221721 for ; Thu, 11 Jun 2026 13:25:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781184302; cv=none; b=Q7+a/ef4XMWBDAuxboER7Eqrc7OzvG6++WOTxwG0xuR1CMe+H6veHZn1D08rJGbOh7W3SAxztlArEy8HEM62JKdJHNIwWL+O+Xi9DW/PxDWMXj/FUaVOJiH3QQXMT3d7pplu5m/K4zDs+PZH/cK/snpFPC/E4/1pv66NPMcO104= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781184302; c=relaxed/simple; bh=y6EgzJnc2EK/SPlbhjC3+3YBSjLg/oahvUQ+YbI8rIk=; h=From:Subject:To:Cc:In-Reply-To:References:Content-Type:Date: Message-Id; b=ivspsWuAi3MGNI4kmVJJk/0WIVWqGgDfJ+rUtOaI0Owl55vudEaBC2x2qEP1wJXY8gEqPxvlzNafv1FPZb0HYTLH1pTx0+PCiZqjEq0YpZj30eK4txjlQg1TjaAD4y5mUk14chjergZHryBtCuSX5Nnv4WjjoH2N9+bj3VGx7kA= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=J9wP2YBG; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="J9wP2YBG" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2A2661F00893; Thu, 11 Jun 2026 13:25:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1781184301; bh=1xsrPXD1UcDYw3A6NdySXPzCdLrjCT2iHgiHuEhql7o=; h=From:Subject:Reply-To:To:Cc:In-Reply-To:References:Date; b=J9wP2YBG1vIm1uyHQVFnqMqhEu5APbCeJDqgBlwFtVxO7tDLcTLh9Og5dBvKGnwMB cvieZrBMwALJxt6C3iXbzHaRcoVb+iBZpCpjlNIrnUvhZDhnZlr1IsUp0u8Mk2qiB0 EsPELuVOqmbPdyQneu3FstPPcV3uzzbfaiyeXaNACPbkDu50BklwgdTLPXTWQbUkrg JxabYNSH+p15iGlLmW+M56ubUOi4Ovr+uHkWPWTANHnESo8/2BLMovKdZNbG7jGdGm XnyS6Z9H+FtQ5WTxnitaCk5ghi8/7/qivv8h+uojVSLoVtm4aHv7sUi37zIgtBPHIv L7BLjHlmfVK+A== From: sashiko-bot@kernel.org Subject: Re: [PATCH RFC 2/3] KVM: guest_memfd: support folio migration for non-confidential VMs Reply-To: sashiko-reviews@lists.linux.dev To: "Shivank Garg" Cc: kvm@vger.kernel.org In-Reply-To: <20260611-shivank-gmem-migrate-v1-2-2d266bfc6f95@amd.com> References: <20260611-shivank-gmem-migrate-v1-2-2d266bfc6f95@amd.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Date: Thu, 11 Jun 2026 13:25:00 +0000 Message-Id: <20260611132501.2A2661F00893@smtp.kernel.org> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Thank you for your contribution! Sashiko AI review found 3 potential issue(= s) to consider: New issues: - [Critical] Conditionally skipping `mapping_set_unmovable()` for non-confi= dential VMs inadvertently removes the `AS_UNEVICTABLE` flag, allowing silen= t eviction of guest memory and leading to host use-after-free and privilege= escalation. - [High] `kvm_gmem_migrate_folio` can use-after-free a `kvm_memory_slot` an= d permanently leak `mmu_invalidate_in_progress` if it races with memslot de= letion after the file descriptor is closed. Pre-existing issues: - [High] `kvm_gmem_error_folio` acquires `filemap_invalidate_lock_shared` w= hile holding `folio_lock`, creating a permanent AB-BA deadlock with truncat= e operations. -- commit 91a9697e1cf39521338cfd839599e2eeea129a1e Author: Shivank Garg Date: Thu Jun 11 13:05:09 2026 +0000 KVM: guest_memfd: support folio migration for non-confidential VMs =20 guest_memfd folios are currently marked unmmovable, so the kernel cannot perform NUMA-balancing, memory compaction, etc. This is unavoidable for confidential VMs (SEV-SNP, TDX), since memory is encrypted and copying it need firmware assistance. However, for non-cofidential VMs (like firecracker), we can migrate the folios. > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index 806a42f0e031a..e4470106fc779 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -487,13 +487,45 @@ static struct file_operations kvm_gmem_fops =3D { > .fallocate =3D kvm_gmem_fallocate, > }; > =20 > +#ifdef CONFIG_MIGRATION > static int kvm_gmem_migrate_folio(struct address_space *mapping, > struct folio *dst, struct folio *src, > enum migrate_mode mode) > { > - WARN_ON_ONCE(1); > - return -EINVAL; > + struct inode *inode =3D mapping->host; > + pgoff_t start, end; > + int ret; > + > + if (!filemap_invalidate_trylock_shared(mapping)) > + return -EAGAIN; > + > + start =3D src->index; > + end =3D start + folio_nr_pages(src); > + > + kvm_gmem_invalidate_begin(inode, start, end); [Severity: High] Does this lockless XArray iteration risk a use-after-free and invalidate leakage if it races with kvm_gmem_unbind() during file close? If the file descriptor is closed (f_count =3D=3D 0), kvm_gmem_unbind() bypa= sses filemap_invalidate_lock because it assumes reliance on kvm->slots_lock. Sin= ce migration via compaction can occur before evict_inode() tears down the page cache, kvm_gmem_migrate_folio() can run concurrently with kvm_gmem_unbind() and memslot deletion without holding kvm->slots_lock or srcu_read_lock(&kvm->srcu). The yielded slot could be freed concurrently by KVM memslot deletion, causi= ng a use-after-free. Additionally, if the slot is removed before kvm_gmem_invalidate_end(), xa_find() would return false and kvm_mmu_invalidate_end() is skipped, permanently leaking mmu_invalidate_in_progress and hanging future VM page faults. > + > + /* > + * For non-confidential guest_memfd the folio is host-readable, > + * so filemap_migrate_folio() can copy the contents itself via > + * folio_mc_copy(). > + * > + * This is also the hook point for confidential VMs (SEV-SNP, TDX) once > + * they are made movable: the host cannot copy encrypted/private memory, > + * so a firmware-assisted copy would run here. > + * Idea: https://lore.kernel.org/r/20260428155043.39251-8-shivankg@amd.= com > + * Mark the @dst->migrate_info field with FOLIO_CONTENT_COPIED, so > + * __migrate_folio() skip folio_mc_copy() for confidential VMs. > + */ > + ret =3D filemap_migrate_folio(mapping, dst, src, mode); > + > + kvm_gmem_invalidate_end(inode, start, end); > + > + filemap_invalidate_unlock_shared(mapping); > + return ret; > } > +#else > +#define kvm_gmem_migrate_folio NULL > +#endif > =20 > static int kvm_gmem_error_folio(struct address_space *mapping, struct fo= lio *folio) > { [Severity: High] This is a pre-existing issue, but does kvm_gmem_error_folio() acquire filemap_invalidate_lock_shared() while holding the folio lock, creating a permanent AB-BA deadlock with truncate operations? Truncate operations (like fallocate(PUNCH_HOLE) via kvm_gmem_punch_hole()) establish a lock ordering by acquiring filemap_invalidate_lock in write mode first, and then locking individual folios via truncate_inode_pages_range(). However, the memory failure (HWPOISON) path (me_pagecache_clean() -> truncate_error_folio()) locks the affected folio first and then invokes the mapping's error_remove_folio callback. kvm_gmem_error_folio() unconditional= ly acquires filemap_invalidate_lock_shared(mapping) while the folio lock is already held. If a hardware memory failure races with a hole punch on the same guest_memfd file, this reverse lock ordering could result in an AB-BA deadlock that hangs the kernel. [ ... ] > @@ -592,9 +624,17 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t= size, u64 flags) > inode->i_size =3D size; > mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); > mapping_set_inaccessible(inode->i_mapping); > - mapping_set_unmovable(inode->i_mapping); > - /* Unmovable mappings are supposed to be marked unevictable as well. */ > - WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); > + > + /* > + * Confidential VMs (SEV-SNP, TDX) bind encryption to the physical > + * address and require firmware assisted copy, so their folios cannot > + * be migrated yet. > + */ > + if (kvm_arch_has_private_mem(kvm)) { > + mapping_set_unmovable(inode->i_mapping); > + /* Unmovable mappings are supposed to be marked unevictable as well. */ > + WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping)); > + } [Severity: Critical] By conditionally skipping mapping_set_unmovable() for non-confidential VMs, do we inadvertently leave the mapping without the AS_UNEVICTABLE flag? mapping_set_unmovable() is the only function that sets both AS_UNMOVABLE and AS_UNEVICTABLE. guest_memfd folios lack a backing store and a writepage callback (noop_dirty_folio). To the MM reclaimer (shrink_folio_list()), the= se folios appear clean and unmapped since there are no userspace PTEs. Under memory pressure, the reclaimer will silently evict them and free their physical pages to the host buddy allocator. Since eviction does not invoke = KVM MMU notifiers, KVM SPTEs remain valid and point to the freed pages. When the host reallocates these pages, the guest could read or write arbitrary host memory. Should mapping_set_unevictable() be called explicitly for non-confidential VMs? --=20 Sashiko AI review =C2=B7 https://sashiko.dev/#/patchset/20260611-shivank-gm= em-migrate-v1-0-2d266bfc6f95@amd.com?part=3D2