All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] KVM: guest_memfd: Fix ABBA deadlock in error_remove_folio
@ 2026-06-11  8:18 zhanghao
  2026-06-11 10:32 ` sashiko-bot
  2026-06-11 14:18 ` Sean Christopherson
  0 siblings, 2 replies; 3+ messages in thread
From: zhanghao @ 2026-06-11  8:18 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm

From b164e59d4068226dfb33babe49292c7a685cacd9 Mon Sep 17 00:00:00 2001
From: Hao Zhang <zhanghao1@kylinos.cn>
Date: Thu, 11 Jun 2026 15:27:27 +0800

memory_failure() calls ->error_remove_folio() while holding the global
mf_mutex and the poisoned folio lock. guest_memfd's implementation takes
mapping.invalidate_lock for read before zapping KVM mappings.

That lock ordering can deadlock against guest_memfd punch-hole, which
holds mapping.invalidate_lock for write and can then wait on the same
folio lock in truncate_inode_pages_range().

Use a trylock in kvm_gmem_error_folio(). If mapping.invalidate_lock is
contended, fail recovery instead of blocking in the memory-failure path,
and instead of reporting MF_DELAYED without actually zapping KVM mappings.

Fixes: a7800aa80ea4 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
Signed-off-by: Hao Zhang <zhanghao1@kylinos.cn>
---
 virt/kvm/guest_memfd.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 69c9d6d546b2..9417be3049cf 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -499,7 +499,16 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
 {
 	pgoff_t start, end;
 
-	filemap_invalidate_lock_shared(mapping);
+	/*
+	 * memory_failure() holds mf_mutex globally.  We must not block
+	 * on filemap_invalidate_lock here, as it can be held exclusive
+	 * by kvm_gmem_fallocate() (MADV_REMOVE/FALLOC_FL_PUNCH_HOLE
+	 * path), creating an ABBA deadlock with the poisoned folio lock.
+	 * If the invalidation lock is contended, fail the recovery instead
+	 * of reporting MF_DELAYED without zapping KVM mappings.
+	 */
+	if (!filemap_invalidate_trylock_shared(mapping))
+		return -EBUSY;
 
 	start = folio->index;
 	end = start + folio_nr_pages(folio);

base-commit: 9716c086c8e8b141d35aa61f2e96a2e83de212a7
-- 
2.15.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] KVM: guest_memfd: Fix ABBA deadlock in error_remove_folio
  2026-06-11  8:18 [PATCH] KVM: guest_memfd: Fix ABBA deadlock in error_remove_folio zhanghao
@ 2026-06-11 10:32 ` sashiko-bot
  2026-06-11 14:18 ` Sean Christopherson
  1 sibling, 0 replies; 3+ messages in thread
From: sashiko-bot @ 2026-06-11 10:32 UTC (permalink / raw)
  To: zhanghao; +Cc: kvm

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

Pre-existing issues:
- [High] kvm_gmem_error_folio incorrectly returns the internal enum MF_DELAYED instead of 0 on success, causing memory_failure() to treat successful recovery as a failure.
- [High] Leaving the poisoned folio in the page cache causes memory_failure() to fail due to an unexpected extra reference during me_pagecache_clean().
--

commit f0c468f1d22ef5955d11ed4dde621b8eef0c2808
Author: Hao Zhang <zhanghao1@kylinos.cn>

KVM: guest_memfd: Fix ABBA deadlock in error_remove_folio

This commit modifies kvm_gmem_error_folio() to use a trylock when acquiring
the mapping's invalidate_lock. This prevents an ABBA deadlock with guest_memfd
punch-hole operations, returning -EBUSY instead if the lock is contended.

> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 69c9d6d546b28..9417be3049cf0 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -499,7 +499,16 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
>  {
>  	pgoff_t start, end;
>  
> -	filemap_invalidate_lock_shared(mapping);
> +	/*
> +	 * memory_failure() holds mf_mutex globally.  We must not block
> +	 * on filemap_invalidate_lock here, as it can be held exclusive
> +	 * by kvm_gmem_fallocate() (MADV_REMOVE/FALLOC_FL_PUNCH_HOLE
> +	 * path), creating an ABBA deadlock with the poisoned folio lock.
> +	 * If the invalidation lock is contended, fail the recovery instead
> +	 * of reporting MF_DELAYED without zapping KVM mappings.
> +	 */
> +	if (!filemap_invalidate_trylock_shared(mapping))
> +		return -EBUSY;
>  
>  	start = folio->index;
>  	end = start + folio_nr_pages(folio);

[Severity: High]
This is a pre-existing issue, but does leaving the poisoned folio in the page
cache cause memory_failure() to fail?

In kvm_gmem_error_folio(), guest_memfd intentionally does not truncate the
poisoned folio from the page cache so that KVM can gracefully handle guest
access to the poisoned page. However, in
mm/memory-failure.c:me_pagecache_clean(), the code checks for extra references
after truncate_error_folio():

	extra_pins = shmem_mapping(mapping);
	...
	if (has_extra_refcount(ps, p, extra_pins))
		ret = MF_FAILED;

Because guest_memfd is not shmem, extra_pins is false. has_extra_refcount()
then sees the lingering page cache reference on the folio as an unexpected
user and returns true. Does this guarantee that me_pagecache_clean() will
always return MF_FAILED for guest_memfd, leading to the VM being hard-killed
by the MCE handler instead of KVM gracefully recovering?

[Severity: High]
This isn't a bug introduced by this patch, but should this callback return 0
on success instead of MF_DELAYED?

The VFS address_space_operations->error_remove_folio callback contract
requires returning 0 on success or a standard negative error code on failure.
However, kvm_gmem_error_folio() returns the internal memory-failure enum
MF_DELAYED (2).

The caller truncate_error_folio() in mm/memory-failure.c checks this:

	if (err != 0)
		pr_info("%#lx: Failed to punch page: %d\n", pfn, err);

Since it mistakenly treats the positive enum value as an error, will this
log "Failed to punch page: 2" and return MF_FAILED, entirely defeating the
graceful recovery feature?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/tencent_4B051B84CDCC0D1964EC00337AA32E40DC07@qq.com?part=1

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] KVM: guest_memfd: Fix ABBA deadlock in error_remove_folio
  2026-06-11  8:18 [PATCH] KVM: guest_memfd: Fix ABBA deadlock in error_remove_folio zhanghao
  2026-06-11 10:32 ` sashiko-bot
@ 2026-06-11 14:18 ` Sean Christopherson
  1 sibling, 0 replies; 3+ messages in thread
From: Sean Christopherson @ 2026-06-11 14:18 UTC (permalink / raw)
  To: zhanghao; +Cc: Paolo Bonzini, kvm, Ackerley Tng, Lisa Wang, David Hildenbrand

+Lisa, David, and Ackerley

On Thu, Jun 11, 2026, zhanghao wrote:
> >From b164e59d4068226dfb33babe49292c7a685cacd9 Mon Sep 17 00:00:00 2001
> From: Hao Zhang <zhanghao1@kylinos.cn>
> Date: Thu, 11 Jun 2026 15:27:27 +0800
> 
> memory_failure() calls ->error_remove_folio() while holding the global
> mf_mutex and the poisoned folio lock. guest_memfd's implementation takes
> mapping.invalidate_lock for read before zapping KVM mappings.
> 
> That lock ordering can deadlock against guest_memfd punch-hole, which
> holds mapping.invalidate_lock for write and can then wait on the same
> folio lock in truncate_inode_pages_range().

I assume this was found by lockdep?  If so, please provide a lockdep splat so that
it's easier to understand exactly what all is problematic.

> Use a trylock in kvm_gmem_error_folio(). If mapping.invalidate_lock is
> contended, fail recovery instead of blocking in the memory-failure path,
> and instead of reporting MF_DELAYED without actually zapping KVM mappings.
> 
> Fixes: a7800aa80ea4 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
> Signed-off-by: Hao Zhang <zhanghao1@kylinos.cn>
> ---
>  virt/kvm/guest_memfd.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 69c9d6d546b2..9417be3049cf 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -499,7 +499,16 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
>  {
>  	pgoff_t start, end;
>  
> -	filemap_invalidate_lock_shared(mapping);
> +	/*
> +	 * memory_failure() holds mf_mutex globally.  

Why does mf_mutex matter? 

> We must not block
> +	 * on filemap_invalidate_lock here, as it can be held exclusive
> +	 * by kvm_gmem_fallocate() (MADV_REMOVE/FALLOC_FL_PUNCH_HOLE
> +	 * path), creating an ABBA deadlock with the poisoned folio lock.

It's not just kvm_gmem_fallocate() that's problematic.  Due to the fairness of
r/w semaphores, waiting writers block future readers, and so any path that takes
a folio lock inside of mapping->invalidate_lock will effectively create the same
scenario.

I really don't like doing a trylock here, it feels like we're hacking around a
poor locking scheme in guest_memfd.  Can't we also fix this by using a dedicated
lock for bindings (and for link_at() in the future)?

E.g. untested, but something like this?

diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
index 86690683b2fe..6ff005978ae1 100644
--- virt/kvm/guest_memfd.c
+++ virt/kvm/guest_memfd.c
@@ -32,6 +32,8 @@ struct gmem_inode {
        struct inode vfs_inode;
        struct list_head gmem_file_list;
 
+       struct rw_semaphore bindings_lock;
+
        u64 flags;
 };
 
@@ -500,7 +502,7 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
 {
        pgoff_t start, end;
 
-       filemap_invalidate_lock_shared(mapping);
+       down_read(&GMEM_I(mapping->host)->bindings_lock);
 
        start = folio->index;
        end = start + folio_nr_pages(folio);
@@ -518,7 +520,7 @@ static int kvm_gmem_error_folio(struct address_space *mapping, struct folio *fol
 
        kvm_gmem_invalidate_end(mapping->host, start, end);
 
-       filemap_invalidate_unlock_shared(mapping);
+       up_read(&GMEM_I(mapping->host)->bindings_lock);
 
        return MF_DELAYED;
 }
@@ -597,6 +599,7 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
        WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
 
        GMEM_I(inode)->flags = flags;
+       init_rwsem(&GMEM_I(inode)->bindings_lock);
 
        file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR, &kvm_gmem_fops);
        if (IS_ERR(file)) {
@@ -691,7 +694,10 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
        if (kvm_gmem_supports_mmap(inode))
                slot->flags |= KVM_MEMSLOT_GMEM_ONLY;
 
+       down_write(&GMEM_I(inode)->bindings_lock);
        xa_store_range(&f->bindings, start, end - 1, slot, GFP_KERNEL);
+       up_write(&GMEM_I(inode)->bindings_lock);
+
        filemap_invalidate_unlock(inode->i_mapping);
 
        /*
@@ -746,7 +752,9 @@ void kvm_gmem_unbind(struct kvm_memory_slot *slot)
        }
 
        filemap_invalidate_lock(file->f_mapping);
+       down_write(&GMEM_I(file_inode(file))->bindings_lock);
        __kvm_gmem_unbind(slot, file->private_data);
+       up_write(&GMEM_I(file_inode(file))->bindings_lock);
        filemap_invalidate_unlock(file->f_mapping);
 }

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-06-11 14:18 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-11  8:18 [PATCH] KVM: guest_memfd: Fix ABBA deadlock in error_remove_folio zhanghao
2026-06-11 10:32 ` sashiko-bot
2026-06-11 14:18 ` Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.