All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Rick P Edgecombe <rick.p.edgecombe@intel.com>
Cc: Yan Y Zhao <yan.y.zhao@intel.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	 "pbonzini@redhat.com" <pbonzini@redhat.com>,
	Reinette Chatre <reinette.chatre@intel.com>,
	 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot
Date: Mon, 19 May 2025 10:06:22 -0700	[thread overview]
Message-ID: <aCtlDhNbgXKg4s5t@google.com> (raw)
In-Reply-To: <52bdeeec0dfbb74f90d656dbd93dc9c7bb30e84f.camel@intel.com>

On Mon, May 19, 2025, Rick P Edgecombe wrote:
> On Mon, 2025-05-19 at 06:33 -0700, Sean Christopherson wrote:
> > Was this hit by a real VMM?  If so, why is a TDX VMM removing a memslot without
> > kicking vCPUs out of KVM?
> > 
> > Regardless, I would prefer not to add a new RET_PF_* flag for this.  At a glance,
> > KVM can simply drop and reacquire SRCU in the relevant paths.
> 
> During the initial debugging and kicking around stage, this is the first
> direction we looked. But kvm_gmem_populate() doesn't have scru locked, so then
> kvm_tdp_map_page() tries to unlock without it being held. (although that version
> didn't check r == RET_PF_RETRY like you had). Yan had the following concerns and
> came up with the version in this series, which we held review on for the list:

Ah, I missed the kvm_gmem_populate() => kvm_tdp_map_page() chain.

> > However, upon further consideration, I am reluctant to implement this fix for

Which fix?

> > the following reasons:
> > - kvm_gmem_populate() already holds the kvm->slots_lock.
> > - While retrying with srcu unlock and lock can workaround the
> >   KVM_MEMSLOT_INVALID deadlock, it results in each kvm_vcpu_pre_fault_memory()
> >   and tdx_handle_ept_violation() faulting with different memslot layouts.

This behavior has existed since pretty much the beginning of KVM time.  TDX is the
oddball that doesn't re-enter the guest.  All other flavors re-enter the guest on
RET_PF_RETRY, which means dropping and reacquiring SRCU.  Which is why I don't like
RET_PF_RETRY_INVALID_SLOT; it's simply handling the case we know about.

Arguably, _TDX_ is buggy by not providing this behavior.

> I'm not sure why the second one is really a problem. For the first one I think
> that path could just take the scru lock in the proper order with kvm-
> >slots_lock?

Acquiring SRCU inside slots_lock should be fine.  The reserve order would be
problematic, as KVM synchronizes SRCU while holding slots_lock.

That said, I don't love the idea of grabbing SRCU, because it's so obviously a
hack.  What about something like this?

---
 arch/x86/kvm/mmu.h     |  2 ++
 arch/x86/kvm/mmu/mmu.c | 49 +++++++++++++++++++++++++++---------------
 arch/x86/kvm/vmx/tdx.c |  7 ++++--
 virt/kvm/kvm_main.c    |  5 ++---
 4 files changed, 41 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b4b6860ab971..0fc68f0fe80e 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -259,6 +259,8 @@ extern bool tdp_mmu_enabled;
 
 bool kvm_tdp_mmu_gpa_is_mapped(struct kvm_vcpu *vcpu, u64 gpa);
 int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level);
+int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
+			  u8 *level);
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index cbc84c6abc2e..4f16fe95173c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4851,24 +4851,15 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 {
 	int r;
 
-	/*
-	 * Restrict to TDP page fault, since that's the only case where the MMU
-	 * is indexed by GPA.
-	 */
-	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
-		return -EOPNOTSUPP;
+	if (signal_pending(current))
+		return -EINTR;
 
-	do {
-		if (signal_pending(current))
-			return -EINTR;
+	if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
+		return -EIO;
 
-		if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu))
-			return -EIO;
-
-		cond_resched();
-		r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
-	} while (r == RET_PF_RETRY);
+	cond_resched();
 
+	r = kvm_mmu_do_page_fault(vcpu, gpa, error_code, true, NULL, level);
 	if (r < 0)
 		return r;
 
@@ -4878,10 +4869,12 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 	case RET_PF_WRITE_PROTECTED:
 		return 0;
 
+	case RET_PF_RETRY:
+		return -EAGAIN;
+
 	case RET_PF_EMULATE:
 		return -ENOENT;
 
-	case RET_PF_RETRY:
 	case RET_PF_CONTINUE:
 	case RET_PF_INVALID:
 	default:
@@ -4891,6 +4884,28 @@ int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level
 }
 EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
 
+int kvm_tdp_prefault_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8 *level)
+{
+	int r;
+
+	/*
+	 * Restrict to TDP page fault, since that's the only case where the MMU
+	 * is indexed by GPA.
+	 */
+	if (vcpu->arch.mmu->page_fault != kvm_tdp_page_fault)
+		return -EOPNOTSUPP;
+
+	for (;;) {
+		r = kvm_tdp_map_page(vcpu, gpa, error_code, level);
+		if (r != -EAGAIN)
+			break;
+
+		/* Comment goes here. */
+		kvm_vcpu_srcu_read_unlock(vcpu);
+		kvm_vcpu_srcu_read_lock(vcpu);
+	}
+}
+
 long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 				    struct kvm_pre_fault_memory *range)
 {
@@ -4918,7 +4933,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 	 * Shadow paging uses GVA for kvm page fault, so restrict to
 	 * two-dimensional paging.
 	 */
-	r = kvm_tdp_map_page(vcpu, range->gpa, error_code, &level);
+	r = kvm_tdp_prefault_page(vcpu, range->gpa, error_code, &level);
 	if (r < 0)
 		return r;
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b952bc673271..1a232562080d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3075,8 +3075,11 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 	if (ret != 1)
 		return -ENOMEM;
 
-	ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
-	if (ret < 0)
+	do {
+		ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
+	while (ret == -EAGAIN);
+
+	if (ret)
 		goto out;
 
 	/*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b24db92e98f3..21a3fa7476dd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4266,7 +4266,6 @@ static int kvm_vcpu_ioctl_get_stats_fd(struct kvm_vcpu *vcpu)
 static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 				     struct kvm_pre_fault_memory *range)
 {
-	int idx;
 	long r;
 	u64 full_size;
 
@@ -4279,7 +4278,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 		return -EINVAL;
 
 	vcpu_load(vcpu);
-	idx = srcu_read_lock(&vcpu->kvm->srcu);
+	kvm_vcpu_srcu_read_lock(vcpu);
 
 	full_size = range->size;
 	do {
@@ -4300,7 +4299,7 @@ static int kvm_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
 		cond_resched();
 	} while (range->size);
 
-	srcu_read_unlock(&vcpu->kvm->srcu, idx);
+	kvm_vcpu_srcu_read_unlock(vcpu);
 	vcpu_put(vcpu);
 
 	/* Return success if at least one page was mapped successfully.  */

base-commit: 12ca5c63556bbfcd77fe890fcdd1cd1adfb31fdd
--

  reply	other threads:[~2025-05-19 17:06 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-19  2:36 [PATCH 0/2] Introduce RET_PF_RETRY_INVALID_SLOT Yan Zhao
2025-05-19  2:37 ` [PATCH 1/2] KVM: x86/mmu: Add RET_PF_RETRY_INVALID_SLOT for fault retry on invalid slot Yan Zhao
2025-05-19 13:33   ` Sean Christopherson
2025-05-19 15:05     ` Reinette Chatre
2025-05-19 15:22       ` Edgecombe, Rick P
2025-05-19 15:53         ` Sean Christopherson
2025-05-19 16:17           ` Edgecombe, Rick P
2025-05-19 16:12     ` Edgecombe, Rick P
2025-05-19 17:06       ` Sean Christopherson [this message]
2025-05-19 17:49         ` Reinette Chatre
2025-05-19 20:14         ` Edgecombe, Rick P
2025-05-20  5:33         ` Yan Zhao
2025-05-20 16:13           ` Sean Christopherson
2025-05-21  1:45             ` Yan Zhao
2025-05-21 15:45               ` Sean Christopherson
2025-05-22  0:40                 ` Yan Zhao
2025-05-20  5:27     ` Yan Zhao
2025-05-19  2:38 ` [PATCH 2/2] KVM: selftests: Test prefault memory with concurrent memslot removal Yan Zhao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aCtlDhNbgXKg4s5t@google.com \
    --to=seanjc@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=reinette.chatre@intel.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=yan.y.zhao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.