From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 705A328750C
	for <linux-kernel@vger.kernel.org>; Thu, 26 Feb 2026 18:20:29 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1772130030; cv=none; b=qcSk//zPfZtOa9iExrj5CwLwU1xEs6uQdkapp3PTnrdZzxtK+cxG8JOsTcLY1ZIm8Cy2H/HOoxYs2PKv5o3ydHM+doO1lOvJfMydM9Q7pLoPMIUBWS3EZbYqlpJxR82HqAnwuM9wHitfQdE80Xx1uze4glOYRRhWppA4AePip2I=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1772130030; c=relaxed/simple;
	bh=nFE5m8dbsWGYttneL42uy16ZH1lEqGJBSIkbjkfM/EM=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type; b=WFjv0WchRuD0APsWh1epjrwq4oomyyH/GnsT98bCh1XwEt1wwztLjSVE66fONR0jMgCmbY9dy6UTl+Ky5Wz1suFPgT5uw+u90ot9qO5TIt4wtHpt5x1rzF5eTOLcTaInvzt6mXJSQ2gpn059tbzWtJTKvvoy+NHfqOlEKBbV7Sk=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=CXyW/zTp; arc=none smtp.client-ip=209.85.215.201
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="CXyW/zTp"
Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-c6e74e55d35so674403a12.3
        for <linux-kernel@vger.kernel.org>; Thu, 26 Feb 2026 10:20:29 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1772130029; x=1772734829; darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=qA6rN9/U0k1KmfITPBKJsGW+h1Irca1/Mbc92gpAgJ8=;
        b=CXyW/zTp0cNyirYtTL0Jj8tHxtR/5d2bc9ejCPezRnIicrWDwN3FXdVyP6v/Kl+okd
         C6WbEgwcWDWvu6uWvlaAtBWEABibyLRE6gVtSVTD1hEmIu/xSJLH2A09+kwBGnhYQv/W
         RaOfZbAOZY505ig64jqX+M5J+pddd3p3jx92KoRj7JyjGbD4CN0hXlvote28GD/Kz+iv
         PhiNGGb4zChbnDPH2zIwJs8Ev+j02Kiqm4UuLMEbRnXJPFQ8UnGq1aKpTEC0bAM/QuEG
         KIx5EdUuiifqvCm9ZQ6RKpDHFqiPKtk3LeUKwaF5WZ8G4l25ietzPgvaR7yfm+nGlU4k
         AMtg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1772130029; x=1772734829;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=qA6rN9/U0k1KmfITPBKJsGW+h1Irca1/Mbc92gpAgJ8=;
        b=UJqYbK7qq9Xaz9XFiZe4hGQtUfNexV/32BJ/q0sSHFZAYSu1KSdH59PTDMZdy2mX90
         9g1CpY4uhW7cP2s707igd1l4wTMi+dtGTl/ruJYAaRhFdE50lpZat+4Ve669GPZnDAma
         KMVJjUTDTm4SC9nOJbmG9L02jTUEnyP1l5Qhz6bngWPqYkqo189feWIm6SJhLBDxoI8F
         M6u0Sd+Iar7E83u3ZBk/wWbO2DjqTOxicR3wqRSocRN0gh4TL+RlErV4RTBhJknhJ1b4
         Pcps3UCphVEAxhJneK8KB5cM44acPeS4gga/MwIvVHl92H/b12tqrjKulMmguXJ37g25
         EW2Q==
X-Forwarded-Encrypted: i=1; AJvYcCUNdBlAgpWWxAkQyvCxQRSRXTxy1Ok0e99Y7JWaKlluJ+yVmKhHIzfteBglujOEpDIM1dESj61IxaQn8Tw=@vger.kernel.org
X-Gm-Message-State: AOJu0Yx8m5Xuz6UUpLqL9jYhBUif0CwQ4G36hjz9bdP73aP4o6XhS30Y
	oT3l45H1yMU48WpTYGA32h4fj4JG1cZ/elkFTUXFwgrZtULoKGYqREH5m7fLu8p5VzeOSKHlwWb
	gZ/g/MQ==
X-Received: from pgmh15.prod.google.com ([2002:a63:574f:0:b0:c1d:67e2:834])
 (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a21:3405:b0:35e:8b76:c960
 with SMTP id adf61e73a8af0-395c3b16cf8mr166728637.48.1772130028579; Thu, 26
 Feb 2026 10:20:28 -0800 (PST)
Date: Thu, 26 Feb 2026 10:20:27 -0800
In-Reply-To: <txfn2izdpaavep6yrcujlxkqrqf2gwk2ccb6dplwcfnsstdnie@lgx74e27nus7>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20260209195142.2554532-1-yosry.ahmed@linux.dev>
 <20260209195142.2554532-2-yosry.ahmed@linux.dev> <txfn2izdpaavep6yrcujlxkqrqf2gwk2ccb6dplwcfnsstdnie@lgx74e27nus7>
Message-ID: <aaCO62eQiZX5pvSk@google.com>
Subject: Re: [PATCH v2 1/2] KVM: SVM: Triple fault L1 on unintercepted
 EFER.SVME clear by L2
From: Sean Christopherson <seanjc@google.com>
To: Yosry Ahmed <yosry@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>, Paolo Bonzini <pbonzini@redhat.com>, kvm@vger.kernel.org, 
	linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="us-ascii"

On Thu, Feb 26, 2026, Yosry Ahmed wrote:
> On Mon, Feb 09, 2026 at 07:51:41PM +0000, Yosry Ahmed wrote:
> > KVM tracks when EFER.SVME is set and cleared to initialize and tear down
> > nested state. However, it doesn't differentiate if EFER.SVME is getting
> > toggled in L1 or L2+. If L2 clears EFER.SVME, and L1 does not intercept
> > the EFER write, KVM exits guest mode and tears down nested state while
> > L2 is running, executing L1 without injecting a proper #VMEXIT.
> > 
> > According to the APM:
> > 
> >     The effect of turning off EFER.SVME while a guest is running is
> >     undefined; therefore, the VMM should always prevent guests from
> >     writing EFER.
> > 
> > Since the behavior is architecturally undefined, KVM gets to choose what
> > to do. Inject a triple fault into L1 as a more graceful option that
> > running L1 with corrupted state.
> > 
> > Co-developed-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> > ---
> >  arch/x86/kvm/svm/svm.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index 5f0136dbdde6..ccd73a3be3f9 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -216,6 +216,17 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
> >  
> >  	if ((old_efer & EFER_SVME) != (efer & EFER_SVME)) {
> >  		if (!(efer & EFER_SVME)) {
> > +			/*
> > +			 * Architecturally, clearing EFER.SVME while a guest is
> > +			 * running yields undefined behavior, i.e. KVM can do
> > +			 * literally anything.  Force the vCPU back into L1 as
> > +			 * that is the safest option for KVM, but synthesize a
> > +			 * triple fault (for L1!) so that KVM at least doesn't
> > +			 * run random L2 code in the context of L1.
> > +			 */
> > +			if (is_guest_mode(vcpu))
> > +				kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> > +
> 
> Sigh, I think this is not correct in all cases:
> 
> 1. If userspace restores a vCPU with EFER.SVME=0 to a vCPU with
> EFER.SVME=1 (e.g. restoring a vCPU running to a vCPU running L2).
> Typically KVM_SET_SREGS is done before KVM_SET_NESTED_STATE, so we may
> set EFER.SVME = 0 before leaving guest mode.
> 
> 2. On vCPU reset, we clear EFER. Hmm, this one is seemingly okay tho,
> looking at kvm_vcpu_reset(), we leave nested first:
> 
> 	/*
> 	 * SVM doesn't unconditionally VM-Exit on INIT and SHUTDOWN, thus it's
> 	 * possible to INIT the vCPU while L2 is active.  Force the vCPU back
> 	 * into L1 as EFER.SVME is cleared on INIT (along with all other EFER
> 	 * bits), i.e. virtualization is disabled.
> 	 */
> 	if (is_guest_mode(vcpu))
> 		kvm_leave_nested(vcpu);
> 
> 	...
> 
> 	kvm_x86_call(set_efer)(vcpu, 0);
> 
> So I think the only problematic case is (1). We can probably fix this by
> plumbing host_initiated through set_efer? This is getting more
> complicated than I would have liked..

What if we instead hook WRMSR interception?  A little fugly (well, more than a
little), but I think it would minimize the chances of a false-positive.  The
biggest potential flaw I see is that this will incorrectly triple fault if KVM
synthesizes a #VMEXIT while emulating the WRMSR.  But that really shouldn't
happen, because even a #GP=>#VMEXIT needs to be queued but not synthesized until
the emulation sequence completes (any other behavior would risk confusing KVM).

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 8f8bc863e214..1d8d9960df20 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3119,10 +3119,28 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 
 static int msr_interception(struct kvm_vcpu *vcpu)
 {
-       if (to_svm(vcpu)->vmcb->control.exit_info_1)
-               return kvm_emulate_wrmsr(vcpu);
-       else
+       bool efer_l2 = is_guest_mode(vcpu) && kvm_rcx_read(vcpu) == MSR_EFER;
+       int r;
+
+       if (!to_svm(vcpu)->vmcb->control.exit_info_1)
                return kvm_emulate_rdmsr(vcpu);
+
+       r = kvm_emulate_wrmsr(vcpu);
+
+       /*
+        * If EFER.SVME is cleared while the vCPU is in L2, KVM forces the vCPU
+        * back into L1 as that is the safest option for KVM.  Architecturally,
+        * clearing EFER.SVME while a guest is running yields undefined behavior,
+        * i.e. KVM can do literally anything.  Synthesize a shutdown (for L1!)
+        * if EFER.SVME was cleared on a guest WRMSR (to avoid false positives
+        * on userspace restoring state), so that so that KVM at least doesn't
+        * run random L2 code in the
+        * context of L1.
+        */
+       if (r && efer_l2 && !is_guest_mode(vcpu))
+               kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+
+       return r;
 }
 
 static int interrupt_window_interception(struct kvm_vcpu *vcpu)