From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 705A328750C for ; Thu, 26 Feb 2026 18:20:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772130030; cv=none; b=qcSk//zPfZtOa9iExrj5CwLwU1xEs6uQdkapp3PTnrdZzxtK+cxG8JOsTcLY1ZIm8Cy2H/HOoxYs2PKv5o3ydHM+doO1lOvJfMydM9Q7pLoPMIUBWS3EZbYqlpJxR82HqAnwuM9wHitfQdE80Xx1uze4glOYRRhWppA4AePip2I= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772130030; c=relaxed/simple; bh=nFE5m8dbsWGYttneL42uy16ZH1lEqGJBSIkbjkfM/EM=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=WFjv0WchRuD0APsWh1epjrwq4oomyyH/GnsT98bCh1XwEt1wwztLjSVE66fONR0jMgCmbY9dy6UTl+Ky5Wz1suFPgT5uw+u90ot9qO5TIt4wtHpt5x1rzF5eTOLcTaInvzt6mXJSQ2gpn059tbzWtJTKvvoy+NHfqOlEKBbV7Sk= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=CXyW/zTp; arc=none smtp.client-ip=209.85.215.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="CXyW/zTp" Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-c6e74e55d35so674403a12.3 for ; Thu, 26 Feb 2026 10:20:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1772130029; x=1772734829; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=qA6rN9/U0k1KmfITPBKJsGW+h1Irca1/Mbc92gpAgJ8=; b=CXyW/zTp0cNyirYtTL0Jj8tHxtR/5d2bc9ejCPezRnIicrWDwN3FXdVyP6v/Kl+okd C6WbEgwcWDWvu6uWvlaAtBWEABibyLRE6gVtSVTD1hEmIu/xSJLH2A09+kwBGnhYQv/W RaOfZbAOZY505ig64jqX+M5J+pddd3p3jx92KoRj7JyjGbD4CN0hXlvote28GD/Kz+iv PhiNGGb4zChbnDPH2zIwJs8Ev+j02Kiqm4UuLMEbRnXJPFQ8UnGq1aKpTEC0bAM/QuEG KIx5EdUuiifqvCm9ZQ6RKpDHFqiPKtk3LeUKwaF5WZ8G4l25ietzPgvaR7yfm+nGlU4k AMtg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1772130029; x=1772734829; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=qA6rN9/U0k1KmfITPBKJsGW+h1Irca1/Mbc92gpAgJ8=; b=UJqYbK7qq9Xaz9XFiZe4hGQtUfNexV/32BJ/q0sSHFZAYSu1KSdH59PTDMZdy2mX90 9g1CpY4uhW7cP2s707igd1l4wTMi+dtGTl/ruJYAaRhFdE50lpZat+4Ve669GPZnDAma KMVJjUTDTm4SC9nOJbmG9L02jTUEnyP1l5Qhz6bngWPqYkqo189feWIm6SJhLBDxoI8F M6u0Sd+Iar7E83u3ZBk/wWbO2DjqTOxicR3wqRSocRN0gh4TL+RlErV4RTBhJknhJ1b4 Pcps3UCphVEAxhJneK8KB5cM44acPeS4gga/MwIvVHl92H/b12tqrjKulMmguXJ37g25 EW2Q== X-Forwarded-Encrypted: i=1; AJvYcCUNdBlAgpWWxAkQyvCxQRSRXTxy1Ok0e99Y7JWaKlluJ+yVmKhHIzfteBglujOEpDIM1dESj61IxaQn8Tw=@vger.kernel.org X-Gm-Message-State: AOJu0Yx8m5Xuz6UUpLqL9jYhBUif0CwQ4G36hjz9bdP73aP4o6XhS30Y oT3l45H1yMU48WpTYGA32h4fj4JG1cZ/elkFTUXFwgrZtULoKGYqREH5m7fLu8p5VzeOSKHlwWb gZ/g/MQ== X-Received: from pgmh15.prod.google.com ([2002:a63:574f:0:b0:c1d:67e2:834]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a21:3405:b0:35e:8b76:c960 with SMTP id adf61e73a8af0-395c3b16cf8mr166728637.48.1772130028579; Thu, 26 Feb 2026 10:20:28 -0800 (PST) Date: Thu, 26 Feb 2026 10:20:27 -0800 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260209195142.2554532-1-yosry.ahmed@linux.dev> <20260209195142.2554532-2-yosry.ahmed@linux.dev> Message-ID: Subject: Re: [PATCH v2 1/2] KVM: SVM: Triple fault L1 on unintercepted EFER.SVME clear by L2 From: Sean Christopherson To: Yosry Ahmed Cc: Yosry Ahmed , Paolo Bonzini , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="us-ascii" On Thu, Feb 26, 2026, Yosry Ahmed wrote: > On Mon, Feb 09, 2026 at 07:51:41PM +0000, Yosry Ahmed wrote: > > KVM tracks when EFER.SVME is set and cleared to initialize and tear down > > nested state. However, it doesn't differentiate if EFER.SVME is getting > > toggled in L1 or L2+. If L2 clears EFER.SVME, and L1 does not intercept > > the EFER write, KVM exits guest mode and tears down nested state while > > L2 is running, executing L1 without injecting a proper #VMEXIT. > > > > According to the APM: > > > > The effect of turning off EFER.SVME while a guest is running is > > undefined; therefore, the VMM should always prevent guests from > > writing EFER. > > > > Since the behavior is architecturally undefined, KVM gets to choose what > > to do. Inject a triple fault into L1 as a more graceful option that > > running L1 with corrupted state. > > > > Co-developed-by: Sean Christopherson > > Signed-off-by: Sean Christopherson > > Signed-off-by: Yosry Ahmed > > --- > > arch/x86/kvm/svm/svm.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c > > index 5f0136dbdde6..ccd73a3be3f9 100644 > > --- a/arch/x86/kvm/svm/svm.c > > +++ b/arch/x86/kvm/svm/svm.c > > @@ -216,6 +216,17 @@ int svm_set_efer(struct kvm_vcpu *vcpu, u64 efer) > > > > if ((old_efer & EFER_SVME) != (efer & EFER_SVME)) { > > if (!(efer & EFER_SVME)) { > > + /* > > + * Architecturally, clearing EFER.SVME while a guest is > > + * running yields undefined behavior, i.e. KVM can do > > + * literally anything. Force the vCPU back into L1 as > > + * that is the safest option for KVM, but synthesize a > > + * triple fault (for L1!) so that KVM at least doesn't > > + * run random L2 code in the context of L1. > > + */ > > + if (is_guest_mode(vcpu)) > > + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); > > + > > Sigh, I think this is not correct in all cases: > > 1. If userspace restores a vCPU with EFER.SVME=0 to a vCPU with > EFER.SVME=1 (e.g. restoring a vCPU running to a vCPU running L2). > Typically KVM_SET_SREGS is done before KVM_SET_NESTED_STATE, so we may > set EFER.SVME = 0 before leaving guest mode. > > 2. On vCPU reset, we clear EFER. Hmm, this one is seemingly okay tho, > looking at kvm_vcpu_reset(), we leave nested first: > > /* > * SVM doesn't unconditionally VM-Exit on INIT and SHUTDOWN, thus it's > * possible to INIT the vCPU while L2 is active. Force the vCPU back > * into L1 as EFER.SVME is cleared on INIT (along with all other EFER > * bits), i.e. virtualization is disabled. > */ > if (is_guest_mode(vcpu)) > kvm_leave_nested(vcpu); > > ... > > kvm_x86_call(set_efer)(vcpu, 0); > > So I think the only problematic case is (1). We can probably fix this by > plumbing host_initiated through set_efer? This is getting more > complicated than I would have liked.. What if we instead hook WRMSR interception? A little fugly (well, more than a little), but I think it would minimize the chances of a false-positive. The biggest potential flaw I see is that this will incorrectly triple fault if KVM synthesizes a #VMEXIT while emulating the WRMSR. But that really shouldn't happen, because even a #GP=>#VMEXIT needs to be queued but not synthesized until the emulation sequence completes (any other behavior would risk confusing KVM). diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 8f8bc863e214..1d8d9960df20 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3119,10 +3119,28 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) static int msr_interception(struct kvm_vcpu *vcpu) { - if (to_svm(vcpu)->vmcb->control.exit_info_1) - return kvm_emulate_wrmsr(vcpu); - else + bool efer_l2 = is_guest_mode(vcpu) && kvm_rcx_read(vcpu) == MSR_EFER; + int r; + + if (!to_svm(vcpu)->vmcb->control.exit_info_1) return kvm_emulate_rdmsr(vcpu); + + r = kvm_emulate_wrmsr(vcpu); + + /* + * If EFER.SVME is cleared while the vCPU is in L2, KVM forces the vCPU + * back into L1 as that is the safest option for KVM. Architecturally, + * clearing EFER.SVME while a guest is running yields undefined behavior, + * i.e. KVM can do literally anything. Synthesize a shutdown (for L1!) + * if EFER.SVME was cleared on a guest WRMSR (to avoid false positives + * on userspace restoring state), so that so that KVM at least doesn't + * run random L2 code in the + * context of L1. + */ + if (r && efer_l2 && !is_guest_mode(vcpu)) + kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu); + + return r; } static int interrupt_window_interception(struct kvm_vcpu *vcpu)