From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 66E4F1DE8AD for ; Tue, 30 Dec 2025 21:03:46 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767128628; cv=none; b=ikpkzzFSNOrR/b9fRZ3E8u0CeVXUGdTlHVjqXRyfvIjqkZYdli58PYSnpVawvwXWwuCcHhIZ96RtDkOcY4lXv1gZVejruCZ9g4fzIwnBRyVJRlEP3tD08XHh4wJC2WoaR8p3R1Pzi0mE7Lw5IEa4uoycgmG/QUWyZSJtnog78jw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767128628; c=relaxed/simple; bh=4+7jjQn6JzQNnlhRmzZZE4sdlO+Oh+y4wNWPEles7Qw=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=hnxSbaKIyRCZ4tX3+1SH7f8UvRA/pbeTk8wzUBOVmfe5NBBliivXGsGfzNagaGxFLkpSXfaWVxSB/1dDymHE/3wJdm3xI7zRS0vF+94fjenH890jqYbebZ+4W++DxGaPIRG3uJNPR4HS8g83auSSJ85J5HsxcH4lHE+AlevCj3A= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=StYbdgis; arc=none smtp.client-ip=209.85.216.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="StYbdgis" Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-34c66cb671fso14302285a91.3 for ; Tue, 30 Dec 2025 13:03:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1767128626; x=1767733426; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=ym4S0QyvPMG0XiW+HIJvIT/hvfPyeKcaVKLj0gbxhNQ=; b=StYbdgis5smx/HqLl59qmx8b9yVJu44NjPdSjste1uFJVyV/+QgME4PNkNCgr5ezli IOBdLQ7WCyCkBLNQcvtxkdtzN/MrEcznka2rVCptdi9/YXdIdMifEfNwn7cWgbPynzW3 8GFUB2jQMDHtSfQ3PbunMpcrvgGJw5dVrsXfrg66GNF3XRiK4Cb0Za6ckm+C4sBZC6rw SINGKeguqJCl/kwlsW2ruFCO9MsqMQraLTCvw8yQgm47erE3VJhVZawKICme7BNyGVQF z9BpevMyi6abfxM+7XxYMceG4RNIi7LxaW2XbqDfU+4qASllmbvjD8n22UHwHjgfEywL a9pg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767128626; x=1767733426; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ym4S0QyvPMG0XiW+HIJvIT/hvfPyeKcaVKLj0gbxhNQ=; b=CPB3rgbvbfOw/snHnZBQvaOF1NJSv+e2bUGeimm/Ay8xeOcsg+PuVErRL7HKJU8jL6 E+qdSTgIb+WyC/649ZO0kzo+ZXtVtXBCmsNwBD8hea6qtclhuERLJTSzVL3vHaautIax 4/5elxAvg5liLr1YY10IQgyAetw+6Xhls12wXSRb0PxTlGYFvBANyetMMMNA+/Nqw15b cAnLoQKqFs2CxAL/wPDFEx3jYEQmPK8s07tjr3+LFO9OJOWfc2J4zJzlm3yKnltqAyCQ QRJ9axEbm7gV7ZgqanpV4qma+dW4vMoAsl7pwQwVZaiEvr+OOEKc2irwyQWlFThBqo72 OCSQ== X-Forwarded-Encrypted: i=1; AJvYcCWHqeE0/82F4z9dNGBa3VDSd+cN7Nh8B8d3JZXxZwG+uQzLmQyML3h1EHaR+WlAuTfp/7s=@vger.kernel.org X-Gm-Message-State: AOJu0Yzj858+1wtEvnBIFyw18IwpTgK8tLdU5KvlPfKT8vv+10ni6RC/ xGphS4xeO17N9NH1wftkJ5NhlYEOva4agSUT4kFPnjk9cHHcEm/zgXSKfA1siD+zN3o9mxR00Ox v6ZomSg== X-Google-Smtp-Source: AGHT+IFLYyERmex9TveHvRNqbQTL0j6gSDW/CkDKRhvyrotz4Q+v2kV0VwCqI9KSie6bNkqYczyCbX9S7UY= X-Received: from pjbbj19.prod.google.com ([2002:a17:90b:893:b0:34c:a40f:705a]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a20:3d86:b0:366:58cc:b74b with SMTP id adf61e73a8af0-376a7aed58cmr35177912637.21.1767128625750; Tue, 30 Dec 2025 13:03:45 -0800 (PST) Date: Tue, 30 Dec 2025 13:03:44 -0800 In-Reply-To: Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20251205231913.441872-1-seanjc@google.com> <20251205231913.441872-7-seanjc@google.com> Message-ID: Subject: Re: [PATCH v3 06/10] KVM: nVMX: Switch to vmcs01 to update SVI on-demand if L2 is active From: Sean Christopherson To: Chao Gao Cc: Paolo Bonzini , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Dongli Zhang Content-Type: text/plain; charset="us-ascii" On Thu, Dec 25, 2025, Chao Gao wrote: > On Fri, Dec 05, 2025 at 03:19:09PM -0800, Sean Christopherson wrote: > >@@ -6963,21 +6963,16 @@ void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr) > > u16 status; > > u8 old; > > > >- /* > >- * If L2 is active, defer the SVI update until vmcs01 is loaded, as SVI > >- * is only relevant for if and only if Virtual Interrupt Delivery is > >- * enabled in vmcs12, and if VID is enabled then L2 EOIs affect L2's > >- * vAPIC, not L1's vAPIC. KVM must update vmcs01 on the next nested > >- * VM-Exit, otherwise L1 with run with a stale SVI. > >- */ > >- if (is_guest_mode(vcpu)) { > >- to_vmx(vcpu)->nested.update_vmcs01_hwapic_isr = true; > >- return; > >- } > >- > > if (max_isr == -1) > > max_isr = 0; > > > >+ /* > >+ * Always update SVI in vmcs01, as SVI is only relevant for L2 if and > >+ * only if Virtual Interrupt Delivery is enabled in vmcs12, and if VID > >+ * is enabled then L2 EOIs affect L2's vAPIC, not L1's vAPIC. > >+ */ > >+ guard(vmx_vmcs01)(vcpu); > > KVM calls this function when virtualizing EOI for L2, and in a previous > discussion, you mentioned that the overhead of switching to VMCS01 is > "non-trivial and unnecessary" (see [1]). > > My testing shows that guard(vmx_vmcs01) takes about 140-250 cycles. I think > this overhead is acceptable for nested scenarios, since it only affects > EOI-induced VM-exits in specific/suboptimal configurations. > > But I'm wondering whether KVM should update SVI on every VM-entry instead of > doing it on-demand (i.e., when vISR gets changed). We've encountered two > SVI-related bugs [1][2] that were difficult to debug. Preventing these issues > entirely seems worthwhile, and the overhead of always updating SVI during > VM-entry should be minimal since KVM already updates RVI (RVI and SVI are in > the the same VMCS field) in vmx_sync_irr_to_pir() when APICv is enabled. Hmm. At first glance, I _really_ like this idea, but I'm leaning fairly strongly towards keeping .hwapic_isr_update(). While small (~28 cycles on EMR), the runtime cost isn't zero, and it affects the fastpath. And number of useful updates is comically small. E.g. without a nested VM, AFAICT they basically never happen post-boot. Even when running nested VMs, the number of useful update when running L1 hovers around ~0.001%. More importantly, KVM will carry most of the complexity related to vISR updates regardless of how KVM handles SVI because of the ISR caching for non-APICv systems. So while I acknowledge that we've had some nasty bugs and 100% agree that squashing them entirely is _very_ enticing, I think those bugs were due to what were effectively two systemic flaws in KVM: (1) not aligning SVI with KVM's ISR caching code, and (2) the whole "defer updates to nested VM-Exit" mess. At the end of this series, both (1) and (2) are "solved". Huh. And now that I look at (1) again, the last patch is wrong (benignly wrong, but still wrong). The changelog says this: First, it adds a call during kvm_lapic_reset(), but that's a glorified nop as the ISR has already been zeroed. but that's simply not true. There's already a call in kvm_lapic_reset(). So that patch can be amended with: diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 7be4d759884c..55a7a2be3a2e 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -2907,10 +2907,8 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event) vcpu->arch.pv_eoi.msr_val = 0; apic_update_ppr(apic); - if (apic->apicv_active) { + if (apic->apicv_active) kvm_x86_call(apicv_post_state_restore)(vcpu); - kvm_x86_call(hwapic_isr_update)(vcpu, -1); - } vcpu->arch.apic_arb_prio = 0; vcpu->arch.apic_attention = 0; At which point updates to highest_isr_cache and .hwapic_isr_update() are fully symmetrical (ignoring that KVM simply invalidates highest_isr_cache instead of scanning the vISR on EOI and APICv changes). So yeah, the more I look at all of this, the more I'm in favor of keeping .hwapic_isr_update(), e.g. if only to let it serve as a canary for finding issues related to highest_isr_cache and/or isr_count. > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index ef8d29c677b9..e7883bf7665f 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -6957,45 +6957,20 @@ void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu) > read_unlock(&vcpu->kvm->mmu_lock); > } > > -void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr) > +static void vmx_set_rvi_svi(int rvi, int svi) If this ever goes anywhere, my vote would be to call this vmx_sync_guest_intr_status(), and pass in only @rvi, e.g. static void vmx_sync_guest_intr_status(struct kvm_vcpu *vcpu, int rvi) { int svi = kvm_lapic_find_highest_isr(vcpu); u16 status, new; ... } > status = vmcs_read16(GUEST_INTR_STATUS); > + new = (rvi & 0xff) | ((u8)svi << 8); I think this is technically undefined behavior? Due to a shift larger than type (casting to an 8-bit value and then shifting by 8). svi[31:8] should always be '0', but to be paranoid we could do: new = (rvi & 0xff) | ((svi & 0xff) << 8); > + if (new != status) > + vmcs_write16(GUEST_INTR_STATUS, new); > }