From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pg1-f202.google.com (mail-pg1-f202.google.com [209.85.215.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A1575352FA5 for ; Wed, 22 Apr 2026 22:42:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.202 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776897733; cv=none; b=cRSmxulEpxXc+doENxkd5lQsO+YoxBGx9QvSG0MkCSMzQK/v+g1XbgOLXSFRCm2fkkrpvIvp5nIzGPfPxnnQJNzj/6y3H0gsNwpF/vgPKOVvl6KWUKwcl4NgPlTV1XuZ3q/WVAmWWiDnj4cg/2M8Yn8wKclKgAQwyUqQzpN3oI8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776897733; c=relaxed/simple; bh=uOpX+ZjfVCTDR7aw/XQp6AxazNFawws2t47O8Q+S+Ug=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=iDeN9DG0KRZRyENuCJhgWZf+HO+QMA/VJZhCRZabkv6l6g1NB1c6RrIj0/OhAEqBUxCOgw10+S36+mT7Mzl/uL15pzaaieqv5DW29aUYH3CiAzEaDkcMAvJSLA8MFCf8o5VK1/XieZFWpUKvuC9svtSLr24UvXKYZDuOrtdNCU0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=b8ZCyrQn; arc=none smtp.client-ip=209.85.215.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="b8ZCyrQn" Received: by mail-pg1-f202.google.com with SMTP id 41be03b00d2f7-bce224720d8so3408493a12.1 for ; Wed, 22 Apr 2026 15:42:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20251104; t=1776897731; x=1777502531; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=7VCAVgk2OJoA+ovROm/47u6CxaTH124+DZLkPj9qaIM=; b=b8ZCyrQndHc59d3qLoksfQ3H0sbp1pO+RD3546VKZToBvN/w5mM5WhYpBYVC6cpM4a RkNGk4bK/D3HvfYYh9Do0Iea7pP5msTi5ka13vhjJ3Z8GpLbPw406TtS/RfNpgTdrcNI toDAKjwoBreHQiiQTkVv5yK3i0rtpnRIDe4kHkugMrxp0LjLOMHMR7HOd9aLP6wWNJ51 eRNzmbrgT8ZV9PwpEdcTaRIEvqlgw4fuwDzkA4FNEMdFTBFCzdPhTdxRZq+BGv76Ux+X 6+0yYH1+PcBgJknie11PU7L7kBI0cWnfWWA6+4CwWAInQZxM2Hr+GErqWcyelVOYTv+v cHkQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776897731; x=1777502531; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=7VCAVgk2OJoA+ovROm/47u6CxaTH124+DZLkPj9qaIM=; b=AtQyS7wRlptBdRrxpJK9FBw9D0OQKoIV9K5rypD0Afx7EznF3MF6AHOtpfPEjwyY6a HJg9ahIIp6MOPGopXsr364hG+udvq7scnLRIFtKZjuXc0CgXhKBjj7BTnKf7diD7VYHW vpWzAdZDBTdLQDOZrNUuiDamT3HSFusbyUwdFtkvAjMHN/XUsW69tMPXROqORPcJGUMp 4oEXeNX/1gbSi2zh3gqw7WiALEiydSx6qU+BWrnOj88RAMI1LZDEhqw84GXaPFTlUfBV JXv95Bfqy763Rev3uMKBY7vSof3ShzHVtZTtgr3BwxLU527L4xJYUGCxWxjahA/55c9J MCwA== X-Forwarded-Encrypted: i=1; AFNElJ8h+qXspyHwCt0scqumbHF+d9Ok9AjrXyw4dTabXD5gpO9TFUKlVBJBa47vWUCVbLEnq1JNHWP+Ou/rUS0=@vger.kernel.org X-Gm-Message-State: AOJu0YzLcLudxbni6bhLXikObkjFEpAgttUMMjngODLjjpUR8sjO4ICa hOa8e0r1XgnNZ+RxIXBJ2BjzkEUQr/yUoXV+eM3XzfiQPmmEjJvdOEhzpqEPvojKqLr9fjy/jkb lD8cGvg== X-Received: from pgc3.prod.google.com ([2002:a05:6a02:2f83:b0:c73:942e:fef3]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a20:4309:b0:3a2:dc51:44e with SMTP id adf61e73a8af0-3a2dc51083fmr13821615637.22.1776897730837; Wed, 22 Apr 2026 15:42:10 -0700 (PDT) Date: Wed, 22 Apr 2026 15:42:09 -0700 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20260326031150.3774017-1-yosry@kernel.org> <20260326031150.3774017-5-yosry@kernel.org> Message-ID: Subject: Re: [PATCH v4 4/6] KVM: x86/pmu: Re-evaluate Host-Only/Guest-Only on nested SVM transitions From: Sean Christopherson To: Yosry Ahmed Cc: Jim Mattson , Paolo Bonzini , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Tue, Apr 21, 2026, Yosry Ahmed wrote: > On Thu, Apr 09, 2026 at 02:21:14PM -0700, Sean Christopherson wrote: > > On Thu, Apr 09, 2026, Sean Christopherson wrote: > > > On Thu, Apr 09, 2026, Jim Mattson wrote: > > > > On Thu, Apr 9, 2026 at 10:48=E2=80=AFAM Sean Christopherson wrote: > > > > > On Thu, Apr 09, 2026, Jim Mattson wrote: > > > > > > > > In general, this deferral is misguided. The G/H bits should= be > > > > > > > > re-evaluated before we call kvm_pmu_instruction_retired() f= or an > > > > > > > > emulated instruction. ... > > > > > > > > This happens too late for VMRUN, since we have already call= ed > > > > > > > > kvm_pmu_instruction_retired() via kvm_skip_emulated_instruc= tion(), and > > > > > > > > VMRUN counts as a *guest* instruction. > > > > > > > > > > > > > > It's just VMRUN that's problematic though, correct? I.e. the= scheme as a whole > > > > > > > is fine, we just need to special case VMRUN due to SVM's erra= tum^Warchitecture. > > > > > > > Alternatively, maybe we could get AMD to document the silly V= MRUN behavior as an > > > > > > > erratum, then we could claim KVM is architecturally superior.= :-D > > > > > > > > > > > > Here, it's just VMRUN. Above, it's WRMSR(EFER). > > > > > > > > > > But clearing EFER.SVME while in the guest generates architectural= ly undefined > > > > > behavior. I don't see any reason to complicate PMU virtualizatio= n for that > > > > > scenario, especially now that KVM synthesizes triple fault for L1= . > > > >=20 > > > > L1 can clear the virtual EFER.SVME. That is well-defined. > > >=20 > > > Gah, I forgot that the H/G bits are ignored when EFER.SVME=3D0. That= 's really > > > annoying. > >=20 > > What do you think about having two flavors of kvm_pmu_handle_nested_tra= nsition()? > > One that defers via a request, and a "special" (SVM-only?) version that= does > > direct updates. > >=20 > > Poking into PMU state in arbitrary contexts makes me nervous. E.g. whe= n calling > > svm_leave_nested(), odds are good EFER isn't even correct, and the upda= te *needs* > > to be deferred. >=20 > Hmm is it really that bad? It's not horrible, but it's a lot of "I think" and "should" and whatnot. I generally agree that it's unlikely to be a problem, but I can point at far = too many bugs where KVM unexpectedly invokes a helper and consumes stale state. I'm not completely opposed to non-deferred updates, but I really don't want= to use them for svm_leave_nested().=20 > - In the emulated VMRUN and #VMEXIT paths, EFER.SVME should be set in > both L1 and L2, so it should be fine. >=20 > - In the restore path entering guest mode, EFER.SVME should also be set > in both L1 and L2. >=20 > So I think svm_leave_nested() is the only interesting case: >=20 > - In the restore path, svm_leave_nested() should only be called if the > CPU is in guest mode and EFER.SVME is set in both L1 and L2. >=20 > - In the EFER update path, L1 will get a shutdown if we forcefully leave > nested anyway, unless userspace is setting state. Either way, the > value of EFER should be correct and valid to use to update the PMU > here. >=20 > - In the vCPU free path, it shouldn't really matter, but the value of > EFER should still be correct. > So I *think* generally the value of EFER should be fine to use. The > other inputs are is_guest_mode() and eventsel. In both cases we should > just make sure to update the PMU *after* updating the state. >=20 > So I think we'd end up with something similar to Jim's v2: > https://lore.kernel.org/kvm/20260129232835.3710773-1-jmattson@google.com/ >=20 > We will directly re-evaluate eventsel_hw on nested transitions, EFER > updates, and PMU MSR updates -- without deferring anything. > > We'd still need to make other changes: > - Always disable the PMC if EFER.SVME is clear and either H/G bit (or > both) is set. >=20 > - Handle VMRUN correctly. I was going to suggest just moving the call to > kvm_skip_emulated_instruction() to the end of the function, but that > doesn't handle the case where we inject #VMEXIT(INVALID) due to a > VMRUN failure (e.g. consistency checks, loading CR3, etc). >=20 > I am actually not sure if the instruction should count in host or > guest mode in this case. Logically, we never entered the guest, so > perhaps counting it in host mode is the right thing to do? I think > we'll also need to test what HW does. >=20 > Honestly, it would be a lot easier of someone from AMD could just tell > us these things :) >=20 > Basically: > - Does the PMU generally count based on processor state (e.g. guest > mode, EFER.SVME) before or after instruction retirement? > - A successful VMRUN will be counted in guest mode, what about a > failed VMRUN that produces #VMEXIT(INVALID)? >=20 > > I definitely don't love having two separate update mechanisms, but it s= eems like > > the safest option in this case. >=20 > Same here, and I like the deferred handling, but to Jim's point I think > we can use it anywhere :/ Why can't we defer the svm_leave_nested() case? The only flows the invoke svm_leave_nested() are non-architectural, being precise there doesn't matte= r at all (and I'm not convinced it matters in general given none of us can figur= e out what hardware is _supposed_ to do). Having a synchronous path for architectural flows, and a deferred mechanism= for everything else seems reasonable, and would all but eliminate my concerns a= bout consuming stale state and/or doing things like attempting to write MSRs whi= le freeing a vCPU.