From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2AF10193404 for ; Tue, 7 Jan 2025 21:15:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736284560; cv=none; b=GONpI5CRHYKdFm3xOSewFTiVh6cTHdcORLbsj3Miq+hnEhsUqhOvt3jT0QIpxnVBWbP1wVstEMOPyNBxvbQ4odPlUNaE38XucaGYcSfS1aH8+hDFqa4NGAIYL7aUUUvqAyX0Ht4OpvmdS/9icG3J/D/2sWJUdqjYf/hLCSRi/uI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1736284560; c=relaxed/simple; bh=2FpNaEXQ8xQA2MTVke+4O0H5c2+DdJHfxrMdHw8J22Q=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=bRNO4DKqI0OqoPhaBdgXhc47tR5r+RhcsbGnuBZakc5pXv87rxCc1A5a2qa8RCbxPWuI4R+HWL+voL7zcNM3vKKQfBc7FSpyLeCqJpb2gZnabJi+rZ6DLiMbRNWiSEo2ImBosdP08d8IZ+oD7eLCWQeZxlaaU1DY4I+4n1FgLtI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=C7JibWba; arc=none smtp.client-ip=209.85.214.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="C7JibWba" Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-216430a88b0so231923625ad.0 for ; Tue, 07 Jan 2025 13:15:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1736284557; x=1736889357; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=tpykW6aRa97Fn/TAsIxTWJnIe1BxGqwoS1e1wLY1NxQ=; b=C7JibWbatkfm8Gj56yZ4Ygmomoil+V5U8Ftws75IiGW4t+ZNLCAyyT6MDbcfXsc2tS zg5zMcBc5z7TaN8hZQBoQDs/IWYBk2AXnK1z5EA6x0i7OvJ6HoX8Eoaa3CgXR8aZQmQu ie7bs2ukxiJxgBzkWgZ3ojOPRoVxF5msq8iNU9IABgJB3sR3h9oiMdJ5GZd9hZ5AIxKT DccRc9Q+GEKPaGdu9xgHsjD3wkGaqyQmTvW/WCp+DLM0/m/rDPuuFl936sWMDmRaSfQO dVz8VDryfMjth79RzuFjNFoyLxcUq+vn749suBjyMsrQGQh/IaEX2a9DAe6FlIcvJrrE J6gw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736284557; x=1736889357; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=tpykW6aRa97Fn/TAsIxTWJnIe1BxGqwoS1e1wLY1NxQ=; b=MVJYxRaU5+N7+vsZgw8Rd0oePtY0fQsj5kNWjzYyoTuS/fkleNlFuWcFkLuQ7xaea/ S1oL+wxYGokL4+5zn3H93wVLHeoLHFejjbI7YvAJ0BnIo09Yusvdh9c0jBePSH4SQNUo 74UYSLSV/jeuSUyKWsfDYADSoGSFLn/ZyKc2CN2KzvxQamUw0w5IWV9/RZC65vd+3bjI NP8EIZV+Qb5KGCk/hbsrGUa3ZrgR55xhlisivj+ZjaQFPXDG9IwGqb3tTxvu+vojNoUX 0fuJ9DPNmRHdIXGs+4Kf7dAEu9+sHYnIbUFV01Aigi8Ta9DCefcwEbXe4rGE1E13wsNh +ldA== X-Forwarded-Encrypted: i=1; AJvYcCWcrUJkvP53jvNGpuystCfa7oXXgJImtgDbWMk0pAb/TIYsYB5WSc/DG72xIyz9pF8DKiU=@vger.kernel.org X-Gm-Message-State: AOJu0YwNyDZ6MPcqS4vgsYhfHTyyA3ZsuoC0tR5zJmAlbEK81KMODkBI Sw20IzBd03Ugw0BfLu1ladpNVjDTc8Ks9sPrjRNTTD7WNAbl0H+HH7OBDLyJ+932X5i6KL/tZ6w eag== X-Google-Smtp-Source: AGHT+IHn9oiNuhAf4IW3SY/bFOYkHHz7vyz4CtxQI658bYKRE7q8bKAyNwRbP7N0cn7Qh1Wdpuli3qvVfKI= X-Received: from pfms15.prod.google.com ([2002:aa7:828f:0:b0:72a:bcc3:4c9a]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a20:6a05:b0:1e1:bdae:e058 with SMTP id adf61e73a8af0-1e88d0d6b86mr1274231637.37.1736284557351; Tue, 07 Jan 2025 13:15:57 -0800 (PST) Date: Tue, 7 Jan 2025 13:15:56 -0800 In-Reply-To: Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20241209010734.3543481-1-binbin.wu@linux.intel.com> <20241209010734.3543481-13-binbin.wu@linux.intel.com> Message-ID: Subject: Re: [PATCH 12/16] KVM: TDX: Inhibit APICv for TDX guest From: Sean Christopherson To: Binbin Wu Cc: Chao Gao , Vishal Annapurve , pbonzini@redhat.com, kvm@vger.kernel.org, rick.p.edgecombe@intel.com, kai.huang@intel.com, adrian.hunter@intel.com, reinette.chatre@intel.com, xiaoyao.li@intel.com, tony.lindgren@linux.intel.com, isaku.yamahata@intel.com, yan.y.zhao@intel.com, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Tue, Jan 07, 2025, Binbin Wu wrote: > On 1/7/2025 11:24 AM, Chao Gao wrote: > > > Note, kvm_use_posted_timer_interrupt() uses a heuristic of HLT/MWAIT = interception > > > being disabled to detect that it's worth trying to post a timer inter= rupt, but off > > > the top of my head I'm pretty sure that's unnecessary and pointless. = The > > Commit 1714a4eb6fb0 gives an explanation: > >=20 > > if only some guests isolated and others not, they would not > > have any benefit from posted timer interrupts, and at the same time = lose > > VMX preemption timer fast paths because kvm_can_post_timer_interrupt= () > > returns true and therefore forces kvm_can_use_hv_timer() to false. But that's only relevant for setup. Upon expiration, if the target vCPU is= in guest mode and APICv is active, then from_timer_fn must be true, which in t= urn means that hrtimers are in use. > Userspace uses KVM_CAP_X86_DISABLE_EXITS to enable mwait_in_guest or/and > hlt_in_guest for non TDX guest. The code doesn't work for TDX guests. > - Whether mwait in guest is allowed for TDX depends on the cpuid > =C2=A0 configuration in TD_PARAMS, not the capability of physical cpu che= cked by > =C2=A0 kvm_can_mwait_in_guest(). > - hlt for TDX is via TDVMCALL, i.e. hlt_in_guest should always be false > =C2=A0 for TDX guests. >=20 > For TDX guests, not sure if it worths to fix kvm_{mwait,hlt}_in_guest() > or add special handling (i.e., skip checking mwait/hlt in guest) because > VMX preemption timer can't be used anyway, in order to allow housekeeping > CPU to post timer interrupt for TDX vCPUs when nohz_full is configured > after changing APICv active to true. Yes, but I don't think we need any TDX-specific logic beyond noting that th= e VMX preemption can't be used. As above, consulting kvm_can_post_timer_inte= rrupt() in the expiration path is silly. And after staring at all of this for a few hours, I think we should ditch t= he entire lapic_timer.pending approach, because at this point it's doing more = harm than good. Tracking pending timer IRQs was primarily done to support deliv= ery of *all* timer interrupts when multiple interrupts were coalesced in the ho= st, e.g. because a vCPU was scheduled out. But that got removed by commit f1ed0450a5fa ("KVM: x86: Remove support for reporting coalesced APIC IRQs")= , partly because posted interrupts threw a wrench in things, but also because delivering multiple periodic interrupts in quick succession is problematic = in its own right. With the interrupt coalescing logic gone, I can't think of any reason KVM n= eeds to kick the vCPU out to the main vcpu_run() loop just to deliver the interr= upt, i.e. pending interrupts before delivering them is unnecessary. E.g. condit= ioning direct deliverly on apicv_active in the !from_timer_fn case makes no sense,= because KVM is going to manually move the interrupt from the PIR to the IRR anyways= . I also don't like that the behavior for the posting path is subtly differen= t than !APICv. E.g. if an hrtimer fires while KVM is handling a write to TMICT, K= VM will deliver the interrupt if configured to post timer, but not if APICv is disa= bled, because the latter will increment "pending", and "pending" will be cleared = before handling the new TMICT. Ditto for switch APIC timer modes. Unfortunately, there is a lot to disentangle before KVM can directly delive= r all APIC timer interupts, as KVM has built up a fair bit of ugliness on top of = "pending".=20 E.g. when switching back to the VMX preemption timer (after a blocking vCPU= is scheduled back in), KVM leaves the hrtimer active. start_hv_timer() accoun= ts for that by checking lapic_timer.pending, but leaving the hrtimer running in th= is case is absurd and unnecessarily relies on "pending" being incremented. if (kvm_x86_call(set_hv_timer)(vcpu, ktimer->tscdeadline, &expired)) return false; ktimer->hv_timer_in_use =3D true; hrtimer_cancel(&ktimer->timer); /* * To simplify handling the periodic timer, leave the hv timer running * even if the deadline timer has expired, i.e. rely on the resulting * VM-Exit to recompute the periodic timer's target expiration. */ if (!apic_lvtt_period(apic)) { /* * Cancel the hv timer if the sw timer fired while the hv timer * was being programmed, or if the hv timer itself expired. */ if (atomic_read(&ktimer->pending)) { cancel_hv_timer(apic); } else if (expired) { apic_timer_expired(apic, false); cancel_hv_timer(apic); } } I'm also not convinced that letting the hrtimer run on a different CPU in t= he "normal" case is a good idea. KVM explicitly migrates the timer, *except* = for the posting case, and so not pinning the hrtimer is likely counter-producti= ve, not to mention confusing. void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu) { struct hrtimer *timer; if (!lapic_in_kernel(vcpu) || kvm_can_post_timer_interrupt(vcpu)) return; timer =3D &vcpu->arch.apic->lapic_timer.timer; if (hrtimer_cancel(timer)) hrtimer_start_expires(timer, HRTIMER_MODE_ABS_HARD); } And if the hrtimer is pinned to the pCPU, then in practice it should be imp= ossible for KVM to post a timer IRQ into a vCPU that's in guest mode. Seapking of which, posting a timer IRQ into a vCPU that's in guest mode is = a bit dodgy. I _think_ it's safe, because all of the flows that can be coinciden= t are actually mutually exclusive. E.g. apic_timer_expired()'s call to __kvm_wait_lapic_expire() can't collide with the call from the entry path, = as the entry path's invocation happens with IRQs disabled. But all of this makes me wary, so I'd much prefer to clean up, harden, and = document the existing code before we try to optimize timer IRQ delivery for more cas= es. That's especially true for TDX as we're already adding significant compexli= ty and multiple novel paths for TDX; we don't need more at this time :-)