From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0A22314012 for ; Thu, 6 Feb 2025 01:29:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738805395; cv=none; b=f02BxtAZ6XJqCuZ+CNLHF/aEU5ttTO9Cs5tedPWH3dd3eAND5AtRoEvBl031H+wBrToNw0KIXktoRGpI3Xsc9yDzEKBsr/RxWg1DPbc5AJYSEpjJkv9Yd+9Iaf6TBgBPvrOreQ87n7vpcqYANUVoVIVJLFOEfEdukBcAL/ZCe0w= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738805395; c=relaxed/simple; bh=R6UH1A8cn89rNpOS450WPLBOLVlrOPelUtl5M85AHTA=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=FrVJeZJBwPst5wUQuc4bSTo3GQCzKsfXhNoBhskyc9fmItSUyg3KmhL6CTIxgoa1ev85w2/oghwjCohK/+olc/F8YAzkkWg8graomL4xoI77DFrcZXk3yh+yAnEqzkhgzVhG/+HzkvVxeGEvk3MmYlJsmIMsvRijZ/dEMETi4Eo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=2Wg5c8sq; arc=none smtp.client-ip=209.85.216.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="2Wg5c8sq" Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-2f9c774c1c1so1060461a91.0 for ; Wed, 05 Feb 2025 17:29:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1738805392; x=1739410192; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=wIxvtwzmLpA97snLzsRJfxwNm03aAP0R1NKy0YT1wtQ=; b=2Wg5c8sq2VPgfAFdrd/2PkJB90nXb4BF+uQ2j0cqeVm0b8t1cHzcdfBfg/sKs98gUY J+8MOEZuP1rQ4UDYxkuuFHufgCKbI0/+rEVUhrmQCk2a6MJGknU37/Kv1DH0JhcDiRz/ BcUHex8yKcDUeMQ57A/LSnEadXgIdON2I0VCdHWdwUobFBtE1onTHgFMlVckZu2XvaIe ifZlXlWlyRlCwcbteqUED0mntVNznEovugzD09ArgYSleanh68b4Gsbrw4z6LKQyILJF llYirpSqwV+GIYGtOQHtRTmbs9nJkSZ5A6fZAdReadrJbUdETvHg3gLqF5ftEFQ6rwpw dK3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738805392; x=1739410192; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=wIxvtwzmLpA97snLzsRJfxwNm03aAP0R1NKy0YT1wtQ=; b=FJUt/aiDQPWybK1pobK+MTCelPTLmHNghBW8ZraZtB90AL+bm6ouS2AuGDj7IEBs7w WRCko2FTXaCltUqsn2tgHULNIqqb0WjgaAgEpqoK8Qj+mG+5T3VoTDh6dABgJShvLDjX yTNsNqaRCYAxbmrqWNhRzcQpUI/jdRPdPrDvC9Va0h3yVXC5kbtvB9Jdud6sX88fIO8b watf1TJMqOjajKYMbDIHnt+TtHSfGXyHAC6WSrFrLJsgTk0D/L38GLmh6urc7WFzgVQZ Ug6mDX+AYz8LPKfPKb5Rlr0AcoUfe2di8fiTO32/X9sQoaHxVshYvRH8pJuxJ5MNj/aW DBSA== X-Forwarded-Encrypted: i=1; AJvYcCXgvSPBuzUxQ8UjROWFD7dLMzdKxV2jkc0jLzh8wgZ1EJs+AR4GBZXA0/Wt2/QVRMJajMFSOWF45QcdwN4=@vger.kernel.org X-Gm-Message-State: AOJu0Yw+QSi/uv6E+W80BwCeDh5Z52wGLQNAFOOsmr8vwvF8VVpj6K/s 4IQQPPuIOVRTrBwoEM+F2NZgx2xxypn6I7yodAfFY76bVAe/09nrtwJT2RYdJzJFN3gnvX6OPf2 4BQ== X-Google-Smtp-Source: AGHT+IEc+WF9sBFeqf5d4e4mEckdMGn6b3CMOgM9mvlyLWRnENoi46S8TptBfJj5+idIW8MscGmtn3KBoRM= X-Received: from pjbee7.prod.google.com ([2002:a17:90a:fc47:b0:2ea:9d23:79a0]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:2c86:b0:2ef:31a9:95af with SMTP id 98e67ed59e1d1-2f9e081077cmr7520457a91.27.1738805392310; Wed, 05 Feb 2025 17:29:52 -0800 (PST) Date: Wed, 5 Feb 2025 17:29:50 -0800 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250107042202.2554063-1-suleiman@google.com> <20250107042202.2554063-2-suleiman@google.com> Message-ID: Subject: Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns(). From: Sean Christopherson To: Suleiman Souhlal Cc: Paolo Bonzini , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Chao Gao , David Woodhouse , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, ssouhlal@freebsd.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Wed, Feb 05, 2025, Suleiman Souhlal wrote: > On Tue, Feb 4, 2025 at 4:58=E2=80=AFPM Suleiman Souhlal wrote: > > > Given that s2idle and standby don't reset host TSC, I think the right= way to > > > handle this conundrum is to address the flaw that's noted in the "bac= kwards TSC" > > > logic, and adjust guest TSC to be fully up-to-date in the S3 (or lowe= r) case. > > > > > > * ...................................... Unfortunately, we = can't > > > * bring the TSCs fully up to date with real time, as we aren= 't yet far > > > * enough into CPU bringup that we know how much real time ha= s actually > > > * elapsed; our helper function, ktime_get_boottime_ns() will= be using boot > > > * variables that haven't been updated yet. > > > > > > I have no idea why commit 0dd6a6edb012 ("KVM: Dont mark TSC unstable = due to S4 > > > suspend") hooked kvm_arch_enable_virtualization_cpu() instead of impl= ementing a > > > PM notifier, but I don't see anything that suggests it was deliberate= , i.e. that > > > KVm *needs* to effectively snapshot guest TSC when onlining CPUs. > > > > > > If that wart is fixed, then both kvmclock and TSC will account host s= uspend time, > > > and KVM can safely account the suspend time into steal time regardles= s of which > > > clock(s) the guest is using. > > > > I tried your suggestion of moving this to a PM notifier and I found > > that it's possible for VCPUs to run after resume but before the PM > > notifier has been called, because the resume notifiers get called > > after tasks are unfrozen. Unfortunately that means that if we were to > > do that, guest TSCs could go backwards. Ugh. That explains why KVM hooks the CPU online path. > > However, I think it should be possible to keep the existing backwards > > guest TSC prevention code but also use a notifier that further adjusts > > the guest TSCs to advance time on suspends where the TSC did go > > backwards. This would make both s2idle and deep suspends behave the > > same way. >=20 > An alternative might be to block VCPUs from newly entering the guest > between the pre and post suspend notifiers. > Otherwise, some of the steal time accounting would have to be done in > kvm_arch_enable_virtualization_cpu(), to make sure it gets applied on > the first VCPU run, in case that happens before the resume notifier > would have fired. But the comment there says we can't call > ktime_get_boottime_ns() there, so maybe that's not possible. I don't think the PM notifier approach is viable. It's simply too late. W= ithout a hook in CPU online, KVM can't even tell which VMs/vCPUs were running befo= re the suspend, i.e. which VMs need to be updated. One idea would be to simply fast forward guest TSC to current time when the= vCPU is loaded after suspend+resume. E.g. this hack appears to work. diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index dcd0c12c308e..27b25f8ca413 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4824,7 +4824,16 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int c= pu) =20 /* Apply any externally detected TSC adjustments (due to suspend) *= / if (unlikely(vcpu->arch.tsc_offset_adjustment)) { - adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustme= nt); + u64 kernel_ns, tsc_now; + + if (kvm_get_time_and_clockread(&kernel_ns, &tsc_now)) { + u64 l1_tsc =3D nsec_to_cycles(vcpu, vcpu->kvm->arch= .kvmclock_offset + kernel_ns); + u64 offset =3D kvm_compute_l1_tsc_offset(vcpu, l1_t= sc); + + kvm_vcpu_write_tsc_offset(vcpu, offset); + } else { + adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_= adjustment); + } vcpu->arch.tsc_offset_adjustment =3D 0; kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); }