From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9837919F41C for ; Fri, 17 Jan 2025 16:52:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.74 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737132772; cv=none; b=jDCKykQinptbBWIq6EKMn0Dt32x3a8GYa5rCLN+hNN7Yilbkdlb4D1R2FVut8wcYEKf5FEMGtR+95+DDS8/UPizD/HqHkSwOIYub7WiDA4h6FpviR8anj/HdiguSj+sT+WH+HYq4/QqYsts4d/8m8Ey2qdM4EDHgaYKtEwq4CXo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1737132772; c=relaxed/simple; bh=JshAQuXrdf/vFS7DdPV21az7k1q9yAkl72UZmaWqTPo=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=B7DkVhuZDzZYEYigdWV17YsKT9J7urnHCxEKwG5dXFk38s+HcnWvnJXECN9szEOa+M9EedUVPwdsnxxRcFuMhNAV9i2SM3cndSCr9HolbktOC6lFhSVqQ8eJxEDKjjTvBn+cLpieQdhDgsGQBxt/clwBQvjJ7zMgBQDfgufH2yI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=LYquXNpJ; arc=none smtp.client-ip=209.85.216.74 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="LYquXNpJ" Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-2ef9dbeb848so4476585a91.0 for ; Fri, 17 Jan 2025 08:52:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737132770; x=1737737570; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=2eWhQ4gbK6D4YVawcJMZcGVfdNNcEyxTDArO+Kcn+bE=; b=LYquXNpJwtDjoJK/oXjqt72xqLBndZnCG4DHVdvmEP+qM7Qn2cuG1UV/Hf5Q1w9N+u R+m6S82X5aeybXQZPSrVO7C9J3i5kAS3XdtRjFxZS8IAyBeec2ok4zrScaFC8hzYl9JL 1xUBZ0s69zABfgfEJ3rRPmY9PUnRQI20Pz6VSsiIQu6kqs6SkS84CdKwOhe+r3lCOqGc WFq63GWzptg8Xi4kvjRNPKlcsYyiBVxVXyHcnVECqf+SVxRQecSzTgppcHMIvJ5Qgy12 vxDE5NyCbVivhplbkI2SKxwMBw1xW3pNNXZ0C5CrTwF/4FMPFnw9Zye96B4Tb2zScxtm i1Ow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737132770; x=1737737570; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=2eWhQ4gbK6D4YVawcJMZcGVfdNNcEyxTDArO+Kcn+bE=; b=C+L6cWwiMEx5yGRsaYTkQjuJnuGxkHPlRm6khlpjAC697LK95/W3jTLL3DBmN1Rsld qRJOX8EQ6OCtwz7O3Vun1kscRqNtvBGE7TDpvl6SgjElxpXnHNTbZcV+lqE1j8JfCo8k HMCmw1+bm8lD05t0i+sQ/C4OD2cOvIaCnlQf/86l3ZnsNYnpWT6Ylaaa1WEbDHd4k20y pBjpMfmVn1FnsqTeE6r1iiF0khWSIsA+wIfTsdO77i+P/2cjs2o/4Q33ba72QCc7jsh5 6lYQspOsvM3oCYvH6SGXssD5oNABC5UVIEo//bFvcgrB1xYJd3DsXqNZPINW60eLaLG3 fY5g== X-Forwarded-Encrypted: i=1; AJvYcCXv10aVa5fCZIXv84x9eT66wQ8k5OQyGQS5AItvXTf0XciSroYtnL+MM+jk3B/maxfYBvGAowuWbRg8S/A=@vger.kernel.org X-Gm-Message-State: AOJu0YzlvVYE1FNXWNIjW3B0sTXzwIwhTlXJ6qu16nAJg4suC7F/qdpp Lj/AMqXYgOeY9M8dRcDcyjbJxPEfRXwWJsHePioMS7w3lfWHU2bGsX0lgyFTnYKKET76rOZSC1c 5hg== X-Google-Smtp-Source: AGHT+IHyq6VJPX9FfnZ6MlDC8z80Nq1NL0zTef+h6tAa7MLyoEn6y308Ywn8tsXMMPCedlKPNa3CH2iVEzU= X-Received: from pjbqi17.prod.google.com ([2002:a17:90b:2751:b0:2ef:7352:9e97]) (user=seanjc job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:1f8b:b0:2f6:539:3cd8 with SMTP id 98e67ed59e1d1-2f782ca2291mr5301914a91.18.1737132769931; Fri, 17 Jan 2025 08:52:49 -0800 (PST) Date: Fri, 17 Jan 2025 08:52:48 -0800 In-Reply-To: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20250107042202.2554063-1-suleiman@google.com> <20250107042202.2554063-2-suleiman@google.com> Message-ID: Subject: Re: [PATCH v3 1/3] kvm: Introduce kvm_total_suspend_ns(). From: Sean Christopherson To: Suleiman Souhlal Cc: Paolo Bonzini , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Chao Gao , David Woodhouse , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, ssouhlal@freebsd.org Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable On Fri, Jan 17, 2025, Suleiman Souhlal wrote: > On Thu, Jan 16, 2025 at 6:49=E2=80=AFAM Sean Christopherson wrote: > > > > On Tue, Jan 07, 2025, Suleiman Souhlal wrote: > > > It returns the cumulative nanoseconds that the host has been suspende= d. > > > It is intended to be used for reporting host suspend time to the gues= t. > > > > ... > > > > > #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER > > > +static int kvm_pm_notifier(struct kvm *kvm, unsigned long state) > > > +{ > > > + switch (state) { > > > + case PM_HIBERNATION_PREPARE: > > > + case PM_SUSPEND_PREPARE: > > > + last_suspend =3D ktime_get_boottime_ns(); > > > + case PM_POST_HIBERNATION: > > > + case PM_POST_SUSPEND: > > > + total_suspend_ns +=3D ktime_get_boottime_ns() - last_su= spend; > > > > After spending too much time poking around kvmlock and sched_clock code= , I'm pretty > > sure that accounting *all* suspend time to steal_time is wildly inaccur= ate for > > most clocksources that will be used by KVM x86 guests. > > > > KVM already adjusts TSC, and by extension kvmclock, to account for the = TSC going > > backwards due to suspend+resume. I haven't dug super deep, buy I assum= e/hope the > > majority of suspend time is handled by massaging guest TSC. > > > > There's still a notable gap, as KVM's TSC adjustments likely won't acco= unt for > > the lag between CPUs coming online and vCPU's being restarted, but I do= n't know > > that having KVM account the suspend duration is the right way to solve = that issue. >=20 > (It is my understanding that steal time has no impact on clock sources.) > On our machines, the problem isn't that the TSC is going backwards. As > you say, kvmclock takes care of that. >=20 > The problem these patches are trying to solve is that the time keeps > advancing for the VM while the host is suspended. Right, the issue is that because KVM adjusts guest TSC if the host TSC does= go backwards, then the accounting will be all kinds of messed up. 1. Initiate suspend at host TSC X, guest TSC X+Y. 2. Save X into last_host_tsc via kvm_arch_vcpu_put(): vcpu->arch.last_host_tsc =3D rdtsc(); 3. Resume after N hours, host TSC reset to 0 and starts counting. 4. kvm_arch_enable_virtualization_cpu() runs at new host time Z. 5. KVM detects backwards TSC (Z < X) and adjusts guest TSC offset so that= guest TSC stays at/near X+Y, i.e. guest TSC becomes "Z + Y + (X - Z)". u64 delta_cyc =3D max_tsc - local_tsc; list_for_each_entry(kvm, &vm_list, vm_list) { kvm->arch.backwards_tsc_observed =3D true; kvm_for_each_vcpu(i, vcpu, kvm) { vcpu->arch.tsc_offset_adjustment +=3D delta_cyc; vcpu->arch.last_host_tsc =3D local_tsc; kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); } Thus, if the guest is using the TSC for sched_clock, guest time does NOT ke= ep advancing. kvmclock on the other hand counts from *host* boot, and so guest time keeps advancing if the guest is using kvmclock. #ifdef CONFIG_X86_64 static s64 get_kvmclock_base_ns(void) { /* Count up from boot time, but with the frequency of the raw clock. */ return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot)= ); } #else static s64 get_kvmclock_base_ns(void) { /* Master clock not used, so we can just use CLOCK_BOOTTIME. */ return ktime_get_boottime_ns(); } #endif In short, AFAICT the issues you are observing are mostly a problem with kvm= clock. Or maybe it's the other way around and effectively freezing guest TSC is su= per problematic and fundamentally flawed. Regardless of which one is "broken", unconditionally accounting suspend tim= e to steal_time will do the wrong thing when sched_clock=3Dtsc. To further mudd= y the waters, current Linux-as-a-guest on modern hardware will likely use clockso= urce=3Dtsc, but sched_clock=3Dkvmclock. In that scenario, guest time doesn't advanced,= but guest scheduler time does. Ugh. That particular wart can be avoided by having the guest use TSC for sched_c= lock[*], e.g. so that at least the behavior of time is consistent. Hmm, if freezing guest time across suspend is indeed problematic, one thoug= ht would be to put the onus on the VMM/user to not advertise a "nonstop TSC" i= f the host may be suspending. The Linux-as-a-guest would prefer kvmclock over TS= C for both clocksource and sched_clock. [*] https://lore.kernel.org/all/Z4gqlbumOFPF_rxd@google.com