From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C7FC723DE for ; Wed, 4 Oct 2023 20:43:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="H3oNnDSb" Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 188E5BF for ; Wed, 4 Oct 2023 13:43:53 -0700 (PDT) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-59f61a639b9so3516267b3.1 for ; Wed, 04 Oct 2023 13:43:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1696452232; x=1697057032; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=wlxMyVNfjSoRd0rBi0mcIrA81XFNXKxdXgwPCPKKi+c=; b=H3oNnDSb1ptmSWlR1eGP4UzY4jKXo+IhvD8pqUNFPMBByPe825qdI2i05focHY1HCk 9YNkeCO6VtciiEAzWZ+peisMwXCu3jUWThy8DExH6SWBzRMp8SBXFI36LDjvJL04Ux6P efXOsdwcfN4WhgeLvvZVmBqMGZFHqrp5cu2YlWn9lh3b3GxfRxvRHRNucaD9AcyhqvrC F2zzubenazDDfJfdi3sPi4NtFZQdYFEYvMKjPNAN5T+uGqOXgeyu5ssEH8/mssUPEftM O78pk3a8gXvcrKmrwX9KAetSYBk3JknJakBY55EAXzHBhIG9jc1bPX9unQxSvsfQmdQ6 WfmA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696452232; x=1697057032; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=wlxMyVNfjSoRd0rBi0mcIrA81XFNXKxdXgwPCPKKi+c=; b=L5qhN8ksLSvccSKOqs/0UJJzfB4qTRCxwdSJgKrcmn4uUotE8GGH7szqTBz2KioGRU 8wlqtC/IJMbfSOVBRpbLzpkca+fOEGCp7IqL1zmc9TvioEiYmftrI8ZHSOHYJ7dlfq7T 0I044KlFhvMNKSD2BHgWRU6zM38i5dEEhQwU0h7Cc6DB4TSfOEApypdMSHYAeTTuMluh yFabvlkhmBHzhHPVe1pyc1I7E1QkO5lKls1Ey6a9TPpTjg1Ro3PIZAm7p5eIWLTSf/CN Y0gATs6M6+5POhBWwvgPP0xAlPtkK07SqO+MuaM1KV2BL+2qaoCSqELYwSjlC5Dd4SaB cG4A== X-Gm-Message-State: AOJu0YzfpkfXSyI7IPdFuYpmSOmswDnJlKx4b2vNkzFk8ojexEljmYWg 5fauC7VT8okB0q2GtNZVmqP7VNdnapA= X-Google-Smtp-Source: AGHT+IG2BxIFzc18lhT+t5otGKsS7OX2vGezvKCEpmxGtv1UbKYti1bZa4/a2NuIvrXiOGhPcEgq5gnEBhw= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a81:a909:0:b0:59b:e97e:f7e3 with SMTP id g9-20020a81a909000000b0059be97ef7e3mr63909ywh.2.1696452232294; Wed, 04 Oct 2023 13:43:52 -0700 (PDT) Date: Wed, 4 Oct 2023 13:43:50 -0700 In-Reply-To: Precedence: bulk X-Mailing-List: linux-perf-users@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20230927113312.GD21810@noisy.programming.kicks-ass.net> <20230929115344.GE6282@noisy.programming.kicks-ass.net> <20231002115718.GB13957@noisy.programming.kicks-ass.net> <20231002204017.GB27267@noisy.programming.kicks-ass.net> Message-ID: Subject: Re: [Patch v4 07/13] perf/x86: Add constraint for guest perf metrics event From: Sean Christopherson To: Mingwei Zhang Cc: Peter Zijlstra , Ingo Molnar , Dapeng Mi , Paolo Bonzini , Arnaldo Carvalho de Melo , Kan Liang , Like Xu , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Ian Rogers , Adrian Hunter , kvm@vger.kernel.org, linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, Zhenyu Wang , Zhang Xiong , Lv Zhiyuan , Yang Weijiang , Dapeng Mi , Jim Mattson , David Dunn , Thomas Gleixner Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net On Tue, Oct 03, 2023, Mingwei Zhang wrote: > On Mon, Oct 2, 2023 at 5:56=E2=80=AFPM Sean Christopherson wrote: > > The "when" is what's important. If KVM took a literal interpretation = of > > "exclude guest" for pass-through MSRs, then KVM would context switch al= l those > > MSRs twice for every VM-Exit=3D>VM-Enter roundtrip, even when the VM-Ex= it isn't a > > reschedule IRQ to schedule in a different task (or vCPU). The overhead= to save > > all the host/guest MSRs and load all of the guest/host MSRs *twice* for= every > > VM-Exit would be a non-starter. E.g. simple VM-Exits are completely ha= ndled in > > <1500 cycles, and "fastpath" exits are something like half that. Switc= hing all > > the MSRs is likely 1000+ cycles, if not double that. >=20 > Hi Sean, >=20 > Sorry, I have no intention to interrupt the conversation, but this is > slightly confusing to me. >=20 > I remember when doing AMX, we added gigantic 8KB memory in the FPU > context switch. That works well in Linux today. Why can't we do the > same for PMU? Assuming we context switch all counters, selectors and > global stuff there? That's what we (Google folks) are proposing. However, there are significan= t side effects if KVM context switches PMU outside of vcpu_run(), whereas the= FPU doesn't suffer the same problems. Keeping the guest FPU resident for the duration of vcpu_run() is, in terms = of functionality, completely transparent to the rest of the kernel. From the = kernel's perspective, the guest FPU is just a variation of a userspace FPU, and the = kernel is already designed to save/restore userspace/guest FPU state when the kern= el wants to use the FPU for whatever reason. And crucially, kernel FPU usage is exp= licit and contained, e.g. see kernel_fpu_{begin,end}(), and comes with mechanisms= for KVM to detect when the guest FPU needs to be reloaded (see TIF_NEED_FPU_LOA= D). The PMU is a completely different story. PMU usage, a.k.a. perf, by design= is "always running". KVM can't transparently stop host usage of the PMU, as d= isabling host PMU usage stops perf events from counting/profiling whatever it is the= y're supposed to profile. Today, KVM minimizes the "downtime" of host PMU usage by context switching = PMU state at VM-Enter and VM-Exit, or at least as close as possible, e.g. for L= BRs and Intel PT. What we are proposing would *significantly* increase the downtime, to the p= oint where it would almost be unbounded in some paths, e.g. if KVM faults in a p= age, gup() could go swap in memory from disk, install PTEs, and so on and so for= th. If the host is trying to profile something related to swap or memory manage= ment, they're out of luck.