From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from desiato.infradead.org (desiato.infradead.org [90.155.92.199]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 97E396F30A for ; Mon, 10 Jun 2024 20:54:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.92.199 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718052869; cv=none; b=YKvcZlfR+oJeeL5/2d6KAzlc0/U0DWBDHlpj5YkA5rEOkF4WEJHk+SK6Qc6CAJBTYtRSQW67eLtvZBGuV3hYAR22ncdOhKSm+kCXUhw4z+lO34uRDFQi33AtDrLU6pB4G8t2ZMn6UnkUd6/7hCHy+MurwoPsse9tyCqun8DNHLo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718052869; c=relaxed/simple; bh=3IJ170TfkXxgGvDgJ3RTbvUYPnXkd2ZZUF0wrDSkLBc=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=cDXL+R/t9LsPm9gwl6NpDjFPtgUoPoI1VHEceHxmxtkQvyNiqcnW3cQKUjpj+ACska7F5xmGV8Tigd5GBz2CYlft5DxC/AylhmpX+NNC+NcHupEmt1N/JKUG/PcuVGBpwMTjOFzqsXJRUO89FrzW0IjUxWmkduUPe5ONP0uU7h0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=h4QetuE9; arc=none smtp.client-ip=90.155.92.199 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="h4QetuE9" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=VS6xSyzjS3Rv5GrMxcihCiEJx2R82kJuQhsJU9W0eQI=; b=h4QetuE99k/oK0fB+yhlDQMS7E ASpUJ7PkP1CZkbUAihiNHKnnyYnp0GCutTdDF+rvdWDDb0o9zZh1VQU92ehwOZh+L2RcBfjKL6ICC NeSyV9F7yJoO04n9EtF1CkLfaHuFKnLTMsOQ8AB73DEFfLfaSS7XCJ5V8Bgs+IJVgW2X63DXsVNLW yuDPLVkmEro+0o0d1xeGmtIRXVWhdwEtfwCpVOhJgTTSKvNVZQ7BoeDxst93PhAE/cR0sNMG3HlRy NOZm0ApS/UeFTd7wsbM2Oz6q/oKgYFURlQ11uiIV6JE2aqgaweOmBAh1f3TuCxIn5/BH5KCWK9CAV RE+Cm0uw==; Received: from j130084.upc-j.chello.nl ([24.132.130.84] helo=noisy.programming.kicks-ass.net) by desiato.infradead.org with esmtpsa (Exim 4.97.1 #2 (Red Hat Linux)) id 1sGm1i-00000001fHm-3JMy; Mon, 10 Jun 2024 20:54:14 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id 5BE00300439; Mon, 10 Jun 2024 22:54:05 +0200 (CEST) Date: Mon, 10 Jun 2024 22:54:05 +0200 From: Peter Zijlstra To: Stephane Eranian Cc: LKML , Ian Rogers , "Liang, Kan" , Andi Kleen , Ingo Molnar , "Narayan, Ananth" , "Bangoria, Ravikumar" , Namhyung Kim , Mingwei Zhang , Dapeng Mi , Zhang Xiong Subject: Re: [RFC] perf_events: exclude_guest impact on time_enabled/time_running Message-ID: <20240610205405.GA8774@noisy.programming.kicks-ass.net> References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Thu, Jun 06, 2024 at 12:57:35AM -0700, Stephane Eranian wrote: > Hi Peter, > > In the context of the new vPMU passthru patch series, we have to look > closer at the definition and implementation of the exclude_guest > filter in the perf_event_attr structure. This filter has been in the > kernel for many years. See patch: > https://lore.kernel.org/all/20240506053020.3911940-8-mizhang@google.com/ > > The presumed definition of the filter is that the user does not want > the event to count while the processor is running in guest mode (i.e., > inside the virtual machine guest OS or guest user code). > > The perf tool sets is by default on all core PMU events: > $ perf stat -vv -e cycles sleep 0 > ------------------------------------------------------------ > perf_event_attr: > size 112 > sample_type IDENTIFIER > read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING > disabled 1 > inherit 1 > enable_on_exec 1 > exclude_guest 1 > ------------------------------------------------------------ > > In the kernel, the way this is treated differs between AMD and Intel > because AMD does provide a hardware filter for guest vs. host in the > PMU counters whereas Intel does not. For the latter, the kernel > simply disables the event in the hardware counters, i.e., the event is > not descheduled. Both approaches produce pretty much the same desired > effect, the event is not counted while in guest mode. > > The issue I would like to raise has to do with the effects on > time_enabled and time_running for exclude_guest=1 events. > > Given the event is not scheduled out while in guest mode, even though > it is stopped, both time_enabled and time_running continue ticking > while in guest mode. If a measurement is 10s long but only 5s are in > non-guest mode, then time_enabled=10s, time_running=10s. The count > represents 10s worth of non guest mode, of which only 5s were really > actively monitoring, but the user has no way of determining this. > > If we look at vPMU passthru, the host event must have exclude_guest=1 > to avoid going into an error state on context switch to the vCPU > thread (with vPMU enabled). But this time, the event is scheduled out, > that means that time_enabled keeps counting, but time_running stops. > On context switch back in, the host event is scheduled again and > time_running restarts ticking. For a 10s measurement, where 5s here in > the guest, the event will come out with time_enabled=10s, > time_running=5s, and the tool will scale it up because it thinks the > event was multiplexed, when in fact it was not. This is not the > intended outcome here. The tool should not scale the count, it was not > multiplexed, it was descheduled because the filter forced it out. > Note that if the event had been multiplexed while running on the host, > then the scaling would be appropriate. > > In that case, I argue, time_running should be updated to cover the > time the event was not running. That would bring us back to the case I > was describing earlier. > > It boils down to the exact definition of exclude_guest and expected > impact on time_enabled and time_running. Then, with or without vPMU > passthru, we can fix the kernel to ensure a uniform behavior. > > What are your thoughts on this problem? So with those patches having explicit scheduling points, we can actually do this time accounting accurately, so I don't see a reason to not do the right thing here. Hysterically this was left vague in order to be able to avoid the scheduling for these scenarios -- performance raisins etc. The thing is, if you push this to its limits, we should start time accounting for the ring selectors too, and that's going to be painful.