From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 4F19EC36002 for ; Tue, 25 Mar 2025 01:10:54 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 13F6C10E0A5; Tue, 25 Mar 2025 01:10:54 +0000 (UTC) Authentication-Results: gabe.freedesktop.org; dkim=pass (2048-bit key; unprotected) header.d=intel.com header.i=@intel.com header.b="n2O/Q+nZ"; dkim-atps=neutral Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) by gabe.freedesktop.org (Postfix) with ESMTPS id 96FD110E0A5 for ; Tue, 25 Mar 2025 01:10:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1742865052; x=1774401052; h=date:message-id:from:to:cc:subject:in-reply-to: references:mime-version:content-transfer-encoding; bh=lA5GEoPCvfgbLUgMN4C+LFZkLkfQxuZY0ZUG+xnP1JY=; b=n2O/Q+nZlLRBXje75EQHe3rnLxL3Vt0WAQYFhSnSQcMMRCSR6Ne8yQey GCmZ4/HXPQ7s5YR/pytD3xK3W6pTp1AHGzFZ8/Pbmk+jg8BOSUYjJYU0x 41sSIbx4IcHKMgf7CDZ9Vwds8vohkNG9SLBn7U5HfzpAEcMrEbWcaz7Pf oHolVYD2CHXxiF7nNrMDLNkAuA77Im7gcjWNCnOseFM+q/HU7rYkusXq3 f3pPH0dN+BX6bzmb7cWa+kg5g0BUfvdKN8oSdNy1t1N4jE0fxZwiqo/9a hQAYWVWW3BJJpAP7NeaQ8SKMy3HF236sNdQGGtA0LBPnOpaRdIvN7yinm Q==; X-CSE-ConnectionGUID: zRlUdewFTL2CNFiDUBfqGg== X-CSE-MsgGUID: +D80xzIRSqSrQX97DTKkeg== X-IronPort-AV: E=McAfee;i="6700,10204,11383"; a="54301470" X-IronPort-AV: E=Sophos;i="6.14,274,1736841600"; d="scan'208";a="54301470" Received: from orviesa009.jf.intel.com ([10.64.159.149]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Mar 2025 18:10:51 -0700 X-CSE-ConnectionGUID: tcW++j0xTvyfPTzF/pqtVw== X-CSE-MsgGUID: egnBrHcZTY6/dBC9YKRZiw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.14,274,1736841600"; d="scan'208";a="123931785" Received: from bbeebe-mobl.amr.corp.intel.com (HELO adixit-MOBL3.intel.com) ([10.125.186.95]) by orviesa009-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Mar 2025 18:10:51 -0700 Date: Mon, 24 Mar 2025 18:10:49 -0700 Message-ID: <87iknxap9i.wl-ashutosh.dixit@intel.com> From: "Dixit, Ashutosh" To: "Belgaumkar, Vinay" Cc: "Vivi, Rodrigo" , "De Marchi, Lucas" , "intel-xe@lists.freedesktop.org" , "Tauro,\ Riana" Subject: Re: [PATCH v3] drm/xe/pmu: Add GT frequency events In-Reply-To: References: <20250312001408.804125-1-vinay.belgaumkar@intel.com> <4283fcec-5ddd-4d06-9eae-8e622e9695ee@intel.com> <87o6xsai7y.wl-ashutosh.dixit@intel.com> <87ldsu9z1s.wl-ashutosh.dixit@intel.com> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?ISO-8859-4?Q?Goj=F2?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/29.4 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-BeenThere: intel-xe@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel Xe graphics driver List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-xe-bounces@lists.freedesktop.org Sender: "Intel-xe" On Mon, 24 Mar 2025 15:14:55 -0700, Belgaumkar, Vinay wrote: > On Mon, Mar 24, 2025 at 02:07:32PM -0500, Lucas De Marchi wrote: > > On Mon, Mar 24, 2025 at 09:24:47AM -0700, Ashutosh Dixit wrote: > > > On Sat, 22 Mar 2025 19:47:04 -0700, Lucas De Marchi wrote: > > > > > > > > On Sat, Mar 22, 2025 at 02:06:09PM -0700, Ashutosh Dixit wrote: > > > > > On Sat, 22 Mar 2025 05:57:12 -0700, Lucas De Marchi wrote: > > > > >> > > > > >> On Fri, Mar 21, 2025 at 03:45:21PM -0700, Belgaumkar, Vinay wrot= e: > > > > >> > > > > > >> > On 3/13/2025 2:45 PM, Lucas De Marchi wrote: > > > > >> >> On Tue, Mar 11, 2025 at 05:14:08PM -0700, Vinay Belgaumkar wr= ote: > > > > >> >>> Define PMU events for GT frequency (actual and requested). > > > > >> >>> This is a port from the i915 driver implementation, where > > > > >> >>> an internal timer is used to aggregate GT frequencies over c= ertain fixed interval. > > > > >> >>> Following PMU events are being added- > > > > >> >> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 ^ > > > > >> >> why do you use "-"=A0 instead of ":"? > > > > >> >> > > > > >> >>> > > > > >> >>> =A0xe_0000_00_02.0/gt-actual-frequency/=A0=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0=A0=A0 [Kernel > > > > >> >>> PMU event] > > > > >> >>> =A0xe_0000_00_02.0/gt-requested-frequency/=A0=A0=A0=A0=A0=A0= =A0=A0=A0=A0 [Kernel > > > > >> >>> PMU event] > > > > >> >>> > > > > >> >>> Standard perf commands can be used to monitor GT frequency- > > > > >> >>> =A0$ perf stat -e > > > > >> >>> xe_0000_00_02.0/gt-requested-frequency,gt=3D0/ -I1000 > > > > >> >>> > > > > >> >>> =A0=A0=A0 1.001175175=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0 700 M > > > > >> >>> xe/gt-requested-frequency,gt=3D0/ > > > > >> >>> =A0=A0=A0 2.005891881=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0 703 M > > > > >> >>> xe/gt-requested-frequency,gt=3D0/ > > > > >> >>> =A0=A0=A0 3.007318169=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0= =A0=A0 700 M > > > > >> >>> xe/gt-requested-frequency,gt=3D0/ > > > > >> >>> > > > > >> >>> Actual frequencies will be reported as 0 when GT is in C6. > > > > >> >> > > > > >> >> > > > > >> >> I think we need to document somewhere, but at the very least > > > > >> >> in the commit message what the event count actually is. Let > > > > >> >> me see if I get this right:=A0 if userspace is sampling every > > > > >> >> 1sec, and assuming the gpu is at 700MHz for the first 0.5sec > > > > >> >> and at 1.4 GHz, userspace should expect to see ~1050 as the > > > > >> >> value. Correct?=A0 I find this frequency handling very > > > > >> >> different from anything else reported via perf. Other than i9= 15, are there any other cases you know of? > > > > >> > > > > > >> > Yes,=A0 user will see a gradual ramp. One thing that could work > > > > >> > is something along these lines: > > > > >> > > > > > >> > @@ -324,7 +327,10 @@ static void xe_pmu_event_update(struct > > > > >> > perf_event > > > > >> > *event) > > > > >> > =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 new =3D __xe_pmu= _event_read(event); > > > > >> > =A0=A0=A0=A0=A0=A0=A0 } while (!local64_try_cmpxchg(&hwc->prev= _count, > > > > >> > &prev, new)); > > > > >> > > > > > >> > -=A0=A0=A0=A0=A0=A0 local64_add(new - prev, &event->count); > > > > >> > +=A0=A0=A0=A0=A0=A0 if (is_gt_frequency_event(id)) > > > > >> > +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 local64_add(new, &= event->count); > > > > >> > +=A0=A0=A0=A0=A0=A0 else > > > > >> > +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 local64_add(new - = prev, &event->count); > > > > >> > > > > > >> > This will give us instantaneous values and will not need the > > > > >> > use of an internal timer.=A0 Should be ok to do it this way? > > > > >> > > > > >> yes, I think it'd be preferred. It's much simpler and don't > > > > >> prevent us from eventually adding an avg_* event if the needs > > > > >> arise. Which would also be clearer on what that is. > > > > > > > > > > Instantaneous values would show 0 if the sampling instant landed > > > > > when gt is > > > > > > > > instantaneous values make much more sense from the perf point of > > > > view IMO. perf is **sampling** them - other than the same thing in > > > > i915, I've never seen an event that returns an average (not even > > > > documenting it as such). > > > > > > > > What would also make sense would be to support perf-record so the > > > > samples are recorded in the buffer that is passed to userspace. In > > > > this case setting up a timer to save them (since we don't have > > > > interrupt for our pmu) would be nice. > > > > > > Afaik, typically userspace calls into the kernel for PMU counters > > > relatively infrequently, say once a second (1 Hz). However, if this > > > low rate does not yield good freq values (say freq is showing lots > > > of 0's), now > > > > why do you say it's not getting good value? is 0 not a good value? > > Should the kernel be returning something else instead of doing an > > average with "lots of zeros" and returning it? > > well, zero is not a good value because it is not real. The register is > zero because the GT is sleeping, but that is not the accurate thing. Not necessarily a not-good value. For example, if the freq is 1 GHz 50% of the time and GT is sleeping 50% of the time, you could say the average freq is 500 MHz. But userspace would have to sample at a high enough rate to figure out GT is sleeping 50% of the time. Maybe 10 or 20 Hz is enough, see below. i915 disregards C6 so would report the average as 1 GHz. The 200 Hz sampling rate for averaging in i915 was designed based on host turbo I believe. > > but tbh we are already taking this decision to our sysfs interface: > > $ cat act_freq > 0 > > But well, for this instantaneous it may be good... we just need to > educate the consumers to see the zero as gpu sleeping in a chart. but > zero is an outlier to the average of the frequencies that it will bring > the mean to a very lower levels and the charts collected with PMU will be > so out of shape to the point it might be useless... > > True. Better to stick with instantaneous value in this case. > > > > > > userspace will need to call into the kernel more frequently, say 10 > > > Hz to > > > 100 Hz, just to get good freq values (all other counters are fine at > > > 1 > > > > It should be perfectly fine for an application to setup a 10Hz timer > > and getting it multiple times if it wants to average.... I don't think > > it needs to be at the hundreds frequency. This could depend on how fast SLPC changes the freq and maybe @Belgaumkar, Vinay might know what is the typical rate at which userspace would need to sample to get a good average. > > > > > Hz). This increased rate of calling into the kernel may be ok for > > > 'perf record' but not sure what happens to say 'gputop'. > > > > oh no, perf-record works diferently... it's not calling into the > > kernel with that frequency. It's perf-stat that works that way. > > > > For perf-record it more like kernel pushing samples in a ringbuffer, > > shared with userspace, and then waking it. We (currently) don't > > support that mode in our implementation: > > > > static int xe_pmu_event_init(struct perf_event *event) { > > ... > > /* unsupported modes and filters */ > > if (event->attr.sample_period) /* no sampling */ > > return -EINVAL; > > ... > > } > > > > > > > > > > I think considerations like this went into i915 deciding to average > > > in the kernel. > > > > as you noticed, maybe there was actually a lack of thought on the > > matter and then later it couldn't be changed anymore... ? I saw > > several odd things in the i915 side. Exposing something with a Hz unit > > is one of them. > > > > > > > > Also, if we are going to expose instantaneous freq, if gt is in C6 > > > (when userspace calls into the kernel to sample), could we save and > > > return the freq /before/ gt went into C6 (to avoid returning 0 freq, > > > and returning the last freq as it was when workload was running)? > > > This might be sufficient and remove the need to sample more > > > frequently, though not sure how complex to implement. > > > > if that data would make more sense for userspace, then I think it > > should be ok. So, let's say gt in in C6, what does make more sense for > > gputop-like and turbostat? > > > > 1) read 0 from the kernel and report 0 to the user > > 2) read 0 from the kernel and report min_freq to the user (min_freq they > > can read once from from sysfs) > > 3) read min_freq from the kernel > > 4) read the last frequency before going into C6 This depends on what we want userland to do: * If we are asking userspace to read the freq's from PMU at a high rate and then average those freq's, we should report 0 * What I was saying was, is it possible to always return a "representative" freq, so that userspace does not have to sample at a high rate and not average the freq. If we can return this "representative" freq, without averaging in the kernel. Can this freq be the last requested/actual freq before gt went into C6? If SLPC is changing the freq rapidly, this would not be accurate either, but maybe it's good enough? And we'd have to tell userspace that it should *not* average the retrieved freq, that would be inaccurate. If they really want an accurate average they should use sysfs. > > The GT requested frequency will already show us the last requested > value. Assuming no throttling, there is a good chance this is what the GT > was at. Besides, KMD does not control when GT goes to C6, GuC does the > idle indication (since we are using GuC RC). As mentioned on v4, if we are not waking the card up, the requested freq would itself be 0? > > > > > > > > > > > > in C6. Also instantaneous values are already available via sysfs. > > > > > > > > with the caveat of a slower interface, but yeah... that's why I > > > > was asking earlier: do we really need this if it's already > > > > available via sysfs? I was told turbostat wants to consolidate all > > > > the data to be read via perf instead of reading a bunch of files fr= om sysfs. > > > > > > OK. > > > > > > > > > > > > Also, the way i915 calculates the average freq is itself > > > > > controversial. i915 computes average freq when gt is awake and > > > > > disregards those intervals when gt is in C6. There was this > > > > > patch to suggest that average freq also take into account those i= ntervals when gt is in C6: > > > > > > > > > > https://patchwork.freedesktop.org/series/109372/ > > > > > > > > > > But it was not accepted since it would change uapi. > > > > > > > > one more reason not to fall in the same trap. Let userspace decide > > > > if it wants the average or not. If it'd need to sample too > > > > frequently, then I think we need to look at supporting perf-record > > > > rather than doing the average (which is actually an integration, > > > > using the rectangle method, and then dividing by the perf_read > > > > delta time). At the very least, rename the event to make the avg ex= plicit. > > > > > > OK, we can return instantaneous freq for now and if that doesn't > > > work out, add an average freq later?