From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 479A0C76195 for ; Mon, 27 Mar 2023 17:05:42 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id C93BE10E3D1; Mon, 27 Mar 2023 17:05:41 +0000 (UTC) Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by gabe.freedesktop.org (Postfix) with ESMTPS id 22D3D10E3C9; Mon, 27 Mar 2023 17:05:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1679936740; x=1711472740; h=date:message-id:from:to:cc:subject:in-reply-to: references:mime-version:content-transfer-encoding; bh=92okBTMoaNA1TKXdXOgZKKqYGYd7hBrDWZR4MLvGXAY=; b=H60+2lHQPcEiFW4I3gW50FxG2gi8roYFd8kTLx7+6PxExmIKtazE1Z6J RXaF1yR2+tCm/C7TNWk06qsY0MEaTAaVjAfKDlcQNXsHs/q2Omdgm7Fzh fJf3/5PATFSczrV6Ty+h3jVNSsGH30y8Q4GOY1CaRChxdC43XlEysMtFJ nW3QzyCrUSNDSV1cwLZa9Ilf7+r66r/AXwygkXV8QJv0jnAp3I4TFBkE1 gsvWe8cCy2Y3H2ZUAlFdqk7kiH2lgxfx6JTXmMg5X9j7EjzFJU5CjeUgW L2u/6o3l6i6NJKmszuzm1Pms4/dHhsJct5IYfsuakFwUoAj5PA12cT0Nh Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10662"; a="402933170" X-IronPort-AV: E=Sophos;i="5.98,295,1673942400"; d="scan'208";a="402933170" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Mar 2023 10:05:39 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10662"; a="748074989" X-IronPort-AV: E=Sophos;i="5.98,295,1673942400"; d="scan'208";a="748074989" Received: from adixit-mobl.amr.corp.intel.com (HELO adixit-arch.intel.com) ([10.212.162.223]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Mar 2023 10:05:37 -0700 Date: Mon, 27 Mar 2023 09:58:52 -0700 Message-ID: <87mt3yku5v.wl-ashutosh.dixit@intel.com> From: "Dixit, Ashutosh" To: Rodrigo Vivi In-Reply-To: References: <20230316035954.2593843-1-ashutosh.dixit@intel.com> <4760d41f-c237-9f97-eb32-5d2ab05eea20@intel.com> <87sfdtload.wl-ashutosh.dixit@intel.com> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?ISO-8859-4?Q?Goj=F2?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/28.2 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Subject: Re: [Intel-gfx] [PATCH] drm/i915/guc: Disable PL1 power limit when loading GuC firmware X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" On Sun, 26 Mar 2023 04:52:59 -0700, Rodrigo Vivi wrote: > Hi Rodrigo, > On Fri, Mar 24, 2023 at 04:31:22PM -0700, Dixit, Ashutosh wrote: > > On Fri, 24 Mar 2023 11:15:02 -0700, Belgaumkar, Vinay wrote: > > > > > > > Hi Vinay, > > > > Thanks for the review. Comments inline below. > > > > > On 3/15/2023 8:59 PM, Ashutosh Dixit wrote: > > > > On dGfx, the PL1 power limit being enabled and set to a low value r= esults > > > > in a low GPU operating freq. It also negates the freq raise operati= on which > > > > is done before GuC firmware load. As a result GuC firmware load can= time > > > > out. Such timeouts were seen in the GL #8062 bug below (where the P= L1 power > > > > limit was enabled and set to a low value). Therefore disable the PL= 1 power > > > > limit when allowed by HW when loading GuC firmware. > > > v3 label missing in subject. > > > > > > > > v2: > > > > - Take mutex (to disallow writes to power1_max) across GuC reset/= fw load > > > > - Add hwm_power_max_restore to error return code path > > > > > > > > v3 (Jani N): > > > > - Add/remove explanatory comments > > > > - Function renames > > > > - Type corrections > > > > - Locking annotation > > > > > > > > Link: https://gitlab.freedesktop.org/drm/intel/-/issues/8062 > > > > Signed-off-by: Ashutosh Dixit > > > > --- > > > > drivers/gpu/drm/i915/gt/uc/intel_uc.c | 9 +++++++ > > > > drivers/gpu/drm/i915/i915_hwmon.c | 39 ++++++++++++++++++++++= +++++ > > > > drivers/gpu/drm/i915/i915_hwmon.h | 7 +++++ > > > > 3 files changed, 55 insertions(+) > > > > > > > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/dr= m/i915/gt/uc/intel_uc.c > > > > index 4ccb4be4c9cba..aa8e35a5636a0 100644 > > > > --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c > > > > +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c > > > > @@ -18,6 +18,7 @@ > > > > #include "intel_uc.h" > > > > #include "i915_drv.h" > > > > +#include "i915_hwmon.h" > > > > static const struct intel_uc_ops uc_ops_off; > > > > static const struct intel_uc_ops uc_ops_on; > > > > @@ -461,6 +462,7 @@ static int __uc_init_hw(struct intel_uc *uc) > > > > struct intel_guc *guc =3D &uc->guc; > > > > struct intel_huc *huc =3D &uc->huc; > > > > int ret, attempts; > > > > + bool pl1en; > > > > > > Init to 'false' here > > > > See next comment. > > > > > > > > > > > > GEM_BUG_ON(!intel_uc_supports_guc(uc)); > > > > GEM_BUG_ON(!intel_uc_wants_guc(uc)); > > > > @@ -491,6 +493,9 @@ static int __uc_init_hw(struct intel_uc *uc) > > > > else > > > > attempts =3D 1; > > > > + /* Disable a potentially low PL1 power limit to allow freq to be > > > > raised */ > > > > + i915_hwmon_power_max_disable(gt->i915, &pl1en); > > > > + > > > > intel_rps_raise_unslice(&uc_to_gt(uc)->rps); > > > > while (attempts--) { > > > > @@ -547,6 +552,8 @@ static int __uc_init_hw(struct intel_uc *uc) > > > > intel_rps_lower_unslice(&uc_to_gt(uc)->rps); > > > > } > > > > + i915_hwmon_power_max_restore(gt->i915, pl1en); > > > > + > > > > guc_info(guc, "submission %s\n", str_enabled_disabled(intel_uc_uses= _guc_submission(uc))); > > > > guc_info(guc, "SLPC %s\n", str_enabled_disabled(intel_uc_uses_guc_s= lpc(uc))); > > > > @@ -563,6 +570,8 @@ static int __uc_init_hw(struct intel_uc *uc) > > > > /* Return GT back to RPn */ > > > > intel_rps_lower_unslice(&uc_to_gt(uc)->rps); > > > > + i915_hwmon_power_max_restore(gt->i915, pl1en); > > > > > > if (pl1en) > > > > > > =A0=A0=A0 i915_hwmon_power_max_enable(). > > > > IMO it's better not to have checks in the main __uc_init_hw() function = (if > > we do this we'll need to add 2 checks in __uc_init_hw()). If you really > > want we could do something like this inside > > i915_hwmon_power_max_disable/i915_hwmon_power_max_restore. But for now I > > am not making any changes. > > > > (I can send a patch with the changes if you want to take a look but IMO= it > > will add more logic/code but without real benefits (it will save a rmw = if > > the limit was already disabled, but IMO this code is called so infreque= ntly > > (only during GuC resets) as to not have any significant impact)). > > > > > > > > > + > > > > __uc_sanitize(uc); > > > > if (!ret) { > > > > diff --git a/drivers/gpu/drm/i915/i915_hwmon.c b/drivers/gpu/drm/i9= 15/i915_hwmon.c > > > > index ee63a8fd88fc1..769b5bda4d53f 100644 > > > > --- a/drivers/gpu/drm/i915/i915_hwmon.c > > > > +++ b/drivers/gpu/drm/i915/i915_hwmon.c > > > > @@ -444,6 +444,45 @@ hwm_power_write(struct hwm_drvdata *ddat, u32 = attr, int chan, long val) > > > > } > > > > } > > > > +void i915_hwmon_power_max_disable(struct drm_i915_private *i915,= bool > > > > *old) > > > Shouldn't we call this i915_hwmon_package_pl1_disable()? > > > > I did think of using "pl1" in the function name but then decided to ret= ain > > "power_max" because other hwmon functions for PL1 limit also use > > "power_max" (hwm_power_max_read/hwm_power_max_write) and currently > > "hwmon_power_max" is mapped to the PL1 limit. So "power_max" is used to > > show that all these functions deal with the PL1 power limit. > > > > There is a comment in __uc_init_hw() explaining "power_max" means the P= L1 > > power limit. > > > > > > + __acquires(i915->hwmon->hwmon_lock) > > > > +{ > > > > + struct i915_hwmon *hwmon =3D i915->hwmon; > > > > + intel_wakeref_t wakeref; > > > > + u32 r; > > > > + > > > > + if (!hwmon || !i915_mmio_reg_valid(hwmon->rg.pkg_rapl_limit)) > > > > + return; > > > > + > > > > + /* Take mutex to prevent concurrent hwm_power_max_write */ > > > > + mutex_lock(&hwmon->hwmon_lock); > > > > + > > > > + with_intel_runtime_pm(hwmon->ddat.uncore->rpm, wakeref) > > > > + r =3D intel_uncore_rmw(hwmon->ddat.uncore, > > > > + hwmon->rg.pkg_rapl_limit, > > > > + PKG_PWR_LIM_1_EN, 0); > > > Most of this code (lock and rmw parts) is already inside static void > > > hwm_locked_with_pm_intel_uncore_rmw() , can we reuse that here? > > > > This was the case in v1 of the patch: > > > > https://patchwork.freedesktop.org/patch/526393/?series=3D115003&rev=3D1 > > > > But now this cannot be done because if you notice we acquire the mutex = in > > i915_hwmon_power_max_disable() and release the mutex in > > i915_hwmon_power_max_restore(). > > > > I explained the reason why this the mutex is handled this way in my rep= ly > > to Jani Nikula here: > > > > https://patchwork.freedesktop.org/patch/526598/?series=3D115003&rev=3D2 > > > > Quoting below: > > > > ``` > > > > + /* hwmon_lock mutex is unlocked in hwm_power_max_restore */ > > > > > > Not too happy about that... any better ideas? > > > > Afais, taking the mutex is the only fully correct solution (when we dis= able > > the power limit, userspace can go re-enable it). Examples of partly > > incorrect solutions (which don't take the mutex) include: > > > > a. Don't take the mutex, don't do anything, ignore any changes to the v= alue > > if it has changed during GuC reset/fw load (just overwrite the chang= ed > > value). Con: changed value is lost. > > > > b. Detect if the value has changed (the limit has been re-enabled) afte= r we > > have disabled the limit and in that case skip restoring the value. B= ut > > then someone can say why do we allow enabling the PL1 limit since we > > want to disable it. > > > > Both these are very unlikely scenarios so they might work. But I would > > first like to explore if holding a mutex across GuC reset is prolebmatic > > since that is /the/ correct solution. But if anyone comes up with a rea= son > > why that cannot be done we can look at these other not completely corre= ct > > options. > > I see what you are doing and it looks indeed a very safe approach to ensu= re > the pl1 won't be toggled by other paths while we need some guaranteed sta= te > here, or hw init fails badly. > > But in the end you are making your lock to protect the code from another = path > and not protecting the data itself. The data was already protected in the > first version with the lock in the rmw. Sorry I am not really following. Daniel had mentioned this "protecting code vs protecting data" but I am wondering how it is applicable in this case. IMO here the data we are protecting is the register which we don't want written to by userland while GuC load is in progress. To do that we need to block the code path writing to register. So what we have here seems to me to be the simplest and cleanest approach for solving this issue. > maybe we need to have some kind of a state check with other state-lock and > then if we are in this forced state for init path, the request for the no= rmal path > ignores and move one, I don't see how this will *not* be racy... > or maybe we queue some request... Queuing a request will not be enough (even if this is possible), the request will need to wait to complete till GuC load completes. So we'll have to complete the request when GuC load completes, similar to releasing the mutex in the current patch. Looks like a much more complicated way of doing what the mutex does very simply. So: a. What is the real problem with the current implementation? b. What would be the correct solution for it? That is how, specifically, should we implement it? Some more guidance will be helpful if you think this patch has issues. Thanks. -- Ashutosh > > ``` > > > > > > + > > > > + *old =3D !!(r & PKG_PWR_LIM_1_EN); > > > > +} > > > > + > > > > +void i915_hwmon_power_max_restore(struct drm_i915_private *i915, b= ool old) > > > > + __releases(i915->hwmon->hwmon_lock) > > > We can just call this i915_hwmon_power_max_enable() and call whenever= the > > > old value was actually enabled. That way, we have proper mirror funct= ions. > > > > As I explained that would mean adding two checks in the main __uc_init_= hw() > > function which I am trying to avoid. So we have disable/restore pair. > > > > > > +{ > > > > + struct i915_hwmon *hwmon =3D i915->hwmon; > > > > + intel_wakeref_t wakeref; > > > > + > > > > + if (!hwmon || !i915_mmio_reg_valid(hwmon->rg.pkg_rapl_limit)) > > > > + return; > > > > + > > > > + with_intel_runtime_pm(hwmon->ddat.uncore->rpm, wakeref) > > > > + intel_uncore_rmw(hwmon->ddat.uncore, > > > > + hwmon->rg.pkg_rapl_limit, > > > > + PKG_PWR_LIM_1_EN, > > > > + old ? PKG_PWR_LIM_1_EN : 0); > > > > > > 3rd param should be 0 here, else we will end up clearing other bits. > > > > No see intel_uncore_rmw(), it will only clear the PKG_PWR_LIM_1_EN bit,= so > > the code here is correct. intel_uncore_rmw() does: > > > > val =3D (old & ~clear) | set; > > > > So for now I am not making any changes, if you feel strongly about > > something one way or another let me know. Anyway these comments should = help > > you understand the patch better so take a look and we can go from there. > > > > Thanks. > > -- > > Ashutosh > > > > > > + > > > > + mutex_unlock(&hwmon->hwmon_lock); > > > > +} > > > > + > > > > static umode_t > > > > hwm_energy_is_visible(const struct hwm_drvdata *ddat, u32 attr) > > > > { > > > > diff --git a/drivers/gpu/drm/i915/i915_hwmon.h b/drivers/gpu/drm/i9= 15/i915_hwmon.h > > > > index 7ca9cf2c34c96..0fcb7de844061 100644 > > > > --- a/drivers/gpu/drm/i915/i915_hwmon.h > > > > +++ b/drivers/gpu/drm/i915/i915_hwmon.h > > > > @@ -7,14 +7,21 @@ > > > > #ifndef __I915_HWMON_H__ > > > > #define __I915_HWMON_H__ > > > > +#include > > > > + > > > > struct drm_i915_private; > > > > +struct intel_gt; > > > > #if IS_REACHABLE(CONFIG_HWMON) > > > > void i915_hwmon_register(struct drm_i915_private *i915); > > > > void i915_hwmon_unregister(struct drm_i915_private *i915); > > > > +void i915_hwmon_power_max_disable(struct drm_i915_private *i915, b= ool *old); > > > > +void i915_hwmon_power_max_restore(struct drm_i915_private *i915, b= ool old); > > > > #else > > > > static inline void i915_hwmon_register(struct drm_i915_private *i= 915) { }; > > > > static inline void i915_hwmon_unregister(struct drm_i915_private = *i915) { }; > > > > +static inline void i915_hwmon_power_max_disable(struct drm_i915_pr= ivate *i915, bool *old) { }; > > > > +static inline void i915_hwmon_power_max_restore(struct drm_i915_pr= ivate *i915, bool old) { }; > > > > #endif > > > > #endif /* __I915_HWMON_H__ */