From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D3A89C433F5 for ; Wed, 24 Nov 2021 15:09:13 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 43A3A6E9D7; Wed, 24 Nov 2021 15:09:13 +0000 (UTC) Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by gabe.freedesktop.org (Postfix) with ESMTPS id EAD1A6E9D7; Wed, 24 Nov 2021 15:09:11 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10178"; a="298698319" X-IronPort-AV: E=Sophos;i="5.87,260,1631602800"; d="scan'208";a="298698319" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Nov 2021 07:09:11 -0800 X-IronPort-AV: E=Sophos;i="5.87,260,1631602800"; d="scan'208";a="509904302" Received: from kkarvas-mobl1.amr.corp.intel.com (HELO intel.com) ([10.255.34.182]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Nov 2021 07:09:10 -0800 Date: Wed, 24 Nov 2021 10:09:08 -0500 From: Rodrigo Vivi To: Tvrtko Ursulin Message-ID: References: <20211117224955.28999-1-vinay.belgaumkar@intel.com> <20211117224955.28999-2-vinay.belgaumkar@intel.com> <6e19db61-dee6-f012-9dd4-b8ef455509e7@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [Intel-gfx] [PATCH 1/3] drm/i915/gt: Spread virtual engines over idle engines X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, Chris Wilson Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" On Wed, Nov 24, 2021 at 08:55:39AM -0500, Rodrigo Vivi wrote: > On Wed, Nov 24, 2021 at 08:56:52AM +0000, Tvrtko Ursulin wrote: > > > > On 23/11/2021 19:52, Rodrigo Vivi wrote: > > > On Tue, Nov 23, 2021 at 09:39:25AM +0000, Tvrtko Ursulin wrote: > > > > > > > > On 17/11/2021 22:49, Vinay Belgaumkar wrote: > > > > > From: Chris Wilson > > > > > > > > > > Everytime we come to the end of a virtual engine's context, re-randomise > > > > > it's siblings[]. As we schedule the siblings' tasklets in the order they > > > > > are in the array, earlier entries are executed first (when idle) and so > > > > > will be preferred when scheduling the next virtual request. Currently, > > > > > we only update the array when switching onto a new idle engine, so we > > > > > prefer to stick on the last execute engine, keeping the work compact. > > > > > However, it can be beneficial to spread the work out across idle > > > > > engines, so choose another sibling as our preferred target at the end of > > > > > the context's execution. > > > > > > > > This partially brings back, from a different angle, the more dynamic > > > > scheduling behavior which has been lost since bugfix 90a987205c6c > > > > ("drm/i915/gt: Only swap to a random sibling once upon creation"). > > > > > > Shouldn't we use the Fixes tag here since this is targeting to fix one > > > of the performance regressions of this patch? > > > > Probably not but hard to say. Note that it wasn't a performance regression > > that was reported but power. > > > > And to go back to what we said elsewhere in the thread, I am actually with > > you in thinking that in the ideal world we need PnP testing across a variety > > of workloads and platforms. And "in the ideal world" should really be in the > > normal world. It is not professional to be reactive to isolated bug reports > > from users, without being able to see the overall picture. > > We surely need to address the bug report from users. I'm just asking to address > that with the smallest fix that we can backport and fit to the products milestones. > > Instead, we are creating another optimization feature on a rush. Without a proper > validation. > > I believe it is too risk to add an algorithm like that without a broader test. > I see a big risk of introducing corner cases that will results in more bug report > from other users in a near future. > > So, let's all be professionals and provide a smaller fix for a regression on > the load balancing scenario and provide a better validation with more data > to justify this new feature. Okay, after more IRC discussions I see that patch 2 is also part of the solution and probably safe. Let me be clear that my biggest complain and the risk is with race-to-idle in patch 3 on trying to predict the rc6 behavior and increasing the freq to try to complete job faster and then get to rc6 faster... That one would need a lot more validation. > > Thanks, > Rodrigo. > > > > > > > One day we could experiment with using engine busyness as criteria (instead > > > > of random). Back in the day busyness was kind of the best strategy, although > > > > sampled at submit, not at the trailing edge like here, but it still may be > > > > able to settle down to engine configuration better in some scenarios. Only > > > > testing could say. > > > > > > > > Still, from memory random also wasn't that bad so this should be okay for > > > > now. > > > > > > > > Reviewed-by: Tvrtko Ursulin > > > > > > Since you reviewed and it looks to be a middle ground point in terms > > > of when to balancing (always like in the initial implementation vs > > > only once like the in 90a987205c6c). > > > > > > If this one is really fixing the regression by itself: > > > Acked-by: Rodrigo Vivi > > > on this patch here. > > > > > > But I still don't want to take the risk with touching the freq with > > > race to idle, until not convinced that it is absolutely needed and > > > that we are not breaking the world out there. > > > > Yes agreed in principle, we have users with different priorities. > > > > However the RPS patches in the series, definitely the 1st one which looks at > > classes versus individual engines, sound plausible to me. Given the absence > > of automated PnP testing mentioned above, in the past it was usually Chris > > who was making the above and beyond effort to evaluate changes like these on > > as many platforms as he could, and with different workloads. Not sure who > > has the mandate and drive to fill that space but something will need to > > happen. > > > > Regards, > > > > Tvrtko > > > > > > > > > > Regards, > > > > > > > > Tvrtko > > > > > > > > > Signed-off-by: Chris Wilson > > > > > Cc: Vinay Belgaumkar > > > > > Cc: Tvrtko Ursulin > > > > > --- > > > > > .../drm/i915/gt/intel_execlists_submission.c | 80 ++++++++++++------- > > > > > 1 file changed, 52 insertions(+), 28 deletions(-) > > > > > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c > > > > > index ca03880fa7e4..b95bbc8fb91a 100644 > > > > > --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c > > > > > +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c > > > > > @@ -539,6 +539,41 @@ static void execlists_schedule_in(struct i915_request *rq, int idx) > > > > > GEM_BUG_ON(intel_context_inflight(ce) != rq->engine); > > > > > } > > > > > +static void virtual_xfer_context(struct virtual_engine *ve, > > > > > + struct intel_engine_cs *engine) > > > > > +{ > > > > > + unsigned int n; > > > > > + > > > > > + if (likely(engine == ve->siblings[0])) > > > > > + return; > > > > > + > > > > > + if (!intel_engine_has_relative_mmio(engine)) > > > > > + lrc_update_offsets(&ve->context, engine); > > > > > + > > > > > + /* > > > > > + * Move the bound engine to the top of the list for > > > > > + * future execution. We then kick this tasklet first > > > > > + * before checking others, so that we preferentially > > > > > + * reuse this set of bound registers. > > > > > + */ > > > > > + for (n = 1; n < ve->num_siblings; n++) { > > > > > + if (ve->siblings[n] == engine) { > > > > > + swap(ve->siblings[n], ve->siblings[0]); > > > > > + break; > > > > > + } > > > > > + } > > > > > +} > > > > > + > > > > > +static int ve_random_sibling(struct virtual_engine *ve) > > > > > +{ > > > > > + return prandom_u32_max(ve->num_siblings); > > > > > +} > > > > > + > > > > > +static int ve_random_other_sibling(struct virtual_engine *ve) > > > > > +{ > > > > > + return 1 + prandom_u32_max(ve->num_siblings - 1); > > > > > +} > > > > > + > > > > > static void > > > > > resubmit_virtual_request(struct i915_request *rq, struct virtual_engine *ve) > > > > > { > > > > > @@ -578,8 +613,23 @@ static void kick_siblings(struct i915_request *rq, struct intel_context *ce) > > > > > rq->execution_mask != engine->mask) > > > > > resubmit_virtual_request(rq, ve); > > > > > - if (READ_ONCE(ve->request)) > > > > > + /* > > > > > + * Reschedule with a new "preferred" sibling. > > > > > + * > > > > > + * The tasklets are executed in the order of ve->siblings[], so > > > > > + * siblings[0] receives preferrential treatment of greedily checking > > > > > + * for execution of the virtual engine. At this point, the virtual > > > > > + * engine is no longer in the current GPU cache due to idleness or > > > > > + * contention, so it can be executed on any without penalty. We > > > > > + * re-randomise at this point in order to spread light loads across > > > > > + * the system, heavy overlapping loads will continue to be greedily > > > > > + * executed by the first available engine. > > > > > + */ > > > > > + if (READ_ONCE(ve->request)) { > > > > > + virtual_xfer_context(ve, > > > > > + ve->siblings[ve_random_other_sibling(ve)]); > > > > > tasklet_hi_schedule(&ve->base.sched_engine->tasklet); > > > > > + } > > > > > } > > > > > static void __execlists_schedule_out(struct i915_request * const rq, > > > > > @@ -1030,32 +1080,6 @@ first_virtual_engine(struct intel_engine_cs *engine) > > > > > return NULL; > > > > > } > > > > > -static void virtual_xfer_context(struct virtual_engine *ve, > > > > > - struct intel_engine_cs *engine) > > > > > -{ > > > > > - unsigned int n; > > > > > - > > > > > - if (likely(engine == ve->siblings[0])) > > > > > - return; > > > > > - > > > > > - GEM_BUG_ON(READ_ONCE(ve->context.inflight)); > > > > > - if (!intel_engine_has_relative_mmio(engine)) > > > > > - lrc_update_offsets(&ve->context, engine); > > > > > - > > > > > - /* > > > > > - * Move the bound engine to the top of the list for > > > > > - * future execution. We then kick this tasklet first > > > > > - * before checking others, so that we preferentially > > > > > - * reuse this set of bound registers. > > > > > - */ > > > > > - for (n = 1; n < ve->num_siblings; n++) { > > > > > - if (ve->siblings[n] == engine) { > > > > > - swap(ve->siblings[n], ve->siblings[0]); > > > > > - break; > > > > > - } > > > > > - } > > > > > -} > > > > > - > > > > > static void defer_request(struct i915_request *rq, struct list_head * const pl) > > > > > { > > > > > LIST_HEAD(list); > > > > > @@ -3590,7 +3614,7 @@ static void virtual_engine_initial_hint(struct virtual_engine *ve) > > > > > * NB This does not force us to execute on this engine, it will just > > > > > * typically be the first we inspect for submission. > > > > > */ > > > > > - swp = prandom_u32_max(ve->num_siblings); > > > > > + swp = ve_random_sibling(ve); > > > > > if (swp) > > > > > swap(ve->siblings[swp], ve->siblings[0]); > > > > > } > > > > >