From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga05.intel.com ([192.55.52.43]:21481 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752037AbdC0Loz (ORCPT ); Mon, 27 Mar 2017 07:44:55 -0400 Subject: Re: [Intel-gfx] [PATCH] drm/i915: Keep all engine locks across scheduling To: Chris Wilson , intel-gfx@lists.freedesktop.org, "# v4 . 10+" References: <20170326084420.3231-1-chris@chris-wilson.co.uk> <20170326084637.13394-1-chris@chris-wilson.co.uk> <05a2f5b7-f004-573d-6f22-50ce59d2e62c@linux.intel.com> <20170327103123.GC10606@nuc-i3427.alporthouse.com> From: Tvrtko Ursulin Message-ID: <26f4763c-d603-d0c4-f12f-f6337aecb2f9@linux.intel.com> Date: Mon, 27 Mar 2017 12:39:38 +0100 MIME-Version: 1.0 In-Reply-To: <20170327103123.GC10606@nuc-i3427.alporthouse.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: stable-owner@vger.kernel.org List-ID: On 27/03/2017 11:31, Chris Wilson wrote: > On Mon, Mar 27, 2017 at 11:11:47AM +0100, Tvrtko Ursulin wrote: >> >> On 26/03/2017 09:46, Chris Wilson wrote: >>> Unlocking is dangerous. In this case we combine an early update to the >>> out-of-queue request, because we know that it will be inserted into the >>> correct FIFO priority-ordered slot when it becomes ready in the future. >>> However, given sufficient enthusiasm, it may become ready as we are >>> continuing to reschedule, and so may gazump the FIFO if we have since >>> dropped its spinlock. The result is that it may be executed too early, >>> before its dependees. >>> >>> Fixes: 20311bd35060 ("drm/i915/scheduler: Execute requests in order of priorities") >>> Testcase: igt/gem_exec_whisper >>> Signed-off-by: Chris Wilson >>> Cc: Tvrtko Ursulin >>> Cc: # v4.10+ >>> --- >>> drivers/gpu/drm/i915/intel_lrc.c | 54 +++++++++++++++++++++++++++------------- >>> 1 file changed, 37 insertions(+), 17 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c >>> index dd0e9d587852..3fdabba0a32d 100644 >>> --- a/drivers/gpu/drm/i915/intel_lrc.c >>> +++ b/drivers/gpu/drm/i915/intel_lrc.c >>> @@ -658,30 +658,47 @@ static void execlists_submit_request(struct drm_i915_gem_request *request) >>> spin_unlock_irqrestore(&engine->timeline->lock, flags); >>> } >>> >>> -static struct intel_engine_cs * >>> -pt_lock_engine(struct i915_priotree *pt, struct intel_engine_cs *locked) >>> +static inline struct intel_engine_cs * >>> +pt_lock_engine(struct i915_priotree *pt, unsigned long *locked) >>> { >>> - struct intel_engine_cs *engine; >>> - >>> - engine = container_of(pt, >>> - struct drm_i915_gem_request, >>> - priotree)->engine; >>> - if (engine != locked) { >>> - if (locked) >>> - spin_unlock_irq(&locked->timeline->lock); >>> - spin_lock_irq(&engine->timeline->lock); >>> - } >>> + struct intel_engine_cs *engine = >>> + container_of(pt, struct drm_i915_gem_request, priotree)->engine; >>> + >>> + /* Locking the engines in a random order will rightfully trigger a >>> + * spasm in lockdep. However, we can ignore lockdep (by marking each >>> + * as a seperate nesting) so long as we never nest the >>> + * engine->timeline->lock elsewhere. Also the number of nesting >>> + * subclasses is severely limited (7) which is going to cause an >>> + * issue at some point. >>> + * BUILD_BUG_ON(I915_NUM_ENGINES >= MAX_LOCKDEP_SUBCLASSES); >> >> Lets bite the bullet and not hide this BUILD_BUG_ON in a comment. :I > > The code would continue to work nevertheless, just lockdep would > eventually give up. I like it slightly better than taking either a > global spinlock for engine->execlists_queue insertion, or taking the > spinlock on every engine for scheduling. How often will we reschedule > across engines? Not sure. I think counting on "doesn't happen often" and "it still works" falls short of your high standards! ;) So a global execlist_queue lock if it must be.. >>> + */ >>> + if (!__test_and_set_bit(engine->id, locked)) >>> + spin_lock_nested(&engine->timeline->lock, >>> + hweight_long(*locked)); >>> >>> return engine; >>> } >>> >>> +static void >>> +unlock_engines(struct drm_i915_private *i915, unsigned long locked) >>> +{ >>> + struct intel_engine_cs *engine; >>> + unsigned long tmp; >>> + >>> + for_each_engine_masked(engine, i915, locked, tmp) >>> + spin_unlock(&engine->timeline->lock); >>> +} >>> + >>> static void execlists_schedule(struct drm_i915_gem_request *request, int prio) >>> { >>> - struct intel_engine_cs *engine = NULL; >>> + struct intel_engine_cs *engine; >>> struct i915_dependency *dep, *p; >>> struct i915_dependency stack; >>> + unsigned long locked = 0; >>> LIST_HEAD(dfs); >>> >>> + BUILD_BUG_ON(I915_NUM_ENGINES > BITS_PER_LONG); >>> + >>> if (prio <= READ_ONCE(request->priotree.priority)) >>> return; >>> >>> @@ -691,6 +708,9 @@ static void execlists_schedule(struct drm_i915_gem_request *request, int prio) >>> stack.signaler = &request->priotree; >>> list_add(&stack.dfs_link, &dfs); >>> >>> + GEM_BUG_ON(irqs_disabled()); >>> + local_irq_disable(); >>> + >> >> Why not just irqsave/restore? Sounds like too low level for this >> position in the flow. If just optimisation it would need a comment I >> think. > > It was because we are not taking the spin lock/unlock inside the same > block, so it felt dangerous. Who holds the irqflags? Hm yes, it cannot be made elegant. >>> /* Recursively bump all dependent priorities to match the new request. >>> * >>> * A naive approach would be to use recursion: >>> @@ -719,7 +739,7 @@ static void execlists_schedule(struct drm_i915_gem_request *request, int prio) >>> if (!RB_EMPTY_NODE(&pt->node)) >>> continue; >>> >>> - engine = pt_lock_engine(pt, engine); >>> + engine = pt_lock_engine(pt, &locked); >>> >>> /* If it is not already in the rbtree, we can update the >>> * priority inplace and skip over it (and its dependencies) >>> @@ -737,7 +757,7 @@ static void execlists_schedule(struct drm_i915_gem_request *request, int prio) >>> >>> INIT_LIST_HEAD(&dep->dfs_link); >>> >>> - engine = pt_lock_engine(pt, engine); >>> + engine = pt_lock_engine(pt, &locked); >>> >>> if (prio <= pt->priority) >>> continue; >>> @@ -750,8 +770,8 @@ static void execlists_schedule(struct drm_i915_gem_request *request, int prio) >>> engine->execlist_first = &pt->node; >>> } >>> >>> - if (engine) >>> - spin_unlock_irq(&engine->timeline->lock); >>> + unlock_engines(request->i915, locked); >>> + local_irq_enable(); >>> >>> /* XXX Do we need to preempt to make room for us and our deps? */ >>> } >>> >> >> I am trying to think whether removing the skip on requests not in >> the execution tree would work and help any. > > It's dangerous due to the duplicate branches in the dependency graph that > we are resolving to generate the topological ordering. We need a way to > do a mark-and-sweep whilst also ensuring that we end up with the correct > order. I'm open to (better :) suggestions. > >> Or if the above scheme >> is completely safe or we would need to lock atomically all engines >> requests on which will be touched. Especially since the code is only >> dealing with adjusting the priorities so I don't immediately see how >> it can cause out of order execution. > > interrupt leading to submit_request, which wants to then insert a > request into the execlist_queue rbtree vs ->schedule() also trying to > manipulate the rbtree (and in this case elements currently outside of the > rbtree). Our insertion into the rbtree ensures fifo so that we don't > reorder the equivalent priority dependencies during ->schedule(), hence > if we mark an out-of-rbtree request as a higher priority before > inserting all of its dependencies into the tree, if the submit_notify > occurs, it will insert the request into the tree before we get to insert > its dependencies, hence reordering. Ok I get the general idea. I don't have any better suggestions at the moment than trying the global lock. Luckily you have just removed one atomic from the irq handler so one step forward, two steps back. :) Regards, Tvrtko