Re: [PATCH 08/15] drm/i915: Slaughter the thundering i915_wait_request herd

From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>,
	intel-gfx@lists.freedesktop.org, "Gong,
	Zhipeng" <zhipeng.gong@intel.com>,
	"Rogozhkin, Dmitry V" <dmitry.v.rogozhkin@intel.com>
Subject: Re: [PATCH 08/15] drm/i915: Slaughter the thundering i915_wait_request herd
Date: Mon, 30 Nov 2015 13:32:16 +0000	[thread overview]
Message-ID: <565C4FE0.8000403@linux.intel.com> (raw)
In-Reply-To: <20151130123029.GA14582@nuc-i3427.alporthouse.com>

On 30/11/15 12:30, Chris Wilson wrote:
> On Mon, Nov 30, 2015 at 12:05:50PM +0000, Tvrtko Ursulin wrote:
>>> +	struct intel_breadcrumbs {
>>> +		spinlock_t lock; /* protects the per-engine request list */
>>
>> s/list/tree/
>
> list. We use the tree as a list!

Maybe but it confuses the reader a bit who goes and looks for some list 
then.

Related, when I was playing with this I was piggy backing on the 
existing per engine request list.

If they are seqno ordered there, and I think they are, then the waker 
thread just traverses it in order until hitting the not completed one, 
waking up every req->wait_queue as it goes.

In this case it doesn't matter that the waiters come in in 
indeterministic order so we don't need a (new) tree.

But the downside is the thread needs struct mutex. And that the 
retirement from that list is so lazy..

>>> +		struct task_struct *task; /* bh for all user interrupts */
>>> +		struct intel_breadcrumbs_engine {
>>> +			struct rb_root requests; /* sorted by retirement */
>>> +			struct drm_i915_gem_request *first;
>>
>> /* First request in the tree to be completed next by the hw. */
>>
>> ?
>
> /* first */ ?

Funny? :) First what? Why make the reader reverse engineer it?

>>> +		} engine[I915_NUM_RINGS];
>>> +	} breadcrumbs;
>>> +
>>>   	struct i915_virtual_gpu vgpu;
>>>
>>>   	struct intel_guc guc;
>>> @@ -2228,6 +2250,11 @@ struct drm_i915_gem_request {
>>>   	/** global list entry for this request */
>>>   	struct list_head list;
>>>
>>> +	/** List of clients waiting for completion of this request */
>>> +	wait_queue_head_t wait;
>>> +	struct rb_node irq_node;
>>> +	unsigned irq_count;
>>
>> To me irq prefix does not fit here that well.
>>
>> req_tree_node?
>>
>> waiter_count, waiters?
>
> waiters, breadcrumb_node, waiters_count ?

Deal.

>>>   	for (;;) {
>>> -		struct timer_list timer;
>>> -
>>> -		prepare_to_wait(&ring->irq_queue, &wait, state);
>>> +		prepare_to_wait(&req->wait, &wait, state);
>>>
>>> -		/* We need to check whether any gpu reset happened in between
>>> -		 * the caller grabbing the seqno and now ... */
>>> -		if (req->reset_counter != i915_reset_counter(&dev_priv->gpu_error)) {
>>> -			/* As we do not requeue the request over a GPU reset,
>>> -			 * if one does occur we know that the request is
>>> -			 * effectively complete.
>>> -			 */
>>> -			ret = 0;
>>> -			break;
>>> -		}
>>> -
>>> -		if (i915_gem_request_completed(req, false)) {
>>> +		if (i915_gem_request_completed(req, true) ||
>>
>> Why it is OK to change the lazy coherency mode here?
>
> We bang on the coherency after the IRQ. Here, we are woken by the IRQ
> waiter so we know that the IRQ/seqno coherency is fine and we can just
> check the CPU cache.

Makes sense, put into comment? There is space now since function is much 
shorter. :)

>>> +		    req->reset_counter != i915_reset_counter(&req->i915->gpu_error)) {
>>
>> It looks like it would be valuable to preserve the comment about no
>> re-queuing etc on GPU reset.
>
> I can drop it from the previous patch :)

No it is useful!

>>> +		if (timeout) {
>>> +			timeout_remain = io_schedule_timeout(timeout_remain);
>>> +			if (timeout_remain == 0) {
>>> +				ret = -ETIME;
>>> +				break;
>>> +			}
>>> +		} else
>>> +			io_schedule();
>>
>> Could set timeout_remain to that MAX value when timeout == NULL and
>> have a single call site to io_schedule_timeout and less indentation.
>> It doesn't matter hugely, maybe it would be a bit easier to read.
>
> Done. io_schedule() calls io_schedule_timeout(), whereas
> schedule_timeout() calls schedule()!

Who disagreed with that? :) Ok!

>>> @@ -900,7 +901,7 @@ static void i915_record_ring_state(struct drm_device *dev,
>>>   		ering->instdone = I915_READ(GEN2_INSTDONE);
>>>   	}
>>>
>>> -	ering->waiting = waitqueue_active(&ring->irq_queue);
>>> +	ering->waiting = READ_ONCE(dev_priv->breadcrumbs.engine[ring->id].first);
>>
>> What does the READ_ONCE do here?
>
> This element is protected by the breadcrumbs spinlock, the READ_ONCE is
> documentation that we don't care about the lock here.

Hm I find it confusing but OK.

>>>   static void notify_ring(struct intel_engine_cs *ring)
>>>   {
>>> -	if (!intel_ring_initialized(ring))
>>> +	if (ring->i915 == NULL)
>>>   		return;
>>>
>>>   	trace_i915_gem_request_notify(ring);
>>> -
>>> -	wake_up_all(&ring->irq_queue);
>>> +	wake_up_process(ring->i915->breadcrumbs.task);
>>
>> For a later extension:
>>
>> How would you view, some time in the future, adding a bool return to
>> ring->put_irq() which would say to the thread that it failed to
>> disable interrupts?
>>
>> In this case thread would exit and would need to be restarted when
>> the next waiter queues up. Possibly extending the
>> idle->put_irq->exit durations as well then.
>
> Honestly, not keen on the idea that the lowlevel put_irq can fail (makes
> cleanup much more fraught). I don't see what improvement you intend
> here, maybe clarifying that would help explain what you mean.

Don't know, maybe reflecting the fact it wasn't the last put_irq call so 
let the caller handle it if they want. We can leave this discussion for 
the future.

>>> @@ -2986,16 +2981,16 @@ static void i915_hangcheck_elapsed(struct work_struct *work)
>>>   			if (ring_idle(ring, seqno)) {
>>>   				ring->hangcheck.action = HANGCHECK_IDLE;
>>>
>>> -				if (waitqueue_active(&ring->irq_queue)) {
>>> +				if (READ_ONCE(dev_priv->breadcrumbs.engine[ring->id].first)) {
>>
>> READ_ONCE is again strange, but other than that I don't know the
>> hangcheck code to really evaulate it.
>>
>> Perhaps this also means this line needs a comment describing what
>> condition/state is actually approximating with the check.
>
> All we are doing is asking if anyone is waiting on the GPU, because the
> GPU has finished all of its work. If so, the IRQ handling is suspect
> (modulo the pre-existing race condition clearly earmarked by READ_ONCE).

Cool, /* Check if someone is waiting on the GPU */ then would make me happy.

>>> +	while (!kthread_should_stop()) {
>>> +		unsigned reset_counter = i915_reset_counter(&i915->gpu_error);
>>> +		unsigned long timeout_jiffies;
>>> +		bool idling = false;
>>> +
>>> +		/* On every tick, walk the seqno-ordered list of requests
>>> +		 * and for each retired request wakeup its clients. If we
>>> +		 * find an unfinished request, go back to sleep.
>>> +		 */
>>
>> s/tick/loop/ ?
> s/tick/iteration/ I'm sticking with tick

Tick makes me tick of a scheduler tick so to me it is the worst of the 
three. Iteration sounds really good. Whether that will be a 
tick/jiffie/orsomething is definite a bit lower in the code.

>> And if we don't find and unfinished request still go back to sleep. :)
>
> Ok.
>
>>> +		set_current_state(TASK_INTERRUPTIBLE);
>>> +
>>> +		/* Note carefully that we do not hold a reference for the
>>> +		 * requests on this list. Therefore we only inspect them
>>> +		 * whilst holding the spinlock to ensure that they are not
>>> +		 * freed in the meantime, and the client must remove the
>>> +		 * request from the list if it is interrupted (before it
>>> +		 * itself releases its reference).
>>> +		 */
>>> +		spin_lock(&b->lock);
>>
>> This lock is more per engine that global in its nature unless you
>> think it was more efficient to do fewer lock atomics vs potential
>> overlap of waiter activity per engines?
>
> Indeed. I decided not to care about making it per-engine.
>
>> Would need a new lock for req->irq_count, per request. So
>> req->lock/req->irq_count and be->lock for the tree operations.
>
> Why? A request is tied to an individual engine, therefore we can use
> that breadcrumb_engine.lock for the request.

Just so 2nd+ waiters would not touch the per engine lock.

>> Thread would only need to deal with the later. Feels like it would
>> work, just not sure if it is worth it.
>
> That was my feeling as well. Not sure if we will have huge contention
> that would be solved with per-engine spinlocks, whilst we still have
> struct_mutex.
>
>>> +		for (i = 0; i < I915_NUM_RINGS; i++) {
>>> +			struct intel_engine_cs *engine = &i915->ring[i];
>>> +			struct intel_breadcrumbs_engine *be = &b->engine[i];
>>> +			struct drm_i915_gem_request *request = be->first;
>>> +
>>> +			if (request == NULL) {
>>> +				if ((irq_get & (1 << i))) {
>>> +					if (irq_enabled & (1 << i)) {
>>> +						engine->irq_put(engine);
>>> +						irq_enabled &= ~ (1 << i);
>>> +					}
>>> +					intel_runtime_pm_put(i915);
>>> +					irq_get &= ~(1 << i);
>>> +				}
>>> +				continue;
>>> +			}
>>> +
>>> +			if ((irq_get & (1 << i)) == 0) {
>>> +				intel_runtime_pm_get(i915);
>>> +				irq_enabled |= __irq_enable(engine) << i;
>>> +				irq_get |= 1 << i;
>>> +			}
>>
>> irq_get and irq_enabled are creating a little bit of a headache for
>> me. And that would probably mean a comment is be in order to explain
>> what they are for and how they work.
>>
>> Like this I don't see the need for irq_get to persist across loops?
>
> Because getting the irq is quite slow, we add a jiffie shadow.

But it is not a shadow in the sense of the same bitmask from the 
previous iteration. The bits are different depending on special cases in 
__irq_enable.

So it needs some comments I think.

>> irq_get looks like "try to get these", irq_enabled is "these ones I
>> got". Apart for the weird usage of it to balance the runtime pm gets
>> and puts.
>>
>> A bit confusing as I said.
>
> I don't like the names, but haven't thought of anything better.
>
>>> +bool intel_breadcrumbs_add_waiter(struct drm_i915_gem_request *request)
>>> +{
>>> +	struct intel_breadcrumbs *b = &request->i915->breadcrumbs;
>>> +	bool first = false;
>>> +
>>> +	spin_lock(&b->lock);
>>> +	if (request->irq_count++ == 0) {
>>> +		struct intel_breadcrumbs_engine *be =
>>> +			&b->engine[request->ring->id];
>>> +		struct rb_node **p, *parent;
>>> +
>>> +		if (be->first == NULL)
>>> +			wake_up_process(b->task);
>>
>> With a global b->lock
>
> or even per engine
>
>> it makes no difference but in the logical
>> sense this would usually go last, after the request has been
>> inserted into the tree and waiter initialized.
>
> Except that the code is simpler with kicking the task first.
>
> Since we both thought it might make sense with a per-engine lock, let's
> go with it. Maybe one day it will pay off.

Ok.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx