Re: [PATCH v9 4/6] drm/i915: Interrupt driven fences

From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: John Harrison <John.C.Harrison@Intel.com>,
	Maarten Lankhorst <maarten.lankhorst@linux.intel.com>,
	Intel-GFX@Lists.FreeDesktop.Org
Subject: Re: [PATCH v9 4/6] drm/i915: Interrupt driven fences
Date: Tue, 14 Jun 2016 12:35:30 +0100	[thread overview]
Message-ID: <575FEC02.9020306@linux.intel.com> (raw)
In-Reply-To: <b667a9b7-f3a4-f4ff-9da3-a9d6a09e66ac@Intel.com>

On 13/06/16 16:51, John Harrison wrote:

[snip]

>>>> diff --git a/drivers/gpu/drm/i915/i915_dma.c
>>>> b/drivers/gpu/drm/i915/i915_dma.c
>>>> index 07edaed..f8f60bb 100644
>>>> --- a/drivers/gpu/drm/i915/i915_dma.c
>>>> +++ b/drivers/gpu/drm/i915/i915_dma.c
>>>> @@ -1019,9 +1019,13 @@ static int i915_workqueues_init(struct
>>>> drm_i915_private *dev_priv)
>>>>        if (dev_priv->wq == NULL)
>>>>            goto out_err;
>>>>
>>>> +    dev_priv->req_wq = alloc_ordered_workqueue("i915-rq", 0);
>>>> +    if (dev_priv->req_wq == NULL)
>>>> +        goto out_free_wq;
>>>> +
>>> Single (per-device) ordered workqueue will serialize interrupt
>>> processing across all engines to one thread. Together with the fact
>>> request worker does not seem to need the sleeping context, I am
>>> thinking that a tasklet per engine would be much better (see
>>> engine->irq_tasklet for an example).
> Other conversations have stated that tasklets are not the best option. I
> did think about having a work queue per engine but that seemed
> excessive. Plus any subsequent work triggered by the fence completion
> will almost certainly require grabbing the driver mutex lock (because
> everything requires the mutex lock) so serialisation of engines doesn't
> sound like much of an issue.

In this patch AFAICS i915_gem_request_worker calls 
i915_gem_request_notify on the engine which only holds the 
engine->fence_lock. So if you had a per engine wq all engines could do 
that processing in parallel.

It is only a matter of when fence_signal_locked wakes up the waiters 
that they might go and hammer on struct_mutex, but for any work they 
want to do between waking up and submitting new work, serializing via a 
single wq will be bad for latency.

>>>
>>>> +        ret = __i915_spin_request(req, state);
>>>> +        if (ret == 0)
>>>> +            goto out;
>>>>        }
>>>>
>>>> +    /*
>>>> +     * Enable interrupt completion of the request.
>>>> +     */
>>>> +    fence_enable_sw_signaling(&req->fence);
>>>> +
>>>>        for (;;) {
>>>>            struct timer_list timer;
>>>>
>>>> @@ -1306,6 +1306,21 @@ int __i915_wait_request(struct
>>>> drm_i915_gem_request *req,
>>>>                break;
>>>>            }
>>>>
>>>> +        if (req->seqno) {
>>>> +            /*
>>>> +             * There is quite a lot of latency in the user interrupt
>>>> +             * path. So do an explicit seqno check and potentially
>>>> +             * remove all that delay.
>>>> +             */
>>>> +            if (req->engine->irq_seqno_barrier)
>>>> +                req->engine->irq_seqno_barrier(req->engine);
>>>> +            seqno = engine->get_seqno(engine);
>>>> +            if (i915_seqno_passed(seqno, req->seqno)) {
>>>> +                ret = 0;
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +
>>>>            if (signal_pending_state(state, current)) {
>>>>                ret = -ERESTARTSYS;
>>>>                break;
>>>> @@ -1332,14 +1347,32 @@ int __i915_wait_request(struct
>>>> drm_i915_gem_request *req,
>>>>                destroy_timer_on_stack(&timer);
>>>>            }
>>>>        }
>>>> -    if (!irq_test_in_progress)
>>>> -        engine->irq_put(engine);
>>>>
>>>>        finish_wait(&engine->irq_queue, &wait);
>>> Hm I don't understand why our custom waiting remains? Shouldn't
>>> fence_wait just be called after the optimistic spin, more or less?
> That would solve the 'thundering problem' if we could. In theory, the
> entire wait function should just be a call to 'fence_wait(&req->fence)'.
> Unfortunately, the wait function goes to sleep holding the mutex lock
> and requires having a bail out option on the wait which is not currently
> part of the fence API. I have a work-in-progress patch that almost
> solves the issues but it isn't quite there yet (and I haven't had much
> chance to work on it for a while).

[Btw does it also mean i915 can not handle the incoming 3rd party fences?]

So the userspace waiters can either wait on a dma-buf or i915 wq, 
depending on what ioctl they have called. Then on interrupt i915 waiters 
are woken directly, while the fence api ones go via a worker.

i915 waiters also do a second wake up the fence API waiters by...

>>>
>>>>    out:
>>>>        trace_i915_gem_request_wait_end(req);
>>>>
>>>> +    if ((ret == 0) && (req->seqno)) {
>>>> +        if (req->engine->irq_seqno_barrier)
>>>> +            req->engine->irq_seqno_barrier(req->engine);
>>>> +        seqno = engine->get_seqno(engine);
>>>> +        if (i915_seqno_passed(seqno, req->seqno) &&
>>>> +            !i915_gem_request_completed(req)) {
>>>> +            /*
>>>> +             * Make sure the request is marked as completed before
>>>> +             * returning. NB: Need to acquire the spinlock around
>>>> +             * the whole call to avoid a race condition with the
>>>> +             * interrupt handler is running concurrently and could
>>>> +             * cause this invocation to early exit even though the
>>>> +             * request has not actually been fully processed yet.
>>>> +             */
>>>> +            spin_lock_irq(&req->engine->fence_lock);
>>>> +            i915_gem_request_notify(req->engine, true);

... this call here.

If there are both classes of waiters one of those calls will be a waste 
of cycles (spin lock, irq toggle, list iter, coherent seqno read) correct?

>>>> +            spin_unlock_irq(&req->engine->fence_lock);
>>>> +        }
>>>> +    }
>>>> +
>>>>        if (timeout) {
>>>>            s64 tres = *timeout - (ktime_get_raw_ns() - before);
>>>>
>>>> @@ -1405,6 +1438,11 @@ static void i915_gem_request_retire(struct
>>>> drm_i915_gem_request *request)
>>>>    {
>>>>        trace_i915_gem_request_retire(request);
>>>>
>>>> +    if (request->irq_enabled) {
>>>> +        request->engine->irq_put(request->engine);
>>>> +        request->irq_enabled = false;
>>> What protects request->irq_enabled? Here versus enable_signalling
>>> bit? It can be called from the external fence users which would take
>>> the fence_lock, but here it does not.
> The flag can only be set when enabling interrupt driven completion
> (which can only happen once and only if the fence is not already
> signalled). The flag can only be cleared when the fence is signalled or
> when the request is retired. And retire without signal can only happen
> if the request is being cancelled in some way (e.g. GPU reset) and thus
> will not ever be signalled. So if we get here then none of the other
> paths are possible anymore.

Couldn't retire without signal happen via execbuf paths calling 
i915_gem_retire_requests_ring?

>>>> +        i915_gem_request_enable_interrupt(req, true);
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>> +/**
>>>> + * i915_gem_request_worker - request work handler callback.
>>>> + * @work: Work structure
>>>> + * Called in response to a seqno interrupt to process the completed
>>>> requests.
>>>> + */
>>>> +void i915_gem_request_worker(struct work_struct *work)
>>>> +{
>>>> +    struct intel_engine_cs *engine;
>>>> +
>>>> +    engine = container_of(work, struct intel_engine_cs, request_work);
>>>> +    i915_gem_request_notify(engine, false);
>>>> +}
>>>> +
>>>> +void i915_gem_request_notify(struct intel_engine_cs *engine, bool
>>>> fence_locked)
>>>> +{
>>>> +    struct drm_i915_gem_request *req, *req_next;
>>>> +    unsigned long flags;
>>>>        u32 seqno;
>>>>
>>>> -    seqno = req->engine->get_seqno(req->engine);
>>>> +    if (list_empty(&engine->fence_signal_list))
>>> Okay this without the lock still makes me nervous. I'd rather not
>>> having to think about why it is safe and can't miss a wakeup.
> I don't see how list_empty() can return a false negative. Even if the
> implementation was such that it could see a partially updated state
> across multiple memory accesses, that will just lead to it thinking
> not-empty which is fine. Any update which takes it from empty to
> not-empty is guaranteed to occur before the act of enabling interrupts
> and thus before notify() can be called. So while it could potentially do
> the full processing when an early exit was fine, it can never early exit
> when it needs to do something.

Something like that sounds like a good comment to put above then! :)

Regards,

Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx