All of lore.kernel.org
 help / color / mirror / Atom feed
From: Francisco Jerez <currojerez@riseup.net>
To: chris.p.wilson@intel.com, intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH] drm/i915/execlists: Pull tasklet interrupt-bh local to direct submission
Date: Mon, 23 Mar 2020 15:30:13 -0700	[thread overview]
Message-ID: <87imiu4scq.fsf@riseup.net> (raw)
In-Reply-To: <158495673843.17851.11761890199116661145@build.alporthouse.com>


[-- Attachment #1.1.1: Type: text/plain, Size: 6591 bytes --]

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Francisco Jerez (2020-03-20 22:14:51)
>> Francisco Jerez <currojerez@riseup.net> writes:
>> 
>> > Chris Wilson <chris@chris-wilson.co.uk> writes:
>> >
>> >> We dropped calling process_csb prior to handling direct submission in
>> >> order to avoid the nesting of spinlocks and lift process_csb() and the
>> >> majority of the tasklet out of irq-off. However, we do want to avoid
>> >> ksoftirqd latency in the fast path, so try and pull the interrupt-bh
>> >> local to direct submission if we can acquire the tasklet's lock.
>> >>
>> >> v2: Tweak the balance to avoid over submitting lite-restores
>> >>
>> >> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> >> Cc: Francisco Jerez <currojerez@riseup.net>
>> >> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>> >> ---
>> >>  drivers/gpu/drm/i915/gt/intel_lrc.c    | 44 ++++++++++++++++++++------
>> >>  drivers/gpu/drm/i915/gt/selftest_lrc.c |  2 +-
>> >>  2 files changed, 36 insertions(+), 10 deletions(-)
>> >>
>> >> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
>> >> index f09dd87324b9..dceb65a0088f 100644
>> >> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
>> >> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
>> >> @@ -2884,17 +2884,17 @@ static void queue_request(struct intel_engine_cs *engine,
>> >>      set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
>> >>  }
>> >>  
>> >> -static void __submit_queue_imm(struct intel_engine_cs *engine)
>> >> +static bool pending_csb(const struct intel_engine_execlists *el)
>> >>  {
>> >> -    struct intel_engine_execlists * const execlists = &engine->execlists;
>> >> +    return READ_ONCE(*el->csb_write) != READ_ONCE(el->csb_head);
>> >> +}
>> >>  
>> >> -    if (reset_in_progress(execlists))
>> >> -            return; /* defer until we restart the engine following reset */
>> >> +static bool skip_lite_restore(struct intel_engine_execlists *el,
>> >> +                          const struct i915_request *rq)
>> >> +{
>> >> +    struct i915_request *inflight = execlists_active(el);
>> >>  
>> >> -    if (execlists->tasklet.func == execlists_submission_tasklet)
>> >> -            __execlists_submission_tasklet(engine);
>> >> -    else
>> >> -            tasklet_hi_schedule(&execlists->tasklet);
>> >> +    return inflight && inflight->context == rq->context;
>> >>  }
>> >>  
>> >>  static void submit_queue(struct intel_engine_cs *engine,
>> >> @@ -2905,8 +2905,34 @@ static void submit_queue(struct intel_engine_cs *engine,
>> >>      if (rq_prio(rq) <= execlists->queue_priority_hint)
>> >>              return;
>> >>  
>> >> +    if (reset_in_progress(execlists))
>> >> +            return; /* defer until we restart the engine following reset */
>> >> +
>> >> +    /*
>> >> +     * Suppress immediate lite-restores, leave that to the tasklet.
>> >> +     *
>> >> +     * However, we leave the queue_priority_hint unset so that if we do
>> >> +     * submit a second context, we push that into ELSP[1] immediately.
>> >> +     */
>> >> +    if (skip_lite_restore(execlists, rq))
>> >> +            return;
>> >> +
>> > Why do you need to treat lite-restore specially here?
>
> Lite-restore have a noticeable impact on no-op loads. A part of that is
> that a lite-restore is about 1us, and the other part is that the driver
> has a lot more work to do. There's a balance point around here for not
> needlessly interrupting ourselves and ensuring that there is no bubble.
>

Oh, I see.  But isn't inhibiting the lite restore likely to be fairly
costly in some cases as well if it causes the GPU to go idle after the
current context completes for as long as it takes the CPU to wake up,
process the IRQ and dequeue the next request?  Would it make sense to
inhibit lite-restore in roughly the same conditions I set the overload
flag?  (since that indicates we'll get an IRQ at least one request
*before* the GPU actually goes idle, so there shouldn't be any penalty
from inhibiting lite restore).

>> >
>> > Anyway, trying this out now in combination with my patches now.
>> >
>> 
>> This didn't seem to help (together with your other suggestion to move
>> the overload accounting to __execlists_schedule_in/out).  And it makes
>> the current -5% SynMark OglMultithread regression with my series go down
>> to -10%.  My previous suggestion of moving the
>> intel_gt_pm_active_begin() call to process_csb() when the submission is
>> ACK'ed by the hardware does seem to help (and it roughly halves the
>> OglMultithread regression), possibly because that way we're able to
>> determine whether the first context was actually overlapping at the
>> point that the second was received by the hardware -- I haven't tested
>> it extensively yet though.
>
> Grumble, it just seems like we are setting and clearing the flag on
> completely unrelated events -- which I still think boils down to working
> around latency in the driver. Or at least I hope there's an explanation
> and bug to fix that improves responsiveness for all.
> -Chris

There *might* be a performance issue somewhere introducing latency that
the instrumentation I added happens to mitigate, but isn't that a sign
that it's fulfilling its purpose of determining when the workload could
be sensitive to CPU latency?

Maybe I didn't explain the idea properly: Given that command submission
is asynchronous with the CS processing the previous context, there is no
way for us to tell whether a request we just submitted was actually
overlapping with the previous one until we read the CSB and see whether
it led to an idle-to-active transition.  Only then can we assume that
the CPU is sending commands to the GPU quickly enough to keep it busy
without interruption.

You might argue that this will introduce a delay in the signalling of
overload roughly equal to the latency it takes for the CPU to receive
the execlists IRQ with the hardware ACK.  However that seems beneficial
since the clearing of overload suffers from the same latency, so the
fraction of time that overload is signalled will otherwise be biased as
a result of the latency difference, causing overload to be overreported
on the average.  Delaying the signalling of overload to the CSB handler
means that any systematic latency in our interrupt processing is
self-correcting.

Anyway, I'm open to other suggestions if you have other ideas that at
least don't worsen the pre-existing regression from my series.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2020-03-23 22:30 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-20 13:01 [Intel-gfx] [PATCH 1/4] drm/i915/gt: Report context-is-closed prior to pinning Chris Wilson
2020-03-20 13:01 ` [Intel-gfx] [PATCH 2/4] drm/i915/execlists: Pull tasklet interrupt-bh local to direct submission Chris Wilson
2020-03-20 17:47   ` [Intel-gfx] [PATCH] " Chris Wilson
2020-03-20 20:28     ` Francisco Jerez
2020-03-20 22:14       ` Francisco Jerez
2020-03-23  9:45         ` Chris Wilson
2020-03-23 22:30           ` Francisco Jerez [this message]
2020-03-23 23:50             ` Chris Wilson
2020-03-24 22:55               ` Francisco Jerez
2020-03-21 13:12     ` Chris Wilson
2020-03-20 13:01 ` [Intel-gfx] [PATCH 3/4] drm/i915: Immediately execute the fenced work Chris Wilson
2020-03-20 13:01 ` [Intel-gfx] [PATCH 4/4] drm/i915/gem: Avoid gem_context->mutex for simple vma lookup Chris Wilson
2020-03-20 13:47   ` Tvrtko Ursulin
2020-03-20 13:56     ` Chris Wilson
2020-03-20 13:15 ` [Intel-gfx] [PATCH 1/4] drm/i915/gt: Report context-is-closed prior to pinning Tvrtko Ursulin
2020-03-20 18:08 ` [Intel-gfx] ✗ Fi.CI.BAT: failure for series starting with [1/4] " Patchwork
2020-03-20 20:06 ` [Intel-gfx] ✓ Fi.CI.BAT: success for series starting with [1/4] drm/i915/gt: Report context-is-closed prior to pinning (rev2) Patchwork
2020-03-21  2:24 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2020-03-24  0:11 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for series starting with [1/4] drm/i915/gt: Report context-is-closed prior to pinning (rev3) Patchwork
  -- strict thread matches above, loose matches on Subject: below --
2020-03-20 10:34 [Intel-gfx] [PATCH] drm/i915/execlists: Pull tasklet interrupt-bh local to direct submission Chris Wilson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87imiu4scq.fsf@riseup.net \
    --to=currojerez@riseup.net \
    --cc=chris.p.wilson@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.