From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Subject: Re: [Intel-gfx] [PATCH 01/31] drm/i915/gt: Ratelimit heartbeat completion probing
Date: Tue, 09 Feb 2021 19:52:27 +0200 [thread overview]
Message-ID: <87wnvht9gk.fsf@gaia.fi.intel.com> (raw)
In-Reply-To: <20210208105236.28498-1-chris@chris-wilson.co.uk>
Chris Wilson <chris@chris-wilson.co.uk> writes:
> The heartbeat runs through a few phases that we expect to complete
> within a certain number of heartbeat intervals. First we must submit the
> heartbeat to the queue, and if the queue is occupied it may take a
> couple of intervals before the heartbeat preempts the workload and is
> submitted to HW. Once running on HW, completion is not instantaneous as
> it may have to first reset the current workload before it itself runs
> through the empty request and signals completion. As such, we know that
> the heartbeat must take at least the preempt reset timeout and before we
> have had a chance to reset the engine, we do not want to issue a global
> reset ourselves (simply so that we only try to do one reset at a time
> and not confuse ourselves by resetting twice and hitting an innocent.)
>
> So by taking into consideration that once running the request must take
> a finite amount of time, we can delay the final completion check to
> accommodate that and avoid checking too early (before we've had a chance
> to handle any engine resets required).
>
> v2: Attach a callback to flush the work immediately upon the heartbeat
> completion and insert the delay before the next.
>
> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2853
> Suggested-by: CQ Tang <cq.tang@intel.com>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
> .../gpu/drm/i915/gt/intel_engine_heartbeat.c | 95 ++++++++++++++++---
> drivers/gpu/drm/i915/gt/intel_engine_types.h | 1 +
> .../drm/i915/gt/selftest_engine_heartbeat.c | 65 ++++++-------
> drivers/gpu/drm/i915/gt/selftest_execlists.c | 5 +-
> 4 files changed, 117 insertions(+), 49 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> index 0b062fad1837..209a477af412 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
> @@ -20,6 +20,18 @@
> * issue a reset -- in the hope that restores progress.
> */
>
> +#define HEARTBEAT_COMPLETION 50u /* milliseconds */
> +
> +static long completion_timeout(const struct intel_engine_cs *engine)
> +{
> + long timeout = HEARTBEAT_COMPLETION;
> +
> + if (intel_engine_has_preempt_reset(engine))
> + timeout += READ_ONCE(engine->props.preempt_timeout_ms);
Was pondering that we dont add slack but the slack is in
the initial value.
> +
> + return msecs_to_jiffies(timeout);
> +}
> +
> static bool next_heartbeat(struct intel_engine_cs *engine)
> {
> long delay;
> @@ -29,6 +41,26 @@ static bool next_heartbeat(struct intel_engine_cs *engine)
> return false;
>
> delay = msecs_to_jiffies_timeout(delay);
> +
> + /*
> + * Once we submit a heartbeat to the HW, we know that it will take
> + * at least a certain amount of time to complete. On a hanging system
> + * it will first have to wait for the preempt reset timeout, and
> + * then it will take some time for the reset to resume with the
> + * heartbeat and for it to complete. So once we have submitted the
> + * heartbeat to HW, we can wait a while longer before declaring the
> + * engine stuck and forcing a reset ourselves. If we do a reset
> + * and the engine is also doing a reset, it is possible that we
> + * reset the engine twice, harming an innocent.
> + *
> + * Before we have sumitted the heartbeat, we do not want to change
s/sumitted/submitted.
> + * the interval as we to promote the heartbeat and trigger preemption
s/we to/we want to/ ?
> + * in a deterministic time frame.
> + */
> + if (engine->heartbeat.systole &&
> + i915_request_is_active(engine->heartbeat.systole))
> + delay = max(delay, completion_timeout(engine));
I see no harm to just always switch to the max.
> +
> if (delay >= HZ)
> delay = round_jiffies_up_relative(delay);
> mod_delayed_work(system_highpri_wq, &engine->heartbeat.work, delay + 1);
> @@ -48,12 +80,49 @@ heartbeat_create(struct intel_context *ce, gfp_t gfp)
> return rq;
> }
>
> +static void defibrillator(struct dma_fence *f, struct dma_fence_cb *cb)
> +{
> + struct intel_engine_cs *engine =
> + container_of(cb, typeof(*engine), heartbeat.cb);
> +
> + if (READ_ONCE(engine->heartbeat.systole))
This particular spot is not a problem but we do manipulate
the systole, without lock, in heartbeat().
So I see race in idle_pulse where we swap the systole.
-Mika
> + mod_delayed_work(system_highpri_wq, &engine->heartbeat.work, 0);
> +}
> +
> +static void
> +untrack_heartbeat(struct intel_engine_cs *engine)
> +{
> + struct i915_request *rq;
> +
> + rq = fetch_and_zero(&engine->heartbeat.systole);
> + if (!rq)
> + return;
> +
> + ENGINE_TRACE(engine, "heartbeat " RQ_FMT "completed\n", RQ_ARG(rq));
> +
> + dma_fence_remove_callback(&rq->fence, &engine->heartbeat.cb);
> + i915_request_put(rq);
> +}
> +
> +static void
> +track_heartbeat(struct intel_engine_cs *engine, struct i915_request *rq)
> +{
> + ENGINE_TRACE(engine, "heartbeat " RQ_FMT "started\n", RQ_ARG(rq));
> +
> + dma_fence_add_callback(&rq->fence,
> + &engine->heartbeat.cb,
> + defibrillator);
> + engine->heartbeat.systole = i915_request_get(rq);
> + if (!next_heartbeat(engine))
> + untrack_heartbeat(engine);
> +}
> +
> static void idle_pulse(struct intel_engine_cs *engine, struct i915_request *rq)
> {
> engine->wakeref_serial = READ_ONCE(engine->serial) + 1;
> i915_request_add_active_barriers(rq);
> if (!engine->heartbeat.systole && intel_engine_has_heartbeat(engine))
> - engine->heartbeat.systole = i915_request_get(rq);
> + track_heartbeat(engine, rq);
> }
>
> static void heartbeat_commit(struct i915_request *rq,
> @@ -106,13 +175,8 @@ static void heartbeat(struct work_struct *wrk)
> intel_engine_flush_scheduler(engine);
>
> rq = engine->heartbeat.systole;
> - if (rq && i915_request_completed(rq)) {
> - ENGINE_TRACE(engine,
> - "heartbeat " RQ_FMT "completed\n",
> - RQ_ARG(rq));
> - i915_request_put(rq);
> - engine->heartbeat.systole = NULL;
> - }
> + if (rq && i915_request_completed(rq))
> + untrack_heartbeat(engine);
>
> if (!intel_engine_pm_get_if_awake(engine))
> return;
> @@ -180,6 +244,11 @@ static void heartbeat(struct work_struct *wrk)
> goto out;
> }
>
> + /* Just completed one heartbeat, wait a tick before the next */
> + if (rq)
> + goto out;
> +
> + /* The engine is parking. We can rest until the next user */
> serial = READ_ONCE(engine->serial);
> if (engine->wakeref_serial == serial)
> goto out;
> @@ -198,14 +267,14 @@ static void heartbeat(struct work_struct *wrk)
> if (IS_ERR(rq))
> goto unlock;
>
> - ENGINE_TRACE(engine, "heartbeat " RQ_FMT "started\n", RQ_ARG(rq));
> heartbeat_commit(rq, &attr);
>
> unlock:
> mutex_unlock(&ce->timeline->mutex);
> out:
> + intel_engine_flush_scheduler(engine);
> if (!engine->i915->params.enable_hangcheck || !next_heartbeat(engine))
> - i915_request_put(fetch_and_zero(&engine->heartbeat.systole));
> + untrack_heartbeat(engine);
> intel_engine_pm_put(engine);
> }
>
> @@ -219,8 +288,10 @@ void intel_engine_unpark_heartbeat(struct intel_engine_cs *engine)
>
> void intel_engine_park_heartbeat(struct intel_engine_cs *engine)
> {
> - if (cancel_delayed_work(&engine->heartbeat.work))
> - i915_request_put(fetch_and_zero(&engine->heartbeat.systole));
> + /* completion may rearm work */
> + while (cancel_delayed_work(&engine->heartbeat.work))
> + ;
> + untrack_heartbeat(engine);
> }
>
> void intel_engine_init_heartbeat(struct intel_engine_cs *engine)
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index 7efa6290cc3e..d27a44070cb1 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -322,6 +322,7 @@ struct intel_engine_cs {
> struct {
> struct delayed_work work;
> struct i915_request *systole;
> + struct dma_fence_cb cb;
> unsigned long blocked;
> } heartbeat;
>
> diff --git a/drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c
> index b2c369317bf1..812c4a168b01 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_engine_heartbeat.c
> @@ -202,47 +202,44 @@ static int cmp_u32(const void *_a, const void *_b)
>
> static int __live_heartbeat_fast(struct intel_engine_cs *engine)
> {
> - const unsigned int error_threshold = max(20000u, jiffies_to_usecs(6));
> - struct intel_context *ce;
> - struct i915_request *rq;
> - ktime_t t0, t1;
> + const unsigned int error_threshold =
> + max(3 * HEARTBEAT_COMPLETION * 1000, jiffies_to_usecs(6));
> + struct intel_context *ce = engine->kernel_context;
> u32 times[5];
> int err;
> int i;
>
> - ce = intel_context_create(engine);
> - if (IS_ERR(ce))
> - return PTR_ERR(ce);
> -
> intel_engine_pm_get(engine);
>
> err = intel_engine_set_heartbeat(engine, 1);
> if (err)
> goto err_pm;
>
> + flush_delayed_work(&engine->heartbeat.work);
> + while (engine->heartbeat.systole)
> + intel_engine_park_heartbeat(engine);
> +
> for (i = 0; i < ARRAY_SIZE(times); i++) {
> - do {
> - /* Manufacture a tick */
> - intel_engine_park_heartbeat(engine);
> - GEM_BUG_ON(engine->heartbeat.systole);
> - engine->serial++; /* pretend we are not idle! */
> - intel_engine_unpark_heartbeat(engine);
> + struct i915_sched_attr attr = { .priority = I915_PRIORITY_MIN };
> + struct i915_request *rq;
> + ktime_t t0, t1;
>
> - flush_delayed_work(&engine->heartbeat.work);
> - if (!delayed_work_pending(&engine->heartbeat.work)) {
> - pr_err("%s: heartbeat %d did not start\n",
> - engine->name, i);
> - err = -EINVAL;
> - goto err_pm;
> - }
> + GEM_BUG_ON(READ_ONCE(engine->heartbeat.systole));
>
> - rcu_read_lock();
> - rq = READ_ONCE(engine->heartbeat.systole);
> - if (rq)
> - rq = i915_request_get_rcu(rq);
> - rcu_read_unlock();
> - } while (!rq);
> + /* Manufacture a tick */
> + mutex_lock(&ce->timeline->mutex);
> + rq = heartbeat_create(ce, GFP_KERNEL);
> + if (!IS_ERR(rq)) {
> + i915_request_get(rq);
> + heartbeat_commit(rq, &attr);
> + }
> + mutex_unlock(&ce->timeline->mutex);
> + if (IS_ERR(rq)) {
> + err = PTR_ERR(rq);
> + goto err_reset;
> + }
>
> + /* Time how long before the heartbeat monitor checks */
> t0 = ktime_get();
> while (rq == READ_ONCE(engine->heartbeat.systole))
> yield(); /* work is on the local cpu! */
> @@ -275,10 +272,10 @@ static int __live_heartbeat_fast(struct intel_engine_cs *engine)
> err = -EINVAL;
> }
>
> +err_reset:
> reset_heartbeat(engine);
> err_pm:
> intel_engine_pm_put(engine);
> - intel_context_put(ce);
> return err;
> }
>
> @@ -308,20 +305,16 @@ static int __live_heartbeat_off(struct intel_engine_cs *engine)
>
> intel_engine_pm_get(engine);
>
> + /* Kick once, so that we change an active heartbeat */
> engine->serial++;
> - flush_delayed_work(&engine->heartbeat.work);
> - if (!delayed_work_pending(&engine->heartbeat.work)) {
> - pr_err("%s: heartbeat not running\n",
> - engine->name);
> - err = -EINVAL;
> - goto err_pm;
> - }
> + intel_engine_unpark_heartbeat(engine);
>
> err = intel_engine_set_heartbeat(engine, 0);
> if (err)
> goto err_pm;
>
> - engine->serial++;
> + /* The next heartbeat work should cancel the heartbeat */
> + engine->serial++; /* pretend the engine is still active */
> flush_delayed_work(&engine->heartbeat.work);
> if (delayed_work_pending(&engine->heartbeat.work)) {
> pr_err("%s: heartbeat still running\n",
> diff --git a/drivers/gpu/drm/i915/gt/selftest_execlists.c b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> index f625c29023ea..04ded3a2d491 100644
> --- a/drivers/gpu/drm/i915/gt/selftest_execlists.c
> +++ b/drivers/gpu/drm/i915/gt/selftest_execlists.c
> @@ -2325,13 +2325,16 @@ static int __cancel_fail(struct live_preempt_cancel *arg)
> del_timer_sync(&engine->execlists.preempt);
> intel_engine_flush_scheduler(engine);
>
> + engine->props.preempt_timeout_ms = 0;
> cancel_reset_timeout(engine);
>
> - /* after failure, require heartbeats to reset device */
> + /* after failure, require fast heartbeats to reset device */
> intel_engine_set_heartbeat(engine, 1);
> err = wait_for_reset(engine, rq, HZ / 2);
> intel_engine_set_heartbeat(engine,
> engine->defaults.heartbeat_interval_ms);
> +
> + engine->props.preempt_timeout_ms = engine->defaults.preempt_timeout_ms;
> if (err) {
> pr_err("Cancelled inflight0 request did not reset\n");
> goto out;
> --
> 2.20.1
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
prev parent reply other threads:[~2021-02-09 17:52 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-02-08 10:52 [Intel-gfx] [PATCH 01/31] drm/i915/gt: Ratelimit heartbeat completion probing Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 02/31] drm/i915: Move context revocation to scheduler Chris Wilson
2021-02-08 11:18 ` Tvrtko Ursulin
2021-02-08 10:52 ` [Intel-gfx] [PATCH 03/31] drm/i915: Introduce the scheduling mode Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 04/31] drm/i915: Move timeslicing flag to scheduler Chris Wilson
2021-02-08 11:43 ` Tvrtko Ursulin
2021-02-08 10:52 ` [Intel-gfx] [PATCH 05/31] drm/i915/gt: Declare when we enabled timeslicing Chris Wilson
2021-02-08 11:44 ` Tvrtko Ursulin
2021-02-08 10:52 ` [Intel-gfx] [PATCH 06/31] drm/i915: Move busywaiting control to the scheduler Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 07/31] drm/i915: Move preempt-reset flag " Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 08/31] drm/i915: Fix the iterative dfs for defering requests Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 09/31] drm/i915: Replace priolist rbtree with a skiplist Chris Wilson
2021-02-08 12:29 ` Tvrtko Ursulin
2021-02-08 12:46 ` Chris Wilson
2021-02-08 15:10 ` Tvrtko Ursulin
2021-02-08 15:23 ` Tvrtko Ursulin
2021-02-08 16:19 ` Chris Wilson
2021-02-09 16:11 ` Tvrtko Ursulin
2021-02-08 10:52 ` [Intel-gfx] [PATCH 10/31] drm/i915: Fair low-latency scheduling Chris Wilson
2021-02-08 14:56 ` Tvrtko Ursulin
2021-02-08 15:29 ` Chris Wilson
2021-02-08 16:03 ` Tvrtko Ursulin
2021-02-08 16:11 ` Chris Wilson
2021-02-09 9:37 ` Tvrtko Ursulin
2021-02-09 10:31 ` Chris Wilson
2021-02-09 10:40 ` Tvrtko Ursulin
2021-02-08 10:52 ` [Intel-gfx] [PATCH 11/31] drm/i915/gt: Specify a deadline for the heartbeat Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 12/31] drm/i915: Extend the priority boosting for the display with a deadline Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 13/31] drm/i915/gt: Support virtual engine queues Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 14/31] drm/i915: Move saturated workload detection back to the context Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 15/31] drm/i915: Bump default timeslicing quantum to 5ms Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 16/31] drm/i915/gt: Delay taking irqoff for execlists submission Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 17/31] drm/i915/gt: Convert the legacy ring submission to use the scheduling interface Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 18/31] drm/i915/gt: Wrap intel_timeline.has_initial_breadcrumb Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 19/31] drm/i915/gt: Track timeline GGTT offset separately from subpage offset Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 20/31] drm/i915/gt: Add timeline "mode" Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 21/31] drm/i915/gt: Use indices for writing into relative timelines Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 22/31] drm/i915/selftests: Exercise relative timeline modes Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 23/31] drm/i915/gt: Use ppHWSP for unshared non-semaphore related timelines Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 24/31] Restore "drm/i915: drop engine_pin/unpin_breadcrumbs_irq" Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 25/31] drm/i915/gt: Support creation of 'internal' rings Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 26/31] drm/i915/gt: Use client timeline address for seqno writes Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 27/31] drm/i915/gt: Infrastructure for ring scheduling Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 28/31] drm/i915/gt: Implement ring scheduler for gen4-7 Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 29/31] drm/i915/gt: Enable ring scheduling for gen5-7 Chris Wilson
2021-02-08 10:52 ` [Intel-gfx] [PATCH 30/31] drm/i915: Support secure dispatch on gen6/gen7 Chris Wilson
2021-02-08 20:55 ` Dave Airlie
2021-02-08 22:49 ` Chris Wilson
2021-02-09 11:02 ` Tvrtko Ursulin
2021-02-08 10:52 ` [Intel-gfx] [PATCH 31/31] drm/i915/gt: Limit C-states while waiting for requests Chris Wilson
2021-02-08 15:43 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/31] drm/i915/gt: Ratelimit heartbeat completion probing Patchwork
2021-02-08 15:45 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2021-02-08 16:13 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2021-02-09 17:52 ` Mika Kuoppala [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87wnvht9gk.fsf@gaia.fi.intel.com \
--to=mika.kuoppala@linux.intel.com \
--cc=chris@chris-wilson.co.uk \
--cc=intel-gfx@lists.freedesktop.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox