From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org
Cc: Matthew Auld <matthew.auld@intel.com>
Subject: Re: [PATCH 2/9] drm/i915/gt: Close race between engine_park and intel_gt_retire_requests
Date: Wed, 20 Nov 2019 13:19:30 +0000 [thread overview]
Message-ID: <f4ce9baf-69fa-784d-c0d8-9f108996af42@linux.intel.com> (raw)
In-Reply-To: <20191120093302.3723715-2-chris@chris-wilson.co.uk>
On 20/11/2019 09:32, Chris Wilson wrote:
> The general concept was that intel_timeline.active_count was locked by
> the intel_timeline.mutex. The exception was for power management, where
> the engine->kernel_context->timeline could be manipulated under the
> global wakeref.mutex.
>
> This was quite solid, as we always manipulated the timeline only while
> we held an engine wakeref.
>
> And then we started retiring requests outside of struct_mutex, only
> using the timelines.active_list and the timeline->mutex. There we
> started manipulating intel_timeline.active_count outside of an engine
> wakeref, and so introduced a race between __engine_park() and
> intel_gt_retire_requests(), a race that could result in the
> engine->kernel_context not being added to the active timelines and so
> losing requests, which caused us to keep the system permanently powered
> up [and unloadable].
>
> The race would be easy to close if we could take the engine wakeref for
> the timeline before we retire -- except timelines are not bound to any
> engine and so we would need to keep all active engines awake. The
> alternative is to guard intel_timeline_enter/intel_timeline_exit for use
> outside of the timeline->mutex.
>
> Fixes: e5dadff4b093 ("drm/i915: Protect request retirement with timeline->mutex")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
> drivers/gpu/drm/i915/gt/intel_gt_requests.c | 8 ++---
> drivers/gpu/drm/i915/gt/intel_timeline.c | 34 +++++++++++++++----
> .../gpu/drm/i915/gt/intel_timeline_types.h | 2 +-
> 3 files changed, 32 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_requests.c b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
> index 25291e2af21e..1a005da8c588 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_requests.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
> @@ -49,8 +49,8 @@ long intel_gt_retire_requests_timeout(struct intel_gt *gt, long timeout)
> continue;
>
> intel_timeline_get(tl);
> - GEM_BUG_ON(!tl->active_count);
> - tl->active_count++; /* pin the list element */
> + GEM_BUG_ON(!atomic_read(&tl->active_count));
> + atomic_inc(&tl->active_count); /* pin the list element */
> spin_unlock_irqrestore(&timelines->lock, flags);
>
> if (timeout > 0) {
> @@ -71,14 +71,14 @@ long intel_gt_retire_requests_timeout(struct intel_gt *gt, long timeout)
>
> /* Resume iteration after dropping lock */
> list_safe_reset_next(tl, tn, link);
> - if (!--tl->active_count)
> + if (atomic_dec_and_test(&tl->active_count))
> list_del(&tl->link);
>
> mutex_unlock(&tl->mutex);
>
> /* Defer the final release to after the spinlock */
> if (refcount_dec_and_test(&tl->kref.refcount)) {
> - GEM_BUG_ON(tl->active_count);
> + GEM_BUG_ON(atomic_read(&tl->active_count));
> list_add(&tl->link, &free);
> }
> }
> diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c b/drivers/gpu/drm/i915/gt/intel_timeline.c
> index 0e277835aad0..b35f12729983 100644
> --- a/drivers/gpu/drm/i915/gt/intel_timeline.c
> +++ b/drivers/gpu/drm/i915/gt/intel_timeline.c
> @@ -334,15 +334,33 @@ void intel_timeline_enter(struct intel_timeline *tl)
> struct intel_gt_timelines *timelines = &tl->gt->timelines;
> unsigned long flags;
>
> + /*
> + * Pretend we are serialised by the timeline->mutex.
> + *
> + * While generally true, there are a few exceptions to the rule
> + * for the engine->kernel_context being used to manage power
> + * transitions. As the engine_park may be called from under any
> + * timeline, it uses the power mutex as a global serialisation
> + * lock to prevent any other request entering its timeline.
> + *
> + * The rule is generally tl->mutex, otherwise engine->wakeref.mutex.
> + *
> + * However, intel_gt_retire_request() does not know which engine
> + * it is retiring along and so cannot partake in the engine-pm
> + * barrier, and there we use the tl->active_count as a means to
> + * pin the timeline in the active_list while the locks are dropped.
> + * Ergo, as that is outside of the engine-pm barrier, we need to
> + * use atomic to manipulate tl->active_count.
> + */
> lockdep_assert_held(&tl->mutex);
> -
> GEM_BUG_ON(!atomic_read(&tl->pin_count));
> - if (tl->active_count++)
> +
> + if (atomic_add_unless(&tl->active_count, 1, 0))
> return;
> - GEM_BUG_ON(!tl->active_count); /* overflow? */
>
> spin_lock_irqsave(&timelines->lock, flags);
> - list_add_tail(&tl->link, &timelines->active_list);
> + if (!atomic_fetch_inc(&tl->active_count))
> + list_add_tail(&tl->link, &timelines->active_list);
> spin_unlock_irqrestore(&timelines->lock, flags);
> }
>
> @@ -351,14 +369,16 @@ void intel_timeline_exit(struct intel_timeline *tl)
> struct intel_gt_timelines *timelines = &tl->gt->timelines;
> unsigned long flags;
>
> + /* See intel_timeline_enter() */
> lockdep_assert_held(&tl->mutex);
>
> - GEM_BUG_ON(!tl->active_count);
> - if (--tl->active_count)
> + GEM_BUG_ON(!atomic_read(&tl->active_count));
> + if (atomic_add_unless(&tl->active_count, -1, 1))
> return;
>
> spin_lock_irqsave(&timelines->lock, flags);
> - list_del(&tl->link);
> + if (atomic_dec_and_test(&tl->active_count))
> + list_del(&tl->link);
> spin_unlock_irqrestore(&timelines->lock, flags);
>
> /*
> diff --git a/drivers/gpu/drm/i915/gt/intel_timeline_types.h b/drivers/gpu/drm/i915/gt/intel_timeline_types.h
> index 98d9ee166379..5244615ed1cb 100644
> --- a/drivers/gpu/drm/i915/gt/intel_timeline_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_timeline_types.h
> @@ -42,7 +42,7 @@ struct intel_timeline {
> * from the intel_context caller plus internal atomicity.
> */
> atomic_t pin_count;
> - unsigned int active_count;
> + atomic_t active_count;
>
> const u32 *hwsp_seqno;
> struct i915_vma *hwsp_ggtt;
>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
WARNING: multiple messages have this Message-ID (diff)
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>, intel-gfx@lists.freedesktop.org
Cc: Matthew Auld <matthew.auld@intel.com>
Subject: Re: [Intel-gfx] [PATCH 2/9] drm/i915/gt: Close race between engine_park and intel_gt_retire_requests
Date: Wed, 20 Nov 2019 13:19:30 +0000 [thread overview]
Message-ID: <f4ce9baf-69fa-784d-c0d8-9f108996af42@linux.intel.com> (raw)
Message-ID: <20191120131930.Ke0wrmi-8MgKiWJv0Bd1kIDblyjyCkXlB65UKYmKV4Y@z> (raw)
In-Reply-To: <20191120093302.3723715-2-chris@chris-wilson.co.uk>
On 20/11/2019 09:32, Chris Wilson wrote:
> The general concept was that intel_timeline.active_count was locked by
> the intel_timeline.mutex. The exception was for power management, where
> the engine->kernel_context->timeline could be manipulated under the
> global wakeref.mutex.
>
> This was quite solid, as we always manipulated the timeline only while
> we held an engine wakeref.
>
> And then we started retiring requests outside of struct_mutex, only
> using the timelines.active_list and the timeline->mutex. There we
> started manipulating intel_timeline.active_count outside of an engine
> wakeref, and so introduced a race between __engine_park() and
> intel_gt_retire_requests(), a race that could result in the
> engine->kernel_context not being added to the active timelines and so
> losing requests, which caused us to keep the system permanently powered
> up [and unloadable].
>
> The race would be easy to close if we could take the engine wakeref for
> the timeline before we retire -- except timelines are not bound to any
> engine and so we would need to keep all active engines awake. The
> alternative is to guard intel_timeline_enter/intel_timeline_exit for use
> outside of the timeline->mutex.
>
> Fixes: e5dadff4b093 ("drm/i915: Protect request retirement with timeline->mutex")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
> drivers/gpu/drm/i915/gt/intel_gt_requests.c | 8 ++---
> drivers/gpu/drm/i915/gt/intel_timeline.c | 34 +++++++++++++++----
> .../gpu/drm/i915/gt/intel_timeline_types.h | 2 +-
> 3 files changed, 32 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_requests.c b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
> index 25291e2af21e..1a005da8c588 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_requests.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
> @@ -49,8 +49,8 @@ long intel_gt_retire_requests_timeout(struct intel_gt *gt, long timeout)
> continue;
>
> intel_timeline_get(tl);
> - GEM_BUG_ON(!tl->active_count);
> - tl->active_count++; /* pin the list element */
> + GEM_BUG_ON(!atomic_read(&tl->active_count));
> + atomic_inc(&tl->active_count); /* pin the list element */
> spin_unlock_irqrestore(&timelines->lock, flags);
>
> if (timeout > 0) {
> @@ -71,14 +71,14 @@ long intel_gt_retire_requests_timeout(struct intel_gt *gt, long timeout)
>
> /* Resume iteration after dropping lock */
> list_safe_reset_next(tl, tn, link);
> - if (!--tl->active_count)
> + if (atomic_dec_and_test(&tl->active_count))
> list_del(&tl->link);
>
> mutex_unlock(&tl->mutex);
>
> /* Defer the final release to after the spinlock */
> if (refcount_dec_and_test(&tl->kref.refcount)) {
> - GEM_BUG_ON(tl->active_count);
> + GEM_BUG_ON(atomic_read(&tl->active_count));
> list_add(&tl->link, &free);
> }
> }
> diff --git a/drivers/gpu/drm/i915/gt/intel_timeline.c b/drivers/gpu/drm/i915/gt/intel_timeline.c
> index 0e277835aad0..b35f12729983 100644
> --- a/drivers/gpu/drm/i915/gt/intel_timeline.c
> +++ b/drivers/gpu/drm/i915/gt/intel_timeline.c
> @@ -334,15 +334,33 @@ void intel_timeline_enter(struct intel_timeline *tl)
> struct intel_gt_timelines *timelines = &tl->gt->timelines;
> unsigned long flags;
>
> + /*
> + * Pretend we are serialised by the timeline->mutex.
> + *
> + * While generally true, there are a few exceptions to the rule
> + * for the engine->kernel_context being used to manage power
> + * transitions. As the engine_park may be called from under any
> + * timeline, it uses the power mutex as a global serialisation
> + * lock to prevent any other request entering its timeline.
> + *
> + * The rule is generally tl->mutex, otherwise engine->wakeref.mutex.
> + *
> + * However, intel_gt_retire_request() does not know which engine
> + * it is retiring along and so cannot partake in the engine-pm
> + * barrier, and there we use the tl->active_count as a means to
> + * pin the timeline in the active_list while the locks are dropped.
> + * Ergo, as that is outside of the engine-pm barrier, we need to
> + * use atomic to manipulate tl->active_count.
> + */
> lockdep_assert_held(&tl->mutex);
> -
> GEM_BUG_ON(!atomic_read(&tl->pin_count));
> - if (tl->active_count++)
> +
> + if (atomic_add_unless(&tl->active_count, 1, 0))
> return;
> - GEM_BUG_ON(!tl->active_count); /* overflow? */
>
> spin_lock_irqsave(&timelines->lock, flags);
> - list_add_tail(&tl->link, &timelines->active_list);
> + if (!atomic_fetch_inc(&tl->active_count))
> + list_add_tail(&tl->link, &timelines->active_list);
> spin_unlock_irqrestore(&timelines->lock, flags);
> }
>
> @@ -351,14 +369,16 @@ void intel_timeline_exit(struct intel_timeline *tl)
> struct intel_gt_timelines *timelines = &tl->gt->timelines;
> unsigned long flags;
>
> + /* See intel_timeline_enter() */
> lockdep_assert_held(&tl->mutex);
>
> - GEM_BUG_ON(!tl->active_count);
> - if (--tl->active_count)
> + GEM_BUG_ON(!atomic_read(&tl->active_count));
> + if (atomic_add_unless(&tl->active_count, -1, 1))
> return;
>
> spin_lock_irqsave(&timelines->lock, flags);
> - list_del(&tl->link);
> + if (atomic_dec_and_test(&tl->active_count))
> + list_del(&tl->link);
> spin_unlock_irqrestore(&timelines->lock, flags);
>
> /*
> diff --git a/drivers/gpu/drm/i915/gt/intel_timeline_types.h b/drivers/gpu/drm/i915/gt/intel_timeline_types.h
> index 98d9ee166379..5244615ed1cb 100644
> --- a/drivers/gpu/drm/i915/gt/intel_timeline_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_timeline_types.h
> @@ -42,7 +42,7 @@ struct intel_timeline {
> * from the intel_context caller plus internal atomicity.
> */
> atomic_t pin_count;
> - unsigned int active_count;
> + atomic_t active_count;
>
> const u32 *hwsp_seqno;
> struct i915_vma *hwsp_ggtt;
>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Regards,
Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx
next prev parent reply other threads:[~2019-11-20 13:19 UTC|newest]
Thread overview: 72+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-20 9:32 [PATCH 1/9] drm/i915/selftests: Take a ref to the request we wait upon Chris Wilson
2019-11-20 9:32 ` [Intel-gfx] " Chris Wilson
2019-11-20 9:32 ` [PATCH 2/9] drm/i915/gt: Close race between engine_park and intel_gt_retire_requests Chris Wilson
2019-11-20 9:32 ` [Intel-gfx] " Chris Wilson
2019-11-20 13:19 ` Tvrtko Ursulin [this message]
2019-11-20 13:19 ` Tvrtko Ursulin
2019-11-20 9:32 ` [PATCH 3/9] drm/i915/gt: Unlock engine-pm after queuing the kernel context switch Chris Wilson
2019-11-20 9:32 ` [Intel-gfx] " Chris Wilson
2019-11-20 11:58 ` Tvrtko Ursulin
2019-11-20 11:58 ` [Intel-gfx] " Tvrtko Ursulin
2019-11-20 12:07 ` Chris Wilson
2019-11-20 12:07 ` [Intel-gfx] " Chris Wilson
2019-11-20 12:40 ` Tvrtko Ursulin
2019-11-20 12:40 ` [Intel-gfx] " Tvrtko Ursulin
2019-11-20 12:44 ` Chris Wilson
2019-11-20 12:44 ` [Intel-gfx] " Chris Wilson
2019-11-20 13:19 ` Tvrtko Ursulin
2019-11-20 13:19 ` [Intel-gfx] " Tvrtko Ursulin
2019-11-20 9:32 ` [PATCH 4/9] drm/i915: Mark up the calling context for intel_wakeref_put() Chris Wilson
2019-11-20 9:32 ` [Intel-gfx] " Chris Wilson
2019-11-20 12:46 ` Tvrtko Ursulin
2019-11-20 12:46 ` [Intel-gfx] " Tvrtko Ursulin
2019-11-20 9:32 ` [PATCH 5/9] drm/i915/gt: Declare timeline.lock to be irq-free Chris Wilson
2019-11-20 9:32 ` [Intel-gfx] " Chris Wilson
2019-11-20 9:32 ` [PATCH 6/9] drm/i915/selftests: Force bonded submission to overlap Chris Wilson
2019-11-20 9:32 ` [Intel-gfx] " Chris Wilson
2019-11-20 12:55 ` Tvrtko Ursulin
2019-11-20 12:55 ` [Intel-gfx] " Tvrtko Ursulin
2019-11-20 12:59 ` Chris Wilson
2019-11-20 12:59 ` [Intel-gfx] " Chris Wilson
2019-11-20 13:18 ` Tvrtko Ursulin
2019-11-20 13:18 ` [Intel-gfx] " Tvrtko Ursulin
2019-11-20 13:29 ` Chris Wilson
2019-11-20 13:29 ` [Intel-gfx] " Chris Wilson
2019-11-22 9:34 ` Tvrtko Ursulin
2019-11-22 9:34 ` [Intel-gfx] " Tvrtko Ursulin
2019-11-22 10:03 ` Chris Wilson
2019-11-22 10:03 ` [Intel-gfx] " Chris Wilson
2019-11-22 10:43 ` [PATCH] " Chris Wilson
2019-11-22 10:43 ` [Intel-gfx] " Chris Wilson
2019-11-20 9:33 ` [PATCH 7/9] drm/i915/selftests: Flush the active callbacks Chris Wilson
2019-11-20 9:33 ` [Intel-gfx] " Chris Wilson
2019-11-20 9:33 ` [PATCH 8/9] drm/i915/selftests: Be explicit in ERR_PTR handling Chris Wilson
2019-11-20 9:33 ` [Intel-gfx] " Chris Wilson
2019-11-20 10:23 ` Matthew Auld
2019-11-20 10:23 ` [Intel-gfx] " Matthew Auld
2019-11-20 9:33 ` [PATCH 9/9] drm/i915/gt: Schedule request retirement when timeline idles Chris Wilson
2019-11-20 9:33 ` [Intel-gfx] " Chris Wilson
2019-11-20 13:16 ` Tvrtko Ursulin
2019-11-20 13:16 ` [Intel-gfx] " Tvrtko Ursulin
2019-11-20 13:39 ` Chris Wilson
2019-11-20 13:39 ` [Intel-gfx] " Chris Wilson
2019-11-20 14:16 ` Tvrtko Ursulin
2019-11-20 14:16 ` [Intel-gfx] " Tvrtko Ursulin
2019-11-20 14:25 ` Chris Wilson
2019-11-20 14:25 ` [Intel-gfx] " Chris Wilson
2019-11-20 10:19 ` [PATCH 1/9] drm/i915/selftests: Take a ref to the request we wait upon Matthew Auld
2019-11-20 10:19 ` [Intel-gfx] " Matthew Auld
2019-11-20 10:25 ` Chris Wilson
2019-11-20 10:25 ` [Intel-gfx] " Chris Wilson
2019-11-20 10:27 ` [PATCH] " Chris Wilson
2019-11-20 10:27 ` [Intel-gfx] " Chris Wilson
2019-11-20 10:34 ` Matthew Auld
2019-11-20 10:34 ` [Intel-gfx] " Matthew Auld
2019-11-20 13:51 ` ✗ Fi.CI.CHECKPATCH: warning for series starting with drm/i915/selftests: Take a ref to the request we wait upon (rev2) Patchwork
2019-11-20 13:51 ` [Intel-gfx] " Patchwork
2019-11-20 14:19 ` ✗ Fi.CI.BAT: failure " Patchwork
2019-11-20 14:19 ` [Intel-gfx] " Patchwork
2019-11-20 14:20 ` Chris Wilson
2019-11-20 14:20 ` [Intel-gfx] " Chris Wilson
2019-11-22 12:33 ` ✗ Fi.CI.BUILD: failure for series starting with drm/i915/selftests: Take a ref to the request we wait upon (rev3) Patchwork
2019-11-22 12:33 ` [Intel-gfx] " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f4ce9baf-69fa-784d-c0d8-9f108996af42@linux.intel.com \
--to=tvrtko.ursulin@linux.intel.com \
--cc=chris@chris-wilson.co.uk \
--cc=intel-gfx@lists.freedesktop.org \
--cc=matthew.auld@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.