Re: [Intel-gfx] [PATCH v3 1/4] drm/i915/guc: Limit scheduling properties to avoid overflow

Intel-GFX Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: John Harrison <john.c.harrison@intel.com>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
	<Intel-GFX@Lists.FreeDesktop.Org>
Cc: DRI-Devel@Lists.FreeDesktop.Org
Subject: Re: [Intel-gfx] [PATCH v3 1/4] drm/i915/guc: Limit scheduling properties to avoid overflow
Date: Wed, 9 Mar 2022 13:10:51 -0800	[thread overview]
Message-ID: <565b4a87-8bd2-ff1e-84e4-95b15238845a@intel.com> (raw)
In-Reply-To: <897489fb-e0d4-78bc-11d0-d7ec9c20ae07@linux.intel.com>

On 3/8/2022 01:43, Tvrtko Ursulin wrote:
> On 03/03/2022 22:37, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> GuC converts the pre-emption timeout and timeslice quantum values into
>> clock ticks internally. That significantly reduces the point of 32bit
>> overflow. On current platforms, worst case scenario is approximately
>> 110 seconds. Rather than allowing the user to set higher values and
>> then get confused by early timeouts, add limits when setting these
>> values.
>>
>> v2: Add helper functins for clamping (review feedback from Tvrtko).
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com> 
>> (v1)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index b3a429a92c0d..8208164c25e7 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -2218,13 +2218,24 @@ static inline u32 
> get_children_join_value(struct intel_context *ce,
>  static void guc_context_policy_init(struct intel_engine_cs *engine,
>                                     struct guc_lrc_desc *desc)
>  {
> +       struct drm_device *drm = &engine->i915->drm;
> +
>         desc->policy_flags = 0;
>
>         if (engine->flags & I915_ENGINE_WANT_FORCED_PREEMPTION)
>                 desc->policy_flags |= 
> CONTEXT_POLICY_FLAG_PREEMPT_TO_IDLE;
>
>         /* NB: For both of these, zero means disabled. */
> +       if (overflows_type(engine->props.timeslice_duration_ms * 1000,
> +                          desc->execution_quantum))
> +               drm_warn_once(drm, "GuC interface cannot support %lums 
> timeslice!\n",
> + engine->props.timeslice_duration_ms);
>         desc->execution_quantum = engine->props.timeslice_duration_ms 
> * 1000;
> +
> +       if (overflows_type(engine->props.preempt_timeout_ms * 1000,
> +                          desc->preemption_timeout))
> +               drm_warn_once(drm, "GuC interface cannot support %lums 
> preemption timeout!\n",
> + engine->props.preempt_timeout_ms);
>         desc->preemption_timeout = engine->props.preempt_timeout_ms * 
> 1000;
>  }
As previously explained, this is wrong. If the check must be present 
then it should be a BUG_ON as it is indicative of an internal driver 
failure. There is already a top level helper function for ensuring all 
range checks are done and the value is valid. If that is broken then 
that is a bug and should have been caught in pre-merge testing or code 
review. It is not possible for a bad value to get beyond that helper 
function. That is the whole point of the helper. We do not double bag 
every other value check in the driver. Once you have passed input 
validation, the values are assumed to be correct. Otherwise we would 
have every other line of code be a value check! And if somehow a bad 
value did make it through, simply printing a once shot warning is 
pointless. You are still going to get undefined behaviour potentially 
leading to a totally broken system. E.g. your very big timeout has 
overflowed and become extremely small, thus no batch buffer can ever 
complete because they all get reset before they have even finished the 
context switch in. That is a fundamentally broken system.

John.


>
>
> With that:
>
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>
> Regards,
>
> Tvrtko
>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_engine.h      |  6 ++
>>   drivers/gpu/drm/i915/gt/intel_engine_cs.c   | 69 +++++++++++++++++++++
>>   drivers/gpu/drm/i915/gt/sysfs_engines.c     | 25 +++++---
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h |  9 +++
>>   4 files changed, 99 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h 
>> b/drivers/gpu/drm/i915/gt/intel_engine.h
>> index 1c0ab05c3c40..d7044c4e526e 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
>> @@ -351,4 +351,10 @@ intel_engine_get_hung_context(struct 
>> intel_engine_cs *engine)
>>       return engine->hung_ce;
>>   }
>>   +u64 intel_clamp_heartbeat_interval_ms(struct intel_engine_cs 
>> *engine, u64 value);
>> +u64 intel_clamp_max_busywait_duration_ns(struct intel_engine_cs 
>> *engine, u64 value);
>> +u64 intel_clamp_preempt_timeout_ms(struct intel_engine_cs *engine, 
>> u64 value);
>> +u64 intel_clamp_stop_timeout_ms(struct intel_engine_cs *engine, u64 
>> value);
>> +u64 intel_clamp_timeslice_duration_ms(struct intel_engine_cs 
>> *engine, u64 value);
>> +
>>   #endif /* _INTEL_RINGBUFFER_H_ */
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
>> b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> index 7447411a5b26..22e70e4e007c 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> @@ -442,6 +442,26 @@ static int intel_engine_setup(struct intel_gt 
>> *gt, enum intel_engine_id id,
>>           engine->flags |= I915_ENGINE_HAS_EU_PRIORITY;
>>       }
>>   +    /* Cap properties according to any system limits */
>> +#define CLAMP_PROP(field) \
>> +    do { \
>> +        u64 clamp = intel_clamp_##field(engine, engine->props.field); \
>> +        if (clamp != engine->props.field) { \
>> +            drm_notice(&engine->i915->drm, \
>> +                   "Warning, clamping %s to %lld to prevent 
>> overflow\n", \
>> +                   #field, clamp); \
>> +            engine->props.field = clamp; \
>> +        } \
>> +    } while (0)
>> +
>> +    CLAMP_PROP(heartbeat_interval_ms);
>> +    CLAMP_PROP(max_busywait_duration_ns);
>> +    CLAMP_PROP(preempt_timeout_ms);
>> +    CLAMP_PROP(stop_timeout_ms);
>> +    CLAMP_PROP(timeslice_duration_ms);
>> +
>> +#undef CLAMP_PROP
>> +
>>       engine->defaults = engine->props; /* never to change again */
>>         engine->context_size = intel_engine_context_size(gt, 
>> engine->class);
>> @@ -464,6 +484,55 @@ static int intel_engine_setup(struct intel_gt 
>> *gt, enum intel_engine_id id,
>>       return 0;
>>   }
>>   +u64 intel_clamp_heartbeat_interval_ms(struct intel_engine_cs 
>> *engine, u64 value)
>> +{
>> +    value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
>> +
>> +    return value;
>> +}
>> +
>> +u64 intel_clamp_max_busywait_duration_ns(struct intel_engine_cs 
>> *engine, u64 value)
>> +{
>> +    value = min(value, jiffies_to_nsecs(2));
>> +
>> +    return value;
>> +}
>> +
>> +u64 intel_clamp_preempt_timeout_ms(struct intel_engine_cs *engine, 
>> u64 value)
>> +{
>> +    /*
>> +     * NB: The GuC API only supports 32bit values. However, the 
>> limit is further
>> +     * reduced due to internal calculations which would otherwise 
>> overflow.
>> +     */
>> +    if (intel_guc_submission_is_wanted(&engine->gt->uc.guc))
>> +        value = min_t(u64, value, GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS);
>> +
>> +    value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
>> +
>> +    return value;
>> +}
>> +
>> +u64 intel_clamp_stop_timeout_ms(struct intel_engine_cs *engine, u64 
>> value)
>> +{
>> +    value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
>> +
>> +    return value;
>> +}
>> +
>> +u64 intel_clamp_timeslice_duration_ms(struct intel_engine_cs 
>> *engine, u64 value)
>> +{
>> +    /*
>> +     * NB: The GuC API only supports 32bit values. However, the 
>> limit is further
>> +     * reduced due to internal calculations which would otherwise 
>> overflow.
>> +     */
>> +    if (intel_guc_submission_is_wanted(&engine->gt->uc.guc))
>> +        value = min_t(u64, value, GUC_POLICY_MAX_EXEC_QUANTUM_MS);
>> +
>> +    value = min_t(u64, value, jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT));
>> +
>> +    return value;
>> +}
>> +
>>   static void __setup_engine_capabilities(struct intel_engine_cs 
>> *engine)
>>   {
>>       struct drm_i915_private *i915 = engine->i915;
>> diff --git a/drivers/gpu/drm/i915/gt/sysfs_engines.c 
>> b/drivers/gpu/drm/i915/gt/sysfs_engines.c
>> index 967031056202..f2d9858d827c 100644
>> --- a/drivers/gpu/drm/i915/gt/sysfs_engines.c
>> +++ b/drivers/gpu/drm/i915/gt/sysfs_engines.c
>> @@ -144,7 +144,7 @@ max_spin_store(struct kobject *kobj, struct 
>> kobj_attribute *attr,
>>              const char *buf, size_t count)
>>   {
>>       struct intel_engine_cs *engine = kobj_to_engine(kobj);
>> -    unsigned long long duration;
>> +    unsigned long long duration, clamped;
>>       int err;
>>         /*
>> @@ -168,7 +168,8 @@ max_spin_store(struct kobject *kobj, struct 
>> kobj_attribute *attr,
>>       if (err)
>>           return err;
>>   -    if (duration > jiffies_to_nsecs(2))
>> +    clamped = intel_clamp_max_busywait_duration_ns(engine, duration);
>> +    if (duration != clamped)
>>           return -EINVAL;
>>         WRITE_ONCE(engine->props.max_busywait_duration_ns, duration);
>> @@ -203,7 +204,7 @@ timeslice_store(struct kobject *kobj, struct 
>> kobj_attribute *attr,
>>           const char *buf, size_t count)
>>   {
>>       struct intel_engine_cs *engine = kobj_to_engine(kobj);
>> -    unsigned long long duration;
>> +    unsigned long long duration, clamped;
>>       int err;
>>         /*
>> @@ -218,7 +219,8 @@ timeslice_store(struct kobject *kobj, struct 
>> kobj_attribute *attr,
>>       if (err)
>>           return err;
>>   -    if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
>> +    clamped = intel_clamp_timeslice_duration_ms(engine, duration);
>> +    if (duration != clamped)
>>           return -EINVAL;
>>         WRITE_ONCE(engine->props.timeslice_duration_ms, duration);
>> @@ -256,7 +258,7 @@ stop_store(struct kobject *kobj, struct 
>> kobj_attribute *attr,
>>          const char *buf, size_t count)
>>   {
>>       struct intel_engine_cs *engine = kobj_to_engine(kobj);
>> -    unsigned long long duration;
>> +    unsigned long long duration, clamped;
>>       int err;
>>         /*
>> @@ -272,7 +274,8 @@ stop_store(struct kobject *kobj, struct 
>> kobj_attribute *attr,
>>       if (err)
>>           return err;
>>   -    if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
>> +    clamped = intel_clamp_stop_timeout_ms(engine, duration);
>> +    if (duration != clamped)
>>           return -EINVAL;
>>         WRITE_ONCE(engine->props.stop_timeout_ms, duration);
>> @@ -306,7 +309,7 @@ preempt_timeout_store(struct kobject *kobj, 
>> struct kobj_attribute *attr,
>>                 const char *buf, size_t count)
>>   {
>>       struct intel_engine_cs *engine = kobj_to_engine(kobj);
>> -    unsigned long long timeout;
>> +    unsigned long long timeout, clamped;
>>       int err;
>>         /*
>> @@ -322,7 +325,8 @@ preempt_timeout_store(struct kobject *kobj, 
>> struct kobj_attribute *attr,
>>       if (err)
>>           return err;
>>   -    if (timeout > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
>> +    clamped = intel_clamp_preempt_timeout_ms(engine, timeout);
>> +    if (timeout != clamped)
>>           return -EINVAL;
>>         WRITE_ONCE(engine->props.preempt_timeout_ms, timeout);
>> @@ -362,7 +366,7 @@ heartbeat_store(struct kobject *kobj, struct 
>> kobj_attribute *attr,
>>           const char *buf, size_t count)
>>   {
>>       struct intel_engine_cs *engine = kobj_to_engine(kobj);
>> -    unsigned long long delay;
>> +    unsigned long long delay, clamped;
>>       int err;
>>         /*
>> @@ -379,7 +383,8 @@ heartbeat_store(struct kobject *kobj, struct 
>> kobj_attribute *attr,
>>       if (err)
>>           return err;
>>   -    if (delay >= jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT))
>> +    clamped = intel_clamp_heartbeat_interval_ms(engine, delay);
>> +    if (delay != clamped)
>>           return -EINVAL;
>>         err = intel_engine_set_heartbeat(engine, delay);
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> index 4b300b6cc0f9..a2d574f2fdd5 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h
>> @@ -262,6 +262,15 @@ struct guc_lrc_desc {
>>     #define GLOBAL_POLICY_DEFAULT_DPC_PROMOTE_TIME_US 500000
>>   +/*
>> + * GuC converts the timeout to clock ticks internally. Different 
>> platforms have
>> + * different GuC clocks. Thus, the maximum value before overflow is 
>> platform
>> + * dependent. Current worst case scenario is about 110s. So, limit 
>> to 100s to be
>> + * safe.
>> + */
>> +#define GUC_POLICY_MAX_EXEC_QUANTUM_MS        (100 * 1000)
>> +#define GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS    (100 * 1000)
>> +
>>   struct guc_policies {
>>       u32 submission_queue_depth[GUC_MAX_ENGINE_CLASSES];
>>       /* In micro seconds. How much time to allow before DPC 
>> processing is

next prev parent reply	other threads:[~2022-03-09 21:11 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-03 22:37 [Intel-gfx] [PATCH v3 0/4] Improve anti-pre-emption w/a for compute workloads John.C.Harrison
2022-03-03 22:37 ` [Intel-gfx] [PATCH v3 1/4] drm/i915/guc: Limit scheduling properties to avoid overflow John.C.Harrison
2022-03-08  9:43   ` Tvrtko Ursulin
2022-03-09 21:10     ` John Harrison [this message]
2022-03-03 22:37 ` [Intel-gfx] [PATCH v3 2/4] drm/i915: Fix compute pre-emption w/a to apply to compute engines John.C.Harrison
2022-03-03 23:16   ` Matt Roper
2022-03-03 22:37 ` [Intel-gfx] [PATCH v3 3/4] drm/i915: Make the heartbeat play nice with long pre-emption timeouts John.C.Harrison
2022-03-03 22:37 ` [Intel-gfx] [PATCH v3 4/4] drm/i915: Improve long running OCL w/a for GuC submission John.C.Harrison
2022-03-08  9:03   ` Mrozek, Michal
2022-03-08  9:41   ` Tvrtko Ursulin
2022-03-09 21:16     ` John Harrison
2022-03-10  9:27       ` Tvrtko Ursulin
2022-03-10 20:24         ` John Harrison
2022-03-11 10:07           ` Tvrtko Ursulin
2022-03-11 10:39             ` Tvrtko Ursulin
2022-03-04  0:55 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Improve anti-pre-emption w/a for compute workloads (rev4) Patchwork
2022-03-04  0:56 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2022-03-04  1:28 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2022-03-04 15:09 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2022-03-08 22:34 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for Improve anti-pre-emption w/a for compute workloads (rev6) Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=565b4a87-8bd2-ff1e-84e4-95b15238845a@intel.com \
    --to=john.c.harrison@intel.com \
    --cc=DRI-Devel@Lists.FreeDesktop.Org \
    --cc=Intel-GFX@Lists.FreeDesktop.Org \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox