From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id DD31BC433EF for ; Tue, 22 Feb 2022 10:39:48 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 56CEF10E512; Tue, 22 Feb 2022 10:39:48 +0000 (UTC) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by gabe.freedesktop.org (Postfix) with ESMTPS id 2E7DB10E512; Tue, 22 Feb 2022 10:39:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1645526387; x=1677062387; h=message-id:date:mime-version:subject:from:to:cc: references:in-reply-to:content-transfer-encoding; bh=+8h1kIQ8eFLWpGQd3uA5+/jcafx2EM5kXpWlzhW6bK4=; b=h96bFO5oUryrdMIz/RcDXDwSsyX8+pGmWWVSImWlsExmk8ZsshHVfCd+ F5+q5u9uum6Ib9fcDYLR0mdNX19MNsxmWF5NjqgME01DMbrBAgQ0F9PPd QNraScq0csDCNbvcPlZfWn8KXBmhTfWZO+sURRmMyG2ThePwjELEubHR8 yyh/494syNPxAFh41M+jJCAlvm0FF7DLFd2veFNjmuOPVrSpV9vAkgvsC C3bjVy1HSSXn4DVJ+JE00/wUohOvg++PHuzJO/uh8HQTE8FVbJUw4qiuw yE/V6w2Vs3G1cFMrAExGSeEUINZWs4if3RBSZQHIYTtHI6pN+L2tn1vOZ Q==; X-IronPort-AV: E=McAfee;i="6200,9189,10265"; a="250487701" X-IronPort-AV: E=Sophos;i="5.88,387,1635231600"; d="scan'208";a="250487701" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Feb 2022 02:39:46 -0800 X-IronPort-AV: E=Sophos;i="5.88,387,1635231600"; d="scan'208";a="507937472" Received: from sjgillin-mobl.ger.corp.intel.com (HELO [10.213.218.63]) ([10.213.218.63]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Feb 2022 02:39:45 -0800 Message-ID: <1bc7257b-ddb6-1a5b-119f-3ef89cd5e634@linux.intel.com> Date: Tue, 22 Feb 2022 10:39:43 +0000 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0 Content-Language: en-US From: Tvrtko Ursulin To: John.C.Harrison@Intel.com, Intel-GFX@Lists.FreeDesktop.Org References: <20220218213307.1338478-1-John.C.Harrison@Intel.com> <20220218213307.1338478-2-John.C.Harrison@Intel.com> <0d0c5a79-1285-0830-3794-e9f0644811a5@linux.intel.com> Organization: Intel Corporation UK Plc In-Reply-To: <0d0c5a79-1285-0830-3794-e9f0644811a5@linux.intel.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Subject: Re: [Intel-gfx] [PATCH 1/3] drm/i915/guc: Limit scheduling properties to avoid overflow X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: DRI-Devel@Lists.FreeDesktop.Org Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" On 22/02/2022 09:52, Tvrtko Ursulin wrote: > > On 18/02/2022 21:33, John.C.Harrison@Intel.com wrote: >> From: John Harrison >> >> GuC converts the pre-emption timeout and timeslice quantum values into >> clock ticks internally. That significantly reduces the point of 32bit >> overflow. On current platforms, worst case scenario is approximately > > Where does 32-bit come from, the GuC side? We already use 64-bits so > that something to fix to start with. Yep... > > ./gt/uc/intel_guc_fwif.h:       u32 execution_quantum; > > ./gt/uc/intel_guc_submission.c: desc->execution_quantum = > engine->props.timeslice_duration_ms * 1000; > > ./gt/intel_engine_types.h:              unsigned long > timeslice_duration_ms; > > timeslice_store/preempt_timeout_store: > err = kstrtoull(buf, 0, &duration); > > So both kconfig and sysfs can already overflow GuC, not only because of > tick conversion internally but because at backend level nothing was done > for assigning 64-bit into 32-bit. Or I failed to find where it is handled. > >> 110 seconds. Rather than allowing the user to set higher values and >> then get confused by early timeouts, add limits when setting these >> values. > > Btw who is reviewing GuC patches these days - things have somehow gotten > pretty quiet in activity and I don't think that's due absence of stuff > to improve or fix? Asking since I think I noticed a few already which > you posted and then crickets on the mailing list. > >> Signed-off-by: John Harrison >> --- >>   drivers/gpu/drm/i915/gt/intel_engine_cs.c   | 15 +++++++++++++++ >>   drivers/gpu/drm/i915/gt/sysfs_engines.c     | 14 ++++++++++++++ >>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h |  9 +++++++++ >>   3 files changed, 38 insertions(+) >> >> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c >> b/drivers/gpu/drm/i915/gt/intel_engine_cs.c >> index e53008b4dd05..2a1e9f36e6f5 100644 >> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c >> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c >> @@ -389,6 +389,21 @@ static int intel_engine_setup(struct intel_gt >> *gt, enum intel_engine_id id, >>       if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS) >>           engine->props.preempt_timeout_ms = 0; >> +    /* Cap timeouts to prevent overflow inside GuC */ >> +    if (intel_guc_submission_is_wanted(>->uc.guc)) { >> +        if (engine->props.timeslice_duration_ms > >> GUC_POLICY_MAX_EXEC_QUANTUM_MS) { > > Hm "wanted".. There's been too much back and forth on the GuC load > options over the years to keep track.. intel_engine_uses_guc work sounds > like would work and read nicer. > > And limit to class instead of applying to all engines looks like a miss. Sorry limit to class does not apply here, I confused this with the last patch. Regards, Tvrtko > >> +            drm_info(&engine->i915->drm, "Warning, clamping timeslice >> duration to %d to prevent possibly overflow\n", >> +                 GUC_POLICY_MAX_EXEC_QUANTUM_MS); >> +            engine->props.timeslice_duration_ms = >> GUC_POLICY_MAX_EXEC_QUANTUM_MS; > > I am not sure logging such message during driver load is useful. Sounds > more like a confused driver which starts with one value and then > overrides itself. I'd just silently set the value appropriate for the > active backend. Preemption timeout kconfig text already documents the > fact timeouts can get overriden at runtime depending on platform+engine. > So maybe just add same text to timeslice kconfig. > >> +        } >> + >> +        if (engine->props.preempt_timeout_ms > >> GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS) { >> +            drm_info(&engine->i915->drm, "Warning, clamping >> pre-emption timeout to %d to prevent possibly overflow\n", >> +                 GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS); >> +            engine->props.preempt_timeout_ms = >> GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS; >> +        } >> +    } >> + >>       engine->defaults = engine->props; /* never to change again */ >>       engine->context_size = intel_engine_context_size(gt, >> engine->class); >> diff --git a/drivers/gpu/drm/i915/gt/sysfs_engines.c >> b/drivers/gpu/drm/i915/gt/sysfs_engines.c >> index 967031056202..f57efe026474 100644 >> --- a/drivers/gpu/drm/i915/gt/sysfs_engines.c >> +++ b/drivers/gpu/drm/i915/gt/sysfs_engines.c >> @@ -221,6 +221,13 @@ timeslice_store(struct kobject *kobj, struct >> kobj_attribute *attr, >>       if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT)) >>           return -EINVAL; >> +    if (intel_uc_uses_guc_submission(&engine->gt->uc) && >> +        duration > GUC_POLICY_MAX_EXEC_QUANTUM_MS) { >> +        duration = GUC_POLICY_MAX_EXEC_QUANTUM_MS; >> +        drm_info(&engine->i915->drm, "Warning, clamping timeslice >> duration to %lld to prevent possibly overflow\n", >> +             duration); >> +    } > > I would suggest to avoid duplicated clamping logic. Maybe hide the all > backend logic into the helpers then, like maybe: > >   d = intel_engine_validate_timeslice/preempt_timeout(engine, duration); >   if (d != duration) >     return -EINVAL: > > Returning -EINVAL would be equivalent to existing behaviour: > >     if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT)) >         return -EINVAL; > > That way userspace has explicit notification and read-back is identical > to written in value. From engine setup you can just call the helper > silently. > >> + >>       WRITE_ONCE(engine->props.timeslice_duration_ms, duration); >>       if (execlists_active(&engine->execlists)) >> @@ -325,6 +332,13 @@ preempt_timeout_store(struct kobject *kobj, >> struct kobj_attribute *attr, >>       if (timeout > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT)) >>           return -EINVAL; >> +    if (intel_uc_uses_guc_submission(&engine->gt->uc) && >> +        timeout > GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS) { >> +        timeout = GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS; >> +        drm_info(&engine->i915->drm, "Warning, clamping pre-emption >> timeout to %lld to prevent possibly overflow\n", >> +             timeout); >> +    } >> + >>       WRITE_ONCE(engine->props.preempt_timeout_ms, timeout); >>       if (READ_ONCE(engine->execlists.pending[0])) >> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h >> b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h >> index 6a4612a852e2..ad131092f8df 100644 >> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h >> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h >> @@ -248,6 +248,15 @@ struct guc_lrc_desc { >>   #define GLOBAL_POLICY_DEFAULT_DPC_PROMOTE_TIME_US 500000 >> +/* >> + * GuC converts the timeout to clock ticks internally. Different >> platforms have >> + * different GuC clocks. Thus, the maximum value before overflow is >> platform >> + * dependent. Current worst case scenario is about 110s. So, limit to >> 100s to be >> + * safe. >> + */ >> +#define GUC_POLICY_MAX_EXEC_QUANTUM_MS        (100 * 1000) >> +#define GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS    (100 * 1000) > > Most important question - > how will we know/notice if/when new GuC arrives where these timeouts > would still overflow? Can this be queried somehow at runtime or where > does the limit comes from? How is GuC told about it? Set in some field > and it just allows too large values silently break things? > > Regards, > > Tvrtko > >> + >>   struct guc_policies { >>       u32 submission_queue_depth[GUC_MAX_ENGINE_CLASSES]; >>       /* In micro seconds. How much time to allow before DPC >> processing is