Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tursulin@ursulin.net>
To: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	"Umesh Nerlige Ramappa" <umesh.nerlige.ramappa@intel.com>,
	"Adrián Larumbe" <adrian.larumbe@collabora.com>,
	"Rob Clark" <robdclark@gmail.com>,
	"Alex Deucher" <alexander.deucher@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	"Maíra Canal" <mcanal@igalia.com>
Subject: Re: [PATCH v2 6/6] drm/xe/client: Print runtime to fdinfo
Date: Wed, 8 May 2024 09:23:17 +0100	[thread overview]
Message-ID: <88d6eeee-fa6f-4e5e-9304-22df8fb0f63c@ursulin.net> (raw)
In-Reply-To: <xwp77yi7y3e3f6eyqf3qqeawsv3nh4db4vwmok3pccdddnimce@n7rts73arupp>


On 07/05/2024 22:35, Lucas De Marchi wrote:
> On Fri, Apr 26, 2024 at 11:47:37AM GMT, Tvrtko Ursulin wrote:
>>
>> On 24/04/2024 00:56, Lucas De Marchi wrote:
>>> Print the accumulated runtime for client when printing fdinfo.
>>> Each time a query is done it first does 2 things:
>>>
>>> 1) loop through all the exec queues for the current client and
>>>    accumulate the runtime, per engine class. CTX_TIMESTAMP is used for
>>>    that, being read from the context image.
>>>
>>> 2) Read a "GPU timestamp" that can be used for considering "how much GPU
>>>    time has passed" and that has the same unit/refclock as the one
>>>    recording the runtime. RING_TIMESTAMP is used for that via MMIO.
>>>
>>> Since for all current platforms RING_TIMESTAMP follows the same
>>> refclock, just read it once, using any first engine.
>>>
>>> This is exported to userspace as 2 numbers in fdinfo:
>>>
>>>     drm-cycles-<class>: <RUNTIME>
>>>     drm-total-cycles-<class>: <TIMESTAMP>
>>>
>>> Userspace is expected to collect at least 2 samples, which allows to
>>> know the client engine busyness as per:
>>>
>>>             RUNTIME1 - RUNTIME0
>>>     busyness = ---------------------
>>>               T1 - T0
>>>
>>> Another thing to point out is that it's expected that userspace will
>>> read any 2 samples every few seconds.  Given the update frequency of the
>>> counters involved and that CTX_TIMESTAMP is 32-bits, the counter for
>>> each exec_queue can wrap around (assuming 100% utilization) after ~200s.
>>> The wraparound is not perceived by userspace since it's just accumulated
>>> for all the exec_queues in a 64-bit counter), but the measurement will
>>> not be accurate if the samples are too far apart.
>>>
>>> This could be mitigated by adding a workqueue to accumulate the counters
>>> every so often, but it's additional complexity for something that is
>>> done already by userspace every few seconds in tools like gputop (from
>>> igt), htop, nvtop, etc with none of them really defaulting to 1 sample
>>> per minute or more.
>>>
>>> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
>>> ---
>>>  Documentation/gpu/drm-usage-stats.rst       |  16 ++-
>>>  Documentation/gpu/xe/index.rst              |   1 +
>>>  Documentation/gpu/xe/xe-drm-usage-stats.rst |  10 ++
>>>  drivers/gpu/drm/xe/xe_drm_client.c          | 138 +++++++++++++++++++-
>>>  4 files changed, 162 insertions(+), 3 deletions(-)
>>>  create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst
>>>
>>> diff --git a/Documentation/gpu/drm-usage-stats.rst 
>>> b/Documentation/gpu/drm-usage-stats.rst
>>> index 6dc299343b48..421766289b78 100644
>>> --- a/Documentation/gpu/drm-usage-stats.rst
>>> +++ b/Documentation/gpu/drm-usage-stats.rst
>>> @@ -112,6 +112,17 @@ larger value within a reasonable period. Upon 
>>> observing a value lower than what
>>>  was previously read, userspace is expected to stay with that larger 
>>> previous
>>>  value until a monotonic update is seen.
>>> +- drm-total-cycles-<keystr>: <uint>
>>> +
>>> +Engine identifier string must be the same as the one specified in the
>>> +drm-cycles-<keystr> tag and shall contain the total number cycles 
>>> for the given
>>> +engine.
>>> +
>>> +This is a timestamp in GPU unspecified unit that matches the update 
>>> rate
>>> +of drm-cycles-<keystr>. For drivers that implement this interface, 
>>> the engine
>>> +utilization can be calculated entirely on the GPU clock domain, without
>>> +considering the CPU sleep time between 2 samples.
>>
>> Two opens.
>>
>> 1)
>> Do we need to explicity document that drm-total-cycles and drm-maxfreq 
>> are mutually exclusive?
> 
> so userspace has a fallback mechanism to calculate utilization depending
> on what keys are available?

No, to document all three at once do not make sense. Or at least are not 
expected. Or you envisage someone might legitimately emit all three? I 
don't see what would be the semantics. When we have cycles+maxfreq the 
latter is in Hz. And when we have cycles+total then it is unitless. All 
three?

>> 2)
>> Should drm-total-cycles for now be documents as driver specific?
> 
> you mean to call it xe-total-cycles?

Yes but it is not an ask, just an open.

>> I have added some more poeple in the cc who were involved with driver 
>> fdinfo implementations if they will have an opinion.
>>
>> I would say potentially yes, and promote it to common if more than one 
>> driver would use it.
>>
>> For instance I see panfrost has the driver specific drm-curfreq 
>> (although isn't documenting it fully in panfrost.rst). And I have to 
>> say it is somewhat questionable to expose the current frequency per 
>> fdinfo per engine but not my call.
> 
> aren't all of Documentation/gpu/drm-usage-stats.rst optional that
> driver may or may not implement? When you say driver-specific I'd think
> more of the ones not using <drm> as prefix as e.g. amd-*.
> 
> I think drm-cycles + drm-total-cycles is just an alternative
> implementation for engine utilization. Like drm-cycles + drm-maxfreq
> already is an alternative to drm-engine and is not implemented by e.g.
> amdgpu/i915.
> 
> I will submit a new version of the entire patch series to get the ball
> rolling, but let's keep this open for now.
> 
> <...>
> 
>>> +static void show_runtime(struct drm_printer *p, struct drm_file *file)
>>> +{
>>> +    struct xe_file *xef = file->driver_priv;
>>> +    struct xe_device *xe = xef->xe;
>>> +    struct xe_gt *gt;
>>> +    struct xe_hw_engine *hwe;
>>> +    struct xe_exec_queue *q;
>>> +    unsigned long i, id_hwe, id_gt, capacity[XE_ENGINE_CLASS_MAX] = 
>>> { };
>>> +    u64 gpu_timestamp, engine_mask = 0;
>>> +    bool gpu_stamp = false;
>>> +
>>> +    xe_pm_runtime_get(xe);
>>> +
>>> +    /* Accumulate all the exec queues from this client */
>>> +    mutex_lock(&xef->exec_queue.lock);
>>> +    xa_for_each(&xef->exec_queue.xa, i, q)
>>> +        xe_exec_queue_update_runtime(q);
>>> +    mutex_unlock(&xef->exec_queue.lock);
>>> +
>>> +
>>> +    /* Calculate capacity of each engine class */
>>> +    BUILD_BUG_ON(ARRAY_SIZE(class_to_mask) != XE_ENGINE_CLASS_MAX);
>>> +    for_each_gt(gt, xe, id_gt)
>>> +        engine_mask |= gt->info.engine_mask;
>>> +    for (i = 0; i < XE_ENGINE_CLASS_MAX; i++)
>>> +        capacity[i] = hweight64(engine_mask & class_to_mask[i]);
>>
>> FWIW the above two loops are static so could store capacity in struct 
>> xe_device.
> 
> yes, but just creating a cache in xe of something derived from gt is not
> something to consider lightly. Particularly considering the small number
> of xe->info.gt_count we have. For something that runs only when someone
> cat the fdinfo, this doesn't seem terrible.
> 
>>
>>> +
>>> +    /*
>>> +     * Iterate over all engines, printing the accumulated
>>> +     * runtime for this client, per engine class
>>> +     */
>>> +    for_each_gt(gt, xe, id_gt) {
>>> +        xe_force_wake_get(gt_to_fw(gt), XE_FW_GT);
>>> +        for_each_hw_engine(hwe, gt, id_hwe) {
>>> +            const char *class_name;
>>> +
>>> +            if (!capacity[hwe->class])
>>> +                continue;
>>> +
>>> +            /*
>>> +             * Use any (first) engine to have a timestamp to be used 
>>> every
>>> +             * time
>>> +             */
>>> +            if (!gpu_stamp) {
>>> +                gpu_timestamp = xe_hw_engine_read_timestamp(hwe);
>>> +                gpu_stamp = true;
>>> +            }
>>> +
>>> +            class_name = xe_hw_engine_class_to_str(hwe->class);
>>> +
>>> +            drm_printf(p, "drm-cycles-%s:\t%llu\n",
>>> +                   class_name, xef->runtime[hwe->class]);
>>> +            drm_printf(p, "drm-total-cycles-%s:\t%llu\n",
>>> +                   class_name, gpu_timestamp);
>>> +
>>> +            if (capacity[hwe->class] > 1)
>>> +                drm_printf(p, "drm-engine-capacity-%s:\t%lu\n",
>>> +                       class_name, capacity[hwe->class]);
>>> +
>>> +            /* engine class already handled, skip next iterations */
>>> +            capacity[hwe->class] = 0;
>>> +        }
>>> +        xe_force_wake_put(gt_to_fw(gt), XE_FW_GT);
>>> +    }
>>
>> More FWIW and AFAICT, could just walk the "list" of classes instead of 
> 
> xe_force_wake_get() is per gt, so the alternative would be... loop
> through the gts to get all forcewakes, loop through all engine classes, 
> loop
> again through all gts to put the forcewake. And we also need to consider
> that an engine class may not be available in all GTs... example:
> vcs/vecs in MTL and later, so we need to track it globally across GTs
> anyway.

Forcewake is only needed once for the gpu_timestamp, no? At least I 
don't see any other potential hardware access in the loop. Hence I 
thought if you could have a known engine to get the timestamp outside 
the loop, you could then run a flat loop (over classes) avoiding the per 
gt fw dance. Your choice ofc.

Regards,

Tvrtko

> 
>> the nested loop with skipping already visited classes. Assuming in xe 
>> there is an easy way to get a known present engine for the 
>> gpu_timestamp it would be flatter and less code.
> 
> from the device we can get either tile or gt. To work with the
> hw_engines we have to look inside the gt.
> 
> Lucas De Marchi
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>> +
>>> +    xe_pm_runtime_get(xe);
>>> +}
>>> +
>>>  /**
>>>   * xe_drm_client_fdinfo() - Callback for fdinfo interface
>>>   * @p: The drm_printer ptr
>>> @@ -192,5 +327,6 @@ static void show_meminfo(struct drm_printer *p, 
>>> struct drm_file *file)
>>>  void xe_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file)
>>>  {
>>>      show_meminfo(p, file);
>>> +    show_runtime(p, file);
>>>  }
>>>  #endif

  reply	other threads:[~2024-05-08  8:23 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-23 23:56 [PATCH v2 0/6] drm/xe: Per client usage Lucas De Marchi
2024-04-23 23:56 ` [PATCH v2 1/6] drm/xe/lrc: Add helper to capture context timestamp Lucas De Marchi
2024-04-23 23:56 ` [PATCH v2 2/6] drm/xe: Add helper to capture engine timestamp Lucas De Marchi
2024-04-23 23:56 ` [PATCH v2 3/6] drm/xe: Add helper to accumulate exec queue runtime Lucas De Marchi
2024-04-24  4:08   ` Matthew Brost
2024-04-24 14:51     ` Lucas De Marchi
2024-04-26 10:49   ` Tvrtko Ursulin
2024-04-26 18:59     ` Umesh Nerlige Ramappa
2024-04-29  8:07       ` Tvrtko Ursulin
2024-04-23 23:56 ` [PATCH v2 4/6] drm/xe: Promote xe_hw_engine_class_to_str() Lucas De Marchi
2024-04-23 23:56 ` [PATCH v2 5/6] drm/xe: Add XE_ENGINE_CLASS_OTHER to str conversion Lucas De Marchi
2024-04-23 23:56 ` [PATCH v2 6/6] drm/xe/client: Print runtime to fdinfo Lucas De Marchi
2024-04-26 10:47   ` Tvrtko Ursulin
2024-05-07 21:35     ` Lucas De Marchi
2024-05-08  8:23       ` Tvrtko Ursulin [this message]
2024-05-08 20:53         ` Lucas De Marchi
2024-05-09  9:39           ` Tvrtko Ursulin
2024-04-24  1:14 ` ✓ CI.Patch_applied: success for drm/xe: Per client usage (rev2) Patchwork
2024-04-24  1:14 ` ✗ CI.checkpatch: warning " Patchwork
2024-04-24  1:15 ` ✓ CI.KUnit: success " Patchwork
2024-04-24  1:27 ` ✓ CI.Build: " Patchwork
2024-04-24  1:29 ` ✓ CI.Hooks: " Patchwork
2024-04-24  1:31 ` ✓ CI.checksparse: " Patchwork
2024-04-24  1:53 ` ✓ CI.BAT: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=88d6eeee-fa6f-4e5e-9304-22df8fb0f63c@ursulin.net \
    --to=tursulin@ursulin.net \
    --cc=adrian.larumbe@collabora.com \
    --cc=alexander.deucher@amd.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=mcanal@igalia.com \
    --cc=robdclark@gmail.com \
    --cc=umesh.nerlige.ramappa@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox