From: Boris Brezillon <boris.brezillon@collabora.com>
To: Steven Price <steven.price@arm.com>
Cc: Neil Armstrong <narmstrong@baylibre.com>,
Emil Velikov <emil.l.velikov@gmail.com>,
dri-devel@lists.freedesktop.org, Rob Herring <robh+dt@kernel.org>,
Mark Janes <mark.a.janes@intel.com>,
kernel@collabora.com, Alyssa Rosenzweig <alyssa@rosenzweig.io>
Subject: Re: [PATCH 0/3] drm/panfrost: Expose HW counters to userspace
Date: Mon, 13 May 2019 15:39:52 +0200 [thread overview]
Message-ID: <20190513153952.68adf596@collabora.com> (raw)
In-Reply-To: <028546a0-546a-b788-567a-a45623d16f5e@arm.com>
On Mon, 13 May 2019 13:48:08 +0100
Steven Price <steven.price@arm.com> wrote:
> On 12/05/2019 14:38, Boris Brezillon wrote:
> > On Sat, 11 May 2019 15:32:20 -0700
> > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
> >
> >> Hi all,
> >>
> >> As Steven Price explained, the "GPU top" kbase approach is often more
> >> useful and accurate than per-draw timing.
> >>
> >> For a 3D game inside a GPU-accelerated desktop, the games' counters
> >> *should* include desktop overhead. This external overhead does affect
> >> the game's performance, especially if the contexts are competing for
> >> resources like memory bandwidth. An isolated sample is easy to achieve
> >> running only the app of interest; in ideal conditions (zero-copy
> >> fullscreen), desktop interference is negligible.
> >>
> >> For driver developers, the system-wide measurements are preferable,
> >> painting a complete system picture and avoiding disruptions. There is no
> >> risk of confusion, as the driver developers understand how the counters
> >> are exposed. Further, benchmarks rendering direct to a GBM surface are
> >> available (glmark2-es2-drm), eliminating interference even with poor
> >> desktop performance.
> >>
> >> For app developers, the confusion of multi-context interference is
> >> unfortunate. Nevertheless, if enabling counters were to slow down an
> >> app, the confusion could be worse. Consider second-order changes in the
> >> app's performance characteristics due to slowdown: if techniques like
> >> dynamic resolution scaling are employed, the counters' results can be
> >> invalid. Likewise, even if the lower-performance counters are
> >> appropriate for purely graphical workloads, complex apps with variable
> >> CPU overhead (e.g. from an FPS-dependent physics engine) can further
> >> confound counters. Low-overhead system-wide measurements mitigate these
> >> concerns.
> >
> > I'd just like to point out that dumping counters the way
> > mali_kbase/gator does likely has an impact on perfs (at least on some
> > GPUs) because of the clean+invalidate-cache that happens before (or
> > after, I don't remember) each dump. IIUC and this cache is actually
>
> The clean is required after the dump to ensure that the CPU can see the
> dumped counters (otherwise they could be stuck in the GPU's cache).
Just to make it clear, I was not questioning the need for a cache clean
here, I was simply pointing out that dumping counters is not free even
when we don't add those serialization points. Note that dumping
counters also has an impact on the cache-miss counter of future dumps,
but there's nothing we can do about that no matter the solution we go
for.
> Note
> that if you don't immediately need to observe the values the clean can
> be deferred. E.g. if each dump you target a different memory region then
> you could perform a flush only after several dumps.
Correct. Having a pool of dump buffers might be an option if we need to
minimize the impact of dumps at some point.
>
> > global and not a per address space thing (which would be possible if the
> > cache lines contain a tag attaching them to a specific address space),
> > that means all jobs running when the clean+invalidate happens will have
> > extra cache misses after each dump. Of course, that's not as invasive as
> > the full serialization that happens with my solution, but it's not free
> > either.
>
> Cache cleans (at the L2 cache level) are global. There are L1 caches
> (and TLBs) but they are not relevant to counter dumping.
Okay.
>
> In practice cache cleans are not that expensive most of the time. As you
> note - mali_kbase doesn't bother to avoid them when doing counter capture.
>
> It should also be unnecessary to "invalidate" - a clean should be
> sufficient. The invalidate is only required if e.g. MMU page tables have
> been modified as well.
I already dropped the invalidate in my patch as we only have a global
address space right now and the buffer used to store perfcnt values is
always the same. We'll have to rework that when support for per-context
address space is added.
> That would reduce the affect on currently running
> jobs. But I notice that mali_kbase doesn't appear to use the "clean
> only" (GPU_COMMAND_CLEAN_CACHES) option. I'm not aware of it being
> broken though.
'clean-only' seems to work fine on T860 at least.
>
> >>
> >> As Rob Clark suggested, system-wide counters could be exposed via a
> >> semi-standardized interface, perhaps within debugfs/sysfs. The interface
> >> could not be completely standard, as the list of counters exposed varies
> >> substantially by vendor and model. Nevertheless, the mechanics of
> >> discovering, enabling, reading, and disabling counters can be
> >> standardized, as can a small set of universally meaningful counters like
> >> total GPU utilization. This would permit a vendor-independent GPU top
> >> app as suggested, as is I believe currently possible with
> >> vendor-specific downstream kernels (e.g. via Gator/Streamline for Mali)
> >>
> >> It looks like this discussion is dormant. Could we try to get this
> >> sorted? For Panfrost, I'm hitting GPU-side bottlenecks that I'm unable
> >> to diagnose without access to the counters, so I'm eager for a mainline
> >> solution to be implemented.
> >
> > I spent a bit of time thinking about it and looking at different
> > solutions.
> >
> > debugfs/sysfs might not be the best solution, especially if we think
> > about the multi-user case (several instances of GPU perfmon tool
> > running in parallel), if we want it to work properly we need a way to
> > instantiate several perf monitors and let the driver add values to all
> > active perfmons everytime a dump happens (no matter who triggered the
> > dump). That's exactly what mali_kbase/gator does BTW. That's achievable
> > through debugs if we consider exposing a knob to instantiate such
> > perfmon instances, but that also means risking perfmon leaks if the
> > user does not take care of killing the perfmon it created when it's done
> > with it (or when it crashes). It might also prove hard to expose that to
> > non-root users in a secure way.
>
> Note that mali_kbase only has to maintain multiple sets of values
> because it provides backwards compatibility to an old version of the
> driver. You could maintain one set of counter values and simply ensure
> they are big enough not to wrap (or handle wrapping, but that is ugly
> and unlikely to happen). Very early versions of kbase simply exposed the
> hardware functionality (dump counters and reset) and therefore faced the
> issue that there could only ever be one user of counters at a time. This
> was "fixed" by emulating the existence of multiple interfaces (vinstr -
> "virtual" instrumentation). If I was redesigning it from scratch then
> I'd definitely remove the "reset" part of the interface and leave it for
> the clients to latch the first value (of each counter) and subtract that
> from subsequent counter reads.
Noted.
>
> Sadly the hardware doesn't help here and some (most?) GPU revisions will
> saturate the counters rather than wrapping. Hence the need to frequently
> poll and reset the counters, even if you only want the 'start'/'end' values.
Not sure wrapping would be better, as you can't tell how many times
you've reached the max val.
BTW, counters are reset every time we do a dump, right? If that's the
case, there's no risk to have end_val < start_val and the only risk is
to get inaccurate values when counter_val > U32_MAX, but reaching
U32_MAX already means that this event occurred a lot during the
monitoring period.
>
> Having only one set of counters might simplify the potential for "leaks"
> - although you still want to ensure that if nobody cares you stop
> collecting counters...
Handling that with a debugfs-based iface implies implementing some
heuristic like 'counters haven't been dumped for a long time, let's
stop monitoring them', and I'm not a big fan of such things.
This being said, I think I'll go for a simple debugfs-based iface to
unblock Alyssa. debugfs is not part of the stable-ABI and I guess we
can agree on explicitly marking it as "unstable" so that when we settle
on a generic interface to expose such counters we can get rid of the
old one.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
next prev parent reply other threads:[~2019-05-13 13:39 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-04-04 15:20 [PATCH 0/3] drm/panfrost: Expose HW counters to userspace Boris Brezillon
2019-04-04 15:20 ` [PATCH 1/3] drm/panfrost: Move gpu_{write, read}() macros to panfrost_regs.h Boris Brezillon
2019-04-04 15:20 ` [PATCH 2/3] drm/panfrost: Expose HW counters to userspace Boris Brezillon
2019-04-04 15:41 ` Alyssa Rosenzweig
2019-04-04 18:17 ` Boris Brezillon
2019-04-04 22:40 ` Alyssa Rosenzweig
2019-04-05 15:36 ` Eric Anholt
2019-04-05 16:17 ` Alyssa Rosenzweig
2019-04-04 15:20 ` [PATCH 3/3] panfrost/drm: Define T860 perf counters Boris Brezillon
2019-04-05 15:20 ` [PATCH 0/3] drm/panfrost: Expose HW counters to userspace Steven Price
2019-04-05 16:33 ` Alyssa Rosenzweig
2019-04-05 17:40 ` Boris Brezillon
2019-04-05 17:43 ` Alyssa Rosenzweig
2019-04-30 12:42 ` Boris Brezillon
2019-04-30 13:10 ` Rob Clark
2019-04-30 15:49 ` Jordan Crouse
2019-05-12 13:40 ` Boris Brezillon
2019-05-13 15:00 ` Jordan Crouse
2019-05-01 17:12 ` Eric Anholt
2019-05-12 13:17 ` Boris Brezillon
2019-05-11 22:32 ` Alyssa Rosenzweig
2019-05-12 13:38 ` Boris Brezillon
2019-05-13 12:48 ` Steven Price
2019-05-13 13:39 ` Boris Brezillon [this message]
2019-05-13 14:13 ` Steven Price
2019-05-13 14:56 ` Alyssa Rosenzweig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190513153952.68adf596@collabora.com \
--to=boris.brezillon@collabora.com \
--cc=alyssa@rosenzweig.io \
--cc=dri-devel@lists.freedesktop.org \
--cc=emil.l.velikov@gmail.com \
--cc=kernel@collabora.com \
--cc=mark.a.janes@intel.com \
--cc=narmstrong@baylibre.com \
--cc=robh+dt@kernel.org \
--cc=steven.price@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.