Re: [PATCH 0/3] drm/panfrost: Expose HW counters to userspace

From: Boris Brezillon <boris.brezillon@collabora.com>
To: Steven Price <steven.price@arm.com>
Cc: Neil Armstrong <narmstrong@baylibre.com>,
	Emil Velikov <emil.l.velikov@gmail.com>,
	dri-devel@lists.freedesktop.org, Rob Herring <robh+dt@kernel.org>,
	Mark Janes <mark.a.janes@intel.com>,
	kernel@collabora.com, Alyssa Rosenzweig <alyssa@rosenzweig.io>
Subject: Re: [PATCH 0/3] drm/panfrost: Expose HW counters to userspace
Date: Mon, 13 May 2019 15:39:52 +0200	[thread overview]
Message-ID: <20190513153952.68adf596@collabora.com> (raw)
In-Reply-To: <028546a0-546a-b788-567a-a45623d16f5e@arm.com>

On Mon, 13 May 2019 13:48:08 +0100
Steven Price <steven.price@arm.com> wrote:

> On 12/05/2019 14:38, Boris Brezillon wrote:
> > On Sat, 11 May 2019 15:32:20 -0700
> > Alyssa Rosenzweig <alyssa@rosenzweig.io> wrote:
> >   
> >> Hi all,
> >>
> >> As Steven Price explained, the "GPU top" kbase approach is often more
> >> useful and accurate than per-draw timing. 
> >>
> >> For a 3D game inside a GPU-accelerated desktop, the games' counters
> >> *should* include desktop overhead. This external overhead does affect
> >> the game's performance, especially if the contexts are competing for
> >> resources like memory bandwidth. An isolated sample is easy to achieve
> >> running only the app of interest; in ideal conditions (zero-copy
> >> fullscreen), desktop interference is negligible. 
> >>
> >> For driver developers, the system-wide measurements are preferable,
> >> painting a complete system picture and avoiding disruptions. There is no
> >> risk of confusion, as the driver developers understand how the counters
> >> are exposed. Further, benchmarks rendering direct to a GBM surface are
> >> available (glmark2-es2-drm), eliminating interference even with poor
> >> desktop performance.
> >>
> >> For app developers, the confusion of multi-context interference is
> >> unfortunate. Nevertheless, if enabling counters were to slow down an
> >> app, the confusion could be worse. Consider second-order changes in the
> >> app's performance characteristics due to slowdown: if techniques like
> >> dynamic resolution scaling are employed, the counters' results can be
> >> invalid.  Likewise, even if the lower-performance counters are
> >> appropriate for purely graphical workloads, complex apps with variable
> >> CPU overhead (e.g. from an FPS-dependent physics engine) can further
> >> confound counters. Low-overhead system-wide measurements mitigate these
> >> concerns.  
> > 
> > I'd just like to point out that dumping counters the way
> > mali_kbase/gator does likely has an impact on perfs (at least on some
> > GPUs) because of the clean+invalidate-cache that happens before (or
> > after, I don't remember) each dump. IIUC and this cache is actually  
> 
> The clean is required after the dump to ensure that the CPU can see the
> dumped counters (otherwise they could be stuck in the GPU's cache).

Just to make it clear, I was not questioning the need for a cache clean
here, I was simply pointing out that dumping counters is not free even
when we don't add those serialization points. Note that dumping
counters also has an impact on the cache-miss counter of future dumps,
but there's nothing we can do about that no matter the solution we go
for.

> Note
> that if you don't immediately need to observe the values the clean can
> be deferred. E.g. if each dump you target a different memory region then
> you could perform a flush only after several dumps.

Correct. Having a pool of dump buffers might be an option if we need to
minimize the impact of dumps at some point.

> 
> > global and not a per address space thing (which would be possible if the
> > cache lines contain a tag attaching them to a specific address space),
> > that means all jobs running when the clean+invalidate happens will have
> > extra cache misses after each dump. Of course, that's not as invasive as
> > the full serialization that happens with my solution, but it's not free
> > either.  
> 
> Cache cleans (at the L2 cache level) are global. There are L1 caches
> (and TLBs) but they are not relevant to counter dumping.

Okay.

> 
> In practice cache cleans are not that expensive most of the time. As you
> note - mali_kbase doesn't bother to avoid them when doing counter capture.
> 
> It should also be unnecessary to "invalidate" - a clean should be
> sufficient. The invalidate is only required if e.g. MMU page tables have
> been modified as well.

I already dropped the invalidate in my patch as we only have a global
address space right now and the buffer used to store perfcnt values is
always the same. We'll have to rework that when support for per-context
address space is added.

> That would reduce the affect on currently running
> jobs. But I notice that mali_kbase doesn't appear to use the "clean
> only" (GPU_COMMAND_CLEAN_CACHES) option. I'm not aware of it being
> broken though.

'clean-only' seems to work fine on T860 at least.

> 
> >>
> >> As Rob Clark suggested, system-wide counters could be exposed via a
> >> semi-standardized interface, perhaps within debugfs/sysfs. The interface
> >> could not be completely standard, as the list of counters exposed varies
> >> substantially by vendor and model. Nevertheless, the mechanics of
> >> discovering, enabling, reading, and disabling counters can be
> >> standardized, as can a small set of universally meaningful counters like
> >> total GPU utilization. This would permit a vendor-independent GPU top
> >> app as suggested, as is I believe currently possible with
> >> vendor-specific downstream kernels (e.g. via Gator/Streamline for Mali)
> >>
> >> It looks like this discussion is dormant. Could we try to get this
> >> sorted? For Panfrost, I'm hitting GPU-side bottlenecks that I'm unable
> >> to diagnose without access to the counters, so I'm eager for a mainline
> >> solution to be implemented.  
> > 
> > I spent a bit of time thinking about it and looking at different
> > solutions.
> > 
> > debugfs/sysfs might not be the best solution, especially if we think
> > about the multi-user case (several instances of GPU perfmon tool
> > running in parallel), if we want it to work properly we need a way to
> > instantiate several perf monitors and let the driver add values to all
> > active perfmons everytime a dump happens (no matter who triggered the
> > dump). That's exactly what mali_kbase/gator does BTW. That's achievable
> > through debugs if we consider exposing a knob to instantiate such
> > perfmon instances, but that also means risking perfmon leaks if the
> > user does not take care of killing the perfmon it created when it's done
> > with it (or when it crashes). It might also prove hard to expose that to
> > non-root users in a secure way.  
> 
> Note that mali_kbase only has to maintain multiple sets of values
> because it provides backwards compatibility to an old version of the
> driver. You could maintain one set of counter values and simply ensure
> they are big enough not to wrap (or handle wrapping, but that is ugly
> and unlikely to happen). Very early versions of kbase simply exposed the
> hardware functionality (dump counters and reset) and therefore faced the
> issue that there could only ever be one user of counters at a time. This
> was "fixed" by emulating the existence of multiple interfaces (vinstr -
> "virtual" instrumentation). If I was redesigning it from scratch then
> I'd definitely remove the "reset" part of the interface and leave it for
> the clients to latch the first value (of each counter) and subtract that
> from subsequent counter reads.

Noted.

> 
> Sadly the hardware doesn't help here and some (most?) GPU revisions will
> saturate the counters rather than wrapping. Hence the need to frequently
> poll and reset the counters, even if you only want the 'start'/'end' values.

Not sure wrapping would be better, as you can't tell how many times
you've reached the max val.

BTW, counters are reset every time we do a dump, right? If that's the
case, there's no risk to have end_val < start_val and the only risk is
to get inaccurate values when counter_val > U32_MAX, but reaching
U32_MAX already means that this event occurred a lot during the
monitoring period.

> 
> Having only one set of counters might simplify the potential for "leaks"
> - although you still want to ensure that if nobody cares you stop
> collecting counters...

Handling that with a debugfs-based iface implies implementing some
heuristic like 'counters haven't been dumped for a long time, let's
stop monitoring them', and I'm not a big fan of such things.

This being said, I think I'll go for a simple debugfs-based iface to
unblock Alyssa. debugfs is not part of the stable-ABI and I guess we
can agree on explicitly marking it as "unstable" so that when we settle
on a generic interface to expose such counters we can get rid of the
old one.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel