From: Peter Zijlstra <peterz@infradead.org>
To: Robert Bragg <robert@sixbynine.org>
Cc: Mark Rutland <mark.rutland@arm.com>,
Matt Fleming <matt.fleming@intel.com>,
dri-devel@lists.freedesktop.org, David Airlie <airlied@linux.ie>,
intel-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org,
Alexander Shishkin <alexander.shishkin@linux.intel.com>,
Sourab Gupta <sourab.gupta@intel.com>,
linux-api@vger.kernel.org, Zheng Yan <zheng.z.yan@intel.com>,
Daniel Vetter <daniel.vetter@intel.com>,
Ingo Molnar <mingo@kernel.org>
Subject: Re: [RFC 0/6] Non perf based Gen Graphics OA unit driver
Date: Fri, 16 Oct 2015 11:43:06 +0200 [thread overview]
Message-ID: <20151016094306.GA3604@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <1443537549-6905-1-git-send-email-robert@sixbynine.org>
On Tue, Sep 29, 2015 at 03:39:03PM +0100, Robert Bragg wrote:
> - We're bridging two complex architectures
>
> To review this work I think it will be relevant to have a good
> general familiarity with Gen graphics (e.g. thinking about the OA
> unit's interaction with the command streamer and execlist
> scheduling) as well as our userspace architecture and how we're
> consuming OA data within Mesa to implement the
> INTEL_performance_query extension.
>
> On the flip side here, its necessary to understand the perf
> userspace interface (for most this is hidden by tools so the details
> aren't common knowledge) as well as the internal design, considering
> that the PMU we're looking at seems to break several current design
> assumptions. I can only claim a limited familiarity with perf's
> design, just as a result of this work.
Right; but a little effort and patience on both sides should get us
there I think. At worst we'll both learn something new ;-)
> - The current OA PMU driver breaks some significant design assumptions.
>
> Existing perf pmus are used for profiling work on a cpu and we're
> introducing the idea of _IS_DEVICE pmus with different security
> implications, the need to fake cpu-related data (such as user/kernel
> registers) to fit with perf's current design, and adding _DEVICE
> records as a way to forward device-specific status records.
There are more devices with counters on than GPUs, so I think it might
make sense to look at extending perf to better deal with this.
> The OA unit writes reports of counters into a circular buffer,
> without involvement from the CPU, making our PMU driver the first of
> a kind.
Agreed, this is somewhat 'odd' from where we are today.
> Perf supports groups of counters and allows those to be read via
> transactions internally but transactions currently seem designed to
> be explicitly initiated from the cpu (say in response to a userspace
> read()) and while we could pull a report out of the OA buffer we
> can't trigger a report from the cpu on demand.
>
> Related to being report based; the OA counters are configured in HW
> as a set while perf generally expects counter configurations to be
> orthogonal. Although counters can be associated with a group leader
> as they are opened, there's no clear precedent for being able to
> provide group-wide configuration attributes and no obvious solution
> as yet that's expected to be acceptable to upstream and meets our
> userspace needs.
I'm not entirely sure what you mean with group-wide configuration
attributes; could you elaborate?
> We currently avoid using perf's grouping feature
> and forward OA reports to userspace via perf's 'raw' sample field.
> This suits our userspace well considering how coupled the counters
> are when dealing with normalizing. It would be inconvenient to split
> counters up into separate events, only to require userspace to
> recombine them.
So IF you were using a group, a single read from the leader can return
you a vector of all values (PERF_FORMAT_GROUP), this avoids having to
do that recombine.
Another option would be to view the arrival of an OA vector in the
datastream as an 'event' and generate a PERF_RECORD_READ in the perf
buffer (which again can use the GROUP vector format).
> Related to counter orthogonality; we can't time share the OA unit,
> while event scheduling is a central design idea within perf for
> allowing userspace to open + enable more events than can be
> configured in HW at any one time.
So we have other PMUs that cannot do this; Gen OA would not be unique in
this. Intel PT for example only allows a single active event.
That said; earlier today I saw:
https://www.youtube.com/watch?v=9J3BQcAeHpI&list=PLe6I3NKr-I4J2oLGXhGOeBMEjh8h10jT3&index=7
where exactly this feature was mentioned as not fitting well into the
existing GPU performance interfaces (GL_AMD_performance_monitor /
GL_INTEL_performance_query).
So there is hardware (Nvidia) out there that does support this. Also
mentioned was that this hardware has global and local counters, where
the local ones are specific to a rendering context. That is not unlike
the per-cpu / per-task stuff perf does.
> The OA unit is not designed to
> allow re-configuration while in use. We can't reconfigure the OA
> unit without loosing internal OA unit state which we can't access
> explicitly to save and restore. Reconfiguring the OA unit is also
> relatively slow, involving ~100 register writes. From userspace Mesa
> also depends on a stable OA configuration when emitting
> MI_REPORT_PERF_COUNT commands and importantly the OA unit can't be
> disabled while there are outstanding MI_RPC commands lest we hang
> the command streamer.
Right; see the PERF_PMU_CAP_EXCLUSIVE stuff.
> - We may be making some technical compromises a.t.m for the sake of
> using perf.
>
> perf_event_open() requires events to either relate to a pid or a
> specific cpu core, while our device pmu relates to neither. Events
> opened with a pid will be automatically enabled/disabled according
> to the scheduling of that process - so not appropriate for us.
Right; the traditional cpu/pid mapping doesn't work well for devices;
but maybe, with some work, we can create something like that
global/local render context from it; although I've no clue what form
that would need at this time.
> When
> an event is related to a cpu id, perf ensures pmu methods will be
> invoked via an inter process interrupt on that core. To avoid
> invasive changes our userspace opens OA perf events for a specific
> cpu.
Some of that might still make sense in the sense that GPUs are subject
to the NUMA topology of machines. I would think you would want most
such things to be done on the node the device is attached to.
Granted, this might not be a concern for Intel graphics, but it might be
relevant for some of the discrete GPUs.
> - I'm not confident our use case benefits much from building on perf:
>
> We aren't using existing perf based tooling with our PMU. Existing
> tools typically assume you're profiling work running on a cpu, e.g.
> expecting samples to be associated with instruction pointers and
> user/kernel registers and aiming to represent metrics in relation
> to application source code. We're forwarding fake register values
> and userspace needs needs to know how to decode the raw OA reports
> before anything can be reported to a user.
>
> With the buffering done by the OA unit I don't think we currently
> benefit from perf's mmapped circular buffer interface. We already
> have a decoupled producer and consumer and since we have to copy out
> of the OA buffer, it would work well for us to hide that copy in
> a simpler read() based interface.
>
>
> - Logistically it might be more practical to contain this to the
> graphics stack.
>
> It seems fair to consider that if we can't see a very compelling
> benefit to building on perf, then containing this work to
> drivers/gpu/drm/i915 may simplify the review process as well as
> future maintenance and development.
> Peter; I wonder if you would tend to agree too that it could make sense
> for us to go with our own interface here?
Sorry this took so long; this wanted a well considered response and
those tend to get delayed in light of 'urgent' stuff.
While I can certainly see the pain points and why you would rather not
deal with them. I think it would make Linux a better place if we could
manage to come up with a generic interface that would work for 'all'
GPUs (and possibly more devices).
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx
WARNING: multiple messages have this Message-ID (diff)
From: Peter Zijlstra <peterz@infradead.org>
To: Robert Bragg <robert@sixbynine.org>
Cc: intel-gfx@lists.freedesktop.org,
Daniel Vetter <daniel.vetter@intel.com>,
Chris Wilson <chris@chris-wilson.co.uk>,
Sourab Gupta <sourab.gupta@intel.com>,
Zhenyu Wang <zhenyuw@linux.intel.com>,
Jani Nikula <jani.nikula@linux.intel.com>,
David Airlie <airlied@linux.ie>, Ingo Molnar <mingo@kernel.org>,
Kan Liang <kan.liang@intel.com>,
Alexander Shishkin <alexander.shishkin@linux.intel.com>,
Zheng Yan <zheng.z.yan@intel.com>,
Mark Rutland <mark.rutland@arm.com>,
Matt Fleming <matt.fleming@intel.com>,
dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org,
linux-api@vger.kernel.org
Subject: Re: [RFC 0/6] Non perf based Gen Graphics OA unit driver
Date: Fri, 16 Oct 2015 11:43:06 +0200 [thread overview]
Message-ID: <20151016094306.GA3604@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <1443537549-6905-1-git-send-email-robert@sixbynine.org>
On Tue, Sep 29, 2015 at 03:39:03PM +0100, Robert Bragg wrote:
> - We're bridging two complex architectures
>
> To review this work I think it will be relevant to have a good
> general familiarity with Gen graphics (e.g. thinking about the OA
> unit's interaction with the command streamer and execlist
> scheduling) as well as our userspace architecture and how we're
> consuming OA data within Mesa to implement the
> INTEL_performance_query extension.
>
> On the flip side here, its necessary to understand the perf
> userspace interface (for most this is hidden by tools so the details
> aren't common knowledge) as well as the internal design, considering
> that the PMU we're looking at seems to break several current design
> assumptions. I can only claim a limited familiarity with perf's
> design, just as a result of this work.
Right; but a little effort and patience on both sides should get us
there I think. At worst we'll both learn something new ;-)
> - The current OA PMU driver breaks some significant design assumptions.
>
> Existing perf pmus are used for profiling work on a cpu and we're
> introducing the idea of _IS_DEVICE pmus with different security
> implications, the need to fake cpu-related data (such as user/kernel
> registers) to fit with perf's current design, and adding _DEVICE
> records as a way to forward device-specific status records.
There are more devices with counters on than GPUs, so I think it might
make sense to look at extending perf to better deal with this.
> The OA unit writes reports of counters into a circular buffer,
> without involvement from the CPU, making our PMU driver the first of
> a kind.
Agreed, this is somewhat 'odd' from where we are today.
> Perf supports groups of counters and allows those to be read via
> transactions internally but transactions currently seem designed to
> be explicitly initiated from the cpu (say in response to a userspace
> read()) and while we could pull a report out of the OA buffer we
> can't trigger a report from the cpu on demand.
>
> Related to being report based; the OA counters are configured in HW
> as a set while perf generally expects counter configurations to be
> orthogonal. Although counters can be associated with a group leader
> as they are opened, there's no clear precedent for being able to
> provide group-wide configuration attributes and no obvious solution
> as yet that's expected to be acceptable to upstream and meets our
> userspace needs.
I'm not entirely sure what you mean with group-wide configuration
attributes; could you elaborate?
> We currently avoid using perf's grouping feature
> and forward OA reports to userspace via perf's 'raw' sample field.
> This suits our userspace well considering how coupled the counters
> are when dealing with normalizing. It would be inconvenient to split
> counters up into separate events, only to require userspace to
> recombine them.
So IF you were using a group, a single read from the leader can return
you a vector of all values (PERF_FORMAT_GROUP), this avoids having to
do that recombine.
Another option would be to view the arrival of an OA vector in the
datastream as an 'event' and generate a PERF_RECORD_READ in the perf
buffer (which again can use the GROUP vector format).
> Related to counter orthogonality; we can't time share the OA unit,
> while event scheduling is a central design idea within perf for
> allowing userspace to open + enable more events than can be
> configured in HW at any one time.
So we have other PMUs that cannot do this; Gen OA would not be unique in
this. Intel PT for example only allows a single active event.
That said; earlier today I saw:
https://www.youtube.com/watch?v=9J3BQcAeHpI&list=PLe6I3NKr-I4J2oLGXhGOeBMEjh8h10jT3&index=7
where exactly this feature was mentioned as not fitting well into the
existing GPU performance interfaces (GL_AMD_performance_monitor /
GL_INTEL_performance_query).
So there is hardware (Nvidia) out there that does support this. Also
mentioned was that this hardware has global and local counters, where
the local ones are specific to a rendering context. That is not unlike
the per-cpu / per-task stuff perf does.
> The OA unit is not designed to
> allow re-configuration while in use. We can't reconfigure the OA
> unit without loosing internal OA unit state which we can't access
> explicitly to save and restore. Reconfiguring the OA unit is also
> relatively slow, involving ~100 register writes. From userspace Mesa
> also depends on a stable OA configuration when emitting
> MI_REPORT_PERF_COUNT commands and importantly the OA unit can't be
> disabled while there are outstanding MI_RPC commands lest we hang
> the command streamer.
Right; see the PERF_PMU_CAP_EXCLUSIVE stuff.
> - We may be making some technical compromises a.t.m for the sake of
> using perf.
>
> perf_event_open() requires events to either relate to a pid or a
> specific cpu core, while our device pmu relates to neither. Events
> opened with a pid will be automatically enabled/disabled according
> to the scheduling of that process - so not appropriate for us.
Right; the traditional cpu/pid mapping doesn't work well for devices;
but maybe, with some work, we can create something like that
global/local render context from it; although I've no clue what form
that would need at this time.
> When
> an event is related to a cpu id, perf ensures pmu methods will be
> invoked via an inter process interrupt on that core. To avoid
> invasive changes our userspace opens OA perf events for a specific
> cpu.
Some of that might still make sense in the sense that GPUs are subject
to the NUMA topology of machines. I would think you would want most
such things to be done on the node the device is attached to.
Granted, this might not be a concern for Intel graphics, but it might be
relevant for some of the discrete GPUs.
> - I'm not confident our use case benefits much from building on perf:
>
> We aren't using existing perf based tooling with our PMU. Existing
> tools typically assume you're profiling work running on a cpu, e.g.
> expecting samples to be associated with instruction pointers and
> user/kernel registers and aiming to represent metrics in relation
> to application source code. We're forwarding fake register values
> and userspace needs needs to know how to decode the raw OA reports
> before anything can be reported to a user.
>
> With the buffering done by the OA unit I don't think we currently
> benefit from perf's mmapped circular buffer interface. We already
> have a decoupled producer and consumer and since we have to copy out
> of the OA buffer, it would work well for us to hide that copy in
> a simpler read() based interface.
>
>
> - Logistically it might be more practical to contain this to the
> graphics stack.
>
> It seems fair to consider that if we can't see a very compelling
> benefit to building on perf, then containing this work to
> drivers/gpu/drm/i915 may simplify the review process as well as
> future maintenance and development.
> Peter; I wonder if you would tend to agree too that it could make sense
> for us to go with our own interface here?
Sorry this took so long; this wanted a well considered response and
those tend to get delayed in light of 'urgent' stuff.
While I can certainly see the pain points and why you would rather not
deal with them. I think it would make Linux a better place if we could
manage to come up with a generic interface that would work for 'all'
GPUs (and possibly more devices).
next prev parent reply other threads:[~2015-10-16 9:43 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-09-29 14:39 [RFC 0/6] Non perf based Gen Graphics OA unit driver Robert Bragg
2015-09-29 14:39 ` Robert Bragg
2015-09-29 14:39 ` [RFC 1/6] drm/i915: Add i915 perf infrastructure Robert Bragg
2015-09-29 14:39 ` Robert Bragg
2015-09-29 14:39 ` [RFC 2/6] drm/i915: rename OACONTROL GEN7_OACONTROL Robert Bragg
2015-09-29 14:39 ` Robert Bragg
2015-09-29 14:39 ` [RFC 3/6] drm/i915: Add static '3D' Haswell OA unit config Robert Bragg
2015-09-29 14:39 ` Robert Bragg
2015-09-29 14:39 ` [RFC 4/6] drm/i915: Add i915 perf event for Haswell OA unit Robert Bragg
2015-09-29 14:39 ` Robert Bragg
2015-09-29 14:55 ` [Intel-gfx] " kbuild test robot
2015-09-29 14:55 ` kbuild test robot
2015-09-29 15:18 ` Peter Zijlstra
2015-09-29 15:18 ` Peter Zijlstra
2015-09-29 23:19 ` [kbuild-all] " Fengguang Wu
2015-09-29 23:19 ` [kbuild-all] [Intel-gfx] " Fengguang Wu
2015-09-29 14:39 ` [RFC 6/6] drm/i915: add oa_event_min_timer_exponent sysctl Robert Bragg
2015-09-29 14:39 ` Robert Bragg
2015-09-30 3:23 ` [RFC 0/6] Non perf based Gen Graphics OA unit driver Zhenyu Wang
2015-09-30 3:23 ` Zhenyu Wang
[not found] ` <1443537549-6905-1-git-send-email-robert-St23OQVBDYPNLxjTenLetw@public.gmane.org>
2015-09-29 14:39 ` [RFC 5/6] drm/i915: Add dev.i915.perf_event_paranoid sysctl option Robert Bragg
2015-09-29 14:39 ` Robert Bragg
2015-09-30 8:30 ` [RFC 0/6] Non perf based Gen Graphics OA unit driver Chris Wilson
2015-09-30 8:30 ` Chris Wilson
2015-09-30 13:36 ` Robert Bragg
2015-10-16 9:43 ` Peter Zijlstra [this message]
2015-10-16 9:43 ` Peter Zijlstra
2015-10-16 10:02 ` Ingo Molnar
2015-10-16 10:02 ` Ingo Molnar
2015-10-16 10:33 ` Peter Zijlstra
2015-10-16 10:33 ` Peter Zijlstra
[not found] ` <20151016103345.GS3816-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
2015-10-16 12:08 ` Robert Bragg
2015-10-16 12:08 ` Robert Bragg
2015-10-20 20:16 ` Robert Bragg
2015-10-20 20:16 ` Robert Bragg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151016094306.GA3604@twins.programming.kicks-ass.net \
--to=peterz@infradead.org \
--cc=airlied@linux.ie \
--cc=alexander.shishkin@linux.intel.com \
--cc=daniel.vetter@intel.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-gfx@lists.freedesktop.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mark.rutland@arm.com \
--cc=matt.fleming@intel.com \
--cc=mingo@kernel.org \
--cc=robert@sixbynine.org \
--cc=sourab.gupta@intel.com \
--cc=zheng.z.yan@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.