public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Lin Ming <ming.m.lin@intel.com>
To: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@elte.hu>, LKML <linux-kernel@vger.kernel.org>,
	Andi Kleen <andi@firstfloor.org>,
	Paul Mackerras <paulus@samba.org>,
	Stephane Eranian <eranian@googlemail.com>,
	Frederic Weisbecker <fweisbec@gmail.com>,
	Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>,
	Dan Terpstra <terpstra@eecs.utk.edu>,
	Philip Mucci <mucci@eecs.utk.edu>,
	Maynard Johnson <mpjohn@us.ibm.com>, Carl Love <cel@us.ibm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Arnaldo Carvalho de Melo <acme@redhat.com>,
	Masami Hiramatsu <mhiramat@redhat.com>
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units
Date: Tue, 30 Mar 2010 15:42:11 +0800	[thread overview]
Message-ID: <1269934931.8575.6.camel@minggr.sh.intel.com> (raw)
In-Reply-To: <d3f22a1003290213x7d7904an59d50eb6a8616133@mail.gmail.com>

Hi, Corey

How is this going now? Are you still working on this?
I'd like to help to add support for uncore, test, write code or anything
else.

Thanks,
Lin Ming

> 
> -----
> Intro
> -----
> One subject that hasn't been addressed since the introduction of
> perf_events in the Linux kernel is that of support for "uncore" or
> "nest" unit events.  Uncore is the term used by the Intel engineers
> for their off-core units but are still on the same die as the cores,
> and "nest" means exactly the same thing for IBM Power processor
> engineers.  I will use the term uncore for brevity and because it's in
> common parlance, but the issues and design possibilities below are
> relevant to both.  I will also broaden the term by stating that uncore
> will also refer to PMUs that are completely off of the processor chip
> altogether.
> 
> Contents
> --------
> 1. Why support PMUs in uncore units?  Is there anything interesting to look at?
> 2. How do uncore events differ from core events?
> 3. Why does a CPU need to be assigned to manage a particular uncore
> unit's events?
> 4. How do you encode uncore events?
> 5. How do you address a particular uncore PMU?
> 6. Event rotation issues with uncore PMUs
> 7. Other issues?
> 8. Feedback?
> 
> ----
> 1. Why support PMUs in uncore units?  Is there anything interesting to look at?
> ----
> 
> Today, many x86 chips contain uncore units, and we think that it's
> likely that the trend will continue, as more devices - I/O, memory
> interfaces, shared caches, accelerators, etc. - are integrated onto
> multi-core chips.  As these devices become more sophisticated and more
> workload is diverted off-core, engineers and performance analysts are
> going to want to look at what's happening in these units so that they
> can find bottlenecks.
> 
> In addition, we think that even off-chip I/O and interconnect devices
> are likely to gain PMUs because engineers will want to find
> bottlenecks in their massively parallel systems.
> 
> ----
> 2. How do uncore events differ from core events?
> ----
> 
> The main difference is that uncore events are mostly likely not going
> to be tied to a particular Linux task, or even a CPU context.  Uncore
> units are resources that are in some sense system-wide, though, they
> may not really be accessible system-wide in some architectures.  In
> the case of accelerators and I/O devices, it's likely they will run
> asynchronously from the cores, and thus keeping track of events on a
> per-task basis doesn't make a lot of sense.  The other existing mode
> in perf_events is a per-CPU context, and it turns out that this mode
> does match up with uncore units well, though the choice of which CPU
> to use to manage that uncore unit is going to need to be
> arch-dependent and may involve other issues as well, such as
> minimizing access latency between the uncore unit and the CPU which is
> managing it.
> 
> ----
> 3. Why does a CPU need to be assigned to manage a particular uncore
> unit's events?
> ----
> 
> * The control registers of the uncore unit's PMU need to be read and
> written, and that may be possible only from a subset of processors in
> the system.
> * A processor is needed to rotate the event list on the uncore unit on
> every tick for the purposes of event scheduling.
> * Because of access latency issues, we may want the CPU to be close in
> locality to the PMU.
> 
> It seems like a good idea to let the kernel decide which CPU to use to
> monitor a particular uncore event, based on the location of the uncore
> unit, and possibly current system load balance.  The user will not
> want to have to figure out this detailed information.
> 
> ----
> 4. How do you encode uncore events?
> ----
> Uncore events will need to be encoded in the config field of the
> perf_event_attr struct using the existing PERF_TYPE_RAW encoding.  64
> bits are available in the config field, and that may be sufficient to
> support events on most systems. However, due to  the proliferation and
> added complexity of PMUs we envision, we might want to add another
> 64-bit config (perhaps call it config_extra or config2) field to
> encode any extra attributes that might be needed.  The exact encoding
> used, just as for the current encoding for core events, will be on a
> per-arch and possibly per-system basis.
> 
> ----
> 5. How do you address a particular uncore PMU?
> ----
> 
> This one is going to be very system- and arch-dependent, but it seems
> fairly clear that we need some sort of addressing scheme that can be
> system/arch-defined by the kernel.
> 
> From a hierarchical perspective, here's an example of possible uncore
> PMU locations in a large system:
> 
> 1) Per-core - units that are shared between all hardware threads in a core
> 2) Per-node - units that are shared between all cores in a node
> 3) Per-chip - units that are shared between all nodes in a chip
> 4) Per-blade - units that are shared between all chips on a blade
> 5) Per-rack - units that are shared between all blades in a rack
> 
> Addressing option 1)
> 
> Reuse the cpu argument: cpu would be interpreted differently if an
> uncore unit is specified (via the perf_event_attr struct's config
> field).
> 
> For the hypothetical system described above, we'd want to have an
> address that contains enough address bits for each of the above.  For
> example:
> 
> bits   field
> ------ -----
> 3..0   PMU number 0-15  /* specifies which of several identical PMUs
> being addressed */
> 7..4   core id 0-15
> 8..8   node id 0-1
> 11..9  chip id 0-7
> 16..12 blade id 0-31
> 23..17 rack id 0-128
> 
> These fields would be exposed via
> /usr/include/linux/perf_events_uncore_addr.h (for example).  How you
> actually assign these numbers to actual hardware is, again,
> system-design dependent, and may be influenced by the use of a
> hypervisor, or other software which allocates resources available to
> the system dynamically.
> 
> How does the user discover the mapping between the hardware made
> available to the system and the addresses shown above?  Again, this is
> system-dependent, and probably outside the scope of this proposal.  In
> other words, I don't know how to do this in a general way, though I
> could probably put something together for a particular system.
> 
> Addressing Option 2)
> 
> Have the kernel create nodes for each uncore PMU in
> /sys/devices/system or other pseudo file system, such as the existing
> /proc/device-tree on Power systems. /sys/devices/system or
> /proc/device-tree could be explored by the
> user tool, and the user could then specify the path of the requested
> PMU via a string which the kernel could interpret.  To be overly
> simplistic, something like
> "/sys/devices/system/pmus/blade4/cpu0/vectorcopro1".  If we settled on
> a common tree root to use, we could specify only the relative path
> name, "blade4/cpu0/vectorcopro1".
> 
> One way to provide this extra "PMU path" argument to the
> sys_perf_event_open() would be to add a bit to the flags argument says
> we're adding a PMU path string onto the end of the argument list.
> 
> This path-string-based addressing option seems to more flexible in the
> long run, and does not have as serious of an issue in mapping PMUs to
> user space; the kernel essentially exposes to user space all of the
> available PMUs for the current partition.   This might create more
> work for the kernel side, but should make the system more transparent
> for user-space tools.  Another system- or at least arch-dependent tool
> would have to be written for user space to help users navigate the
> device tree to find the PMU they want to use.  I don't think it would
> make sense to build that capability into perf, because the software
> would be arch- or system-dependent.
> 
> It could be argued that we should use a common user space tree to
> represent PMUs for all architectures and systems, so that the
> arch-independent perf code would be able to display available uncore
> PMUs.  That may be a goal that's very hard to achieve because of the
> wide variation in architectures.  Any thoughts on that?
> 
> ----
> 6. Event rotation issues with uncore PMUs
> ----
> 
> Currently, the perf_events code rotates the set of events assigned to
> a CPU or task on every system tick, so that event scheduling
> collisions on a PMU are mitigated.  This turns out to cause problems
> for uncore units for two reasons - inefficiency and CPU load.
> 
> a) Rotation of a set of events across more than one PMU causes
> inefficient rotation.
> 
> Consider the following event list; the letter designates the PMU and
> the number is the event number on that PMU.
> A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4 C5
> 
> after one rotation, you can see that the event list will be:
> 
> C5 A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4
> 
> and then
> 
> C4 C5 A1 A2 A3 B1 B2 B3 B4 C1 C2 C3
> 
> Notice how the relative positions for the A and B PMU events haven't
> changed even after two (or even five) rotations, so they will schedule
> the events in the same order for some time.  This will skew the
> multiplexing so that some events will be scheduled much less often
> than they should or could be.
> 
> What we'd like to have happen is that events for each PMU be rotated
> in their own lists.  For example, before rotation:
> 
> A1 A2 A3
> B1 B2 B3 B4
> C1 C2 C3 C4 C5
> 
> After rotation:
> 
> A3 A1 A2
> B2 B3 B4 B1
> C2 C3 C4 C5 C1
> 
> We've got some ideas about how to make this happen, using either
> separate lists, or placing them on separate CPUs.
> 
> b) Access to some PMU uncore units may be quite slow due to the
> interconnect that is used.  This can place a burden on the CPU if it
> is done every system tick.
> 
> This can be addressed by keeping a counter, on a per-PMU context basis
> that reduces the rate of event rotations.  Setting the rotation period
> to three, for example, would cause event rotations in that context to
> happen on every third tick, instead of every tick.  We think that the
> kernel could measure the amount of time it is taking to do a rotate,
> and then dynamically decrease the rotation rate if it's taking too
> long; "rotation rate throttling" in other words.
> 
> ----
> 7. Other issues?
> ----
> 
> This section left blank for now.
> 
> ----
> 8. Feedback?
> ----
> 
> I'd appreciate any feedback you might have on this topic.  You can
> contact me directly at the email address below, or better yet, reply
> to LKML.
> 
> --
> Regards,
> 
> - Corey
> 
> Corey Ashford
> Software Engineer
> IBM Linux Technology Center, Linux Toolchain
> Beaverton, OR
> 503-578-3507
> cjashfor@us.ibm.com
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


  parent reply	other threads:[~2010-03-30  8:00 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-19 19:41 [RFC] perf_events: support for uncore a.k.a. nest units Corey Ashford
2010-01-20  0:44 ` Andi Kleen
2010-01-20  1:49   ` Corey Ashford
2010-01-20  9:35     ` Andi Kleen
2010-01-20 19:28       ` Corey Ashford
2010-01-20 13:34 ` Peter Zijlstra
2010-01-20 21:33   ` Peter Zijlstra
2010-01-20 23:23     ` Corey Ashford
2010-01-21  7:21       ` Ingo Molnar
2010-01-21 19:13         ` Corey Ashford
2010-01-21 19:28           ` Corey Ashford
2010-01-27 10:28             ` Ingo Molnar
2010-01-27 19:50               ` Corey Ashford
2010-01-28 10:57                 ` Peter Zijlstra
2010-01-28 18:00                   ` Corey Ashford
2010-01-28 19:06                     ` Peter Zijlstra
2010-01-28 19:44                       ` Corey Ashford
2010-01-28 22:08                       ` Corey Ashford
2010-01-29  9:52                         ` Peter Zijlstra
2010-01-29 23:05                           ` Corey Ashford
2010-01-30  8:42                             ` Peter Zijlstra
2010-02-01 19:39                               ` Corey Ashford
2010-02-01 19:54                                 ` Peter Zijlstra
2010-01-21  8:36       ` Peter Zijlstra
2010-01-21  8:47     ` stephane eranian
2010-01-21  8:59       ` Peter Zijlstra
2010-01-21  9:16         ` stephane eranian
2010-01-21  9:43         ` stephane eranian
     [not found] ` <d3f22a1003290213x7d7904an59d50eb6a8616133@mail.gmail.com>
2010-03-30  7:42   ` Lin Ming [this message]
2010-03-30 16:49     ` Corey Ashford
2010-03-30 17:15       ` Peter Zijlstra
2010-03-30 22:12         ` Corey Ashford
2010-03-31 14:01           ` Peter Zijlstra
2010-03-31 14:13             ` stephane eranian
2010-03-31 15:49             ` Maynard Johnson
2010-03-31 17:50             ` Corey Ashford
2010-04-15 21:16         ` Gary.Mohr
2010-04-16 13:24           ` Peter Zijlstra
2010-04-19  9:08             ` Lin Ming
2010-04-19  9:27               ` Peter Zijlstra
2010-04-20 11:55             ` Lin Ming
2010-04-20 12:03               ` Peter Zijlstra
2010-04-21  8:08                 ` Lin Ming
2010-04-21  8:32                   ` stephane eranian
2010-04-21  8:39                     ` Lin Ming
2010-04-21  8:44                       ` stephane eranian
2010-04-21  9:42                         ` Lin Ming
2010-04-21  9:57                           ` Peter Zijlstra
2010-04-21 22:12                             ` Lin Ming
2010-04-21 14:22                               ` Peter Zijlstra
2010-04-21 22:38                                 ` Lin Ming
2010-04-21 14:53                                   ` Peter Zijlstra
2010-03-30 21:28       ` stephane eranian
2010-03-30 23:11         ` Corey Ashford
2010-03-31 13:43           ` stephane eranian

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1269934931.8575.6.camel@minggr.sh.intel.com \
    --to=ming.m.lin@intel.com \
    --cc=acme@redhat.com \
    --cc=andi@firstfloor.org \
    --cc=cel@us.ibm.com \
    --cc=cjashfor@linux.vnet.ibm.com \
    --cc=eranian@googlemail.com \
    --cc=fweisbec@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhiramat@redhat.com \
    --cc=mingo@elte.hu \
    --cc=mpjohn@us.ibm.com \
    --cc=mucci@eecs.utk.edu \
    --cc=paulus@samba.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=terpstra@eecs.utk.edu \
    --cc=xiaoguangrong@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox