From: eranian@googlemail.com
To: linux-kernel@vger.kernel.org
Subject: [patch 23/24] perfmon: kernel documentation
Date: Tue, 25 Nov 2008 13:36:44 -0800 (PST) [thread overview]
Message-ID: <492c6fec.0405560a.726a.253d@mx.google.com> (raw)
This patch adds the perfmon interface documentation text file
under Documentation.
Signed-off-by: Stephane Eranian <eranian@gmail.com>
--
Index: o3/Documentation/perfmon.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ o3/Documentation/perfmon.txt 2008-10-16 12:25:49.000000000 +0200
@@ -0,0 +1,206 @@
+ The perfmon hardware monitoring interface
+ ------------------------------------------
+ Stephane Eranian
+ <eranian@gmail.com>
+
+I/ Introduction
+
+ The perfmon interface provides access to the hardware performance counters
+ of major processors. Nowadays, all processors implement some flavor of
+ performance counters which capture micro-architectural level information
+ such as the number of elapsed cycles, number of cache misses, and so on.
+
+ The interface is implemented as a set of new system calls and a set of
+ config files in /sys.
+
+ It is possible to monitor a single thread or a CPU. In either mode,
+ applications can count or sample. System-wide monitoring is supported by
+ running a monitoring session on each CPU. The interface supports event-based
+ sampling where the sampling period is expressed as the number of occurrences
+ of event, instead of just a timeout. This approach provides a better
+ granularity and flexibility.
+
+ For performance reason, it is possible to use a kernel-level sampling buffer
+ to minimize the overhead incurred by sampling. The format of the buffer,
+ what is recorded, how it is recorded, and how it is exported to user is
+ controlled by a kernel module called a sampling format. The current
+ implementation comes with a default format but it is possible to create
+ additional formats. There is an kernel registration interface for formats.
+ Each format is identified by a simple string which a tool can pass when a
+ monitoring session is created.
+
+ The interface also provides support for event set and multiplexing to work
+ around hardware limitations in the number of available counters or in how
+ events can be combined. Each set defines as many counters as the hardware
+ can support. The kernel then multiplexes the sets. The interface supports
+ time-based switching but also overflow-based switching, i.e., after n
+ overflows of designated counters.
+
+ Applications never manipulates the actual performance counter registers.
+ Instead they see a logical Performance Monitoring Unit (PMU) composed of a
+ set of config registers (PMC) and a set of data registers (PMD). Note that
+ PMD are not necessarily counters, they can be buffers. The logical PMU is
+ then mapped onto the actual PMU using a mapping table which is implemented
+ as a kernel module. The mapping is chosen once for each new processor. It is
+ visible in /sys/kernel/perfmon/pmu_desc. The kernel module is automatically
+ loaded on first use.
+
+ A monitoring session is uniquely identified by a file descriptor obtained
+ when the session is created. File sharing semantics apply to access the
+ session inside a process. A session is never inherited across fork. The file
+ descriptor can be used to receive counter overflow notifications or when the
+ sampling buffer is full. It is possible to use poll/select on the descriptor
+ to wait for notifications from multiple sessions. Similarly, the descriptor
+ supports asynchronous notifications via SIGIO.
+
+ Counters are always exported as being 64-bit wide regardless of what the
+ underlying hardware implements.
+
+II/ Kernel compilation
+
+ To enable perfmon, you need to enable CONFIG_PERFMON and also some of the
+ model-specific PMU modules.
+
+III/ OProfile interactions
+
+ The set of features offered by perfmon is rich enough to support migrating
+ Oprofile on top of it. That means that PMU programming and low-level
+ interrupt handling could be done by perfmon. The Oprofile sampling buffer
+ management code in the kernel as well as how samples are exported to users
+ could remain through the use of a sampling format. This is how Oprofile
+ works on Itanium.
+
+ The current interactions with Oprofile are:
+ - on X86: Both subsystems can be compiled into the same kernel. There
+ is enforced mutual exclusion between the two subsystems. When
+ there is an Oprofile session, no perfmon session can exist
+ and vice-versa.
+
+ - On IA-64: Oprofile works on top of perfmon. Oprofile being a
+ system-wide monitoring tool, the regular per-thread vs.
+ system-wide session restrictions apply.
+
+ - on PPC: no integration yet. Only one subsystem can be enabled.
+ - on MIPS: no integration yet. Only one subsystem can be enabled.
+
+IV/ User tools
+
+ We have released a simple monitoring tool to demonstrate the features of
+ the interface. The tool is called pfmon and it comes with a simple helper
+ library called libpfm. The library comes with a set of examples to show
+ how to use the kernel interface. Visit http://perfmon2.sf.net for details.
+
+ There maybe other tools available for perfmon.
+
+V/ How to program?
+
+ The best way to learn how to program perfmon, is to take a look at the
+ source code for the examples in libpfm. The source code is available from:
+
+ http://perfmon2.sf.net
+
+VI/ System calls overview
+
+ In this section, we describe the state of the interface as submitted to the
+ kernel. There are more extensions available, and we will update the section
+ as they get implemented in the upstream kernel.
+
+ The interface is implemented by the following system calls:
+
+ * int pfm_create(int flags, pfarg_sinfo_t *s);
+
+ This function creates a perfmon per-thread session.
+ The flags parameter is currently unused and must be set to 0.
+
+ Upon return and if s is not NULL, the kernel return the list of available
+ PMC and PMD registers. Tools should not assume, they have access to the
+ entire PMU, it may be shared with other kernel subsystems, e.g., on X86
+ the NMI watchdog timer.
+
+ The function returns the file descriptor identifying the session.
+
+ * int pfm_write(int fd, int flags, int type, void *d, size_t sz)
+
+ This function is used to write PMU registers for the session identified
+ by fd.
+
+ The flags parameter is currently unused and must be set to 0.
+
+ The type reflects the type of registers to write and determines the type
+ of the d parameter. The following types are defined:
+
+ - PFM_RW_PMC: write PMC registers, expect pfarg_pmr_t pointer for d
+ - PFM_RW_PMD: write PMD registers, expect pfarg_pmr_t pointer for d
+
+ The type field is not a bitmask, only one type can be passed per call.
+
+ the sz parameter describes the size of the vector of elements passed in d.
+
+ * int pfm_read(int fd, int flags, int type, void *d, size_t sz);
+
+ This function is used to read PMU registers for the session identified
+ by fd.
+
+ This function is used to write PMU registers for the session identified
+ by fd.
+
+ The flags parameter is currently unused and must be set to 0.
+
+ The type reflects the type of registers to write and determines the type
+ of the d parameter. The following types are supported:
+
+ - PFM_RW_PMD: write PMD registers, expect pfarg_pmr_t pointer for d
+
+ The type field is not a bitmask, only one type can be passed per call.
+
+ Reading of PMC registers is not allowed.
+
+ the sz parameter describes the size of the vector of elements passed in d.
+
+
+ * int pfm_attach(int fd, int flags, int target);
+
+ This function is used to attach and detach the session to and from
+ thread.
+
+ To attach the thread is identified by target which must have the
+ value returned by gettid() (not pthread_self). For a single threaded
+ process, that value is equal to the value returned by getpid().
+
+ To detach, the special target PFM_NO_TARGET must be passed.
+
+ The flags parameter is currently unused and must be set to 0.
+
+ The session is always attached as stopped, i.e., with monitoring
+ inactive. Monitoring is always stopped as a consequence of detaching.
+
+ * int pfm_set_state(int fd, int flags, int state);
+
+ The function is used to set the running state of the session. The state to
+ go to is indicated by state.
+
+ The following states are defined, only one can be specified at a time:
+
+ - PFM_ST_START: start monitoring
+ - PFM_ST_STOP: stop monitoring
+
+ The flags parameter is currently unused and must be set to 0.
+
+ * int close(int fd)
+
+ To destroy a session, the regular close() system call is used.
+
+
+VII/ /sys interface overview
+
+ Refer to Documentation/ABI/testing/sysfs-perfmon-* for a detailed
+ description of the sysfs interface of perfmon2.
+
+VIII/ debugfs interface overview
+
+ Refer to Documentation/perfmon-debugfs.txt for a detailed description of the
+ debug and statistics interface of perfmon.
+
+IX/ Documentation
+
+ Visit http://perfmon2.sf.net
Index: o3/Documentation/ABI/testing/sysfs-perfmon
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ o3/Documentation/ABI/testing/sysfs-perfmon 2008-10-16 12:25:18.000000000 +0200
@@ -0,0 +1,42 @@
+What: /sys/kernel/perfmon
+Date: Oct 2008
+KernelVersion: 2.6.27
+Contact: eranian@gmail.com
+
+Description: provide the configuration interface for the perfmon subsystems.
+ The tree contains information about the detected hardware,
+ current state of the subsystem as well as some configuration
+ parameters.
+
+ The tree consists of the following entries:
+
+ /sys/kernel/perfmon/debug (read-write):
+
+ Enable perfmon debugging output. The traces are rate-limited
+ to avoid flooding the console. It is possible to change the
+ throttling via /proc/sys/kernel/printk_ratelimit.
+
+ The value is interpreted as a bitmask. Each bit enables a
+ particular type of debug messages. Refer to the file
+ include/linux/perfmon_kern.h for more information.
+
+ /sys/kernel/perfmon/task_group (read-write):
+
+ Users group allowed to create a per-thread context (session).
+ -1 means any group.
+
+ /sys/kernel/perfmon/task_sessions_count (read-only):
+
+ Number of per-thread contexts (sessions) currently attached
+ to threads.
+
+ /sys/kernel/perfmon/version (read-only):
+
+ Perfmon interface revision number.
+
+ /sys/kernel/perfmon/arg_mem_max(read-write):
+
+ Maximum size of vector arguments expressed in bytes.
+ It can be modified but must be at least a page.
+ Default: PAGE_SIZE
+
Index: o3/Documentation/ABI/testing/sysfs-perfmon-pmu
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ o3/Documentation/ABI/testing/sysfs-perfmon-pmu 2008-10-16 12:25:04.000000000 +0200
@@ -0,0 +1,48 @@
+What: /sys/kernel/perfmon/pmu
+Date: Nov 2007
+KernelVersion: 2.6.24
+Contact: eranian@gmail.com
+
+Description: Provides information about the active PMU description
+ module. The module contains the mapping of the actual
+ performance counter registers onto the logical PMU exposed by
+ perfmon. There is at most one PMU description module loaded
+ at any time.
+
+ The sysfs PMU tree provides a description of the mapping for
+ each register. There is one subdir per config and data register
+ along an entry for the name of the PMU model.
+
+ The entries are as follows:
+
+ /sys/kernel/perfmon/pmu_desc/model (read-only):
+
+ Name of the PMU model is clear text and zero terminated.
+
+ Then, for each logical PMU register, XX, gets a subtree with the
+ following entries:
+
+ /sys/kernel/perfmon/pmu_desc/pm*XX/addr (read-only):
+
+ The physical address or index of the actual underlying hardware
+ register. On Itanium, it corresponds to the index. But on X86
+ processor, this is the actual MSR address.
+
+ /sys/kernel/perfmon/pmu_desc/pm*XX/dfl_val (read-only):
+
+ The default value of the register in hexadecimal.
+
+ /sys/kernel/perfmon/pmu_desc/pm*XX/name (read-only):
+
+ The name of the hardware register.
+
+ /sys/kernel/perfmon/pmu_desc/pm*XX/rsvd_msk (read-only):
+
+ Bitmask of reserved bits, i.e., bits which cannot be changed
+ by applications. When a bit is set, it means the corresponding
+ bit in the actual register is reserved.
+
+ /sys/kernel/perfmon/pmu_desc/pm*XX/width (read-only):
+
+ The width in bits of the registers. This field is only
+ relevant for counter registers.
--
next reply other threads:[~2008-11-25 21:44 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-11-25 21:36 eranian [this message]
-- strict thread matches above, loose matches on Subject: below --
2008-11-26 8:43 [patch 23/24] perfmon: kernel documentation eranian
2008-11-26 12:21 ` Andi Kleen
2008-11-26 16:41 ` Randy Dunlap
2008-11-26 17:00 ` Andi Kleen
2008-11-26 18:21 ` stephane eranian
2008-11-26 19:34 ` Andi Kleen
2008-11-26 20:24 ` stephane eranian
2008-11-26 20:56 ` Andi Kleen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=492c6fec.0405560a.726a.253d@mx.google.com \
--to=eranian@googlemail.com \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.