From: Arnaldo Carvalho de Melo <acme@kernel.org>
To: Ingo Molnar <mingo@kernel.org>
Cc: linux-kernel@vger.kernel.org,
Adrian Hunter <adrian.hunter@intel.com>,
Andi Kleen <ak@linux.intel.com>,
Mathieu Poirier <mathieu.poirier@linaro.org>,
Pawel Moll <pawel.moll@arm.com>,
Stephane Eranian <eranian@google.com>,
Arnaldo Carvalho de Melo <acme@redhat.com>
Subject: [PATCH 14/18] perf: Add PERF_RECORD_SWITCH to indicate context switches
Date: Thu, 23 Jul 2015 22:58:27 -0300 [thread overview]
Message-ID: <1437703111-4930-15-git-send-email-acme@kernel.org> (raw)
In-Reply-To: <1437703111-4930-1-git-send-email-acme@kernel.org>
From: Adrian Hunter <adrian.hunter@intel.com>
There are already two events for context switches, namely the tracepoint
sched:sched_switch and the software event context_switches.
Unfortunately neither are suitable for use by non-privileged users for
the purpose of synchronizing hardware trace data (e.g. Intel PT) to the
context switch.
Tracepoints are no good at all for non-privileged users because they
need either CAP_SYS_ADMIN or /proc/sys/kernel/perf_event_paranoid <= -1.
On the other hand, kernel software events need either CAP_SYS_ADMIN or
/proc/sys/kernel/perf_event_paranoid <= 1.
Now many distributions do default perf_event_paranoid to 1 making
context_switches a contender, except it has another problem (which is
also shared with sched:sched_switch) which is that it happens before
perf schedules events out instead of after perf schedules events in.
Whereas a privileged user can see all the events anyway, a
non-privileged user only sees events for their own processes, in other
words they see when their process was scheduled out not when it was
scheduled in. That presents two problems to use the event:
1. the information comes too late, so tools have to look ahead in the
event stream to find out what the current state is
2. if they are unlucky tracing might have stopped before the
context-switches event is recorded.
This new PERF_RECORD_SWITCH event does not have those problems
and it also has a couple of other small advantages.
It is easier to use because it is an auxiliary event (like mmap, comm
and task events) which can be enabled by setting a single bit. It is
smaller than sched:sched_switch and easier to parse.
To make the event useful for privileged users also, if the
context is cpu-wide then the event record will be
PERF_RECORD_SWITCH_CPU_WIDE which is the same as
PERF_RECORD_SWITCH except it also provides the next or
previous pid/tid.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Pawel Moll <pawel.moll@arm.com>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1437471846-26995-2-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
include/uapi/linux/perf_event.h | 31 +++++++++++-
kernel/events/core.c | 103 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 133 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index d97f84c080da..022d0acf7df0 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -330,7 +330,8 @@ struct perf_event_attr {
mmap2 : 1, /* include mmap with inode data */
comm_exec : 1, /* flag comm events that are due to an exec */
use_clockid : 1, /* use @clockid for time fields */
- __reserved_1 : 38;
+ context_switch : 1, /* context switch data */
+ __reserved_1 : 37;
union {
__u32 wakeup_events; /* wakeup every n events */
@@ -572,9 +573,11 @@ struct perf_event_mmap_page {
/*
* PERF_RECORD_MISC_MMAP_DATA and PERF_RECORD_MISC_COMM_EXEC are used on
* different events so can reuse the same bit position.
+ * Ditto PERF_RECORD_MISC_SWITCH_OUT.
*/
#define PERF_RECORD_MISC_MMAP_DATA (1 << 13)
#define PERF_RECORD_MISC_COMM_EXEC (1 << 13)
+#define PERF_RECORD_MISC_SWITCH_OUT (1 << 13)
/*
* Indicates that the content of PERF_SAMPLE_IP points to
* the actual instruction that triggered the event. See also
@@ -818,6 +821,32 @@ enum perf_event_type {
*/
PERF_RECORD_LOST_SAMPLES = 13,
+ /*
+ * Records a context switch in or out (flagged by
+ * PERF_RECORD_MISC_SWITCH_OUT). See also
+ * PERF_RECORD_SWITCH_CPU_WIDE.
+ *
+ * struct {
+ * struct perf_event_header header;
+ * struct sample_id sample_id;
+ * };
+ */
+ PERF_RECORD_SWITCH = 14,
+
+ /*
+ * CPU-wide version of PERF_RECORD_SWITCH with next_prev_pid and
+ * next_prev_tid that are the next (switching out) or previous
+ * (switching in) pid/tid.
+ *
+ * struct {
+ * struct perf_event_header header;
+ * u32 next_prev_pid;
+ * u32 next_prev_tid;
+ * struct sample_id sample_id;
+ * };
+ */
+ PERF_RECORD_SWITCH_CPU_WIDE = 15,
+
PERF_RECORD_MAX, /* non-ABI */
};
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d3dae3419b99..ce21143c0d9e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -163,6 +163,7 @@ static atomic_t nr_mmap_events __read_mostly;
static atomic_t nr_comm_events __read_mostly;
static atomic_t nr_task_events __read_mostly;
static atomic_t nr_freq_events __read_mostly;
+static atomic_t nr_switch_events __read_mostly;
static LIST_HEAD(pmus);
static DEFINE_MUTEX(pmus_lock);
@@ -2619,6 +2620,9 @@ static void perf_pmu_sched_task(struct task_struct *prev,
local_irq_restore(flags);
}
+static void perf_event_switch(struct task_struct *task,
+ struct task_struct *next_prev, bool sched_in);
+
#define for_each_task_context_nr(ctxn) \
for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)
@@ -2641,6 +2645,9 @@ void __perf_event_task_sched_out(struct task_struct *task,
if (__this_cpu_read(perf_sched_cb_usages))
perf_pmu_sched_task(task, next, false);
+ if (atomic_read(&nr_switch_events))
+ perf_event_switch(task, next, false);
+
for_each_task_context_nr(ctxn)
perf_event_context_sched_out(task, ctxn, next);
@@ -2831,6 +2838,9 @@ void __perf_event_task_sched_in(struct task_struct *prev,
if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
perf_cgroup_sched_in(prev, task);
+ if (atomic_read(&nr_switch_events))
+ perf_event_switch(task, prev, true);
+
if (__this_cpu_read(perf_sched_cb_usages))
perf_pmu_sched_task(prev, task, true);
}
@@ -3454,6 +3464,10 @@ static void unaccount_event(struct perf_event *event)
atomic_dec(&nr_task_events);
if (event->attr.freq)
atomic_dec(&nr_freq_events);
+ if (event->attr.context_switch) {
+ static_key_slow_dec_deferred(&perf_sched_events);
+ atomic_dec(&nr_switch_events);
+ }
if (is_cgroup_event(event))
static_key_slow_dec_deferred(&perf_sched_events);
if (has_branch_stack(event))
@@ -5982,6 +5996,91 @@ void perf_log_lost_samples(struct perf_event *event, u64 lost)
}
/*
+ * context_switch tracking
+ */
+
+struct perf_switch_event {
+ struct task_struct *task;
+ struct task_struct *next_prev;
+
+ struct {
+ struct perf_event_header header;
+ u32 next_prev_pid;
+ u32 next_prev_tid;
+ } event_id;
+};
+
+static int perf_event_switch_match(struct perf_event *event)
+{
+ return event->attr.context_switch;
+}
+
+static void perf_event_switch_output(struct perf_event *event, void *data)
+{
+ struct perf_switch_event *se = data;
+ struct perf_output_handle handle;
+ struct perf_sample_data sample;
+ int ret;
+
+ if (!perf_event_switch_match(event))
+ return;
+
+ /* Only CPU-wide events are allowed to see next/prev pid/tid */
+ if (event->ctx->task) {
+ se->event_id.header.type = PERF_RECORD_SWITCH;
+ se->event_id.header.size = sizeof(se->event_id.header);
+ } else {
+ se->event_id.header.type = PERF_RECORD_SWITCH_CPU_WIDE;
+ se->event_id.header.size = sizeof(se->event_id);
+ se->event_id.next_prev_pid =
+ perf_event_pid(event, se->next_prev);
+ se->event_id.next_prev_tid =
+ perf_event_tid(event, se->next_prev);
+ }
+
+ perf_event_header__init_id(&se->event_id.header, &sample, event);
+
+ ret = perf_output_begin(&handle, event, se->event_id.header.size);
+ if (ret)
+ return;
+
+ if (event->ctx->task)
+ perf_output_put(&handle, se->event_id.header);
+ else
+ perf_output_put(&handle, se->event_id);
+
+ perf_event__output_id_sample(event, &handle, &sample);
+
+ perf_output_end(&handle);
+}
+
+static void perf_event_switch(struct task_struct *task,
+ struct task_struct *next_prev, bool sched_in)
+{
+ struct perf_switch_event switch_event;
+
+ /* N.B. caller checks nr_switch_events != 0 */
+
+ switch_event = (struct perf_switch_event){
+ .task = task,
+ .next_prev = next_prev,
+ .event_id = {
+ .header = {
+ /* .type */
+ .misc = sched_in ? 0 : PERF_RECORD_MISC_SWITCH_OUT,
+ /* .size */
+ },
+ /* .next_prev_pid */
+ /* .next_prev_tid */
+ },
+ };
+
+ perf_event_aux(perf_event_switch_output,
+ &switch_event,
+ NULL);
+}
+
+/*
* IRQ throttle logging
*/
@@ -7479,6 +7578,10 @@ static void account_event(struct perf_event *event)
if (atomic_inc_return(&nr_freq_events) == 1)
tick_nohz_full_kick_all();
}
+ if (event->attr.context_switch) {
+ atomic_inc(&nr_switch_events);
+ static_key_slow_inc(&perf_sched_events.key);
+ }
if (has_branch_stack(event))
static_key_slow_inc(&perf_sched_events.key);
if (is_cgroup_event(event))
--
2.1.0
next prev parent reply other threads:[~2015-07-24 2:05 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-24 1:58 [GIT PULL 00/18] perf/core improvements and fixes Arnaldo Carvalho de Melo
2015-07-24 1:58 ` [PATCH 01/18] perf test: Check for refcnt in thread_map test Arnaldo Carvalho de Melo
2015-07-24 1:58 ` [PATCH 02/18] perf evlist: Force perf_evlist__set_maps to propagate maps through events Arnaldo Carvalho de Melo
2015-07-24 1:58 ` [PATCH 03/18] perf evlist: Use bool instead of target argument in propagate_maps() Arnaldo Carvalho de Melo
2015-07-24 1:58 ` [PATCH 04/18] perf evlist: Tolerate NULL maps in propagate_maps Arnaldo Carvalho de Melo
2015-07-24 1:58 ` [PATCH 05/18] perf header: Use argv style storage for cmdline feature data Arnaldo Carvalho de Melo
2015-07-24 1:58 ` [PATCH 06/18] perf symbols: Add front end cache for DSO symbol lookup Arnaldo Carvalho de Melo
2015-07-24 1:58 ` [PATCH 07/18] perf symbols: Introduce map__is_(kernel,kmodule)() Arnaldo Carvalho de Melo
2015-07-24 1:58 ` [PATCH 08/18] tools lib traceevent: Allow setting an alternative symbol resolver Arnaldo Carvalho de Melo
2015-07-24 1:58 ` [PATCH 09/18] perf symbols: Provide libtraceevent callback to resolve kernel symbols Arnaldo Carvalho de Melo
2015-07-24 1:58 ` [PATCH 10/18] perf trace: Provide libtracevent with a kernel symbol resolver Arnaldo Carvalho de Melo
2015-07-24 1:58 ` [PATCH 11/18] perf script: Switch from perf.data's kallsyms to perf's " Arnaldo Carvalho de Melo
2015-08-03 17:41 ` Jiri Olsa
2015-08-03 19:00 ` Arnaldo Carvalho de Melo
2015-08-03 19:27 ` Arnaldo Carvalho de Melo
2015-08-03 20:12 ` Jiri Olsa
2015-07-24 1:58 ` [PATCH 12/18] perf tools: Stop reading the kallsyms data from perf.data Arnaldo Carvalho de Melo
2015-07-24 1:58 ` [PATCH 13/18] perf tools: Stop copying kallsyms into the perf.data file header Arnaldo Carvalho de Melo
2015-07-24 1:58 ` Arnaldo Carvalho de Melo [this message]
2015-07-24 1:58 ` [PATCH 15/18] perf tools: Add new PERF_RECORD_SWITCH event Arnaldo Carvalho de Melo
2015-07-24 1:58 ` [PATCH 16/18] perf record: Add option --switch-events to select PERF_RECORD_SWITCH events Arnaldo Carvalho de Melo
2015-07-24 1:58 ` [PATCH 17/18] perf script: Don't assume evsel position of tracking events Arnaldo Carvalho de Melo
2015-07-24 1:58 ` [PATCH 18/18] perf script: Add option --show-switch-events Arnaldo Carvalho de Melo
2015-07-27 15:58 ` [GIT PULL 00/18] perf/core improvements and fixes Ingo Molnar
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1437703111-4930-15-git-send-email-acme@kernel.org \
--to=acme@kernel.org \
--cc=acme@redhat.com \
--cc=adrian.hunter@intel.com \
--cc=ak@linux.intel.com \
--cc=eranian@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mathieu.poirier@linaro.org \
--cc=mingo@kernel.org \
--cc=pawel.moll@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.