* [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2]
@ 2009-05-22 4:17 Paul Mackerras
2009-05-22 4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras
` (2 more replies)
0 siblings, 3 replies; 32+ messages in thread
From: Paul Mackerras @ 2009-05-22 4:17 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Peter Zijlstra, linux-kernel, Corey Ashford, Thomas Gleixner
This replaces the struct perf_counter_context in the task_struct with
a pointer to a dynamically allocated perf_counter_context struct. The
main reason for doing is this is to allow us to transfer a
perf_counter_context from one task to another when we do lazy PMU
switching in a later patch.
This has a few side-benefits: the task_struct becomes a little smaller,
we save some memory because only tasks that have perf_counters attached
get a perf_counter_context allocated for them, and we can remove the
inclusion of <linux/perf_counter.h> in sched.h, meaning that we don't
end up recompiling nearly everything whenever perf_counter.h changes.
The perf_counter_context structures are reference-counted and freed
when the last reference is dropped. A context can have references
from its task and the counters on its task. Counters can outlive the
task so it is possible that a context will be freed well after its
task has exited.
Contexts are allocated on fork if the parent had a context, or
otherwise the first time that a per-task counter is created on a task.
In the latter case, we set the context pointer in the task struct
locklessly using an atomic compare-and-exchange operation in case we
raced with some other task in creating a context for the subject task.
This also removes the task pointer from the perf_counter struct. The
task pointer was not used anywhere and would make it harder to move a
context from one task to another. Anything that needed to know which
task a counter was attached to was already using counter->ctx->task.
The __perf_counter_init_context function moves up in perf_counter.c
so that it can be called from find_get_context, and now initializes
the refcount, but is otherwise unchanged.
We were potentially calling list_del_counter twice: once from
__perf_counter_exit_task when the task exits and once from
__perf_counter_remove_from_context when the counter's fd gets closed.
This adds a check in list_del_counter so it doesn't do anything if
the counter has already been removed from the lists.
Since perf_counter_task_sched_in doesn't do anything if the task doesn't
have a context, and leaves cpuctx->task_ctx = NULL, this adds code to
__perf_install_in_context to set cpuctx->task_ctx if necessary, i.e. in
the case where the current task adds the first counter to itself and
thus creates a context for itself.
This also adds similar code to __perf_counter_enable to handle a
similar situation which can arise when the counters have been disabled
using prctl; that also leaves cpuctx->task_ctx = NULL.
Signed-off-by: Paul Mackerras <paulus@samba.org>
---
arch/x86/kernel/apic/apic.c | 1 +
include/linux/init_task.h | 13 ---
include/linux/perf_counter.h | 4 +-
include/linux/sched.h | 6 +-
kernel/exit.c | 3 +-
kernel/fork.c | 1 +
kernel/perf_counter.c | 218 ++++++++++++++++++++++++++----------------
7 files changed, 145 insertions(+), 101 deletions(-)
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index e9021a9..b4f6440 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -14,6 +14,7 @@
* Mikael Pettersson : PM converted to driver model.
*/
+#include <linux/perf_counter.h>
#include <linux/kernel_stat.h>
#include <linux/mc146818rtc.h>
#include <linux/acpi_pmtmr.h>
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 503afaa..d87247d 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -108,18 +108,6 @@ extern struct group_info init_groups;
extern struct cred init_cred;
-#ifdef CONFIG_PERF_COUNTERS
-# define INIT_PERF_COUNTERS(tsk) \
- .perf_counter_ctx.counter_list = \
- LIST_HEAD_INIT(tsk.perf_counter_ctx.counter_list), \
- .perf_counter_ctx.event_list = \
- LIST_HEAD_INIT(tsk.perf_counter_ctx.event_list), \
- .perf_counter_ctx.lock = \
- __SPIN_LOCK_UNLOCKED(tsk.perf_counter_ctx.lock),
-#else
-# define INIT_PERF_COUNTERS(tsk)
-#endif
-
/*
* INIT_TASK is used to set up the first task table, touch at
* your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -183,7 +171,6 @@ extern struct cred init_cred;
}, \
.dirties = INIT_PROP_LOCAL_SINGLE(dirties), \
INIT_IDS \
- INIT_PERF_COUNTERS(tsk) \
INIT_TRACE_IRQFLAGS \
INIT_LOCKDEP \
INIT_FTRACE_GRAPH \
diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index f612941..0713090 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -449,7 +449,6 @@ struct perf_counter {
struct hw_perf_counter hw;
struct perf_counter_context *ctx;
- struct task_struct *task;
struct file *filp;
struct perf_counter *parent;
@@ -498,7 +497,6 @@ struct perf_counter {
* Used as a container for task counters and CPU counters as well:
*/
struct perf_counter_context {
-#ifdef CONFIG_PERF_COUNTERS
/*
* Protect the states of the counters in the list,
* nr_active, and the list:
@@ -516,6 +514,7 @@ struct perf_counter_context {
int nr_counters;
int nr_active;
int is_active;
+ atomic_t refcount;
struct task_struct *task;
/*
@@ -523,7 +522,6 @@ struct perf_counter_context {
*/
u64 time;
u64 timestamp;
-#endif
};
/**
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ff59d12..9714d45 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -71,7 +71,6 @@ struct sched_param {
#include <linux/path.h>
#include <linux/compiler.h>
#include <linux/completion.h>
-#include <linux/perf_counter.h>
#include <linux/pid.h>
#include <linux/percpu.h>
#include <linux/topology.h>
@@ -99,6 +98,7 @@ struct robust_list_head;
struct bio;
struct bts_tracer;
struct fs_struct;
+struct perf_counter_context;
/*
* List of flags we want to share for kernel threads,
@@ -1387,7 +1387,9 @@ struct task_struct {
struct list_head pi_state_list;
struct futex_pi_state *pi_state_cache;
#endif
- struct perf_counter_context perf_counter_ctx;
+#ifdef CONFIG_PERF_COUNTERS
+ struct perf_counter_context *perf_counter_ctxp;
+#endif
#ifdef CONFIG_NUMA
struct mempolicy *mempolicy;
short il_next;
diff --git a/kernel/exit.c b/kernel/exit.c
index f9dfedd..99ad406 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -48,6 +48,7 @@
#include <linux/tracehook.h>
#include <linux/fs_struct.h>
#include <linux/init_task.h>
+#include <linux/perf_counter.h>
#include <trace/sched.h>
#include <asm/uaccess.h>
@@ -159,7 +160,7 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
struct task_struct *tsk = container_of(rhp, struct task_struct, rcu);
#ifdef CONFIG_PERF_COUNTERS
- WARN_ON_ONCE(!list_empty(&tsk->perf_counter_ctx.counter_list));
+ WARN_ON_ONCE(tsk->perf_counter_ctxp);
#endif
trace_sched_process_free(tsk);
put_task_struct(tsk);
diff --git a/kernel/fork.c b/kernel/fork.c
index d32fef4..e72a09f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -63,6 +63,7 @@
#include <linux/fs_struct.h>
#include <trace/sched.h>
#include <linux/magic.h>
+#include <linux/perf_counter.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index 08584c1..06ea3ea 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -97,6 +97,17 @@ void perf_enable(void)
hw_perf_enable();
}
+static void get_ctx(struct perf_counter_context *ctx)
+{
+ atomic_inc(&ctx->refcount);
+}
+
+static void put_ctx(struct perf_counter_context *ctx)
+{
+ if (atomic_dec_and_test(&ctx->refcount))
+ kfree(ctx);
+}
+
static void
list_add_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
{
@@ -118,11 +129,17 @@ list_add_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
ctx->nr_counters++;
}
+/*
+ * Remove a counter from the lists for its context.
+ * Must be called with counter->mutex and ctx->mutex held.
+ */
static void
list_del_counter(struct perf_counter *counter, struct perf_counter_context *ctx)
{
struct perf_counter *sibling, *tmp;
+ if (list_empty(&counter->list_entry))
+ return;
ctx->nr_counters--;
list_del_init(&counter->list_entry);
@@ -216,8 +233,6 @@ static void __perf_counter_remove_from_context(void *info)
counter_sched_out(counter, cpuctx, ctx);
- counter->task = NULL;
-
list_del_counter(counter, ctx);
if (!ctx->task) {
@@ -279,7 +294,6 @@ retry:
*/
if (!list_empty(&counter->list_entry)) {
list_del_counter(counter, ctx);
- counter->task = NULL;
}
spin_unlock_irq(&ctx->lock);
}
@@ -568,11 +582,17 @@ static void __perf_install_in_context(void *info)
* If this is a task context, we need to check whether it is
* the current task context of this cpu. If not it has been
* scheduled out before the smp call arrived.
+ * Or possibly this is the right context but it isn't
+ * on this cpu because it had no counters.
*/
- if (ctx->task && cpuctx->task_ctx != ctx)
- return;
+ if (ctx->task && cpuctx->task_ctx != ctx) {
+ if (cpuctx->task_ctx || ctx->task != current)
+ return;
+ cpuctx->task_ctx = ctx;
+ }
spin_lock_irqsave(&ctx->lock, flags);
+ ctx->is_active = 1;
update_context_time(ctx);
/*
@@ -653,7 +673,6 @@ perf_install_in_context(struct perf_counter_context *ctx,
return;
}
- counter->task = task;
retry:
task_oncpu_function_call(task, __perf_install_in_context,
counter);
@@ -693,10 +712,14 @@ static void __perf_counter_enable(void *info)
* If this is a per-task counter, need to check whether this
* counter's task is the current task on this cpu.
*/
- if (ctx->task && cpuctx->task_ctx != ctx)
- return;
+ if (ctx->task && cpuctx->task_ctx != ctx) {
+ if (cpuctx->task_ctx || ctx->task != current)
+ return;
+ cpuctx->task_ctx = ctx;
+ }
spin_lock_irqsave(&ctx->lock, flags);
+ ctx->is_active = 1;
update_context_time(ctx);
counter->prev_state = counter->state;
@@ -852,10 +875,10 @@ void __perf_counter_sched_out(struct perf_counter_context *ctx,
void perf_counter_task_sched_out(struct task_struct *task, int cpu)
{
struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
- struct perf_counter_context *ctx = &task->perf_counter_ctx;
+ struct perf_counter_context *ctx = task->perf_counter_ctxp;
struct pt_regs *regs;
- if (likely(!cpuctx->task_ctx))
+ if (likely(!ctx || !cpuctx->task_ctx))
return;
update_context_time(ctx);
@@ -871,6 +894,8 @@ static void __perf_counter_task_sched_out(struct perf_counter_context *ctx)
{
struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+ if (!cpuctx->task_ctx)
+ return;
__perf_counter_sched_out(ctx, cpuctx);
cpuctx->task_ctx = NULL;
}
@@ -969,8 +994,10 @@ __perf_counter_sched_in(struct perf_counter_context *ctx,
void perf_counter_task_sched_in(struct task_struct *task, int cpu)
{
struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu);
- struct perf_counter_context *ctx = &task->perf_counter_ctx;
+ struct perf_counter_context *ctx = task->perf_counter_ctxp;
+ if (likely(!ctx))
+ return;
__perf_counter_sched_in(ctx, cpuctx, cpu);
cpuctx->task_ctx = ctx;
}
@@ -985,11 +1012,11 @@ static void perf_counter_cpu_sched_in(struct perf_cpu_context *cpuctx, int cpu)
int perf_counter_task_disable(void)
{
struct task_struct *curr = current;
- struct perf_counter_context *ctx = &curr->perf_counter_ctx;
+ struct perf_counter_context *ctx = curr->perf_counter_ctxp;
struct perf_counter *counter;
unsigned long flags;
- if (likely(!ctx->nr_counters))
+ if (!ctx || !ctx->nr_counters)
return 0;
local_irq_save(flags);
@@ -1020,12 +1047,12 @@ int perf_counter_task_disable(void)
int perf_counter_task_enable(void)
{
struct task_struct *curr = current;
- struct perf_counter_context *ctx = &curr->perf_counter_ctx;
+ struct perf_counter_context *ctx = curr->perf_counter_ctxp;
struct perf_counter *counter;
unsigned long flags;
int cpu;
- if (likely(!ctx->nr_counters))
+ if (!ctx || !ctx->nr_counters)
return 0;
local_irq_save(flags);
@@ -1128,19 +1155,23 @@ void perf_counter_task_tick(struct task_struct *curr, int cpu)
return;
cpuctx = &per_cpu(perf_cpu_context, cpu);
- ctx = &curr->perf_counter_ctx;
+ ctx = curr->perf_counter_ctxp;
perf_adjust_freq(&cpuctx->ctx);
- perf_adjust_freq(ctx);
+ if (ctx)
+ perf_adjust_freq(ctx);
perf_counter_cpu_sched_out(cpuctx);
- __perf_counter_task_sched_out(ctx);
+ if (ctx)
+ __perf_counter_task_sched_out(ctx);
rotate_ctx(&cpuctx->ctx);
- rotate_ctx(ctx);
+ if (ctx)
+ rotate_ctx(ctx);
perf_counter_cpu_sched_in(cpuctx, cpu);
- perf_counter_task_sched_in(curr, cpu);
+ if (ctx)
+ perf_counter_task_sched_in(curr, cpu);
}
/*
@@ -1176,6 +1207,22 @@ static u64 perf_counter_read(struct perf_counter *counter)
return atomic64_read(&counter->count);
}
+/*
+ * Initialize the perf_counter context in a task_struct:
+ */
+static void
+__perf_counter_init_context(struct perf_counter_context *ctx,
+ struct task_struct *task)
+{
+ memset(ctx, 0, sizeof(*ctx));
+ spin_lock_init(&ctx->lock);
+ mutex_init(&ctx->mutex);
+ INIT_LIST_HEAD(&ctx->counter_list);
+ INIT_LIST_HEAD(&ctx->event_list);
+ atomic_set(&ctx->refcount, 1);
+ ctx->task = task;
+}
+
static void put_context(struct perf_counter_context *ctx)
{
if (ctx->task)
@@ -1186,6 +1233,7 @@ static struct perf_counter_context *find_get_context(pid_t pid, int cpu)
{
struct perf_cpu_context *cpuctx;
struct perf_counter_context *ctx;
+ struct perf_counter_context *tctx;
struct task_struct *task;
/*
@@ -1225,15 +1273,36 @@ static struct perf_counter_context *find_get_context(pid_t pid, int cpu)
if (!task)
return ERR_PTR(-ESRCH);
- ctx = &task->perf_counter_ctx;
- ctx->task = task;
-
/* Reuse ptrace permission checks for now. */
if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
- put_context(ctx);
+ put_task_struct(task);
return ERR_PTR(-EACCES);
}
+ ctx = task->perf_counter_ctxp;
+ if (!ctx) {
+ ctx = kmalloc(sizeof(struct perf_counter_context), GFP_KERNEL);
+ if (!ctx) {
+ put_task_struct(task);
+ return ERR_PTR(-ENOMEM);
+ }
+ __perf_counter_init_context(ctx, task);
+ /*
+ * Make sure other cpus see correct values for *ctx
+ * once task->perf_counter_ctxp is visible to them.
+ */
+ smp_wmb();
+ tctx = cmpxchg(&task->perf_counter_ctxp, NULL, ctx);
+ if (tctx) {
+ /*
+ * We raced with some other task; use
+ * the context they set.
+ */
+ kfree(ctx);
+ ctx = tctx;
+ }
+ }
+
return ctx;
}
@@ -1242,6 +1311,7 @@ static void free_counter_rcu(struct rcu_head *head)
struct perf_counter *counter;
counter = container_of(head, struct perf_counter, rcu_head);
+ put_ctx(counter->ctx);
kfree(counter);
}
@@ -2247,7 +2317,7 @@ static void perf_counter_comm_event(struct perf_comm_event *comm_event)
perf_counter_comm_ctx(&cpuctx->ctx, comm_event);
put_cpu_var(perf_cpu_context);
- perf_counter_comm_ctx(¤t->perf_counter_ctx, comm_event);
+ perf_counter_comm_ctx(current->perf_counter_ctxp, comm_event);
}
void perf_counter_comm(struct task_struct *task)
@@ -2256,7 +2326,9 @@ void perf_counter_comm(struct task_struct *task)
if (!atomic_read(&nr_comm_tracking))
return;
-
+ if (!current->perf_counter_ctxp)
+ return;
+
comm_event = (struct perf_comm_event){
.task = task,
.event = {
@@ -2372,7 +2444,7 @@ got_name:
perf_counter_mmap_ctx(&cpuctx->ctx, mmap_event);
put_cpu_var(perf_cpu_context);
- perf_counter_mmap_ctx(¤t->perf_counter_ctx, mmap_event);
+ perf_counter_mmap_ctx(current->perf_counter_ctxp, mmap_event);
kfree(buf);
}
@@ -2384,6 +2456,8 @@ void perf_counter_mmap(unsigned long addr, unsigned long len,
if (!atomic_read(&nr_mmap_tracking))
return;
+ if (!current->perf_counter_ctxp)
+ return;
mmap_event = (struct perf_mmap_event){
.file = file,
@@ -2985,6 +3059,7 @@ perf_counter_alloc(struct perf_counter_hw_event *hw_event,
counter->group_leader = group_leader;
counter->pmu = NULL;
counter->ctx = ctx;
+ get_ctx(ctx);
counter->state = PERF_COUNTER_STATE_INACTIVE;
if (hw_event->disabled)
@@ -3150,21 +3225,6 @@ err_put_context:
}
/*
- * Initialize the perf_counter context in a task_struct:
- */
-static void
-__perf_counter_init_context(struct perf_counter_context *ctx,
- struct task_struct *task)
-{
- memset(ctx, 0, sizeof(*ctx));
- spin_lock_init(&ctx->lock);
- mutex_init(&ctx->mutex);
- INIT_LIST_HEAD(&ctx->counter_list);
- INIT_LIST_HEAD(&ctx->event_list);
- ctx->task = task;
-}
-
-/*
* inherit a counter from parent task to child task:
*/
static struct perf_counter *
@@ -3195,7 +3255,6 @@ inherit_counter(struct perf_counter *parent_counter,
/*
* Link it up in the child's context:
*/
- child_counter->task = child;
add_counter_to_ctx(child_counter, child_ctx);
child_counter->parent = parent_counter;
@@ -3294,40 +3353,15 @@ __perf_counter_exit_task(struct task_struct *child,
struct perf_counter *parent_counter;
/*
- * If we do not self-reap then we have to wait for the
- * child task to unschedule (it will happen for sure),
- * so that its counter is at its final count. (This
- * condition triggers rarely - child tasks usually get
- * off their CPU before the parent has a chance to
- * get this far into the reaping action)
+ * Protect against concurrent operations on child_counter
+ * due its fd getting closed, etc.
*/
- if (child != current) {
- wait_task_inactive(child, 0);
- update_counter_times(child_counter);
- list_del_counter(child_counter, child_ctx);
- } else {
- struct perf_cpu_context *cpuctx;
- unsigned long flags;
-
- /*
- * Disable and unlink this counter.
- *
- * Be careful about zapping the list - IRQ/NMI context
- * could still be processing it:
- */
- local_irq_save(flags);
- perf_disable();
-
- cpuctx = &__get_cpu_var(perf_cpu_context);
+ mutex_lock(&child_counter->mutex);
- group_sched_out(child_counter, cpuctx, child_ctx);
- update_counter_times(child_counter);
+ update_counter_times(child_counter);
+ list_del_counter(child_counter, child_ctx);
- list_del_counter(child_counter, child_ctx);
-
- perf_enable();
- local_irq_restore(flags);
- }
+ mutex_unlock(&child_counter->mutex);
parent_counter = child_counter->parent;
/*
@@ -3346,19 +3380,29 @@ __perf_counter_exit_task(struct task_struct *child,
*
* Note: we may be running in child context, but the PID is not hashed
* anymore so new counters will not be added.
+ * (XXX not sure that is true when we get called from flush_old_exec.
+ * -- paulus)
*/
void perf_counter_exit_task(struct task_struct *child)
{
struct perf_counter *child_counter, *tmp;
struct perf_counter_context *child_ctx;
+ unsigned long flags;
WARN_ON_ONCE(child != current);
- child_ctx = &child->perf_counter_ctx;
+ child_ctx = child->perf_counter_ctxp;
- if (likely(!child_ctx->nr_counters))
+ if (likely(!child_ctx))
return;
+ local_irq_save(flags);
+ __perf_counter_task_sched_out(child_ctx);
+ child->perf_counter_ctxp = NULL;
+ local_irq_restore(flags);
+
+ mutex_lock(&child_ctx->mutex);
+
again:
list_for_each_entry_safe(child_counter, tmp, &child_ctx->counter_list,
list_entry)
@@ -3371,6 +3415,10 @@ again:
*/
if (!list_empty(&child_ctx->counter_list))
goto again;
+
+ mutex_unlock(&child_ctx->mutex);
+
+ put_ctx(child_ctx);
}
/*
@@ -3382,19 +3430,25 @@ void perf_counter_init_task(struct task_struct *child)
struct perf_counter *counter;
struct task_struct *parent = current;
- child_ctx = &child->perf_counter_ctx;
- parent_ctx = &parent->perf_counter_ctx;
-
- __perf_counter_init_context(child_ctx, child);
+ child->perf_counter_ctxp = NULL;
/*
* This is executed from the parent task context, so inherit
- * counters that have been marked for cloning:
+ * counters that have been marked for cloning.
+ * First allocate and initialize a context for the child.
*/
- if (likely(!parent_ctx->nr_counters))
+ child_ctx = kmalloc(sizeof(struct perf_counter_context), GFP_KERNEL);
+ if (!child_ctx)
+ return;
+
+ parent_ctx = parent->perf_counter_ctxp;
+ if (likely(!parent_ctx || !parent_ctx->nr_counters))
return;
+ __perf_counter_init_context(child_ctx, child);
+ child->perf_counter_ctxp = child_ctx;
+
/*
* Lock the parent list. No need to lock the child - not PID
* hashed yet and not running, so nobody can access it.
--
1.6.0.4
^ permalink raw reply related [flat|nested] 32+ messages in thread* [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-22 4:17 [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] Paul Mackerras @ 2009-05-22 4:27 ` Paul Mackerras 2009-05-22 8:16 ` Peter Zijlstra ` (6 more replies) 2009-05-22 8:06 ` [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] Peter Zijlstra 2009-05-22 10:27 ` [tip:perfcounters/core] perf_counter: Dynamically allocate tasks' perf_counter_context struct tip-bot for Paul Mackerras 2 siblings, 7 replies; 32+ messages in thread From: Paul Mackerras @ 2009-05-22 4:27 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zijlstra, linux-kernel, Corey Ashford, Thomas Gleixner When monitoring a process and its descendants with a set of inherited counters, we can often get the situation in a context switch where both the old (outgoing) and new (incoming) process have the same set of counters, and their values are ultimately going to be added together. In that situation it doesn't matter which set of counters are used to count the activity for the new process, so there is really no need to go through the process of reading the hardware counters and updating the old task's counters and then setting up the PMU for the new task. This optimizes the context switch in this situation. Instead of scheduling out the perf_counter_context for the old task and scheduling in the new context, we simply transfer the old context to the new task and keep using it without interruption. The new context gets transferred to the old task. This means that both tasks still have a valid perf_counter_context, so no special case is introduced when the old task gets scheduled in again, either on this CPU or another CPU. The equivalence of contexts is detected by keeping a pointer in each cloned context pointing to the context it was cloned from. To cope with the situation where a context is changed by adding or removing counters after it has been cloned, we also keep a generation number on each context which is incremented every time a context is changed. When a context is cloned we take a copy of the parent's generation number, and two cloned contexts are equivalent only if they have the same parent and the same generation number. In order that the parent context pointer remains valid (and is not reused), we increment the parent context's reference count for each context cloned from it. Since we don't have individual fds for the counters in a cloned context, the only thing that can make two clones of a given parent different after they have been cloned is enabling or disabling all counters with prctl. To account for this, we keep a count of the number of enabled counters in each context. Two contexts must have the same number of enabled counters to be considered equivalent. Here are some measurements of the context switch time as measured with the lat_ctx benchmark from lmbench, comparing the times obtained with and without this patch series: -----Unmodified----- With this patch series Counters: none 2 HW 4H+4S none 2 HW 4H+4S 2 processes: Average 3.44 6.45 11.24 3.12 3.39 3.60 St dev 0.04 0.04 0.13 0.05 0.17 0.19 8 processes: Average 6.45 8.79 14.00 5.57 6.23 7.57 St dev 1.27 1.04 0.88 1.42 1.46 1.42 32 processes: Average 5.56 8.43 13.78 5.28 5.55 7.15 St dev 0.41 0.47 0.53 0.54 0.57 0.81 The numbers are the mean and standard deviation of 20 runs of lat_ctx. The "none" columns are lat_ctx run directly without any counters. The "2 HW" columns are with lat_ctx run under perfstat, counting cycles and instructions. The "4H+4S" columns are lat_ctx run under perfstat with 4 hardware counters and 4 software counters (cycles, instructions, cache references, cache misses, task clock, context switch, cpu migrations, and page faults). Signed-off-by: Paul Mackerras <paulus@samba.org> --- include/linux/perf_counter.h | 12 ++++- kernel/perf_counter.c | 109 ++++++++++++++++++++++++++++++++++++----- kernel/sched.c | 2 +- 3 files changed, 107 insertions(+), 16 deletions(-) diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h index 0713090..4cae01a 100644 --- a/include/linux/perf_counter.h +++ b/include/linux/perf_counter.h @@ -513,6 +513,7 @@ struct perf_counter_context { struct list_head event_list; int nr_counters; int nr_active; + int nr_enabled; int is_active; atomic_t refcount; struct task_struct *task; @@ -522,6 +523,14 @@ struct perf_counter_context { */ u64 time; u64 timestamp; + + /* + * These fields let us detect when two contexts have both + * been cloned (inherited) from a common ancestor. + */ + struct perf_counter_context *parent_ctx; + u32 parent_gen; + u32 generation; }; /** @@ -552,7 +561,8 @@ extern int perf_max_counters; extern const struct pmu *hw_perf_counter_init(struct perf_counter *counter); extern void perf_counter_task_sched_in(struct task_struct *task, int cpu); -extern void perf_counter_task_sched_out(struct task_struct *task, int cpu); +extern void perf_counter_task_sched_out(struct task_struct *task, + struct task_struct *next, int cpu); extern void perf_counter_task_tick(struct task_struct *task, int cpu); extern void perf_counter_init_task(struct task_struct *child); extern void perf_counter_exit_task(struct task_struct *child); diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c index 06ea3ea..c100554 100644 --- a/kernel/perf_counter.c +++ b/kernel/perf_counter.c @@ -104,8 +104,11 @@ static void get_ctx(struct perf_counter_context *ctx) static void put_ctx(struct perf_counter_context *ctx) { - if (atomic_dec_and_test(&ctx->refcount)) + if (atomic_dec_and_test(&ctx->refcount)) { + if (ctx->parent_ctx) + put_ctx(ctx->parent_ctx); kfree(ctx); + } } static void @@ -127,6 +130,8 @@ list_add_counter(struct perf_counter *counter, struct perf_counter_context *ctx) list_add_rcu(&counter->event_entry, &ctx->event_list); ctx->nr_counters++; + if (counter->state >= PERF_COUNTER_STATE_INACTIVE) + ctx->nr_enabled++; } /* @@ -141,6 +146,8 @@ list_del_counter(struct perf_counter *counter, struct perf_counter_context *ctx) if (list_empty(&counter->list_entry)) return; ctx->nr_counters--; + if (counter->state >= PERF_COUNTER_STATE_INACTIVE) + ctx->nr_enabled--; list_del_init(&counter->list_entry); list_del_rcu(&counter->event_entry); @@ -204,6 +211,22 @@ group_sched_out(struct perf_counter *group_counter, } /* + * Mark this context as not being a clone of another. + * Called when counters are added to or removed from this context. + * We also increment our generation number so that anything that + * was cloned from this context before this will not match anything + * cloned from this context after this. + */ +static void unclone_ctx(struct perf_counter_context *ctx) +{ + ++ctx->generation; + if (!ctx->parent_ctx) + return; + put_ctx(ctx->parent_ctx); + ctx->parent_ctx = NULL; +} + +/* * Cross CPU call to remove a performance counter * * We disable the counter on the hardware level first. After that we @@ -263,6 +286,7 @@ static void perf_counter_remove_from_context(struct perf_counter *counter) struct perf_counter_context *ctx = counter->ctx; struct task_struct *task = ctx->task; + unclone_ctx(ctx); if (!task) { /* * Per cpu counters are removed via an smp call and @@ -378,6 +402,7 @@ static void __perf_counter_disable(void *info) else counter_sched_out(counter, cpuctx, ctx); counter->state = PERF_COUNTER_STATE_OFF; + ctx->nr_enabled--; } spin_unlock_irqrestore(&ctx->lock, flags); @@ -419,6 +444,7 @@ static void perf_counter_disable(struct perf_counter *counter) if (counter->state == PERF_COUNTER_STATE_INACTIVE) { update_counter_times(counter); counter->state = PERF_COUNTER_STATE_OFF; + ctx->nr_enabled--; } spin_unlock_irq(&ctx->lock); @@ -727,6 +753,7 @@ static void __perf_counter_enable(void *info) goto unlock; counter->state = PERF_COUNTER_STATE_INACTIVE; counter->tstamp_enabled = ctx->time - counter->total_time_enabled; + ctx->nr_enabled++; /* * If the counter is in a group and isn't the group leader, @@ -817,6 +844,7 @@ static void perf_counter_enable(struct perf_counter *counter) counter->state = PERF_COUNTER_STATE_INACTIVE; counter->tstamp_enabled = ctx->time - counter->total_time_enabled; + ctx->nr_enabled++; } out: spin_unlock_irq(&ctx->lock); @@ -862,6 +890,25 @@ void __perf_counter_sched_out(struct perf_counter_context *ctx, } /* + * Test whether two contexts are equivalent, i.e. whether they + * have both been cloned from the same version of the same context + * and they both have the same number of enabled counters. + * If the number of enabled counters is the same, then the set + * of enabled counters should be the same, because these are both + * inherited contexts, therefore we can't access individual counters + * in them directly with an fd; we can only enable/disable all + * counters via prctl, or enable/disable all counters in a family + * via ioctl, which will have the same effect on both contexts. + */ +static int context_equiv(struct perf_counter_context *ctx1, + struct perf_counter_context *ctx2) +{ + return ctx1->parent_ctx && ctx1->parent_ctx == ctx2->parent_ctx + && ctx1->parent_gen == ctx2->parent_gen + && ctx1->nr_enabled == ctx2->nr_enabled; +} + +/* * Called from scheduler to remove the counters of the current task, * with interrupts disabled. * @@ -872,10 +919,12 @@ void __perf_counter_sched_out(struct perf_counter_context *ctx, * accessing the counter control register. If a NMI hits, then it will * not restart the counter. */ -void perf_counter_task_sched_out(struct task_struct *task, int cpu) +void perf_counter_task_sched_out(struct task_struct *task, + struct task_struct *next, int cpu) { struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu); struct perf_counter_context *ctx = task->perf_counter_ctxp; + struct perf_counter_context *next_ctx; struct pt_regs *regs; if (likely(!ctx || !cpuctx->task_ctx)) @@ -885,6 +934,16 @@ void perf_counter_task_sched_out(struct task_struct *task, int cpu) regs = task_pt_regs(task); perf_swcounter_event(PERF_COUNT_CONTEXT_SWITCHES, 1, 1, regs, 0); + + next_ctx = next->perf_counter_ctxp; + if (next_ctx && context_equiv(ctx, next_ctx)) { + task->perf_counter_ctxp = next_ctx; + next->perf_counter_ctxp = ctx; + ctx->task = next; + next_ctx->task = task; + return; + } + __perf_counter_sched_out(ctx, cpuctx); cpuctx->task_ctx = NULL; @@ -998,6 +1057,8 @@ void perf_counter_task_sched_in(struct task_struct *task, int cpu) if (likely(!ctx)) return; + if (cpuctx->task_ctx == ctx) + return; __perf_counter_sched_in(ctx, cpuctx, cpu); cpuctx->task_ctx = ctx; } @@ -3253,6 +3314,16 @@ inherit_counter(struct perf_counter *parent_counter, return child_counter; /* + * Make the child state follow the state of the parent counter, + * not its hw_event.disabled bit. We hold the parent's mutex, + * so we won't race with perf_counter_{en,dis}able_family. + */ + if (parent_counter->state >= PERF_COUNTER_STATE_INACTIVE) + child_counter->state = PERF_COUNTER_STATE_INACTIVE; + else + child_counter->state = PERF_COUNTER_STATE_OFF; + + /* * Link it up in the child's context: */ add_counter_to_ctx(child_counter, child_ctx); @@ -3277,16 +3348,6 @@ inherit_counter(struct perf_counter *parent_counter, mutex_lock(&parent_counter->mutex); list_add_tail(&child_counter->child_list, &parent_counter->child_list); - /* - * Make the child state follow the state of the parent counter, - * not its hw_event.disabled bit. We hold the parent's mutex, - * so we won't race with perf_counter_{en,dis}able_family. - */ - if (parent_counter->state >= PERF_COUNTER_STATE_INACTIVE) - child_counter->state = PERF_COUNTER_STATE_INACTIVE; - else - child_counter->state = PERF_COUNTER_STATE_OFF; - mutex_unlock(&parent_counter->mutex); return child_counter; @@ -3429,6 +3490,7 @@ void perf_counter_init_task(struct task_struct *child) struct perf_counter_context *child_ctx, *parent_ctx; struct perf_counter *counter; struct task_struct *parent = current; + int inherited_all = 1; child->perf_counter_ctxp = NULL; @@ -3463,12 +3525,31 @@ void perf_counter_init_task(struct task_struct *child) if (counter != counter->group_leader) continue; - if (!counter->hw_event.inherit) + if (!counter->hw_event.inherit) { + inherited_all = 0; continue; + } if (inherit_group(counter, parent, - parent_ctx, child, child_ctx)) + parent_ctx, child, child_ctx)) { + inherited_all = 0; break; + } + } + + if (inherited_all) { + /* + * Mark the child context as a clone of the parent + * context, or of whatever the parent is a clone of. + */ + if (parent_ctx->parent_ctx) { + child_ctx->parent_ctx = parent_ctx->parent_ctx; + child_ctx->parent_gen = parent_ctx->parent_gen; + } else { + child_ctx->parent_ctx = parent_ctx; + child_ctx->parent_gen = parent_ctx->generation; + } + get_ctx(child_ctx->parent_ctx); } mutex_unlock(&parent_ctx->mutex); diff --git a/kernel/sched.c b/kernel/sched.c index 419a39d..4c0d58b 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -5091,7 +5091,7 @@ need_resched_nonpreemptible: if (likely(prev != next)) { sched_info_switch(prev, next); - perf_counter_task_sched_out(prev, cpu); + perf_counter_task_sched_out(prev, next, cpu); rq->nr_switches++; rq->curr = next; -- 1.6.0.4 ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-22 4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras @ 2009-05-22 8:16 ` Peter Zijlstra 2009-05-22 9:56 ` Paul Mackerras 2009-05-22 8:32 ` Peter Zijlstra ` (5 subsequent siblings) 6 siblings, 1 reply; 32+ messages in thread From: Peter Zijlstra @ 2009-05-22 8:16 UTC (permalink / raw) To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote: > Since we don't have individual fds for the counters in a cloned > context, the only thing that can make two clones of a given parent > different after they have been cloned is enabling or disabling all > counters with prctl. To account for this, we keep a count of the > number of enabled counters in each context. Two contexts must have > the same number of enabled counters to be considered equivalent. Curious point that.. so prctl() can disable counters it doesn't own. Shouldn't we instead fix that? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-22 8:16 ` Peter Zijlstra @ 2009-05-22 9:56 ` Paul Mackerras 2009-05-22 10:08 ` Peter Zijlstra 0 siblings, 1 reply; 32+ messages in thread From: Paul Mackerras @ 2009-05-22 9:56 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner Peter Zijlstra writes: > On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote: > > Since we don't have individual fds for the counters in a cloned > > context, the only thing that can make two clones of a given parent > > different after they have been cloned is enabling or disabling all > > counters with prctl. To account for this, we keep a count of the > > number of enabled counters in each context. Two contexts must have > > the same number of enabled counters to be considered equivalent. > > Curious point that.. so prctl() can disable counters it doesn't own. > > Shouldn't we instead fix that? Well, prctl enables/disables the counters that are counting on the current process, regardless of who or what created them. I always thought that was a little strange; maybe it is useful to be able to disable all the counters that any other process might have put on to you, but I can't think of a scenario where you'd really need to do that, particularly since the disable is a one-shot operation, and doesn't prevent new (enabled) counters being attached to you. On the other hand, what does "all the counters I own" mean? Does it mean all the ones that I have fds open for? Or does it mean all the ones that I created? Either way we don't have a good way to enumerate them. Paul. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-22 9:56 ` Paul Mackerras @ 2009-05-22 10:08 ` Peter Zijlstra 2009-05-23 12:38 ` Ingo Molnar 0 siblings, 1 reply; 32+ messages in thread From: Peter Zijlstra @ 2009-05-22 10:08 UTC (permalink / raw) To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner On Fri, 2009-05-22 at 19:56 +1000, Paul Mackerras wrote: > Peter Zijlstra writes: > > > On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote: > > > Since we don't have individual fds for the counters in a cloned > > > context, the only thing that can make two clones of a given parent > > > different after they have been cloned is enabling or disabling all > > > counters with prctl. To account for this, we keep a count of the > > > number of enabled counters in each context. Two contexts must have > > > the same number of enabled counters to be considered equivalent. > > > > Curious point that.. so prctl() can disable counters it doesn't own. > > > > Shouldn't we instead fix that? > > Well, prctl enables/disables the counters that are counting on the > current process, regardless of who or what created them. I always > thought that was a little strange; maybe it is useful to be able to > disable all the counters that any other process might have put on to > you, but I can't think of a scenario where you'd really need to do > that, particularly since the disable is a one-shot operation, and > doesn't prevent new (enabled) counters being attached to you. > > On the other hand, what does "all the counters I own" mean? Does it > mean all the ones that I have fds open for? Or does it mean all the > ones that I created? Either way we don't have a good way to enumerate > them. I'm for all counters you created (ie have a fd for). Being able to disable counters others created on you just sounds wrong. If we can settle on a semantic, I'm sure we can implement it :-) Ingo, Corey, any opinions? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-22 10:08 ` Peter Zijlstra @ 2009-05-23 12:38 ` Ingo Molnar 2009-05-23 13:06 ` Peter Zijlstra 2009-05-24 23:55 ` Paul Mackerras 0 siblings, 2 replies; 32+ messages in thread From: Ingo Molnar @ 2009-05-23 12:38 UTC (permalink / raw) To: Peter Zijlstra Cc: Paul Mackerras, linux-kernel, Corey Ashford, Thomas Gleixner * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > On Fri, 2009-05-22 at 19:56 +1000, Paul Mackerras wrote: > > Peter Zijlstra writes: > > > > > On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote: > > > > Since we don't have individual fds for the counters in a cloned > > > > context, the only thing that can make two clones of a given parent > > > > different after they have been cloned is enabling or disabling all > > > > counters with prctl. To account for this, we keep a count of the > > > > number of enabled counters in each context. Two contexts must have > > > > the same number of enabled counters to be considered equivalent. > > > > > > Curious point that.. so prctl() can disable counters it doesn't own. > > > > > > Shouldn't we instead fix that? > > > > Well, prctl enables/disables the counters that are counting on the > > current process, regardless of who or what created them. I always > > thought that was a little strange; maybe it is useful to be able to > > disable all the counters that any other process might have put on to > > you, but I can't think of a scenario where you'd really need to do > > that, particularly since the disable is a one-shot operation, and > > doesn't prevent new (enabled) counters being attached to you. > > > > On the other hand, what does "all the counters I own" mean? > > Does it mean all the ones that I have fds open for? Or does it > > mean all the ones that I created? Either way we don't have a > > good way to enumerate them. > > I'm for all counters you created (ie have a fd for). Being able to > disable counters others created on you just sounds wrong. > > If we can settle on a semantic, I'm sure we can implement it :-) > > Ingo, Corey, any opinions? It indeed doesnt sound correct that we can disable counters others created on us - especially if they are in a different (higher privileged) security context than us. OTOH, enabling/disabling counters in specific functions of a library might be a valid use-case. So perhaps make this an attribute: .transparent or so, with perf stat defaulting on it to be transparent (i.e. not child context disable-able). Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-23 12:38 ` Ingo Molnar @ 2009-05-23 13:06 ` Peter Zijlstra 2009-05-24 23:55 ` Paul Mackerras 1 sibling, 0 replies; 32+ messages in thread From: Peter Zijlstra @ 2009-05-23 13:06 UTC (permalink / raw) To: Ingo Molnar; +Cc: Paul Mackerras, linux-kernel, Corey Ashford, Thomas Gleixner On Sat, 2009-05-23 at 14:38 +0200, Ingo Molnar wrote: > > I'm for all counters you created (ie have a fd for). Being able to > > disable counters others created on you just sounds wrong. > > > > If we can settle on a semantic, I'm sure we can implement it :-) > > > > Ingo, Corey, any opinions? > > It indeed doesnt sound correct that we can disable counters others > created on us - especially if they are in a different (higher > privileged) security context than us. > > OTOH, enabling/disabling counters in specific functions of a library > might be a valid use-case. So perhaps make this an attribute: > ..transparent or so, with perf stat defaulting on it to be > transparent (i.e. not child context disable-able). I'm not sure that's something we want to do. Furthermore, if we do want it, the current implementation is not sufficient, because, as Paul noted, we can attach a new counter right after the disable. I really think such limitations should come from whatever security policy there is on attaching counters. Eg. using selinux, label gnupg as non-countable so that you simply cannot attach (incl inherit) counters to it. Allowing such hooks into libraries will destroy transparency for developers and I don't think that's something we'd want to promote. I'll implement my suggestion so we can take it from there. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-23 12:38 ` Ingo Molnar 2009-05-23 13:06 ` Peter Zijlstra @ 2009-05-24 23:55 ` Paul Mackerras 1 sibling, 0 replies; 32+ messages in thread From: Paul Mackerras @ 2009-05-24 23:55 UTC (permalink / raw) To: Ingo Molnar; +Cc: Peter Zijlstra, linux-kernel, Corey Ashford, Thomas Gleixner Ingo Molnar writes: > OTOH, enabling/disabling counters in specific functions of a library > might be a valid use-case. If that's the use case, wouldn't we wouldn't the "enable" to restore the previous state, rather than enabling all counters? Paul. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-22 4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras 2009-05-22 8:16 ` Peter Zijlstra @ 2009-05-22 8:32 ` Peter Zijlstra 2009-05-22 8:57 ` Ingo Molnar 2009-05-22 9:29 ` Paul Mackerras 2009-05-22 9:22 ` Peter Zijlstra ` (4 subsequent siblings) 6 siblings, 2 replies; 32+ messages in thread From: Peter Zijlstra @ 2009-05-22 8:32 UTC (permalink / raw) To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote: > The equivalence of contexts is detected by keeping a pointer in > each cloned context pointing to the context it was cloned from. > To cope with the situation where a context is changed by adding > or removing counters after it has been cloned, we also keep a > generation number on each context which is incremented every time > a context is changed. When a context is cloned we take a copy > of the parent's generation number, and two cloned contexts are > equivalent only if they have the same parent and the same > generation number. In order that the parent context pointer > remains valid (and is not reused), we increment the parent > context's reference count for each context cloned from it. > + u32 generation; Suppose someone writes a malicious proglet that inherits the counters, puts the child to sleep, does 2^32 mods on the counter set, and then wakes up the child. Would that merely corrupt the results, or make the kernel explode? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-22 8:32 ` Peter Zijlstra @ 2009-05-22 8:57 ` Ingo Molnar 2009-05-22 9:02 ` Peter Zijlstra 2009-05-22 9:29 ` Paul Mackerras 1 sibling, 1 reply; 32+ messages in thread From: Ingo Molnar @ 2009-05-22 8:57 UTC (permalink / raw) To: Peter Zijlstra Cc: Paul Mackerras, linux-kernel, Corey Ashford, Thomas Gleixner * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote: > > > The equivalence of contexts is detected by keeping a pointer in > > each cloned context pointing to the context it was cloned from. > > To cope with the situation where a context is changed by adding > > or removing counters after it has been cloned, we also keep a > > generation number on each context which is incremented every time > > a context is changed. When a context is cloned we take a copy > > of the parent's generation number, and two cloned contexts are > > equivalent only if they have the same parent and the same > > generation number. In order that the parent context pointer > > remains valid (and is not reused), we increment the parent > > context's reference count for each context cloned from it. > > > + u32 generation; > > Suppose someone writes a malicious proglet that inherits the > counters, puts the child to sleep, does 2^32 mods on the counter > set, and then wakes up the child. > > Would that merely corrupt the results, or make the kernel explode? i think it should be an u64. Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-22 8:57 ` Ingo Molnar @ 2009-05-22 9:02 ` Peter Zijlstra 2009-05-22 10:14 ` Ingo Molnar 0 siblings, 1 reply; 32+ messages in thread From: Peter Zijlstra @ 2009-05-22 9:02 UTC (permalink / raw) To: Ingo Molnar; +Cc: Paul Mackerras, linux-kernel, Corey Ashford, Thomas Gleixner On Fri, 2009-05-22 at 10:57 +0200, Ingo Molnar wrote: > * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > > On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote: > > > > > The equivalence of contexts is detected by keeping a pointer in > > > each cloned context pointing to the context it was cloned from. > > > To cope with the situation where a context is changed by adding > > > or removing counters after it has been cloned, we also keep a > > > generation number on each context which is incremented every time > > > a context is changed. When a context is cloned we take a copy > > > of the parent's generation number, and two cloned contexts are > > > equivalent only if they have the same parent and the same > > > generation number. In order that the parent context pointer > > > remains valid (and is not reused), we increment the parent > > > context's reference count for each context cloned from it. > > > > > + u32 generation; > > > > Suppose someone writes a malicious proglet that inherits the > > counters, puts the child to sleep, does 2^32 mods on the counter > > set, and then wakes up the child. > > > > Would that merely corrupt the results, or make the kernel explode? > > i think it should be an u64. So at to make the attack infeasible by death of the attacker and burnout of the solar-system? :-) ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-22 9:02 ` Peter Zijlstra @ 2009-05-22 10:14 ` Ingo Molnar 0 siblings, 0 replies; 32+ messages in thread From: Ingo Molnar @ 2009-05-22 10:14 UTC (permalink / raw) To: Peter Zijlstra Cc: Paul Mackerras, linux-kernel, Corey Ashford, Thomas Gleixner * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > On Fri, 2009-05-22 at 10:57 +0200, Ingo Molnar wrote: > > * Peter Zijlstra <a.p.zijlstra@chello.nl> wrote: > > > > > On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote: > > > > > > > The equivalence of contexts is detected by keeping a pointer in > > > > each cloned context pointing to the context it was cloned from. > > > > To cope with the situation where a context is changed by adding > > > > or removing counters after it has been cloned, we also keep a > > > > generation number on each context which is incremented every time > > > > a context is changed. When a context is cloned we take a copy > > > > of the parent's generation number, and two cloned contexts are > > > > equivalent only if they have the same parent and the same > > > > generation number. In order that the parent context pointer > > > > remains valid (and is not reused), we increment the parent > > > > context's reference count for each context cloned from it. > > > > > > > + u32 generation; > > > > > > Suppose someone writes a malicious proglet that inherits the > > > counters, puts the child to sleep, does 2^32 mods on the counter > > > set, and then wakes up the child. > > > > > > Would that merely corrupt the results, or make the kernel explode? > > > > i think it should be an u64. > > So at to make the attack infeasible by death of the attacker and > burnout of the solar-system? :-) Yeah. We only have to make it for the next 1 billion years anyway - then due to the sun being ~50% brighter the planet as we know it will be scorched up anyway. (Unless some entity puts it under massive space based shades until then, to preserve the ancient biosphere for nostalgic or museologic reasons or so.) Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-22 8:32 ` Peter Zijlstra 2009-05-22 8:57 ` Ingo Molnar @ 2009-05-22 9:29 ` Paul Mackerras 1 sibling, 0 replies; 32+ messages in thread From: Paul Mackerras @ 2009-05-22 9:29 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner Peter Zijlstra writes: > Suppose someone writes a malicious proglet that inherits the counters, > puts the child to sleep, does 2^32 mods on the counter set, and then > wakes up the child. > > Would that merely corrupt the results, or make the kernel explode? You'd have to do something like create some counters on yourself, fork, do the 2^32 counter creations and deletions, then do another fork. All that would happen is that some of the counters would count on one of the child processes when they should be counting on the other. But it would be easy and cheap to make the generation count be 64 bits, and then it won't overflow in my lifetime at least, and after that I don't care. :) So I agree with Ingo that we should just do that. Paul. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-22 4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras 2009-05-22 8:16 ` Peter Zijlstra 2009-05-22 8:32 ` Peter Zijlstra @ 2009-05-22 9:22 ` Peter Zijlstra 2009-05-22 9:42 ` Peter Zijlstra 2009-05-22 10:05 ` Paul Mackerras 2009-05-22 10:11 ` Ingo Molnar ` (3 subsequent siblings) 6 siblings, 2 replies; 32+ messages in thread From: Peter Zijlstra @ 2009-05-22 9:22 UTC (permalink / raw) To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote: > > -----Unmodified----- With this patch series > Counters: none 2 HW 4H+4S none 2 HW 4H+4S > > 2 processes: > Average 3.44 6.45 11.24 3.12 3.39 3.60 > St dev 0.04 0.04 0.13 0.05 0.17 0.19 > > 8 processes: > Average 6.45 8.79 14.00 5.57 6.23 7.57 > St dev 1.27 1.04 0.88 1.42 1.46 1.42 > > 32 processes: > Average 5.56 8.43 13.78 5.28 5.55 7.15 > St dev 0.41 0.47 0.53 0.54 0.57 0.81 Any clues as to why the time is still dependent on the number of counters in the context? The approach seems to be O(1) in that it does a simple counter context swap on sched_out and nothing on sched_in. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-22 9:22 ` Peter Zijlstra @ 2009-05-22 9:42 ` Peter Zijlstra 2009-05-22 10:07 ` Paul Mackerras 2009-05-22 10:05 ` Paul Mackerras 1 sibling, 1 reply; 32+ messages in thread From: Peter Zijlstra @ 2009-05-22 9:42 UTC (permalink / raw) To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner On Fri, 2009-05-22 at 11:22 +0200, Peter Zijlstra wrote: > On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote: > > > > -----Unmodified----- With this patch series > > Counters: none 2 HW 4H+4S none 2 HW 4H+4S > > > > 2 processes: > > Average 3.44 6.45 11.24 3.12 3.39 3.60 > > St dev 0.04 0.04 0.13 0.05 0.17 0.19 > > > > 8 processes: > > Average 6.45 8.79 14.00 5.57 6.23 7.57 > > St dev 1.27 1.04 0.88 1.42 1.46 1.42 > > > > 32 processes: > > Average 5.56 8.43 13.78 5.28 5.55 7.15 > > St dev 0.41 0.47 0.53 0.54 0.57 0.81 > > Any clues as to why the time is still dependent on the number of > counters in the context? The approach seems to be O(1) in that it > does a simple counter context swap on sched_out and nothing on sched_in. Ah, should I read this as 'n' lmbench instances on a single cpu? So that we get multiple inheritance sets mixed and have to switch between them? ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-22 9:42 ` Peter Zijlstra @ 2009-05-22 10:07 ` Paul Mackerras 0 siblings, 0 replies; 32+ messages in thread From: Paul Mackerras @ 2009-05-22 10:07 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner Peter Zijlstra writes: > Ah, should I read this as 'n' lmbench instances on a single cpu? So that > we get multiple inheritance sets mixed and have to switch between them? No, it's one lat_ctx instance, which creates N processes connected in a ring with pipes, and passes a token around some number of times and measures the overall time to do that. It subtracts off the time it takes to do all the pipe reads and writes in a single process to get an estimate of how much the context switches are costing. Paul. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-22 9:22 ` Peter Zijlstra 2009-05-22 9:42 ` Peter Zijlstra @ 2009-05-22 10:05 ` Paul Mackerras 1 sibling, 0 replies; 32+ messages in thread From: Paul Mackerras @ 2009-05-22 10:05 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner Peter Zijlstra writes: > On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote: > > > > -----Unmodified----- With this patch series > > Counters: none 2 HW 4H+4S none 2 HW 4H+4S > > > > 2 processes: > > Average 3.44 6.45 11.24 3.12 3.39 3.60 > > St dev 0.04 0.04 0.13 0.05 0.17 0.19 > > > > 8 processes: > > Average 6.45 8.79 14.00 5.57 6.23 7.57 > > St dev 1.27 1.04 0.88 1.42 1.46 1.42 > > > > 32 processes: > > Average 5.56 8.43 13.78 5.28 5.55 7.15 > > St dev 0.41 0.47 0.53 0.54 0.57 0.81 > > Any clues as to why the time is still dependent on the number of > counters in the context? The approach seems to be O(1) in that it > does a simple counter context swap on sched_out and nothing on sched_in. Only the context switches between the lat_ctx processes will be optimized; switches between them and other processes will still do the full PMU switch. The CPU would be switching to other processes from time to time during the run (e.g. to run various kernel daemons, which seem to continue to proliferate) so some fraction of the context switches would be as expensive as before. Probably I should measure what that fraction is, but it's Friday night and I'm feeling lazy. :) Paul. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-22 4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras ` (2 preceding siblings ...) 2009-05-22 9:22 ` Peter Zijlstra @ 2009-05-22 10:11 ` Ingo Molnar 2009-05-22 10:27 ` [tip:perfcounters/core] perf_counter: Optimize " tip-bot for Paul Mackerras ` (2 subsequent siblings) 6 siblings, 0 replies; 32+ messages in thread From: Ingo Molnar @ 2009-05-22 10:11 UTC (permalink / raw) To: Paul Mackerras Cc: Peter Zijlstra, linux-kernel, Corey Ashford, Thomas Gleixner * Paul Mackerras <paulus@samba.org> wrote: > When monitoring a process and its descendants with a set of > inherited counters, we can often get the situation in a context > switch where both the old (outgoing) and new (incoming) process > have the same set of counters, and their values are ultimately > going to be added together. In that situation it doesn't matter > which set of counters are used to count the activity for the new > process, so there is really no need to go through the process of > reading the hardware counters and updating the old task's counters > and then setting up the PMU for the new task. > > This optimizes the context switch in this situation. Instead of > scheduling out the perf_counter_context for the old task and > scheduling in the new context, we simply transfer the old context > to the new task and keep using it without interruption. The new > context gets transferred to the old task. This means that both > tasks still have a valid perf_counter_context, so no special case > is introduced when the old task gets scheduled in again, either on > this CPU or another CPU. > > The equivalence of contexts is detected by keeping a pointer in > each cloned context pointing to the context it was cloned from. To > cope with the situation where a context is changed by adding or > removing counters after it has been cloned, we also keep a > generation number on each context which is incremented every time > a context is changed. When a context is cloned we take a copy of > the parent's generation number, and two cloned contexts are > equivalent only if they have the same parent and the same > generation number. In order that the parent context pointer > remains valid (and is not reused), we increment the parent > context's reference count for each context cloned from it. > > Since we don't have individual fds for the counters in a cloned > context, the only thing that can make two clones of a given parent > different after they have been cloned is enabling or disabling all > counters with prctl. To account for this, we keep a count of the > number of enabled counters in each context. Two contexts must have > the same number of enabled counters to be considered equivalent. > > Here are some measurements of the context switch time as measured with > the lat_ctx benchmark from lmbench, comparing the times obtained with > and without this patch series: > > -----Unmodified----- With this patch series > Counters: none 2 HW 4H+4S none 2 HW 4H+4S > > 2 processes: > Average 3.44 6.45 11.24 3.12 3.39 3.60 > St dev 0.04 0.04 0.13 0.05 0.17 0.19 > > 8 processes: > Average 6.45 8.79 14.00 5.57 6.23 7.57 > St dev 1.27 1.04 0.88 1.42 1.46 1.42 > > 32 processes: > Average 5.56 8.43 13.78 5.28 5.55 7.15 > St dev 0.41 0.47 0.53 0.54 0.57 0.81 > > The numbers are the mean and standard deviation of 20 runs of > lat_ctx. The "none" columns are lat_ctx run directly without any > counters. The "2 HW" columns are with lat_ctx run under perfstat, > counting cycles and instructions. The "4H+4S" columns are lat_ctx > run under perfstat with 4 hardware counters and 4 software > counters (cycles, instructions, cache references, cache misses, > task clock, context switch, cpu migrations, and page faults). > > Signed-off-by: Paul Mackerras <paulus@samba.org> > --- > include/linux/perf_counter.h | 12 ++++- > kernel/perf_counter.c | 109 ++++++++++++++++++++++++++++++++++++----- > kernel/sched.c | 2 +- > 3 files changed, 107 insertions(+), 16 deletions(-) Impressive! I'm wondering where the sensitivity of lat_ctx on the number of counters comes from. I'd expect there to be constant (and very low) overhead. It could be measurement noise - lat_ctx is very sensitive on L2 layout and memory allocation patterns - those are very hard to eliminate and are not measured via the stddev numbers. (many of those effects are per bootup specific, and bootups dont randomize them - so there's no easy way to measure their statistical impact.) Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* [tip:perfcounters/core] perf_counter: Optimize context switch between identical inherited contexts 2009-05-22 4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras ` (3 preceding siblings ...) 2009-05-22 10:11 ` Ingo Molnar @ 2009-05-22 10:27 ` tip-bot for Paul Mackerras 2009-05-24 11:33 ` Ingo Molnar 2009-05-22 10:36 ` [tip:perfcounters/core] perf_counter: fix !PERF_COUNTERS build failure tip-bot for Ingo Molnar 2009-05-22 13:46 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Peter Zijlstra 6 siblings, 1 reply; 32+ messages in thread From: tip-bot for Paul Mackerras @ 2009-05-22 10:27 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, acme, paulus, hpa, mingo, a.p.zijlstra, mtosatti, tglx, cjashfor, mingo Commit-ID: 564c2b210add41df9a3a5aaa365c1d97cff6110d Gitweb: http://git.kernel.org/tip/564c2b210add41df9a3a5aaa365c1d97cff6110d Author: Paul Mackerras <paulus@samba.org> AuthorDate: Fri, 22 May 2009 14:27:22 +1000 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Fri, 22 May 2009 12:18:20 +0200 perf_counter: Optimize context switch between identical inherited contexts When monitoring a process and its descendants with a set of inherited counters, we can often get the situation in a context switch where both the old (outgoing) and new (incoming) process have the same set of counters, and their values are ultimately going to be added together. In that situation it doesn't matter which set of counters are used to count the activity for the new process, so there is really no need to go through the process of reading the hardware counters and updating the old task's counters and then setting up the PMU for the new task. This optimizes the context switch in this situation. Instead of scheduling out the perf_counter_context for the old task and scheduling in the new context, we simply transfer the old context to the new task and keep using it without interruption. The new context gets transferred to the old task. This means that both tasks still have a valid perf_counter_context, so no special case is introduced when the old task gets scheduled in again, either on this CPU or another CPU. The equivalence of contexts is detected by keeping a pointer in each cloned context pointing to the context it was cloned from. To cope with the situation where a context is changed by adding or removing counters after it has been cloned, we also keep a generation number on each context which is incremented every time a context is changed. When a context is cloned we take a copy of the parent's generation number, and two cloned contexts are equivalent only if they have the same parent and the same generation number. In order that the parent context pointer remains valid (and is not reused), we increment the parent context's reference count for each context cloned from it. Since we don't have individual fds for the counters in a cloned context, the only thing that can make two clones of a given parent different after they have been cloned is enabling or disabling all counters with prctl. To account for this, we keep a count of the number of enabled counters in each context. Two contexts must have the same number of enabled counters to be considered equivalent. Here are some measurements of the context switch time as measured with the lat_ctx benchmark from lmbench, comparing the times obtained with and without this patch series: -----Unmodified----- With this patch series Counters: none 2 HW 4H+4S none 2 HW 4H+4S 2 processes: Average 3.44 6.45 11.24 3.12 3.39 3.60 St dev 0.04 0.04 0.13 0.05 0.17 0.19 8 processes: Average 6.45 8.79 14.00 5.57 6.23 7.57 St dev 1.27 1.04 0.88 1.42 1.46 1.42 32 processes: Average 5.56 8.43 13.78 5.28 5.55 7.15 St dev 0.41 0.47 0.53 0.54 0.57 0.81 The numbers are the mean and standard deviation of 20 runs of lat_ctx. The "none" columns are lat_ctx run directly without any counters. The "2 HW" columns are with lat_ctx run under perfstat, counting cycles and instructions. The "4H+4S" columns are lat_ctx run under perfstat with 4 hardware counters and 4 software counters (cycles, instructions, cache references, cache misses, task clock, context switch, cpu migrations, and page faults). [ Impact: performance optimization of counter context-switches ] Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <18966.10666.517218.332164@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/perf_counter.h | 12 ++++- kernel/perf_counter.c | 109 ++++++++++++++++++++++++++++++++++++----- kernel/sched.c | 2 +- 3 files changed, 107 insertions(+), 16 deletions(-) diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h index 0713090..4cae01a 100644 --- a/include/linux/perf_counter.h +++ b/include/linux/perf_counter.h @@ -513,6 +513,7 @@ struct perf_counter_context { struct list_head event_list; int nr_counters; int nr_active; + int nr_enabled; int is_active; atomic_t refcount; struct task_struct *task; @@ -522,6 +523,14 @@ struct perf_counter_context { */ u64 time; u64 timestamp; + + /* + * These fields let us detect when two contexts have both + * been cloned (inherited) from a common ancestor. + */ + struct perf_counter_context *parent_ctx; + u32 parent_gen; + u32 generation; }; /** @@ -552,7 +561,8 @@ extern int perf_max_counters; extern const struct pmu *hw_perf_counter_init(struct perf_counter *counter); extern void perf_counter_task_sched_in(struct task_struct *task, int cpu); -extern void perf_counter_task_sched_out(struct task_struct *task, int cpu); +extern void perf_counter_task_sched_out(struct task_struct *task, + struct task_struct *next, int cpu); extern void perf_counter_task_tick(struct task_struct *task, int cpu); extern void perf_counter_init_task(struct task_struct *child); extern void perf_counter_exit_task(struct task_struct *child); diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c index 06ea3ea..c100554 100644 --- a/kernel/perf_counter.c +++ b/kernel/perf_counter.c @@ -104,8 +104,11 @@ static void get_ctx(struct perf_counter_context *ctx) static void put_ctx(struct perf_counter_context *ctx) { - if (atomic_dec_and_test(&ctx->refcount)) + if (atomic_dec_and_test(&ctx->refcount)) { + if (ctx->parent_ctx) + put_ctx(ctx->parent_ctx); kfree(ctx); + } } static void @@ -127,6 +130,8 @@ list_add_counter(struct perf_counter *counter, struct perf_counter_context *ctx) list_add_rcu(&counter->event_entry, &ctx->event_list); ctx->nr_counters++; + if (counter->state >= PERF_COUNTER_STATE_INACTIVE) + ctx->nr_enabled++; } /* @@ -141,6 +146,8 @@ list_del_counter(struct perf_counter *counter, struct perf_counter_context *ctx) if (list_empty(&counter->list_entry)) return; ctx->nr_counters--; + if (counter->state >= PERF_COUNTER_STATE_INACTIVE) + ctx->nr_enabled--; list_del_init(&counter->list_entry); list_del_rcu(&counter->event_entry); @@ -204,6 +211,22 @@ group_sched_out(struct perf_counter *group_counter, } /* + * Mark this context as not being a clone of another. + * Called when counters are added to or removed from this context. + * We also increment our generation number so that anything that + * was cloned from this context before this will not match anything + * cloned from this context after this. + */ +static void unclone_ctx(struct perf_counter_context *ctx) +{ + ++ctx->generation; + if (!ctx->parent_ctx) + return; + put_ctx(ctx->parent_ctx); + ctx->parent_ctx = NULL; +} + +/* * Cross CPU call to remove a performance counter * * We disable the counter on the hardware level first. After that we @@ -263,6 +286,7 @@ static void perf_counter_remove_from_context(struct perf_counter *counter) struct perf_counter_context *ctx = counter->ctx; struct task_struct *task = ctx->task; + unclone_ctx(ctx); if (!task) { /* * Per cpu counters are removed via an smp call and @@ -378,6 +402,7 @@ static void __perf_counter_disable(void *info) else counter_sched_out(counter, cpuctx, ctx); counter->state = PERF_COUNTER_STATE_OFF; + ctx->nr_enabled--; } spin_unlock_irqrestore(&ctx->lock, flags); @@ -419,6 +444,7 @@ static void perf_counter_disable(struct perf_counter *counter) if (counter->state == PERF_COUNTER_STATE_INACTIVE) { update_counter_times(counter); counter->state = PERF_COUNTER_STATE_OFF; + ctx->nr_enabled--; } spin_unlock_irq(&ctx->lock); @@ -727,6 +753,7 @@ static void __perf_counter_enable(void *info) goto unlock; counter->state = PERF_COUNTER_STATE_INACTIVE; counter->tstamp_enabled = ctx->time - counter->total_time_enabled; + ctx->nr_enabled++; /* * If the counter is in a group and isn't the group leader, @@ -817,6 +844,7 @@ static void perf_counter_enable(struct perf_counter *counter) counter->state = PERF_COUNTER_STATE_INACTIVE; counter->tstamp_enabled = ctx->time - counter->total_time_enabled; + ctx->nr_enabled++; } out: spin_unlock_irq(&ctx->lock); @@ -862,6 +890,25 @@ void __perf_counter_sched_out(struct perf_counter_context *ctx, } /* + * Test whether two contexts are equivalent, i.e. whether they + * have both been cloned from the same version of the same context + * and they both have the same number of enabled counters. + * If the number of enabled counters is the same, then the set + * of enabled counters should be the same, because these are both + * inherited contexts, therefore we can't access individual counters + * in them directly with an fd; we can only enable/disable all + * counters via prctl, or enable/disable all counters in a family + * via ioctl, which will have the same effect on both contexts. + */ +static int context_equiv(struct perf_counter_context *ctx1, + struct perf_counter_context *ctx2) +{ + return ctx1->parent_ctx && ctx1->parent_ctx == ctx2->parent_ctx + && ctx1->parent_gen == ctx2->parent_gen + && ctx1->nr_enabled == ctx2->nr_enabled; +} + +/* * Called from scheduler to remove the counters of the current task, * with interrupts disabled. * @@ -872,10 +919,12 @@ void __perf_counter_sched_out(struct perf_counter_context *ctx, * accessing the counter control register. If a NMI hits, then it will * not restart the counter. */ -void perf_counter_task_sched_out(struct task_struct *task, int cpu) +void perf_counter_task_sched_out(struct task_struct *task, + struct task_struct *next, int cpu) { struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu); struct perf_counter_context *ctx = task->perf_counter_ctxp; + struct perf_counter_context *next_ctx; struct pt_regs *regs; if (likely(!ctx || !cpuctx->task_ctx)) @@ -885,6 +934,16 @@ void perf_counter_task_sched_out(struct task_struct *task, int cpu) regs = task_pt_regs(task); perf_swcounter_event(PERF_COUNT_CONTEXT_SWITCHES, 1, 1, regs, 0); + + next_ctx = next->perf_counter_ctxp; + if (next_ctx && context_equiv(ctx, next_ctx)) { + task->perf_counter_ctxp = next_ctx; + next->perf_counter_ctxp = ctx; + ctx->task = next; + next_ctx->task = task; + return; + } + __perf_counter_sched_out(ctx, cpuctx); cpuctx->task_ctx = NULL; @@ -998,6 +1057,8 @@ void perf_counter_task_sched_in(struct task_struct *task, int cpu) if (likely(!ctx)) return; + if (cpuctx->task_ctx == ctx) + return; __perf_counter_sched_in(ctx, cpuctx, cpu); cpuctx->task_ctx = ctx; } @@ -3253,6 +3314,16 @@ inherit_counter(struct perf_counter *parent_counter, return child_counter; /* + * Make the child state follow the state of the parent counter, + * not its hw_event.disabled bit. We hold the parent's mutex, + * so we won't race with perf_counter_{en,dis}able_family. + */ + if (parent_counter->state >= PERF_COUNTER_STATE_INACTIVE) + child_counter->state = PERF_COUNTER_STATE_INACTIVE; + else + child_counter->state = PERF_COUNTER_STATE_OFF; + + /* * Link it up in the child's context: */ add_counter_to_ctx(child_counter, child_ctx); @@ -3277,16 +3348,6 @@ inherit_counter(struct perf_counter *parent_counter, mutex_lock(&parent_counter->mutex); list_add_tail(&child_counter->child_list, &parent_counter->child_list); - /* - * Make the child state follow the state of the parent counter, - * not its hw_event.disabled bit. We hold the parent's mutex, - * so we won't race with perf_counter_{en,dis}able_family. - */ - if (parent_counter->state >= PERF_COUNTER_STATE_INACTIVE) - child_counter->state = PERF_COUNTER_STATE_INACTIVE; - else - child_counter->state = PERF_COUNTER_STATE_OFF; - mutex_unlock(&parent_counter->mutex); return child_counter; @@ -3429,6 +3490,7 @@ void perf_counter_init_task(struct task_struct *child) struct perf_counter_context *child_ctx, *parent_ctx; struct perf_counter *counter; struct task_struct *parent = current; + int inherited_all = 1; child->perf_counter_ctxp = NULL; @@ -3463,12 +3525,31 @@ void perf_counter_init_task(struct task_struct *child) if (counter != counter->group_leader) continue; - if (!counter->hw_event.inherit) + if (!counter->hw_event.inherit) { + inherited_all = 0; continue; + } if (inherit_group(counter, parent, - parent_ctx, child, child_ctx)) + parent_ctx, child, child_ctx)) { + inherited_all = 0; break; + } + } + + if (inherited_all) { + /* + * Mark the child context as a clone of the parent + * context, or of whatever the parent is a clone of. + */ + if (parent_ctx->parent_ctx) { + child_ctx->parent_ctx = parent_ctx->parent_ctx; + child_ctx->parent_gen = parent_ctx->parent_gen; + } else { + child_ctx->parent_ctx = parent_ctx; + child_ctx->parent_gen = parent_ctx->generation; + } + get_ctx(child_ctx->parent_ctx); } mutex_unlock(&parent_ctx->mutex); diff --git a/kernel/sched.c b/kernel/sched.c index 419a39d..4c0d58b 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -5091,7 +5091,7 @@ need_resched_nonpreemptible: if (likely(prev != next)) { sched_info_switch(prev, next); - perf_counter_task_sched_out(prev, cpu); + perf_counter_task_sched_out(prev, next, cpu); rq->nr_switches++; rq->curr = next; ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [tip:perfcounters/core] perf_counter: Optimize context switch between identical inherited contexts 2009-05-22 10:27 ` [tip:perfcounters/core] perf_counter: Optimize " tip-bot for Paul Mackerras @ 2009-05-24 11:33 ` Ingo Molnar 2009-05-25 6:18 ` Paul Mackerras 0 siblings, 1 reply; 32+ messages in thread From: Ingo Molnar @ 2009-05-24 11:33 UTC (permalink / raw) To: mingo, hpa, paulus, acme, linux-kernel, a.p.zijlstra, mtosatti, tglx, cjashfor Cc: linux-tip-commits * tip-bot for Paul Mackerras <paulus@samba.org> wrote: > @@ -885,6 +934,16 @@ void perf_counter_task_sched_out(struct task_struct *task, int cpu) > > regs = task_pt_regs(task); > perf_swcounter_event(PERF_COUNT_CONTEXT_SWITCHES, 1, 1, regs, 0); > + > + next_ctx = next->perf_counter_ctxp; > + if (next_ctx && context_equiv(ctx, next_ctx)) { > + task->perf_counter_ctxp = next_ctx; > + next->perf_counter_ctxp = ctx; > + ctx->task = next; > + next_ctx->task = task; > + return; > + } there's one complication that this trick is causing - the migration counter relies on ctx->task to get per task migration stats: static inline u64 get_cpu_migrations(struct perf_counter *counter) { struct task_struct *curr = counter->ctx->task; if (curr) return curr->se.nr_migrations; return cpu_nr_migrations(smp_processor_id()); } as ctx->task is now jumping (while we keep the context), the migration stats are out of whack. Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [tip:perfcounters/core] perf_counter: Optimize context switch between identical inherited contexts 2009-05-24 11:33 ` Ingo Molnar @ 2009-05-25 6:18 ` Paul Mackerras 2009-05-25 6:54 ` Ingo Molnar 0 siblings, 1 reply; 32+ messages in thread From: Paul Mackerras @ 2009-05-25 6:18 UTC (permalink / raw) To: Ingo Molnar Cc: mingo, hpa, acme, linux-kernel, a.p.zijlstra, mtosatti, tglx, cjashfor, linux-tip-commits Ingo Molnar writes: > * tip-bot for Paul Mackerras <paulus@samba.org> wrote: > > > @@ -885,6 +934,16 @@ void perf_counter_task_sched_out(struct task_struct *task, int cpu) > > > > regs = task_pt_regs(task); > > perf_swcounter_event(PERF_COUNT_CONTEXT_SWITCHES, 1, 1, regs, 0); > > + > > + next_ctx = next->perf_counter_ctxp; > > + if (next_ctx && context_equiv(ctx, next_ctx)) { > > + task->perf_counter_ctxp = next_ctx; > > + next->perf_counter_ctxp = ctx; > > + ctx->task = next; > > + next_ctx->task = task; > > + return; > > + } > > there's one complication that this trick is causing - the migration > counter relies on ctx->task to get per task migration stats: > > static inline u64 get_cpu_migrations(struct perf_counter *counter) > { > struct task_struct *curr = counter->ctx->task; > > if (curr) > return curr->se.nr_migrations; > return cpu_nr_migrations(smp_processor_id()); > } > > as ctx->task is now jumping (while we keep the context), the > migration stats are out of whack. How did you notice this? The overall sum over all children should still be correct, though some individual children's counters could go negative, so the result of a read on the counter when some children have exited and others haven't could look a bit strange. Reading the counter after all children have exited should be fine, though. One of the effects of optimizing the context switch is that in general, reading the value of an inheritable counter when some children have exited but some are still running might produce results that include some of the activity of the still-running children and might not include all of the activity of the children that have exited. If that's a concern then we need to implement the "sync child counters" ioctl that has been suggested. As for the migration counter, it is the only software counter that is still using the "old" approach, i.e. it doesn't generate interrupts and it uses the counter->prev_state field (which I hope to eliminate one day). It's also the only software counter which counts events that happen while the task is not scheduled in. The cleanest thing would be to rewrite the migration counter code to have a callin from the scheduler when migrations happen. Paul. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [tip:perfcounters/core] perf_counter: Optimize context switch between identical inherited contexts 2009-05-25 6:18 ` Paul Mackerras @ 2009-05-25 6:54 ` Ingo Molnar 0 siblings, 0 replies; 32+ messages in thread From: Ingo Molnar @ 2009-05-25 6:54 UTC (permalink / raw) To: Paul Mackerras Cc: mingo, hpa, acme, linux-kernel, a.p.zijlstra, mtosatti, tglx, cjashfor, linux-tip-commits * Paul Mackerras <paulus@samba.org> wrote: > Ingo Molnar writes: > > > * tip-bot for Paul Mackerras <paulus@samba.org> wrote: > > > > > @@ -885,6 +934,16 @@ void perf_counter_task_sched_out(struct task_struct *task, int cpu) > > > > > > regs = task_pt_regs(task); > > > perf_swcounter_event(PERF_COUNT_CONTEXT_SWITCHES, 1, 1, regs, 0); > > > + > > > + next_ctx = next->perf_counter_ctxp; > > > + if (next_ctx && context_equiv(ctx, next_ctx)) { > > > + task->perf_counter_ctxp = next_ctx; > > > + next->perf_counter_ctxp = ctx; > > > + ctx->task = next; > > > + next_ctx->task = task; > > > + return; > > > + } > > > > there's one complication that this trick is causing - the migration > > counter relies on ctx->task to get per task migration stats: > > > > static inline u64 get_cpu_migrations(struct perf_counter *counter) > > { > > struct task_struct *curr = counter->ctx->task; > > > > if (curr) > > return curr->se.nr_migrations; > > return cpu_nr_migrations(smp_processor_id()); > > } > > > > as ctx->task is now jumping (while we keep the context), the > > migration stats are out of whack. > > How did you notice this? The overall sum over all children should > still be correct, though some individual children's counters could > go negative, so the result of a read on the counter when some > children have exited and others haven't could look a bit strange. > Reading the counter after all children have exited should be fine, > though. i've noticed a few weirdnesses and then added a debug check and noticed the negative delta values. > One of the effects of optimizing the context switch is that in > general, reading the value of an inheritable counter when some > children have exited but some are still running might produce > results that include some of the activity of the still-running > children and might not include all of the activity of the children > that have exited. If that's a concern then we need to implement > the "sync child counters" ioctl that has been suggested. > > As for the migration counter, it is the only software counter that > is still using the "old" approach, i.e. it doesn't generate > interrupts and it uses the counter->prev_state field (which I hope > to eliminate one day). It's also the only software counter which > counts events that happen while the task is not scheduled in. The > cleanest thing would be to rewrite the migration counter code to > have a callin from the scheduler when migrations happen. I'll check with the debug check removed again. If the end result is OK then i dont think we need to worry much about this, at this stage. Ingo ^ permalink raw reply [flat|nested] 32+ messages in thread
* [tip:perfcounters/core] perf_counter: fix !PERF_COUNTERS build failure 2009-05-22 4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras ` (4 preceding siblings ...) 2009-05-22 10:27 ` [tip:perfcounters/core] perf_counter: Optimize " tip-bot for Paul Mackerras @ 2009-05-22 10:36 ` tip-bot for Ingo Molnar 2009-05-22 13:46 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Peter Zijlstra 6 siblings, 0 replies; 32+ messages in thread From: tip-bot for Ingo Molnar @ 2009-05-22 10:36 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, acme, paulus, hpa, mingo, a.p.zijlstra, mtosatti, tglx, cjashfor, mingo Commit-ID: 910431c7f2e963017d767b29c80ae706421e569f Gitweb: http://git.kernel.org/tip/910431c7f2e963017d767b29c80ae706421e569f Author: Ingo Molnar <mingo@elte.hu> AuthorDate: Fri, 22 May 2009 12:32:15 +0200 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Fri, 22 May 2009 12:33:14 +0200 perf_counter: fix !PERF_COUNTERS build failure Update the !CONFIG_PERF_COUNTERS prototype too, for perf_counter_task_sched_out(). [ Impact: build fix ] Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <18966.10666.517218.332164@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/perf_counter.h | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h index 4cae01a..2eedae8 100644 --- a/include/linux/perf_counter.h +++ b/include/linux/perf_counter.h @@ -625,7 +625,8 @@ extern void perf_counter_init(void); static inline void perf_counter_task_sched_in(struct task_struct *task, int cpu) { } static inline void -perf_counter_task_sched_out(struct task_struct *task, int cpu) { } +perf_counter_task_sched_out(struct task_struct *task, + struct task_struct *next, int cpu) { } static inline void perf_counter_task_tick(struct task_struct *task, int cpu) { } static inline void perf_counter_init_task(struct task_struct *child) { } ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-22 4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras ` (5 preceding siblings ...) 2009-05-22 10:36 ` [tip:perfcounters/core] perf_counter: fix !PERF_COUNTERS build failure tip-bot for Ingo Molnar @ 2009-05-22 13:46 ` Peter Zijlstra 2009-05-25 0:15 ` Paul Mackerras 6 siblings, 1 reply; 32+ messages in thread From: Peter Zijlstra @ 2009-05-22 13:46 UTC (permalink / raw) To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner On Fri, 2009-05-22 at 14:27 +1000, Paul Mackerras wrote: > + next_ctx = next->perf_counter_ctxp; > + if (next_ctx && context_equiv(ctx, next_ctx)) { > + task->perf_counter_ctxp = next_ctx; > + next->perf_counter_ctxp = ctx; > + ctx->task = next; > + next_ctx->task = task; > + return; > + } Ingo just pointed out that there is nothing there to close the race with attaching a counter. That is, you could end up attaching your counter to the wrong task. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-22 13:46 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Peter Zijlstra @ 2009-05-25 0:15 ` Paul Mackerras 2009-05-25 10:38 ` Peter Zijlstra 0 siblings, 1 reply; 32+ messages in thread From: Paul Mackerras @ 2009-05-25 0:15 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner Peter Zijlstra writes: > Ingo just pointed out that there is nothing there to close the race with > attaching a counter. > > That is, you could end up attaching your counter to the wrong task. Good point. Doing the unclone in find_get_context would be a start, but the locking could get tricky (in fact we don't have any way to remove an inherited counter from a context, so we only have to worry about counters being attached). I'll work out a solution after I have digested your recent batch of patches. Paul. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-25 0:15 ` Paul Mackerras @ 2009-05-25 10:38 ` Peter Zijlstra 2009-05-25 10:50 ` Peter Zijlstra 2009-05-25 11:06 ` Paul Mackerras 0 siblings, 2 replies; 32+ messages in thread From: Peter Zijlstra @ 2009-05-25 10:38 UTC (permalink / raw) To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner On Mon, 2009-05-25 at 10:15 +1000, Paul Mackerras wrote: > Peter Zijlstra writes: > > > Ingo just pointed out that there is nothing there to close the race with > > attaching a counter. > > > > That is, you could end up attaching your counter to the wrong task. > > Good point. Doing the unclone in find_get_context would be a start, > but the locking could get tricky (in fact we don't have any way to > remove an inherited counter from a context, so we only have to worry > about counters being attached). I'll work out a solution after I have > digested your recent batch of patches. I'm currently staring at something like the below, trying to find races etc.. ;-) attach vs destroy vs flip --- Index: linux-2.6/kernel/perf_counter.c =================================================================== --- linux-2.6.orig/kernel/perf_counter.c +++ linux-2.6/kernel/perf_counter.c @@ -102,13 +102,29 @@ static void get_ctx(struct perf_counter_ atomic_inc(&ctx->refcount); } -static void put_ctx(struct perf_counter_context *ctx) +static void free_ctx_rcu(struct rcu_head *head) +{ + struct perf_counter_context *ctx; + + ctx = container_of(head, struct perf_counter_context, rcu_head); + kfree(ctx); +} + +static bool __put_ctx(struct perf_counter_context *ctx) { if (atomic_dec_and_test(&ctx->refcount)) { if (ctx->parent_ctx) put_ctx(ctx->parent_ctx); - kfree(ctx); + return true; } + + return false; +} + +static void put_ctx(struct perf_counter_context *ctx) +{ + if (__put_ctx(ctx)) + call_rcu(&ctx->rcu_head, free_ctx_rcu); } /* @@ -934,8 +950,16 @@ void perf_counter_task_sched_out(struct next_ctx = next->perf_counter_ctxp; if (next_ctx && context_equiv(ctx, next_ctx)) { + ctx->task = NULL; + next_ctx->task = NULL; + + smp_wmb(); + task->perf_counter_ctxp = next_ctx; next->perf_counter_ctxp = ctx; + + smp_wmb(); + ctx->task = next; next_ctx->task = task; return; @@ -1284,19 +1308,31 @@ static struct perf_counter_context *find return ERR_PTR(-EACCES); } + rcu_read_lock(); +again: ctx = task->perf_counter_ctxp; + /* + * matched against the xchg() in perf_counter_exit_task() setting + * ctx to NULL and the cmpxchg() below. + */ + smp_read_barrier_depends(); if (!ctx) { + rcu_read_unlock(); + /* + * cannot attach counters to a dying task. + */ + if (task->flags & PF_EXITING) { + put_task_struct(task); + return ERR_PTR(-ESRCH); + } + ctx = kmalloc(sizeof(struct perf_counter_context), GFP_KERNEL); if (!ctx) { put_task_struct(task); return ERR_PTR(-ENOMEM); } __perf_counter_init_context(ctx, task); - /* - * Make sure other cpus see correct values for *ctx - * once task->perf_counter_ctxp is visible to them. - */ - smp_wmb(); + rcu_read_lock(); tctx = cmpxchg(&task->perf_counter_ctxp, NULL, ctx); if (tctx) { /* @@ -1308,6 +1344,16 @@ static struct perf_counter_context *find } } + if (!atomic_inc_not_zero(&ctx->reference)) + goto again; + + if (rcu_dereference(ctx->task) != task) { + put_ctx(ctx); + goto again; + } + + rcu_read_unlock(); + return ctx; } @@ -1316,7 +1362,6 @@ static void free_counter_rcu(struct rcu_ struct perf_counter *counter; counter = container_of(head, struct perf_counter, rcu_head); - put_ctx(counter->ctx); kfree(counter); } @@ -1337,6 +1382,7 @@ static void free_counter(struct perf_cou if (counter->destroy) counter->destroy(counter); + put_ctx(counter->ctx); call_rcu(&counter->rcu_head, free_counter_rcu); } @@ -3231,6 +3277,7 @@ SYSCALL_DEFINE5(perf_counter_open, out_fput: fput_light(group_file, fput_needed); + put_ctx(ctx); return ret; @@ -3390,25 +3437,25 @@ __perf_counter_exit_task(struct task_str * * Note: we may be running in child context, but the PID is not hashed * anymore so new counters will not be added. - * (XXX not sure that is true when we get called from flush_old_exec. - * -- paulus) */ void perf_counter_exit_task(struct task_struct *child) { struct perf_counter *child_counter, *tmp; struct perf_counter_context *child_ctx; unsigned long flags; + bool free_ctx; WARN_ON_ONCE(child != current); - child_ctx = child->perf_counter_ctxp; + child_ctx = xchg(&child->perf_counter_ctxp, NULL); if (likely(!child_ctx)) return; + free_ctx = __put_ctx(child_ctx); + local_irq_save(flags); __perf_counter_task_sched_out(child_ctx); - child->perf_counter_ctxp = NULL; local_irq_restore(flags); mutex_lock(&child_ctx->mutex); @@ -3428,7 +3475,8 @@ again: mutex_unlock(&child_ctx->mutex); - put_ctx(child_ctx); + if (free_ctx) + call_rcu(&ctx->rcu_head, free_ctx_rcu); } /* ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-25 10:38 ` Peter Zijlstra @ 2009-05-25 10:50 ` Peter Zijlstra 2009-05-25 11:06 ` Paul Mackerras 1 sibling, 0 replies; 32+ messages in thread From: Peter Zijlstra @ 2009-05-25 10:50 UTC (permalink / raw) To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner On Mon, 2009-05-25 at 12:38 +0200, Peter Zijlstra wrote: > On Mon, 2009-05-25 at 10:15 +1000, Paul Mackerras wrote: > > Peter Zijlstra writes: > > > > > Ingo just pointed out that there is nothing there to close the race with > > > attaching a counter. > > > > > > That is, you could end up attaching your counter to the wrong task. > > > > Good point. Doing the unclone in find_get_context would be a start, > > but the locking could get tricky (in fact we don't have any way to > > remove an inherited counter from a context, so we only have to worry > > about counters being attached). I'll work out a solution after I have > > digested your recent batch of patches. > > I'm currently staring at something like the below, trying to find races > etc.. ;-) > > attach vs destroy vs flip Hmm, it appears we only unclone ctx on remove, not attach.. how odd.. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-25 10:38 ` Peter Zijlstra 2009-05-25 10:50 ` Peter Zijlstra @ 2009-05-25 11:06 ` Paul Mackerras 2009-05-25 11:27 ` Peter Zijlstra 1 sibling, 1 reply; 32+ messages in thread From: Paul Mackerras @ 2009-05-25 11:06 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner Peter Zijlstra writes: > I'm currently staring at something like the below, trying to find races > etc.. ;-) [snip] > next_ctx = next->perf_counter_ctxp; > if (next_ctx && context_equiv(ctx, next_ctx)) { > + ctx->task = NULL; > + next_ctx->task = NULL; Trouble is, ctx->task == NULL is used as an indication that this is a per-cpu context in various places. Also, in find_get_context, we have a lifetime problem because *ctx could get swapped and then freed underneath us immediately after we read task->perf_counter_ctxp. So we need a lock in the task_struct that stops sched_out from swapping the context underneath us. That led me to the patch below, which I'm about to test... :) It does the unclone in find_get_context; we don't actually need it on remove because we have no way to remove an inherited counter from a task without the task exiting. Paul. diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 353c0ac..96b0a73 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -110,6 +110,8 @@ extern struct cred init_cred; #ifdef CONFIG_PERF_COUNTERS # define INIT_PERF_COUNTERS(tsk) \ + .perf_counter_ctx_lock = \ + __SPIN_LOCK_UNLOCKED(tsk.perf_counter_ctx_lock), \ .perf_counter_mutex = \ __MUTEX_INITIALIZER(tsk.perf_counter_mutex), \ .perf_counter_list = LIST_HEAD_INIT(tsk.perf_counter_list), diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h index 2ddf5e3..77cbbe7 100644 --- a/include/linux/perf_counter.h +++ b/include/linux/perf_counter.h @@ -531,8 +531,8 @@ struct perf_counter_context { * been cloned (inherited) from a common ancestor. */ struct perf_counter_context *parent_ctx; - u32 parent_gen; - u32 generation; + u64 parent_gen; + u64 generation; }; /** diff --git a/include/linux/sched.h b/include/linux/sched.h index bc9326d..717830e 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1389,6 +1389,7 @@ struct task_struct { #endif #ifdef CONFIG_PERF_COUNTERS struct perf_counter_context *perf_counter_ctxp; + spinlock_t perf_counter_ctx_lock; struct mutex perf_counter_mutex; struct list_head perf_counter_list; #endif diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c index cb40625..4c19404 100644 --- a/kernel/perf_counter.c +++ b/kernel/perf_counter.c @@ -211,22 +211,6 @@ group_sched_out(struct perf_counter *group_counter, } /* - * Mark this context as not being a clone of another. - * Called when counters are added to or removed from this context. - * We also increment our generation number so that anything that - * was cloned from this context before this will not match anything - * cloned from this context after this. - */ -static void unclone_ctx(struct perf_counter_context *ctx) -{ - ++ctx->generation; - if (!ctx->parent_ctx) - return; - put_ctx(ctx->parent_ctx); - ctx->parent_ctx = NULL; -} - -/* * Cross CPU call to remove a performance counter * * We disable the counter on the hardware level first. After that we @@ -280,13 +264,16 @@ static void __perf_counter_remove_from_context(void *info) * * CPU counters are removed with a smp call. For task counters we only * call when the task is on a CPU. + * + * If counter->ctx is a cloned context, callers must make sure that + * every task struct that counter->ctx->task could possibly point to + * remains valid. */ static void perf_counter_remove_from_context(struct perf_counter *counter) { struct perf_counter_context *ctx = counter->ctx; struct task_struct *task = ctx->task; - unclone_ctx(ctx); if (!task) { /* * Per cpu counters are removed via an smp call and @@ -409,6 +396,10 @@ static void __perf_counter_disable(void *info) /* * Disable a counter. + * + * If counter->ctx is a cloned context, callers must make sure that + * every task struct that counter->ctx->task could possibly point to + * remains valid. */ static void perf_counter_disable(struct perf_counter *counter) { @@ -793,6 +784,10 @@ static void __perf_counter_enable(void *info) /* * Enable a counter. + * + * If counter->ctx is a cloned context, callers must make sure that + * every task struct that counter->ctx->task could possibly point to + * remains valid. */ static void perf_counter_enable(struct perf_counter *counter) { @@ -932,12 +927,32 @@ void perf_counter_task_sched_out(struct task_struct *task, regs = task_pt_regs(task); perf_swcounter_event(PERF_COUNT_CONTEXT_SWITCHES, 1, 1, regs, 0); + /* + * Note: task->perf_counter_ctxp and next->perf_counter_ctxp + * can't change underneath us here if we see them non-NULL, + * because this is the only place where we change which + * context a task points to, and the scheduler will ensure + * that this code isn't being called for either of these tasks + * on any other cpu at the same time. + */ next_ctx = next->perf_counter_ctxp; if (next_ctx && context_equiv(ctx, next_ctx)) { + /* + * Lock the contexts of both the old and new tasks so that we + * don't change things underneath find_get_context etc. + * We don't need to be careful about the order in which we + * lock the tasks because no other cpu could be trying to lock + * both of these tasks -- this is the only place where we lock + * two tasks. + */ + spin_lock(&task->perf_counter_ctx_lock); + spin_lock(&next->perf_counter_ctx_lock); task->perf_counter_ctxp = next_ctx; next->perf_counter_ctxp = ctx; ctx->task = next; next_ctx->task = task; + spin_unlock(&next->perf_counter_ctx_lock); + spin_unlock(&task->perf_counter_ctx_lock); return; } @@ -1067,6 +1082,7 @@ static void perf_counter_cpu_sched_in(struct perf_cpu_context *cpuctx, int cpu) __perf_counter_sched_in(ctx, cpuctx, cpu); } +// XXX doesn't do inherited counters too? int perf_counter_task_enable(void) { struct perf_counter *counter; @@ -1079,6 +1095,7 @@ int perf_counter_task_enable(void) return 0; } +// XXX doesn't do inherited counters too? int perf_counter_task_disable(void) { struct perf_counter *counter; @@ -1108,6 +1125,7 @@ static void perf_adjust_freq(struct perf_counter_context *ctx) if (!counter->hw_event.freq || !counter->hw_event.irq_freq) continue; + // XXX does this assume we got a whole tick worth of counts? events = HZ * counter->hw.interrupts * counter->hw.irq_period; period = div64_u64(events, counter->hw_event.irq_freq); @@ -1238,7 +1256,6 @@ static struct perf_counter_context *find_get_context(pid_t pid, int cpu) { struct perf_cpu_context *cpuctx; struct perf_counter_context *ctx; - struct perf_counter_context *tctx; struct task_struct *task; /* @@ -1284,30 +1301,42 @@ static struct perf_counter_context *find_get_context(pid_t pid, int cpu) return ERR_PTR(-EACCES); } + /* + * Taking this lock stops the context-switch code from switching + * our context to another task. We also use the lock to solve + * the race where two tasks allocate a context at the same time. + */ + spin_lock(&task->perf_counter_ctx_lock); ctx = task->perf_counter_ctxp; if (!ctx) { + spin_unlock(&task->perf_counter_ctx_lock); ctx = kmalloc(sizeof(struct perf_counter_context), GFP_KERNEL); if (!ctx) { put_task_struct(task); return ERR_PTR(-ENOMEM); } __perf_counter_init_context(ctx, task); - /* - * Make sure other cpus see correct values for *ctx - * once task->perf_counter_ctxp is visible to them. - */ - smp_wmb(); - tctx = cmpxchg(&task->perf_counter_ctxp, NULL, ctx); - if (tctx) { + spin_lock(&task->perf_counter_ctx_lock); + if (task->perf_counter_ctxp) { /* * We raced with some other task; use * the context they set. */ kfree(ctx); - ctx = tctx; + ctx = task->perf_counter_ctxp; } } + /* + * Since we're about to add another counter to this context, + * it won't be a clone of another context any longer. + */ + ++ctx->generation; + if (ctx->parent_ctx) { + put_ctx(ctx->parent_ctx); + ctx->parent_ctx = NULL; + } + return ctx; } @@ -1360,7 +1389,7 @@ static int perf_release(struct inode *inode, struct file *file) put_task_struct(counter->owner); free_counter(counter); - put_context(ctx); + put_context(ctx); // XXX use after free? return 0; } @@ -1382,7 +1411,7 @@ perf_read_hw(struct perf_counter *counter, char __user *buf, size_t count) if (counter->state == PERF_COUNTER_STATE_ERROR) return 0; - mutex_lock(&counter->child_mutex); + mutex_lock(&counter->child_mutex); // XXX doesn't exclude sync_child_counter, so why? values[0] = perf_counter_read(counter); n = 1; if (counter->hw_event.read_format & PERF_FORMAT_TOTAL_TIME_ENABLED) @@ -1450,6 +1479,12 @@ static void perf_counter_for_each_sibling(struct perf_counter *counter, mutex_unlock(&ctx->mutex); } +/* + * Holding the top-level counter's child_mutex means that any + * descendant process that has inherited this counter will block + * in sync_child_counter, thus satisfying the task existence requirements + * of perf_counter_enable/disable. + */ static void perf_counter_for_each_child(struct perf_counter *counter, void (*func)(struct perf_counter *)) { @@ -3395,15 +3430,14 @@ void perf_counter_exit_task(struct task_struct *child) WARN_ON_ONCE(child != current); - child_ctx = child->perf_counter_ctxp; - - if (likely(!child_ctx)) + if (likely(!child->perf_counter_ctxp)) return; - local_irq_save(flags); + spin_lock_irqsave(&child->perf_counter_ctx_lock, flags); + child_ctx = child->perf_counter_ctxp; __perf_counter_task_sched_out(child_ctx); child->perf_counter_ctxp = NULL; - local_irq_restore(flags); + spin_unlock_irqrestore(&child->perf_counter_ctx_lock, flags); mutex_lock(&child_ctx->mutex); ^ permalink raw reply related [flat|nested] 32+ messages in thread
* Re: [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts 2009-05-25 11:06 ` Paul Mackerras @ 2009-05-25 11:27 ` Peter Zijlstra 0 siblings, 0 replies; 32+ messages in thread From: Peter Zijlstra @ 2009-05-25 11:27 UTC (permalink / raw) To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner On Mon, 2009-05-25 at 21:06 +1000, Paul Mackerras wrote: > Peter Zijlstra writes: > > > I'm currently staring at something like the below, trying to find races > > etc.. ;-) > [snip] > > next_ctx = next->perf_counter_ctxp; > > if (next_ctx && context_equiv(ctx, next_ctx)) { > > + ctx->task = NULL; > > + next_ctx->task = NULL; > > Trouble is, ctx->task == NULL is used as an indication that this is a > per-cpu context in various places. Yeah, already realized that, its enough to simply put them to the new task, before flipping the context pointers. > Also, in find_get_context, we have a lifetime problem because *ctx > could get swapped and then freed underneath us immediately after we > read task->perf_counter_ctxp. So we need a lock in the task_struct > that stops sched_out from swapping the context underneath us. That > led me to the patch below, which I'm about to test... :) It does the > unclone in find_get_context; we don't actually need it on remove > because we have no way to remove an inherited counter from a task > without the task exiting. Right, I was trying to solve the lifetime issues with RCU and refcount tricks and the races with ordering, instead of adding another lock. > @@ -932,12 +927,32 @@ void perf_counter_task_sched_out(struct task_struct *task, > regs = task_pt_regs(task); > perf_swcounter_event(PERF_COUNT_CONTEXT_SWITCHES, 1, 1, regs, 0); > > + /* > + * Note: task->perf_counter_ctxp and next->perf_counter_ctxp > + * can't change underneath us here if we see them non-NULL, > + * because this is the only place where we change which > + * context a task points to, and the scheduler will ensure > + * that this code isn't being called for either of these tasks > + * on any other cpu at the same time. > + */ > next_ctx = next->perf_counter_ctxp; > if (next_ctx && context_equiv(ctx, next_ctx)) { > + /* > + * Lock the contexts of both the old and new tasks so that we > + * don't change things underneath find_get_context etc. > + * We don't need to be careful about the order in which we > + * lock the tasks because no other cpu could be trying to lock > + * both of these tasks -- this is the only place where we lock > + * two tasks. > + */ > + spin_lock(&task->perf_counter_ctx_lock); > + spin_lock(&next->perf_counter_ctx_lock); > task->perf_counter_ctxp = next_ctx; > next->perf_counter_ctxp = ctx; > ctx->task = next; > next_ctx->task = task; > + spin_unlock(&next->perf_counter_ctx_lock); > + spin_unlock(&task->perf_counter_ctx_lock); > return; > } FWIW that nested lock will make lockdep complain -- it can't deadlock since we're under rq->lock and the tasks can't be stolen from the rq in that case. So you can silence lockdep by using spin_lock_nested_lock() > @@ -1067,6 +1082,7 @@ static void perf_counter_cpu_sched_in(struct perf_cpu_context *cpuctx, int cpu) > __perf_counter_sched_in(ctx, cpuctx, cpu); > } > > +// XXX doesn't do inherited counters too? > int perf_counter_task_enable(void) > { > struct perf_counter *counter; Good point,.. perf_counter_for_each_child(counter, perf_counter_disable) should fix that I think. > @@ -1360,7 +1389,7 @@ static int perf_release(struct inode *inode, struct file *file) > put_task_struct(counter->owner); > > free_counter(counter); > - put_context(ctx); > + put_context(ctx); // XXX use after free? > > return 0; > } don't htink so, but will have a look. ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] 2009-05-22 4:17 [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] Paul Mackerras 2009-05-22 4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras @ 2009-05-22 8:06 ` Peter Zijlstra 2009-05-22 9:30 ` Paul Mackerras 2009-05-22 10:27 ` [tip:perfcounters/core] perf_counter: Dynamically allocate tasks' perf_counter_context struct tip-bot for Paul Mackerras 2 siblings, 1 reply; 32+ messages in thread From: Peter Zijlstra @ 2009-05-22 8:06 UTC (permalink / raw) To: Paul Mackerras; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner On Fri, 2009-05-22 at 14:17 +1000, Paul Mackerras wrote: > + /* > + * Make sure other cpus see correct values for *ctx > + * once task->perf_counter_ctxp is visible to them. > + */ > + smp_wmb(); > + tctx = cmpxchg(&task->perf_counter_ctxp, NULL, ctx); Documentation/memory-barriers.txt: > Any atomic operation that modifies some state in memory and returns information > about the state (old or new) implies an SMP-conditional general memory barrier > (smp_mb()) on each side of the actual operation (with the exception of > explicit lock operations, described later). These include: > cmpxchg(); ^ permalink raw reply [flat|nested] 32+ messages in thread
* Re: [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] 2009-05-22 8:06 ` [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] Peter Zijlstra @ 2009-05-22 9:30 ` Paul Mackerras 0 siblings, 0 replies; 32+ messages in thread From: Paul Mackerras @ 2009-05-22 9:30 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Corey Ashford, Thomas Gleixner Peter Zijlstra writes: > On Fri, 2009-05-22 at 14:17 +1000, Paul Mackerras wrote: > > + /* > > + * Make sure other cpus see correct values for *ctx > > + * once task->perf_counter_ctxp is visible to them. > > + */ > > + smp_wmb(); > > + tctx = cmpxchg(&task->perf_counter_ctxp, NULL, ctx); > > > Documentation/memory-barriers.txt: > > > Any atomic operation that modifies some state in memory and returns information > > about the state (old or new) implies an SMP-conditional general memory barrier > > (smp_mb()) on each side of the actual operation (with the exception of > > explicit lock operations, described later). These include: > > > cmpxchg(); OK, fair enough, I'll take it out. Paul. ^ permalink raw reply [flat|nested] 32+ messages in thread
* [tip:perfcounters/core] perf_counter: Dynamically allocate tasks' perf_counter_context struct 2009-05-22 4:17 [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] Paul Mackerras 2009-05-22 4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras 2009-05-22 8:06 ` [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] Peter Zijlstra @ 2009-05-22 10:27 ` tip-bot for Paul Mackerras 2 siblings, 0 replies; 32+ messages in thread From: tip-bot for Paul Mackerras @ 2009-05-22 10:27 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, acme, paulus, hpa, mingo, a.p.zijlstra, mtosatti, tglx, cjashfor, mingo Commit-ID: a63eaf34ae60bdb067a354cc8def2e8f4a01f5f4 Gitweb: http://git.kernel.org/tip/a63eaf34ae60bdb067a354cc8def2e8f4a01f5f4 Author: Paul Mackerras <paulus@samba.org> AuthorDate: Fri, 22 May 2009 14:17:31 +1000 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Fri, 22 May 2009 12:18:19 +0200 perf_counter: Dynamically allocate tasks' perf_counter_context struct This replaces the struct perf_counter_context in the task_struct with a pointer to a dynamically allocated perf_counter_context struct. The main reason for doing is this is to allow us to transfer a perf_counter_context from one task to another when we do lazy PMU switching in a later patch. This has a few side-benefits: the task_struct becomes a little smaller, we save some memory because only tasks that have perf_counters attached get a perf_counter_context allocated for them, and we can remove the inclusion of <linux/perf_counter.h> in sched.h, meaning that we don't end up recompiling nearly everything whenever perf_counter.h changes. The perf_counter_context structures are reference-counted and freed when the last reference is dropped. A context can have references from its task and the counters on its task. Counters can outlive the task so it is possible that a context will be freed well after its task has exited. Contexts are allocated on fork if the parent had a context, or otherwise the first time that a per-task counter is created on a task. In the latter case, we set the context pointer in the task struct locklessly using an atomic compare-and-exchange operation in case we raced with some other task in creating a context for the subject task. This also removes the task pointer from the perf_counter struct. The task pointer was not used anywhere and would make it harder to move a context from one task to another. Anything that needed to know which task a counter was attached to was already using counter->ctx->task. The __perf_counter_init_context function moves up in perf_counter.c so that it can be called from find_get_context, and now initializes the refcount, but is otherwise unchanged. We were potentially calling list_del_counter twice: once from __perf_counter_exit_task when the task exits and once from __perf_counter_remove_from_context when the counter's fd gets closed. This adds a check in list_del_counter so it doesn't do anything if the counter has already been removed from the lists. Since perf_counter_task_sched_in doesn't do anything if the task doesn't have a context, and leaves cpuctx->task_ctx = NULL, this adds code to __perf_install_in_context to set cpuctx->task_ctx if necessary, i.e. in the case where the current task adds the first counter to itself and thus creates a context for itself. This also adds similar code to __perf_counter_enable to handle a similar situation which can arise when the counters have been disabled using prctl; that also leaves cpuctx->task_ctx = NULL. [ Impact: refactor counter context management to prepare for new feature ] Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> LKML-Reference: <18966.10075.781053.231153@cargo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- arch/x86/kernel/apic/apic.c | 1 + include/linux/init_task.h | 13 --- include/linux/perf_counter.h | 4 +- include/linux/sched.h | 6 +- kernel/exit.c | 3 +- kernel/fork.c | 1 + kernel/perf_counter.c | 218 ++++++++++++++++++++++++++---------------- 7 files changed, 145 insertions(+), 101 deletions(-) diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index e9021a9..b4f6440 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -14,6 +14,7 @@ * Mikael Pettersson : PM converted to driver model. */ +#include <linux/perf_counter.h> #include <linux/kernel_stat.h> #include <linux/mc146818rtc.h> #include <linux/acpi_pmtmr.h> diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 503afaa..d87247d 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -108,18 +108,6 @@ extern struct group_info init_groups; extern struct cred init_cred; -#ifdef CONFIG_PERF_COUNTERS -# define INIT_PERF_COUNTERS(tsk) \ - .perf_counter_ctx.counter_list = \ - LIST_HEAD_INIT(tsk.perf_counter_ctx.counter_list), \ - .perf_counter_ctx.event_list = \ - LIST_HEAD_INIT(tsk.perf_counter_ctx.event_list), \ - .perf_counter_ctx.lock = \ - __SPIN_LOCK_UNLOCKED(tsk.perf_counter_ctx.lock), -#else -# define INIT_PERF_COUNTERS(tsk) -#endif - /* * INIT_TASK is used to set up the first task table, touch at * your own risk!. Base=0, limit=0x1fffff (=2MB) @@ -183,7 +171,6 @@ extern struct cred init_cred; }, \ .dirties = INIT_PROP_LOCAL_SINGLE(dirties), \ INIT_IDS \ - INIT_PERF_COUNTERS(tsk) \ INIT_TRACE_IRQFLAGS \ INIT_LOCKDEP \ INIT_FTRACE_GRAPH \ diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h index f612941..0713090 100644 --- a/include/linux/perf_counter.h +++ b/include/linux/perf_counter.h @@ -449,7 +449,6 @@ struct perf_counter { struct hw_perf_counter hw; struct perf_counter_context *ctx; - struct task_struct *task; struct file *filp; struct perf_counter *parent; @@ -498,7 +497,6 @@ struct perf_counter { * Used as a container for task counters and CPU counters as well: */ struct perf_counter_context { -#ifdef CONFIG_PERF_COUNTERS /* * Protect the states of the counters in the list, * nr_active, and the list: @@ -516,6 +514,7 @@ struct perf_counter_context { int nr_counters; int nr_active; int is_active; + atomic_t refcount; struct task_struct *task; /* @@ -523,7 +522,6 @@ struct perf_counter_context { */ u64 time; u64 timestamp; -#endif }; /** diff --git a/include/linux/sched.h b/include/linux/sched.h index ff59d12..9714d45 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -71,7 +71,6 @@ struct sched_param { #include <linux/path.h> #include <linux/compiler.h> #include <linux/completion.h> -#include <linux/perf_counter.h> #include <linux/pid.h> #include <linux/percpu.h> #include <linux/topology.h> @@ -99,6 +98,7 @@ struct robust_list_head; struct bio; struct bts_tracer; struct fs_struct; +struct perf_counter_context; /* * List of flags we want to share for kernel threads, @@ -1387,7 +1387,9 @@ struct task_struct { struct list_head pi_state_list; struct futex_pi_state *pi_state_cache; #endif - struct perf_counter_context perf_counter_ctx; +#ifdef CONFIG_PERF_COUNTERS + struct perf_counter_context *perf_counter_ctxp; +#endif #ifdef CONFIG_NUMA struct mempolicy *mempolicy; short il_next; diff --git a/kernel/exit.c b/kernel/exit.c index f9dfedd..99ad406 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -48,6 +48,7 @@ #include <linux/tracehook.h> #include <linux/fs_struct.h> #include <linux/init_task.h> +#include <linux/perf_counter.h> #include <trace/sched.h> #include <asm/uaccess.h> @@ -159,7 +160,7 @@ static void delayed_put_task_struct(struct rcu_head *rhp) struct task_struct *tsk = container_of(rhp, struct task_struct, rcu); #ifdef CONFIG_PERF_COUNTERS - WARN_ON_ONCE(!list_empty(&tsk->perf_counter_ctx.counter_list)); + WARN_ON_ONCE(tsk->perf_counter_ctxp); #endif trace_sched_process_free(tsk); put_task_struct(tsk); diff --git a/kernel/fork.c b/kernel/fork.c index d32fef4..e72a09f 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -63,6 +63,7 @@ #include <linux/fs_struct.h> #include <trace/sched.h> #include <linux/magic.h> +#include <linux/perf_counter.h> #include <asm/pgtable.h> #include <asm/pgalloc.h> diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c index 08584c1..06ea3ea 100644 --- a/kernel/perf_counter.c +++ b/kernel/perf_counter.c @@ -97,6 +97,17 @@ void perf_enable(void) hw_perf_enable(); } +static void get_ctx(struct perf_counter_context *ctx) +{ + atomic_inc(&ctx->refcount); +} + +static void put_ctx(struct perf_counter_context *ctx) +{ + if (atomic_dec_and_test(&ctx->refcount)) + kfree(ctx); +} + static void list_add_counter(struct perf_counter *counter, struct perf_counter_context *ctx) { @@ -118,11 +129,17 @@ list_add_counter(struct perf_counter *counter, struct perf_counter_context *ctx) ctx->nr_counters++; } +/* + * Remove a counter from the lists for its context. + * Must be called with counter->mutex and ctx->mutex held. + */ static void list_del_counter(struct perf_counter *counter, struct perf_counter_context *ctx) { struct perf_counter *sibling, *tmp; + if (list_empty(&counter->list_entry)) + return; ctx->nr_counters--; list_del_init(&counter->list_entry); @@ -216,8 +233,6 @@ static void __perf_counter_remove_from_context(void *info) counter_sched_out(counter, cpuctx, ctx); - counter->task = NULL; - list_del_counter(counter, ctx); if (!ctx->task) { @@ -279,7 +294,6 @@ retry: */ if (!list_empty(&counter->list_entry)) { list_del_counter(counter, ctx); - counter->task = NULL; } spin_unlock_irq(&ctx->lock); } @@ -568,11 +582,17 @@ static void __perf_install_in_context(void *info) * If this is a task context, we need to check whether it is * the current task context of this cpu. If not it has been * scheduled out before the smp call arrived. + * Or possibly this is the right context but it isn't + * on this cpu because it had no counters. */ - if (ctx->task && cpuctx->task_ctx != ctx) - return; + if (ctx->task && cpuctx->task_ctx != ctx) { + if (cpuctx->task_ctx || ctx->task != current) + return; + cpuctx->task_ctx = ctx; + } spin_lock_irqsave(&ctx->lock, flags); + ctx->is_active = 1; update_context_time(ctx); /* @@ -653,7 +673,6 @@ perf_install_in_context(struct perf_counter_context *ctx, return; } - counter->task = task; retry: task_oncpu_function_call(task, __perf_install_in_context, counter); @@ -693,10 +712,14 @@ static void __perf_counter_enable(void *info) * If this is a per-task counter, need to check whether this * counter's task is the current task on this cpu. */ - if (ctx->task && cpuctx->task_ctx != ctx) - return; + if (ctx->task && cpuctx->task_ctx != ctx) { + if (cpuctx->task_ctx || ctx->task != current) + return; + cpuctx->task_ctx = ctx; + } spin_lock_irqsave(&ctx->lock, flags); + ctx->is_active = 1; update_context_time(ctx); counter->prev_state = counter->state; @@ -852,10 +875,10 @@ void __perf_counter_sched_out(struct perf_counter_context *ctx, void perf_counter_task_sched_out(struct task_struct *task, int cpu) { struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu); - struct perf_counter_context *ctx = &task->perf_counter_ctx; + struct perf_counter_context *ctx = task->perf_counter_ctxp; struct pt_regs *regs; - if (likely(!cpuctx->task_ctx)) + if (likely(!ctx || !cpuctx->task_ctx)) return; update_context_time(ctx); @@ -871,6 +894,8 @@ static void __perf_counter_task_sched_out(struct perf_counter_context *ctx) { struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context); + if (!cpuctx->task_ctx) + return; __perf_counter_sched_out(ctx, cpuctx); cpuctx->task_ctx = NULL; } @@ -969,8 +994,10 @@ __perf_counter_sched_in(struct perf_counter_context *ctx, void perf_counter_task_sched_in(struct task_struct *task, int cpu) { struct perf_cpu_context *cpuctx = &per_cpu(perf_cpu_context, cpu); - struct perf_counter_context *ctx = &task->perf_counter_ctx; + struct perf_counter_context *ctx = task->perf_counter_ctxp; + if (likely(!ctx)) + return; __perf_counter_sched_in(ctx, cpuctx, cpu); cpuctx->task_ctx = ctx; } @@ -985,11 +1012,11 @@ static void perf_counter_cpu_sched_in(struct perf_cpu_context *cpuctx, int cpu) int perf_counter_task_disable(void) { struct task_struct *curr = current; - struct perf_counter_context *ctx = &curr->perf_counter_ctx; + struct perf_counter_context *ctx = curr->perf_counter_ctxp; struct perf_counter *counter; unsigned long flags; - if (likely(!ctx->nr_counters)) + if (!ctx || !ctx->nr_counters) return 0; local_irq_save(flags); @@ -1020,12 +1047,12 @@ int perf_counter_task_disable(void) int perf_counter_task_enable(void) { struct task_struct *curr = current; - struct perf_counter_context *ctx = &curr->perf_counter_ctx; + struct perf_counter_context *ctx = curr->perf_counter_ctxp; struct perf_counter *counter; unsigned long flags; int cpu; - if (likely(!ctx->nr_counters)) + if (!ctx || !ctx->nr_counters) return 0; local_irq_save(flags); @@ -1128,19 +1155,23 @@ void perf_counter_task_tick(struct task_struct *curr, int cpu) return; cpuctx = &per_cpu(perf_cpu_context, cpu); - ctx = &curr->perf_counter_ctx; + ctx = curr->perf_counter_ctxp; perf_adjust_freq(&cpuctx->ctx); - perf_adjust_freq(ctx); + if (ctx) + perf_adjust_freq(ctx); perf_counter_cpu_sched_out(cpuctx); - __perf_counter_task_sched_out(ctx); + if (ctx) + __perf_counter_task_sched_out(ctx); rotate_ctx(&cpuctx->ctx); - rotate_ctx(ctx); + if (ctx) + rotate_ctx(ctx); perf_counter_cpu_sched_in(cpuctx, cpu); - perf_counter_task_sched_in(curr, cpu); + if (ctx) + perf_counter_task_sched_in(curr, cpu); } /* @@ -1176,6 +1207,22 @@ static u64 perf_counter_read(struct perf_counter *counter) return atomic64_read(&counter->count); } +/* + * Initialize the perf_counter context in a task_struct: + */ +static void +__perf_counter_init_context(struct perf_counter_context *ctx, + struct task_struct *task) +{ + memset(ctx, 0, sizeof(*ctx)); + spin_lock_init(&ctx->lock); + mutex_init(&ctx->mutex); + INIT_LIST_HEAD(&ctx->counter_list); + INIT_LIST_HEAD(&ctx->event_list); + atomic_set(&ctx->refcount, 1); + ctx->task = task; +} + static void put_context(struct perf_counter_context *ctx) { if (ctx->task) @@ -1186,6 +1233,7 @@ static struct perf_counter_context *find_get_context(pid_t pid, int cpu) { struct perf_cpu_context *cpuctx; struct perf_counter_context *ctx; + struct perf_counter_context *tctx; struct task_struct *task; /* @@ -1225,15 +1273,36 @@ static struct perf_counter_context *find_get_context(pid_t pid, int cpu) if (!task) return ERR_PTR(-ESRCH); - ctx = &task->perf_counter_ctx; - ctx->task = task; - /* Reuse ptrace permission checks for now. */ if (!ptrace_may_access(task, PTRACE_MODE_READ)) { - put_context(ctx); + put_task_struct(task); return ERR_PTR(-EACCES); } + ctx = task->perf_counter_ctxp; + if (!ctx) { + ctx = kmalloc(sizeof(struct perf_counter_context), GFP_KERNEL); + if (!ctx) { + put_task_struct(task); + return ERR_PTR(-ENOMEM); + } + __perf_counter_init_context(ctx, task); + /* + * Make sure other cpus see correct values for *ctx + * once task->perf_counter_ctxp is visible to them. + */ + smp_wmb(); + tctx = cmpxchg(&task->perf_counter_ctxp, NULL, ctx); + if (tctx) { + /* + * We raced with some other task; use + * the context they set. + */ + kfree(ctx); + ctx = tctx; + } + } + return ctx; } @@ -1242,6 +1311,7 @@ static void free_counter_rcu(struct rcu_head *head) struct perf_counter *counter; counter = container_of(head, struct perf_counter, rcu_head); + put_ctx(counter->ctx); kfree(counter); } @@ -2247,7 +2317,7 @@ static void perf_counter_comm_event(struct perf_comm_event *comm_event) perf_counter_comm_ctx(&cpuctx->ctx, comm_event); put_cpu_var(perf_cpu_context); - perf_counter_comm_ctx(¤t->perf_counter_ctx, comm_event); + perf_counter_comm_ctx(current->perf_counter_ctxp, comm_event); } void perf_counter_comm(struct task_struct *task) @@ -2256,7 +2326,9 @@ void perf_counter_comm(struct task_struct *task) if (!atomic_read(&nr_comm_tracking)) return; - + if (!current->perf_counter_ctxp) + return; + comm_event = (struct perf_comm_event){ .task = task, .event = { @@ -2372,7 +2444,7 @@ got_name: perf_counter_mmap_ctx(&cpuctx->ctx, mmap_event); put_cpu_var(perf_cpu_context); - perf_counter_mmap_ctx(¤t->perf_counter_ctx, mmap_event); + perf_counter_mmap_ctx(current->perf_counter_ctxp, mmap_event); kfree(buf); } @@ -2384,6 +2456,8 @@ void perf_counter_mmap(unsigned long addr, unsigned long len, if (!atomic_read(&nr_mmap_tracking)) return; + if (!current->perf_counter_ctxp) + return; mmap_event = (struct perf_mmap_event){ .file = file, @@ -2985,6 +3059,7 @@ perf_counter_alloc(struct perf_counter_hw_event *hw_event, counter->group_leader = group_leader; counter->pmu = NULL; counter->ctx = ctx; + get_ctx(ctx); counter->state = PERF_COUNTER_STATE_INACTIVE; if (hw_event->disabled) @@ -3150,21 +3225,6 @@ err_put_context: } /* - * Initialize the perf_counter context in a task_struct: - */ -static void -__perf_counter_init_context(struct perf_counter_context *ctx, - struct task_struct *task) -{ - memset(ctx, 0, sizeof(*ctx)); - spin_lock_init(&ctx->lock); - mutex_init(&ctx->mutex); - INIT_LIST_HEAD(&ctx->counter_list); - INIT_LIST_HEAD(&ctx->event_list); - ctx->task = task; -} - -/* * inherit a counter from parent task to child task: */ static struct perf_counter * @@ -3195,7 +3255,6 @@ inherit_counter(struct perf_counter *parent_counter, /* * Link it up in the child's context: */ - child_counter->task = child; add_counter_to_ctx(child_counter, child_ctx); child_counter->parent = parent_counter; @@ -3294,40 +3353,15 @@ __perf_counter_exit_task(struct task_struct *child, struct perf_counter *parent_counter; /* - * If we do not self-reap then we have to wait for the - * child task to unschedule (it will happen for sure), - * so that its counter is at its final count. (This - * condition triggers rarely - child tasks usually get - * off their CPU before the parent has a chance to - * get this far into the reaping action) + * Protect against concurrent operations on child_counter + * due its fd getting closed, etc. */ - if (child != current) { - wait_task_inactive(child, 0); - update_counter_times(child_counter); - list_del_counter(child_counter, child_ctx); - } else { - struct perf_cpu_context *cpuctx; - unsigned long flags; - - /* - * Disable and unlink this counter. - * - * Be careful about zapping the list - IRQ/NMI context - * could still be processing it: - */ - local_irq_save(flags); - perf_disable(); - - cpuctx = &__get_cpu_var(perf_cpu_context); + mutex_lock(&child_counter->mutex); - group_sched_out(child_counter, cpuctx, child_ctx); - update_counter_times(child_counter); + update_counter_times(child_counter); + list_del_counter(child_counter, child_ctx); - list_del_counter(child_counter, child_ctx); - - perf_enable(); - local_irq_restore(flags); - } + mutex_unlock(&child_counter->mutex); parent_counter = child_counter->parent; /* @@ -3346,19 +3380,29 @@ __perf_counter_exit_task(struct task_struct *child, * * Note: we may be running in child context, but the PID is not hashed * anymore so new counters will not be added. + * (XXX not sure that is true when we get called from flush_old_exec. + * -- paulus) */ void perf_counter_exit_task(struct task_struct *child) { struct perf_counter *child_counter, *tmp; struct perf_counter_context *child_ctx; + unsigned long flags; WARN_ON_ONCE(child != current); - child_ctx = &child->perf_counter_ctx; + child_ctx = child->perf_counter_ctxp; - if (likely(!child_ctx->nr_counters)) + if (likely(!child_ctx)) return; + local_irq_save(flags); + __perf_counter_task_sched_out(child_ctx); + child->perf_counter_ctxp = NULL; + local_irq_restore(flags); + + mutex_lock(&child_ctx->mutex); + again: list_for_each_entry_safe(child_counter, tmp, &child_ctx->counter_list, list_entry) @@ -3371,6 +3415,10 @@ again: */ if (!list_empty(&child_ctx->counter_list)) goto again; + + mutex_unlock(&child_ctx->mutex); + + put_ctx(child_ctx); } /* @@ -3382,19 +3430,25 @@ void perf_counter_init_task(struct task_struct *child) struct perf_counter *counter; struct task_struct *parent = current; - child_ctx = &child->perf_counter_ctx; - parent_ctx = &parent->perf_counter_ctx; - - __perf_counter_init_context(child_ctx, child); + child->perf_counter_ctxp = NULL; /* * This is executed from the parent task context, so inherit - * counters that have been marked for cloning: + * counters that have been marked for cloning. + * First allocate and initialize a context for the child. */ - if (likely(!parent_ctx->nr_counters)) + child_ctx = kmalloc(sizeof(struct perf_counter_context), GFP_KERNEL); + if (!child_ctx) + return; + + parent_ctx = parent->perf_counter_ctxp; + if (likely(!parent_ctx || !parent_ctx->nr_counters)) return; + __perf_counter_init_context(child_ctx, child); + child->perf_counter_ctxp = child_ctx; + /* * Lock the parent list. No need to lock the child - not PID * hashed yet and not running, so nobody can access it. ^ permalink raw reply related [flat|nested] 32+ messages in thread
end of thread, other threads:[~2009-05-25 11:27 UTC | newest] Thread overview: 32+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-05-22 4:17 [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] Paul Mackerras 2009-05-22 4:27 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Paul Mackerras 2009-05-22 8:16 ` Peter Zijlstra 2009-05-22 9:56 ` Paul Mackerras 2009-05-22 10:08 ` Peter Zijlstra 2009-05-23 12:38 ` Ingo Molnar 2009-05-23 13:06 ` Peter Zijlstra 2009-05-24 23:55 ` Paul Mackerras 2009-05-22 8:32 ` Peter Zijlstra 2009-05-22 8:57 ` Ingo Molnar 2009-05-22 9:02 ` Peter Zijlstra 2009-05-22 10:14 ` Ingo Molnar 2009-05-22 9:29 ` Paul Mackerras 2009-05-22 9:22 ` Peter Zijlstra 2009-05-22 9:42 ` Peter Zijlstra 2009-05-22 10:07 ` Paul Mackerras 2009-05-22 10:05 ` Paul Mackerras 2009-05-22 10:11 ` Ingo Molnar 2009-05-22 10:27 ` [tip:perfcounters/core] perf_counter: Optimize " tip-bot for Paul Mackerras 2009-05-24 11:33 ` Ingo Molnar 2009-05-25 6:18 ` Paul Mackerras 2009-05-25 6:54 ` Ingo Molnar 2009-05-22 10:36 ` [tip:perfcounters/core] perf_counter: fix !PERF_COUNTERS build failure tip-bot for Ingo Molnar 2009-05-22 13:46 ` [PATCH 2/2] perf_counter: optimize context switch between identical inherited contexts Peter Zijlstra 2009-05-25 0:15 ` Paul Mackerras 2009-05-25 10:38 ` Peter Zijlstra 2009-05-25 10:50 ` Peter Zijlstra 2009-05-25 11:06 ` Paul Mackerras 2009-05-25 11:27 ` Peter Zijlstra 2009-05-22 8:06 ` [PATCH 1/2] perf_counter: dynamically allocate tasks' perf_counter_context struct [v2] Peter Zijlstra 2009-05-22 9:30 ` Paul Mackerras 2009-05-22 10:27 ` [tip:perfcounters/core] perf_counter: Dynamically allocate tasks' perf_counter_context struct tip-bot for Paul Mackerras
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).