* [PATCH] perf_events: fix and improve x86 event scheduling
@ 2011-11-07 11:01 Stephane Eranian
2011-11-07 11:55 ` Peter Zijlstra
` (3 more replies)
0 siblings, 4 replies; 30+ messages in thread
From: Stephane Eranian @ 2011-11-07 11:01 UTC (permalink / raw)
To: linux-kernel; +Cc: robert.richter, peterz, mingo, ming.m.lin, ak
Major rewrite of the x86 event scheduling routine.
The routine is shared between AMD and Intel.
The existing code had an issue with certain combinations
of constraints. As Robert Richter @ AMD demonstrated on
AMD Bulldozer:
e0 [3,0]
e1 [2:0]
e2 [2:0]
e3 [2:0]
With this list of events, the existing scheduling algorithm
would never schedule e3. That's because it would always choose
counter 0 for e0:
e0 -> 0
e1 -> 1
e2 -> 2
e3 -> X
Robert Richter proposed a fix to the algorithm which was based
on tagging constraints that overlap. He was using rollbacks to
simulate recursion.
We propose an alternate solution which is simpler, faster. This
is a small modification to the existing algorithm. It adds some
smart in how a counter is chosen for a given event. The first
available counter is not systemtically selected. Instead, the
algorithm checks how the constraints between the events overlap.
For each counter, it assigns a weight which is the number of events
which it can count for the event set. For each event, the algorithm
assigns the counter with the smallest weight first.
In the example above, the new algoritm assigns:
e0 -> 3
e1 -> 0
e2 -> 1
e3 -> 2
Because:
counter 0, weight = 4
counter 1, weight = 3
counter 2, weight = 3
counter 3, weight = 1
We maximize PMU usage and ensure all events are measured.
The patch also restructures the code to separate scheduling of
constrained vs. unconstrained events. An unconstrained event is
one that can be programmed into any of the generic counters. For
those, we now use the simplest algorihtm possible: use next free
counter. There is now a fast path which is beneficial on
processors such as AMD64 Fam10h.
Signed-off-by: Stephane Eranian <eranian@google.com>
---
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 6408910..95cca3c 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -488,60 +488,37 @@ static inline int is_x86_event(struct perf_event *event)
return event->pmu == &pmu;
}
-int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
+static int
+x86_sched_constrained(struct perf_event **c_events, int num_c,
+ struct event_constraint **constraints,
+ int n,
+ unsigned long *used_mask,
+ int *assign)
{
- struct event_constraint *c, *constraints[X86_PMC_IDX_MAX];
- unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
- int i, j, w, wmax, num = 0;
- struct hw_perf_event *hwc;
-
- bitmap_zero(used_mask, X86_PMC_IDX_MAX);
-
- for (i = 0; i < n; i++) {
- c = x86_pmu.get_event_constraints(cpuc, cpuc->event_list[i]);
- constraints[i] = c;
- }
+ struct event_constraint *c;
+ struct perf_event **e;
+ int wcnt[X86_PMC_IDX_MAX];
+ int i, j, w, wmax, num;
+ int map_idx;
+ int min_wcnt, wcnt_idx;
+ memset(wcnt, 0, sizeof(wcnt));
/*
- * fastpath, try to reuse previous register
+ * compute counter weight: given a set of event constraints, the weight
+ * represents the number of events which can be programmed on a counter
*/
for (i = 0; i < n; i++) {
- hwc = &cpuc->event_list[i]->hw;
c = constraints[i];
-
- /* never assigned */
- if (hwc->idx == -1)
- break;
-
- /* constraint still honored */
- if (!test_bit(hwc->idx, c->idxmsk))
- break;
-
- /* not already used */
- if (test_bit(hwc->idx, used_mask))
- break;
-
- __set_bit(hwc->idx, used_mask);
- if (assign)
- assign[i] = hwc->idx;
+ for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
+ wcnt[j]++;
+ }
}
- if (i == n)
- goto done;
/*
- * begin slow path
- */
-
- bitmap_zero(used_mask, X86_PMC_IDX_MAX);
-
- /*
- * weight = number of possible counters
+ * constraint weight = number of possible counters for an event
*
* 1 = most constrained, only works on one counter
* wmax = least constrained, works on any counter
- *
- * assign events to counters starting with most
- * constrained events.
*/
wmax = x86_pmu.num_counters;
@@ -553,42 +530,179 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
if (x86_pmu.num_counters_fixed)
wmax++;
- for (w = 1, num = n; num && w <= wmax; w++) {
- /* for each event */
- for (i = 0; num && i < n; i++) {
- c = constraints[i];
- hwc = &cpuc->event_list[i]->hw;
+ /*
+ * assign from most constrained to least constrained
+ */
+ for (w = 1, num = num_c; num && w <= wmax; w++) {
+ /* for each constrained event */
+ for (i = 0, e = c_events; i < num_c; i++, e++) {
+
+ map_idx = (*e)->hw.map_idx;
+ c = constraints[map_idx];
if (c->weight != w)
continue;
+ min_wcnt = INT_MAX;
+ wcnt_idx = -1;
+ /*
+ * scan all possible counters for this event
+ * but use the one with the smallest counter weight,
+ * i.e., give a chance to other less constrained events
+ */
for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
- if (!test_bit(j, used_mask))
- break;
- }
- if (j == X86_PMC_IDX_MAX)
+ if (test_bit(j, used_mask))
+ continue;
+
+ if (wcnt[j] < min_wcnt) {
+ min_wcnt = wcnt[j];
+ wcnt_idx = j;
+ }
+ }
+ if (wcnt_idx == -1)
break;
- __set_bit(j, used_mask);
+ __set_bit(wcnt_idx, used_mask);
if (assign)
- assign[i] = j;
- num--;
+ assign[map_idx] = wcnt_idx;
+
+ /* marked used */
+ wcnt[wcnt_idx] = INT_MAX;
+
+ if (--num == 0)
+ break;
+ }
+ }
+ return num;
+}
+
+static int x86_sched_unconstrained(struct perf_event **events,
+ int num_u,
+ int n,
+ unsigned long *used_mask,
+ int *assign)
+{
+ unsigned long free_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+ struct perf_event **e = events + n - num_u;
+ int i;
+
+ /*
+ * it is faster to stick with X86_PMC_IDX_MAX which is a
+ * power of 2 rather than trying to limit complement to
+ * actual number of counters
+ */
+ bitmap_complement(free_mask, used_mask, X86_PMC_IDX_MAX);
+
+ /*
+ * faster to use X86_PMC_IDX_MAX rather than the actual
+ * number of counters. We bail out once all events have
+ * been assigned anyway.
+ */
+ for_each_set_bit(i, free_mask, X86_PMC_IDX_MAX) {
+ if (i >= x86_pmu.num_counters)
+ break;
+ if (assign)
+ assign[(*e)->hw.map_idx] = i;
+ if (--num_u == 0)
+ break;
+ e++;
+ }
+ return num_u;
+}
+
+
+int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
+{
+ struct event_constraint **c;
+ struct event_constraint *constraints[X86_PMC_IDX_MAX];
+ struct perf_event *events[X86_PMC_IDX_MAX];
+ unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+ int i;
+ int num_c = 0, num_u = 0;
+ struct hw_perf_event *hwc;
+
+ /*
+ * collect event constraints
+ * separate constrained vs. unconstrained events
+ *
+ * map_idx = actual index in event_list and assign arrays
+ *
+ * we add constrained events from index 0 to n
+ * we add unconstrained events from index n - 1 to 0
+ * that way we save stack memory by not defining two arrays
+ */
+ for (i = 0; i < n; i++) {
+ constraints[i] =
+ x86_pmu.get_event_constraints(cpuc, cpuc->event_list[i]);
+ if (constraints[i] != &unconstrained) {
+ events[num_c] = cpuc->event_list[i];
+ events[num_c]->hw.map_idx = i;
+ num_c++;
+ } else {
+ events[n - num_u - 1] = cpuc->event_list[i];
+ events[n - num_u - 1]->hw.map_idx = i;
+ num_u++;
}
}
+ /*
+ * fastpath: try to reuse previous assignments
+ */
+ for (i = 0, c = constraints; i < n; i++, c++) {
+ hwc = &cpuc->event_list[i]->hw;
+
+ /* never assigned, must go to slow path */
+ if (hwc->idx == -1)
+ break;
+
+ /* constraint still honored */
+ if (!test_bit(hwc->idx, (*c)->idxmsk))
+ break;
+
+ /* counter not already used */
+ if (test_bit(hwc->idx, used_mask))
+ break;
+
+ __set_bit(hwc->idx, used_mask);
+ if (assign)
+ assign[i] = hwc->idx;
+ }
+ /* was able to reuse every counter, we're done */
+ if (i == n) {
+ num_u = num_c = 0;
+ goto done;
+ }
+
+ /*
+ * begin slow path
+ */
+ bitmap_zero(used_mask, X86_PMC_IDX_MAX);
+
+ /* schedule constrained events */
+ if (num_c)
+ num_c = x86_sched_constrained(events,
+ num_c,
+ constraints,
+ n, used_mask, assign);
+
+ /* schedule unconstrained events */
+ if (num_u)
+ num_u = x86_sched_unconstrained(events,
+ num_u,
+ n, used_mask, assign);
done:
/*
* scheduling failed or is just a simulation,
* free resources if necessary
*/
- if (!assign || num) {
+ if (!assign || num_u || num_c) {
for (i = 0; i < n; i++) {
if (x86_pmu.put_event_constraints)
x86_pmu.put_event_constraints(cpuc, cpuc->event_list[i]);
}
}
- return num ? -ENOSPC : 0;
+ return num_c || num_u ? -ENOSPC : 0;
}
/*
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1e9ebe5..2605244 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -564,6 +564,7 @@ struct hw_perf_event {
int idx;
int last_cpu;
struct hw_perf_event_extra extra_reg;
+ int map_idx;
};
struct { /* software */
struct hrtimer hrtimer;
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-07 11:01 [PATCH] perf_events: fix and improve x86 event scheduling Stephane Eranian
@ 2011-11-07 11:55 ` Peter Zijlstra
2011-11-07 12:10 ` Peter Zijlstra
` (2 subsequent siblings)
3 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2011-11-07 11:55 UTC (permalink / raw)
To: Stephane Eranian; +Cc: linux-kernel, robert.richter, mingo, ming.m.lin, ak
On Mon, 2011-11-07 at 12:01 +0100, Stephane Eranian wrote:
> Major rewrite of the x86 event scheduling routine.
> The routine is shared between AMD and Intel.
>
> The existing code had an issue with certain combinations
> of constraints. As Robert Richter @ AMD demonstrated on
> AMD Bulldozer:
>
> e0 [3,0]
> e1 [2:0]
> e2 [2:0]
> e3 [2:0]
>
> With this list of events, the existing scheduling algorithm
> would never schedule e3. That's because it would always choose
> counter 0 for e0:
> e0 -> 0
> e1 -> 1
> e2 -> 2
> e3 -> X
>
> Robert Richter proposed a fix to the algorithm which was based
> on tagging constraints that overlap. He was using rollbacks to
> simulate recursion.
>
> We propose an alternate solution which is simpler, faster. This
> is a small modification to the existing algorithm. It adds some
> smart in how a counter is chosen for a given event. The first
> available counter is not systemtically selected. Instead, the
systematically
> algorithm checks how the constraints between the events overlap.
> For each counter, it assigns a weight which is the number of events
> which it can count for the event set. For each event, the algorithm
> assigns the counter with the smallest weight first.
>
> In the example above, the new algoritm assigns:
algorithm
> e0 -> 3
> e1 -> 0
> e2 -> 1
> e3 -> 2
>
> Because:
> counter 0, weight = 4
> counter 1, weight = 3
> counter 2, weight = 3
> counter 3, weight = 1
>
> We maximize PMU usage and ensure all events are measured.
Might be good to mention this is O(n^3) in time; also is there still a
case where this algorithm will fail and the O(n!) one will succeed?
The complexities seem to suggest there is, and I suspect it would be
where there is a tie in constraint weight. Please think about this and
tell us why we don't care ;-)
Few nits on the code:
> +int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
> +{
> + struct event_constraint **c;
> + struct event_constraint *constraints[X86_PMC_IDX_MAX];
> + struct perf_event *events[X86_PMC_IDX_MAX];
> + unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
> + int i;
> + int num_c = 0, num_u = 0;
> + struct hw_perf_event *hwc;
> +
> + /*
> + * collect event constraints
> + * separate constrained vs. unconstrained events
> + *
> + * map_idx = actual index in event_list and assign arrays
> + *
> + * we add constrained events from index 0 to n
> + * we add unconstrained events from index n - 1 to 0
> + * that way we save stack memory by not defining two arrays
> + */
> + for (i = 0; i < n; i++) {
> + constraints[i] =
> + x86_pmu.get_event_constraints(cpuc, cpuc->event_list[i]);
> + if (constraints[i] != &unconstrained) {
> + events[num_c] = cpuc->event_list[i];
> + events[num_c]->hw.map_idx = i;
> + num_c++;
> + } else {
> + events[n - num_u - 1] = cpuc->event_list[i];
> + events[n - num_u - 1]->hw.map_idx = i;
> + num_u++;
> }
> }
> + /*
> + * fastpath: try to reuse previous assignments
> + */
> + for (i = 0, c = constraints; i < n; i++, c++) {
> + hwc = &cpuc->event_list[i]->hw;
> +
> + /* never assigned, must go to slow path */
> + if (hwc->idx == -1)
> + break;
> +
> + /* constraint still honored */
> + if (!test_bit(hwc->idx, (*c)->idxmsk))
> + break;
> +
> + /* counter not already used */
> + if (test_bit(hwc->idx, used_mask))
> + break;
> +
> + __set_bit(hwc->idx, used_mask);
> + if (assign)
> + assign[i] = hwc->idx;
> + }
> + /* was able to reuse every counter, we're done */
> + if (i == n) {
> + num_u = num_c = 0;
> + goto done;
> + }
> +
> + /*
> + * begin slow path
> + */
> + bitmap_zero(used_mask, X86_PMC_IDX_MAX);
> +
> + /* schedule constrained events */
> + if (num_c)
> + num_c = x86_sched_constrained(events,
> + num_c,
> + constraints,
> + n, used_mask, assign);
> +
> + /* schedule unconstrained events */
> + if (num_u)
> + num_u = x86_sched_unconstrained(events,
> + num_u,
> + n, used_mask, assign);
For both branches, please wrap in (superfluous) curly braces.
> done:
> /*
> * scheduling failed or is just a simulation,
> * free resources if necessary
> */
> - if (!assign || num) {
> + if (!assign || num_u || num_c) {
> for (i = 0; i < n; i++) {
> if (x86_pmu.put_event_constraints)
> x86_pmu.put_event_constraints(cpuc, cpuc->event_list[i]);
> }
> }
> - return num ? -ENOSPC : 0;
> + return num_c || num_u ? -ENOSPC : 0;
> }
We really should change that -ENOSPC... :-)
>
> /*
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 1e9ebe5..2605244 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -564,6 +564,7 @@ struct hw_perf_event {
> int idx;
> int last_cpu;
> struct hw_perf_event_extra extra_reg;
> + int map_idx;
> };
> struct { /* software */
> struct hrtimer hrtimer;
Somewhat sad we need to add a field there, I think its ok we the
software side of things is still the largest so it doesn't matter but do
check.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-07 11:01 [PATCH] perf_events: fix and improve x86 event scheduling Stephane Eranian
2011-11-07 11:55 ` Peter Zijlstra
@ 2011-11-07 12:10 ` Peter Zijlstra
2011-11-07 13:52 ` Stephane Eranian
2011-11-10 14:37 ` Peter Zijlstra
2011-11-10 18:03 ` Robert Richter
3 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2011-11-07 12:10 UTC (permalink / raw)
To: Stephane Eranian; +Cc: linux-kernel, robert.richter, mingo, ming.m.lin, ak
On Mon, 2011-11-07 at 12:01 +0100, Stephane Eranian wrote:
> + /*
> + * scan all possible counters for this event
> + * but use the one with the smallest counter weight,
> + * i.e., give a chance to other less constrained events
> + */
> for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
>
> + if (test_bit(j, used_mask))
> + continue;
> +
> + if (wcnt[j] < min_wcnt) {
> + min_wcnt = wcnt[j];
> + wcnt_idx = j;
> + }
> + }
The problem with this is that it will typically hit the worst case for
Intel fixed-purpose events since the fixed purpose counters have the
highest counter index and their constraint masks are the heaviest in the
system ensuring we hit the max loop count on the top loop.
Then again, with Robert's approach we have to mark all fixed purpose
thingies as redo and we might hit some weird cases there as well, can't
seem to get me brain straight on that case though.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-07 12:10 ` Peter Zijlstra
@ 2011-11-07 13:52 ` Stephane Eranian
2011-11-07 14:56 ` Peter Zijlstra
0 siblings, 1 reply; 30+ messages in thread
From: Stephane Eranian @ 2011-11-07 13:52 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, robert.richter, mingo, ming.m.lin, ak
On Mon, Nov 7, 2011 at 12:10 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2011-11-07 at 12:01 +0100, Stephane Eranian wrote:
>> + /*
>> + * scan all possible counters for this event
>> + * but use the one with the smallest counter weight,
>> + * i.e., give a chance to other less constrained events
>> + */
>> for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
>>
>> + if (test_bit(j, used_mask))
>> + continue;
>> +
>> + if (wcnt[j] < min_wcnt) {
>> + min_wcnt = wcnt[j];
>> + wcnt_idx = j;
>> + }
>> + }
>
> The problem with this is that it will typically hit the worst case for
> Intel fixed-purpose events since the fixed purpose counters have the
> highest counter index and their constraint masks are the heaviest in the
> system ensuring we hit the max loop count on the top loop.
>
Yes, but to avoid this you'd need some form of sorting.
We could also cut down on top loop iterations by caching the weight list
of the events and examining only the weights we do have.
As for the complexity, I think it is mostly bound to the number of
counters rather
than event.
The top loop is about event weights, that's bound by the number of counters.
The middle loop is about the number of events.
The inner loop is about the number of counters.
So I'd expect the complexity to be O(c^2*n)
where c is the number of counters, n the number of events.
But given we limit the number of events to that of counters,
we do have O(c^3).
As for the map_idx, it's there to track the position of each event in the
initial event list. We shuffle events between constrained and unconstrained.
By stashing the map_idx in the hw_perf_event struct we avoid having to
pass around yet another array.
> Then again, with Robert's approach we have to mark all fixed purpose
> thingies as redo and we might hit some weird cases there as well, can't
> seem to get me brain straight on that case though.
>
>
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-07 13:52 ` Stephane Eranian
@ 2011-11-07 14:56 ` Peter Zijlstra
0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2011-11-07 14:56 UTC (permalink / raw)
To: Stephane Eranian; +Cc: linux-kernel, robert.richter, mingo, ming.m.lin, ak
On Mon, 2011-11-07 at 13:52 +0000, Stephane Eranian wrote:
> But given we limit the number of events to that of counters,
> we do have O(c^3).
Right, but SNB without HT gives you 8 GP counters, yielding a rather big
number. Suppose you're trying to fill it with 9 cycle events (1 for the
fixed purpose thingy), that'll end up being: 9^3 = 729 = big number.
(arguably adding 9 cycle counters is a tad retarded, but hey ;-)
It would be good to try and get it down to somewhere near 81 again,
although my brain isn't currently providing any sane ideas on how.
> As for the map_idx, it's there to track the position of each event in the
> initial event list. We shuffle events between constrained and unconstrained.
> By stashing the map_idx in the hw_perf_event struct we avoid having to
> pass around yet another array.
Yeah, I saw why you needed it..
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-07 11:01 [PATCH] perf_events: fix and improve x86 event scheduling Stephane Eranian
2011-11-07 11:55 ` Peter Zijlstra
2011-11-07 12:10 ` Peter Zijlstra
@ 2011-11-10 14:37 ` Peter Zijlstra
2011-11-10 15:09 ` Stephane Eranian
2011-11-10 18:03 ` Robert Richter
3 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2011-11-10 14:37 UTC (permalink / raw)
To: Stephane Eranian; +Cc: linux-kernel, robert.richter, mingo, ming.m.lin, ak
Just throwing this out there (hasn't event been compiled etc..).
The idea is to try the fixed counters first so that we don't
'accidentally' fill a GP counter with something that could have lived on
the fixed purpose one and then end up under utilizing the PMU that way.
It ought to solve the most common PMU programming fail on Intel
thingies.
---
Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -558,14 +558,22 @@ int x86_schedule_events(struct cpu_hw_ev
if (c->weight != w)
continue;
- for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
+ if (x86_pmu.num_counters_fixed) {
+ j = X86_PMC_IDX_FIXED - 1;
+ for_each_set_bit_cont(j, c->idxmsk, X86_PMC_IDX_MAX) {
+ if (!test_bit(k, used_mask))
+ goto assign;
+ }
+ }
+
+ for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_FIXED) {
if (!test_bit(j, used_mask))
- break;
+ goto assign;
}
- if (j == X86_PMC_IDX_MAX)
- break;
+ break;
+assign:
__set_bit(j, used_mask);
if (assign)
Index: linux-2.6/include/linux/bitops.h
===================================================================
--- linux-2.6.orig/include/linux/bitops.h
+++ linux-2.6/include/linux/bitops.h
@@ -26,6 +26,12 @@ extern unsigned long __sw_hweight64(__u6
(bit) < (size); \
(bit) = find_next_bit((addr), (size), (bit) + 1))
+#define for_each_set_bit_cont(bit, addr, size) \
+ for ((bit) = find_next_bit((addr), (size), (bit) + 1); \
+ (bit) < (size); \
+ (bit) = find_next_bit((addr), (size), (bit) + 1))
+
+
static __inline__ int get_bitmask_order(unsigned int count)
{
int order;
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-10 14:37 ` Peter Zijlstra
@ 2011-11-10 15:09 ` Stephane Eranian
2011-11-10 16:41 ` Gleb Natapov
` (2 more replies)
0 siblings, 3 replies; 30+ messages in thread
From: Stephane Eranian @ 2011-11-10 15:09 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: linux-kernel, robert.richter, mingo, ming.m.lin, ak
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 3163 bytes --]
On Thu, Nov 10, 2011 at 3:37 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> Just throwing this out there (hasn't event been compiled etc..).
>
> The idea is to try the fixed counters first so that we don't
> 'accidentally' fill a GP counter with something that could have lived on
> the fixed purpose one and then end up under utilizing the PMU that way.
>
> It ought to solve the most common PMU programming fail on Intel
> thingies.
>
What are the configs for which you have failures on Intel?
I think I can improve my algorithm for fixed counters by treating
them separately and trying fixed counters first any supported
event.
> ---
> Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
> +++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
> @@ -558,14 +558,22 @@ int x86_schedule_events(struct cpu_hw_ev
> Â Â Â Â Â Â Â Â Â Â Â Â if (c->weight != w)
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â continue;
>
> - Â Â Â Â Â Â Â Â Â Â Â for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
> + Â Â Â Â Â Â Â Â Â Â Â if (x86_pmu.num_counters_fixed) {
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â j = X86_PMC_IDX_FIXED - 1;
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â for_each_set_bit_cont(j, c->idxmsk, X86_PMC_IDX_MAX) {
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â if (!test_bit(k, used_mask))
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â goto assign;
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â }
> + Â Â Â Â Â Â Â Â Â Â Â }
> +
> + Â Â Â Â Â Â Â Â Â Â Â for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_FIXED) {
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â if (!test_bit(j, used_mask))
> - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â break;
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â goto assign;
> Â Â Â Â Â Â Â Â Â Â Â Â }
>
> - Â Â Â Â Â Â Â Â Â Â Â if (j == X86_PMC_IDX_MAX)
> - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â break;
> + Â Â Â Â Â Â Â Â Â Â Â break;
>
> +assign:
> Â Â Â Â Â Â Â Â Â Â Â Â __set_bit(j, used_mask);
>
> Â Â Â Â Â Â Â Â Â Â Â Â if (assign)
> Index: linux-2.6/include/linux/bitops.h
> ===================================================================
> --- linux-2.6.orig/include/linux/bitops.h
> +++ linux-2.6/include/linux/bitops.h
> @@ -26,6 +26,12 @@ extern unsigned long __sw_hweight64(__u6
> Â Â Â Â Â Â (bit) < (size); \
> Â Â Â Â Â Â (bit) = find_next_bit((addr), (size), (bit) + 1))
>
> +#define for_each_set_bit_cont(bit, addr, size) \
> + Â Â Â for ((bit) = find_next_bit((addr), (size), (bit) + 1); \
> + Â Â Â Â Â Â (bit) < (size); \
> + Â Â Â Â Â Â (bit) = find_next_bit((addr), (size), (bit) + 1))
> +
> +
> Â static __inline__ int get_bitmask_order(unsigned int count)
> Â {
> Â Â Â Â int order;
>
>
ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-10 15:09 ` Stephane Eranian
@ 2011-11-10 16:41 ` Gleb Natapov
2011-11-10 16:59 ` Stephane Eranian
2011-11-10 16:59 ` Robert Richter
2011-11-10 18:31 ` Peter Zijlstra
2 siblings, 1 reply; 30+ messages in thread
From: Gleb Natapov @ 2011-11-10 16:41 UTC (permalink / raw)
To: Stephane Eranian
Cc: Peter Zijlstra, linux-kernel, robert.richter, mingo, ming.m.lin,
ak
On Thu, Nov 10, 2011 at 04:09:32PM +0100, Stephane Eranian wrote:
> On Thu, Nov 10, 2011 at 3:37 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > Just throwing this out there (hasn't event been compiled etc..).
> >
> > The idea is to try the fixed counters first so that we don't
> > 'accidentally' fill a GP counter with something that could have lived on
> > the fixed purpose one and then end up under utilizing the PMU that way.
> >
> > It ought to solve the most common PMU programming fail on Intel
> > thingies.
> >
Heh, just looked into doing exactly that here.
> What are the configs for which you have failures on Intel?
>
Suppose you have 3 fixed event counters and 2 GP counters and 3 event.
One can go to one of the fixed counters or any GP, 2 others can be only on
GP. If the event that can go to fixed counter will be placed into GP
counter then one of the remaining events will fail to be scheduled.
> I think I can improve my algorithm for fixed counters by treating
> them separately and trying fixed counters first any supported
> event.
>
--
Gleb.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-10 15:09 ` Stephane Eranian
2011-11-10 16:41 ` Gleb Natapov
@ 2011-11-10 16:59 ` Robert Richter
2011-11-10 18:31 ` Peter Zijlstra
2 siblings, 0 replies; 30+ messages in thread
From: Robert Richter @ 2011-11-10 16:59 UTC (permalink / raw)
To: Stephane Eranian; +Cc: Peter Zijlstra, linux-kernel, mingo, ming.m.lin, ak
On 10.11.11 16:09:32, Stephane Eranian wrote:
> On Thu, Nov 10, 2011 at 3:37 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > Just throwing this out there (hasn't event been compiled etc..).
> >
> > The idea is to try the fixed counters first so that we don't
> > 'accidentally' fill a GP counter with something that could have lived on
> > the fixed purpose one and then end up under utilizing the PMU that way.
> >
> > It ought to solve the most common PMU programming fail on Intel
> > thingies.
Peter, what exactly are that Intel constraints you try to solve? Could
you give a short example.
Thanks,
-Robert
--
Advanced Micro Devices, Inc.
Operating System Research Center
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-10 16:41 ` Gleb Natapov
@ 2011-11-10 16:59 ` Stephane Eranian
2011-11-10 17:13 ` Gleb Natapov
0 siblings, 1 reply; 30+ messages in thread
From: Stephane Eranian @ 2011-11-10 16:59 UTC (permalink / raw)
To: Gleb Natapov
Cc: Peter Zijlstra, linux-kernel, robert.richter, mingo, ming.m.lin,
ak
On Thu, Nov 10, 2011 at 5:41 PM, Gleb Natapov <gleb@redhat.com> wrote:
> On Thu, Nov 10, 2011 at 04:09:32PM +0100, Stephane Eranian wrote:
>> On Thu, Nov 10, 2011 at 3:37 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > Just throwing this out there (hasn't event been compiled etc..).
>> >
>> > The idea is to try the fixed counters first so that we don't
>> > 'accidentally' fill a GP counter with something that could have lived on
>> > the fixed purpose one and then end up under utilizing the PMU that way.
>> >
>> > It ought to solve the most common PMU programming fail on Intel
>> > thingies.
>> >
> Heh, just looked into doing exactly that here.
>
>> What are the configs for which you have failures on Intel?
>>
> Suppose you have 3 fixed event counters and 2 GP counters and 3 event.
> One can go to one of the fixed counters or any GP, 2 others can be only on
> GP. If the event that can go to fixed counter will be placed into GP
> counter then one of the remaining events will fail to be scheduled.
>
Yes and the current algorithm does the right thing.
e1 (1 fixed +2 GP) -> weight = 3
e2 (2 GP) -> weight = 2
e3 (2 GP) -> weight = 2
The current algorithm schedules the event from light to heavy
weight. Thus, it schedules e2, e3 first on the GPs and then
e1 necessarily ends up on the fixed counter.
Do you have a test case where this does not work on Intel?
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-10 16:59 ` Stephane Eranian
@ 2011-11-10 17:13 ` Gleb Natapov
0 siblings, 0 replies; 30+ messages in thread
From: Gleb Natapov @ 2011-11-10 17:13 UTC (permalink / raw)
To: Stephane Eranian
Cc: Peter Zijlstra, linux-kernel, robert.richter, mingo, ming.m.lin,
ak
On Thu, Nov 10, 2011 at 05:59:22PM +0100, Stephane Eranian wrote:
> On Thu, Nov 10, 2011 at 5:41 PM, Gleb Natapov <gleb@redhat.com> wrote:
> > On Thu, Nov 10, 2011 at 04:09:32PM +0100, Stephane Eranian wrote:
> >> On Thu, Nov 10, 2011 at 3:37 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > Just throwing this out there (hasn't event been compiled etc..).
> >> >
> >> > The idea is to try the fixed counters first so that we don't
> >> > 'accidentally' fill a GP counter with something that could have lived on
> >> > the fixed purpose one and then end up under utilizing the PMU that way.
> >> >
> >> > It ought to solve the most common PMU programming fail on Intel
> >> > thingies.
> >> >
> > Heh, just looked into doing exactly that here.
> >
> >> What are the configs for which you have failures on Intel?
> >>
> > Suppose you have 3 fixed event counters and 2 GP counters and 3 event.
> > One can go to one of the fixed counters or any GP, 2 others can be only on
> > GP. If the event that can go to fixed counter will be placed into GP
> > counter then one of the remaining events will fail to be scheduled.
> >
> Yes and the current algorithm does the right thing.
> e1 (1 fixed +2 GP) -> weight = 3
> e2 (2 GP) -> weight = 2
> e3 (2 GP) -> weight = 2
>
> The current algorithm schedules the event from light to heavy
> weight. Thus, it schedules e2, e3 first on the GPs and then
> e1 necessarily ends up on the fixed counter.
>
> Do you have a test case where this does not work on Intel?
Yeah, you are right indeed. Actually my incentive was to force an event
to a fixed counter since it didn't count correctly when programmed into
GP.
What about if event that can go to fixed counter can go to only first
GP? Then weight of each event will be 2. But I am not sure that such
constrain actually exists on any real CPU.
--
Gleb.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-07 11:01 [PATCH] perf_events: fix and improve x86 event scheduling Stephane Eranian
` (2 preceding siblings ...)
2011-11-10 14:37 ` Peter Zijlstra
@ 2011-11-10 18:03 ` Robert Richter
2011-11-10 18:41 ` Peter Zijlstra
` (2 more replies)
3 siblings, 3 replies; 30+ messages in thread
From: Robert Richter @ 2011-11-10 18:03 UTC (permalink / raw)
To: Stephane Eranian
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, mingo@elte.hu,
ming.m.lin@intel.com, ak@linux.intel.com
On 07.11.11 06:01:49, Stephane Eranian wrote:
>
> Major rewrite of the x86 event scheduling routine.
> The routine is shared between AMD and Intel.
>
> The existing code had an issue with certain combinations
> of constraints. As Robert Richter @ AMD demonstrated on
> AMD Bulldozer:
>
> e0 [3,0]
> e1 [2:0]
> e2 [2:0]
> e3 [2:0]
>
> With this list of events, the existing scheduling algorithm
> would never schedule e3. That's because it would always choose
> counter 0 for e0:
> e0 -> 0
> e1 -> 1
> e2 -> 2
> e3 -> X
>
> Robert Richter proposed a fix to the algorithm which was based
> on tagging constraints that overlap. He was using rollbacks to
> simulate recursion.
>
> We propose an alternate solution which is simpler, faster. This
> is a small modification to the existing algorithm. It adds some
> smart in how a counter is chosen for a given event. The first
Posting a link to my patch set for reference:
https://lkml.org/lkml/2011/9/10/51
What are the reasons to your alternate solution? Is it recursion, code
complexity, or a use case it does not fit in? I see recursion as the
main concern with my patch set, but its impact is known and limited.
Anyway, a algorithm without recursion would be generally better.
> available counter is not systemtically selected. Instead, the
> algorithm checks how the constraints between the events overlap.
> For each counter, it assigns a weight which is the number of events
> which it can count for the event set. For each event, the algorithm
> assigns the counter with the smallest weight first.
But this algorithm does not work for all cases and does not solve the
problem in general. Your idea to have weights for counters might be
the right approach.
E.g. the algorithm fails with (all weights are the same):
c0 c1 c2
e0 x x
e1 x x
e2 x x
... leading to:
e0 -> c0
e1 -> C1
e3 -> X
You basically have to recalculate the weights after you had assigned a
counter.
But even if we do this, I am still not sure if that would cover all
cases.
>
> In the example above, the new algoritm assigns:
> e0 -> 3
> e1 -> 0
> e2 -> 1
> e3 -> 2
>
> Because:
> counter 0, weight = 4
> counter 1, weight = 3
> counter 2, weight = 3
> counter 3, weight = 1
>
> We maximize PMU usage and ensure all events are measured.
>
> The patch also restructures the code to separate scheduling of
> constrained vs. unconstrained events. An unconstrained event is
> one that can be programmed into any of the generic counters. For
> those, we now use the simplest algorihtm possible: use next free
> counter. There is now a fast path which is beneficial on
> processors such as AMD64 Fam10h.
I don't see the need for a differentiation between constraint and
unconstraint events. If loops are optimized in the constraint path
there is not much overhead anymore. This could be done by specifying
min and max limits for ranges. Special cases (unconstraint events,
fixed counter, etc.) make the code more complex. I don't think a good
algorithm will need this.
> @@ -553,42 +530,179 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
> if (x86_pmu.num_counters_fixed)
> wmax++;
>
> - for (w = 1, num = n; num && w <= wmax; w++) {
> - /* for each event */
> - for (i = 0; num && i < n; i++) {
> - c = constraints[i];
> - hwc = &cpuc->event_list[i]->hw;
> + /*
> + * assign from most constrained to least constrained
> + */
> + for (w = 1, num = num_c; num && w <= wmax; w++) {
> + /* for each constrained event */
> + for (i = 0, e = c_events; i < num_c; i++, e++) {
> +
> + map_idx = (*e)->hw.map_idx;
> + c = constraints[map_idx];
>
> if (c->weight != w)
> continue;
>
> + min_wcnt = INT_MAX;
Should be X86_PMC_IDX_MAX.
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 1e9ebe5..2605244 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -564,6 +564,7 @@ struct hw_perf_event {
> int idx;
> int last_cpu;
> struct hw_perf_event_extra extra_reg;
> + int map_idx;
This is not the right place. It is for all architectures but actually
locally only used. A local array in x86_schedule_events() would work
too.
-Robert
> };
> struct { /* software */
> struct hrtimer hrtimer;
>
--
Advanced Micro Devices, Inc.
Operating System Research Center
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-10 15:09 ` Stephane Eranian
2011-11-10 16:41 ` Gleb Natapov
2011-11-10 16:59 ` Robert Richter
@ 2011-11-10 18:31 ` Peter Zijlstra
2 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2011-11-10 18:31 UTC (permalink / raw)
To: Stephane Eranian; +Cc: linux-kernel, robert.richter, mingo, ming.m.lin, ak
On Thu, 2011-11-10 at 16:09 +0100, Stephane Eranian wrote:
> What are the configs for which you have failures on Intel?
>
Continuing to use Core2 as an example platform:
cycles,cycles,instructions,instructions
They all have weight 3, yet the masks are not the same, it'll assign
them like: pmc0, pmc1, fp-instructions, fail
Counter rotation will make sure the next attempt looks like:
instructions,cycles,cycles,instructions
Which will work correctly since it'll do: pmc0, pmc1, fp-cycles,
fp-instructions.
Its usually not a really big deal, but I ran into it few months ago and
noticed it because the scaled values weren't what I was expecting them
to be, took me a while to figure out wth happened.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-10 18:03 ` Robert Richter
@ 2011-11-10 18:41 ` Peter Zijlstra
2011-11-10 18:52 ` Peter Zijlstra
2011-11-14 12:55 ` [PATCH] perf_events: fix and improve x86 event scheduling Stephane Eranian
2 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2011-11-10 18:41 UTC (permalink / raw)
To: Robert Richter
Cc: Stephane Eranian, linux-kernel@vger.kernel.org, mingo@elte.hu,
ming.m.lin@intel.com, ak@linux.intel.com
On Thu, 2011-11-10 at 19:03 +0100, Robert Richter wrote:
> But this algorithm does not work for all cases and does not solve the
> problem in general.
Yeah, the problem in general is O(n!) no O(n^3) algorithm can compute
the optimal solution for n>3 or so.
I think the goal is to keep the 'normal' case O(n^2) but try and suck
less for the corner cases without degenerating into a full blown O(n!).
So I think we want an amortized O(n^2) with an upper bound well below
O(n!).
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-10 18:03 ` Robert Richter
2011-11-10 18:41 ` Peter Zijlstra
@ 2011-11-10 18:52 ` Peter Zijlstra
2011-11-11 14:29 ` Robert Richter
2011-11-14 12:55 ` [PATCH] perf_events: fix and improve x86 event scheduling Stephane Eranian
2 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2011-11-10 18:52 UTC (permalink / raw)
To: Robert Richter
Cc: Stephane Eranian, linux-kernel@vger.kernel.org, mingo@elte.hu,
ming.m.lin@intel.com, ak@linux.intel.com
On Thu, 2011-11-10 at 19:41 +0100, Peter Zijlstra wrote:
> On Thu, 2011-11-10 at 19:03 +0100, Robert Richter wrote:
> > But this algorithm does not work for all cases and does not solve the
> > problem in general.
>
> Yeah, the problem in general is O(n!) no O(n^3) algorithm can compute
> the optimal solution for n>3 or so.
>
> I think the goal is to keep the 'normal' case O(n^2) but try and suck
> less for the corner cases without degenerating into a full blown O(n!).
>
> So I think we want an amortized O(n^2) with an upper bound well below
> O(n!).
I don't think its reasonable to require a perfect solver for the generic
problem since O(n!) is well outside sanity.
Practically though, the most challenging problem is the AMD F15 thing,
since those constraints are quite specific it might just be possible to
construct an algorithm that finds optimal solutions well below O(n!) for
that particular constraint set.
IIRC, Robert's proposal limits the rewind stack to 1, which, if my mind
didn't completely stop working, should end up being something like
O(n^3). Now I don't know if Robert's thing is perfect for AMD F15 or if
there's still some odd fail cases, but since he's from AMD I suspect its
good enough in practice.
In fact, I almost merged his code, my only complaint was a lack of
comments and having had to spend several hours to fully understand the
thing a few months ago, I felt it really could use some since I didn't
want to have to spend that amount of effort every time I'd have to look
at the thing.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-10 18:52 ` Peter Zijlstra
@ 2011-11-11 14:29 ` Robert Richter
2011-11-14 17:51 ` [PATCH v3 0/2] perf, x86: handle overlapping counters Robert Richter
0 siblings, 1 reply; 30+ messages in thread
From: Robert Richter @ 2011-11-11 14:29 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Stephane Eranian, linux-kernel@vger.kernel.org, mingo@elte.hu,
ming.m.lin@intel.com, ak@linux.intel.com
On 10.11.11 19:52:47, Peter Zijlstra wrote:
> IIRC, Robert's proposal limits the rewind stack to 1, which, if my mind
> didn't completely stop working, should end up being something like
> O(n^3). Now I don't know if Robert's thing is perfect for AMD F15 or if
> there's still some odd fail cases, but since he's from AMD I suspect its
> good enough in practice.
Yes, patches are ok for AMD F15.
> In fact, I almost merged his code, my only complaint was a lack of
> comments and having had to spend several hours to fully understand the
> thing a few months ago, I felt it really could use some since I didn't
> want to have to spend that amount of effort every time I'd have to look
> at the thing.
Will repost an updated version.
-Robert
--
Advanced Micro Devices, Inc.
Operating System Research Center
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-10 18:03 ` Robert Richter
2011-11-10 18:41 ` Peter Zijlstra
2011-11-10 18:52 ` Peter Zijlstra
@ 2011-11-14 12:55 ` Stephane Eranian
2011-11-14 14:12 ` Peter Zijlstra
2 siblings, 1 reply; 30+ messages in thread
From: Stephane Eranian @ 2011-11-14 12:55 UTC (permalink / raw)
To: Robert Richter
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org, mingo@elte.hu,
ming.m.lin@intel.com, ak@linux.intel.com
Robert,
On Thu, Nov 10, 2011 at 7:03 PM, Robert Richter <robert.richter@amd.com> wrote:
> On 07.11.11 06:01:49, Stephane Eranian wrote:
>>
>> Major rewrite of the x86 event scheduling routine.
>> The routine is shared between AMD and Intel.
>>
>> The existing code had an issue with certain combinations
>> of constraints. As Robert Richter @ AMD demonstrated on
>> AMD Bulldozer:
>>
>> e0 [3,0]
>> e1 [2:0]
>> e2 [2:0]
>> e3 [2:0]
>>
>> With this list of events, the existing scheduling algorithm
>> would never schedule e3. That's because it would always choose
>> counter 0 for e0:
>> e0 -> 0
>> e1 -> 1
>> e2 -> 2
>> e3 -> X
>>
>> Robert Richter proposed a fix to the algorithm which was based
>> on tagging constraints that overlap. He was using rollbacks to
>> simulate recursion.
>>
>> We propose an alternate solution which is simpler, faster. This
>> is a small modification to the existing algorithm. It adds some
>> smart in how a counter is chosen for a given event. The first
>
> Posting a link to my patch set for reference:
>
> https://lkml.org/lkml/2011/9/10/51
>
> What are the reasons to your alternate solution? Is it recursion, code
> complexity, or a use case it does not fit in? I see recursion as the
> main concern with my patch set, but its impact is known and limited.
> Anyway, a algorithm without recursion would be generally better.
>
>> available counter is not systemtically selected. Instead, the
>> algorithm checks how the constraints between the events overlap.
>> For each counter, it assigns a weight which is the number of events
>> which it can count for the event set. For each event, the algorithm
>> assigns the counter with the smallest weight first.
>
> But this algorithm does not work for all cases and does not solve the
> problem in general. Your idea to have weights for counters might be
> the right approach.
>
> E.g. the algorithm fails with (all weights are the same):
>
> c0 c1 c2
> e0 x x
> e1 x x
> e2 x x
>
> ... leading to:
>
> e0 -> c0
> e1 -> C1
> e3 -> X
>
> You basically have to recalculate the weights after you had assigned a
> counter.
>
Yes, it does not yield an assignment which maximizes the PMU usage.
I have been talking with co-workers experts in operational research
about our problem. They all pointed to me to the max flow algorithm from
Ford-Fulkerson (search for it on Wikipedia). I think it solves the complexity
and recursion problems. My understanding is that the complexity is also
more under control.
I started experimenting with this algorithm. I will report in a few days.
One thing for sure, it does provide the 'maximized' answer for your
example above and also for your initial BD example.
> But even if we do this, I am still not sure if that would cover all
> cases.
>
>>
>> In the example above, the new algoritm assigns:
>> e0 -> 3
>> e1 -> 0
>> e2 -> 1
>> e3 -> 2
>>
>> Because:
>> counter 0, weight = 4
>> counter 1, weight = 3
>> counter 2, weight = 3
>> counter 3, weight = 1
>>
>> We maximize PMU usage and ensure all events are measured.
>>
>> The patch also restructures the code to separate scheduling of
>> constrained vs. unconstrained events. An unconstrained event is
>> one that can be programmed into any of the generic counters. For
>> those, we now use the simplest algorihtm possible: use next free
>> counter. There is now a fast path which is beneficial on
>> processors such as AMD64 Fam10h.
>
> I don't see the need for a differentiation between constraint and
> unconstraint events. If loops are optimized in the constraint path
> there is not much overhead anymore. This could be done by specifying
> min and max limits for ranges. Special cases (unconstraint events,
> fixed counter, etc.) make the code more complex. I don't think a good
> algorithm will need this.
>
>> @@ -553,42 +530,179 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
>> if (x86_pmu.num_counters_fixed)
>> wmax++;
>>
>> - for (w = 1, num = n; num && w <= wmax; w++) {
>> - /* for each event */
>> - for (i = 0; num && i < n; i++) {
>> - c = constraints[i];
>> - hwc = &cpuc->event_list[i]->hw;
>> + /*
>> + * assign from most constrained to least constrained
>> + */
>> + for (w = 1, num = num_c; num && w <= wmax; w++) {
>> + /* for each constrained event */
>> + for (i = 0, e = c_events; i < num_c; i++, e++) {
>> +
>> + map_idx = (*e)->hw.map_idx;
>> + c = constraints[map_idx];
>>
>> if (c->weight != w)
>> continue;
>>
>> + min_wcnt = INT_MAX;
>
> Should be X86_PMC_IDX_MAX.
>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 1e9ebe5..2605244 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -564,6 +564,7 @@ struct hw_perf_event {
>> int idx;
>> int last_cpu;
>> struct hw_perf_event_extra extra_reg;
>> + int map_idx;
>
> This is not the right place. It is for all architectures but actually
> locally only used. A local array in x86_schedule_events() would work
> too.
>
> -Robert
>
>> };
>> struct { /* software */
>> struct hrtimer hrtimer;
>>
>
> --
> Advanced Micro Devices, Inc.
> Operating System Research Center
>
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-14 12:55 ` [PATCH] perf_events: fix and improve x86 event scheduling Stephane Eranian
@ 2011-11-14 14:12 ` Peter Zijlstra
2011-11-14 14:26 ` Stephane Eranian
0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2011-11-14 14:12 UTC (permalink / raw)
To: Stephane Eranian
Cc: Robert Richter, linux-kernel@vger.kernel.org, mingo@elte.hu,
ming.m.lin@intel.com, ak@linux.intel.com
On Mon, 2011-11-14 at 13:55 +0100, Stephane Eranian wrote:
> I have been talking with co-workers experts in operational research
> about our problem. They all pointed to me to the max flow algorithm from
> Ford-Fulkerson (search for it on Wikipedia). I think it solves the complexity
> and recursion problems. My understanding is that the complexity is also
> more under control.
How would you apply this algorithm to the problem at hand? I'm probably
missing the obvious thing here, but if we want the flow to be the number
of assigned counter then we end up with nodes being the various
permutations of assignments or so, which isn't helpful since that'd be
n! nodes.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-14 14:12 ` Peter Zijlstra
@ 2011-11-14 14:26 ` Stephane Eranian
2011-11-14 16:00 ` Peter Zijlstra
0 siblings, 1 reply; 30+ messages in thread
From: Stephane Eranian @ 2011-11-14 14:26 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Robert Richter, linux-kernel@vger.kernel.org, mingo@elte.hu,
ming.m.lin@intel.com, ak@linux.intel.com
Let's take Robert's example:
c0 c1 c2
e0 x x
e1 x x
e2 x x
The nodes in the graph:
- S (starting point)
- T (the ending point)
- e0, e1, e2 (the events)
- c0, c1, c2 (the counters)
There is an edge from the source to all the events.
There is an edge from all the counters to the sync.
There is an edge between an event and a counter, if
it can count the event.
The capacity of any edge is 1.
The max flow algorithm returns the maximum number of
events that can be measured together. If max_flow ==
number of events, we maxed out.
To find the assignment, we simply need to inspect the
outgoing edges from events and use the only edge which
has a flow == 1.
If we list all the edges for the example above, we have:
S -> e0
S -> e1
S -> e2
e0 -> c0
e0 -> c2
e1 -> c1
e1 -> c2
e2 -> c0
e2 -> c1
c0 -> T
c1 -> T
c2 -> T
then we do max_flow(S, T), Here we find 3.
Given the capacity is either 0 or 1, I think we could use
bitmasks to describe the edges.
On Mon, Nov 14, 2011 at 3:12 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2011-11-14 at 13:55 +0100, Stephane Eranian wrote:
>> I have been talking with co-workers experts in operational research
>> about our problem. They all pointed to me to the max flow algorithm from
>> Ford-Fulkerson (search for it on Wikipedia). I think it solves the complexity
>> and recursion problems. My understanding is that the complexity is also
>> more under control.
>
> How would you apply this algorithm to the problem at hand? I'm probably
> missing the obvious thing here, but if we want the flow to be the number
> of assigned counter then we end up with nodes being the various
> permutations of assignments or so, which isn't helpful since that'd be
> n! nodes.
>
>
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-14 14:26 ` Stephane Eranian
@ 2011-11-14 16:00 ` Peter Zijlstra
2011-11-14 17:39 ` Stephane Eranian
0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2011-11-14 16:00 UTC (permalink / raw)
To: Stephane Eranian
Cc: Robert Richter, linux-kernel@vger.kernel.org, mingo@elte.hu,
ming.m.lin@intel.com, ak@linux.intel.com
On Mon, 2011-11-14 at 15:26 +0100, Stephane Eranian wrote:
> There is an edge from the source to all the events.
> There is an edge from all the counters to the sync.
> There is an edge between an event and a counter, if
> it can count the event.
>
> The capacity of any edge is 1.
Ah indeed.
So that gives:
E = e+e*c+c ~= O(c^2); since e<=c
V = 2+e+c ~= O(c)
Then going by:
http://en.wikipedia.org/wiki/Maximum_flow_problem
we have to stay away from Edmonds-Karp.
Ford-Fulkerson would end up being O(E * c) = O(c^3), since max |f| is c.
Which is pretty much identical to all these O(V^2 E) = O(c^3) as well.
Dinitz blocking flow with dynamic trees looks more interesting at O(c^2
log(c)). Push relabel with dynamic trees looks to be best at O(c^2),
since V^2/E ends up being c^2/c^2 = 1.
Creating the graph itself will be O(c^2) as well, due to E.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-14 16:00 ` Peter Zijlstra
@ 2011-11-14 17:39 ` Stephane Eranian
2011-11-14 21:43 ` Peter Zijlstra
2011-11-14 22:16 ` Peter Zijlstra
0 siblings, 2 replies; 30+ messages in thread
From: Stephane Eranian @ 2011-11-14 17:39 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Robert Richter, linux-kernel@vger.kernel.org, mingo@elte.hu,
ming.m.lin@intel.com, ak@linux.intel.com
On Mon, Nov 14, 2011 at 5:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2011-11-14 at 15:26 +0100, Stephane Eranian wrote:
>> There is an edge from the source to all the events.
>> There is an edge from all the counters to the sync.
>> There is an edge between an event and a counter, if
>> it can count the event.
>>
>> The capacity of any edge is 1.
>
> Ah indeed.
>
> So that gives:
>
> E = e+e*c+c ~= O(c^2); since e<=c
> V = 2+e+c ~= O(c)
>
> Then going by:
>
> http://en.wikipedia.org/wiki/Maximum_flow_problem
>
> we have to stay away from Edmonds-Karp.
>
> Ford-Fulkerson would end up being O(E * c) = O(c^3), since max |f| is c.
> Which is pretty much identical to all these O(V^2 E) = O(c^3) as well.
>
> Dinitz blocking flow with dynamic trees looks more interesting at O(c^2
> log(c)). Push relabel with dynamic trees looks to be best at O(c^2),
> since V^2/E ends up being c^2/c^2 = 1.
>
> Creating the graph itself will be O(c^2) as well, due to E.
>
I think we are in the special case of a bi-partite graph with unit capacities,
thus the complexity can be reduced even more.
See Special Cases in http://en.wikipedia.org/wiki/Dinic%27s_algorithm
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v3 0/2] perf, x86: handle overlapping counters
2011-11-11 14:29 ` Robert Richter
@ 2011-11-14 17:51 ` Robert Richter
2011-11-14 17:51 ` [PATCH v3 1/2] perf, x86: Implement event scheduler helper functions Robert Richter
2011-11-14 17:51 ` [PATCH v3 2/2] perf, x86: Fix event scheduler for constraints with overlapping counters Robert Richter
0 siblings, 2 replies; 30+ messages in thread
From: Robert Richter @ 2011-11-14 17:51 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Stephane Eranian, Ingo Molnar, LKML, Robert Richter
I know there is an ongoing discussion about the best algorithm to be
used for this problem. Posting this version with the latest updates
anyway...
This patch set implements support for overlapping counters (patch #2).
For this the existing x86 event scheduler is reworked by introducing
helper functions (patch #1).
V2:
changes in patch #2:
* Renamed redo -> overlap.
* Reimplementation using perf scheduling helper functions.
V3:
changes in patch #1:
* Added macro for_each_set_bit_cont().
* Changed functions interfaces of perf_sched_find_counter() and
perf_sched_next_event() to use bool as return value.
* Added some comments to make code better understandable.
changes in patch #2:
* Added WARN_ON_ONCE() if out of save states.
* Changed function interface of perf_sched_restore_state() to use bool
as return value.
-Robert
Robert Richter (2):
perf, x86: Implement event scheduler helper functions
perf, x86: Fix event scheduler for constraints with overlapping
counters
arch/x86/kernel/cpu/perf_event.c | 217 +++++++++++++++++++++++++---------
arch/x86/kernel/cpu/perf_event.h | 30 +++++-
arch/x86/kernel/cpu/perf_event_amd.c | 2 +-
include/linux/bitops.h | 10 ++-
4 files changed, 200 insertions(+), 59 deletions(-)
--
1.7.7
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH v3 1/2] perf, x86: Implement event scheduler helper functions
2011-11-14 17:51 ` [PATCH v3 0/2] perf, x86: handle overlapping counters Robert Richter
@ 2011-11-14 17:51 ` Robert Richter
2011-11-16 16:02 ` Peter Zijlstra
2011-11-14 17:51 ` [PATCH v3 2/2] perf, x86: Fix event scheduler for constraints with overlapping counters Robert Richter
1 sibling, 1 reply; 30+ messages in thread
From: Robert Richter @ 2011-11-14 17:51 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Stephane Eranian, Ingo Molnar, LKML, Robert Richter
This patch introduces x86 perf scheduler code helper functions. We
need this to later add more complex functionality to support
overlapping counter constraints (next patch).
The algorithm is modified so that the range of weight values is now
generated from the constraints. There shouldn't be other functional
changes.
With the helper functions the scheduler is controlled. There are
functions to initialize, traverse the event list, find unused counters
etc. The scheduler keeps its own state.
V3:
* Added macro for_each_set_bit_cont().
* Changed functions interfaces of perf_sched_find_counter() and
perf_sched_next_event() to use bool as return value.
* Added some comments to make code better understandable.
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
---
arch/x86/kernel/cpu/perf_event.c | 174 ++++++++++++++++++++++++++------------
include/linux/bitops.h | 10 ++-
2 files changed, 129 insertions(+), 55 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 6408910..ad73725 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -488,18 +488,134 @@ static inline int is_x86_event(struct perf_event *event)
return event->pmu == &pmu;
}
+/*
+ * Event scheduler state:
+ *
+ * Assign events iterating over all events and counters, beginning
+ * with events with least weights first. Keep the current iterator
+ * state in struct sched_state.
+ */
+struct sched_state {
+ int weight;
+ int event; /* event index */
+ int counter; /* counter index */
+ int unassigned; /* number of events to be assigned left */
+ unsigned long used[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+};
+
+struct perf_sched {
+ int max_weight;
+ int max_events;
+ struct event_constraint **constraints;
+ struct sched_state state;
+};
+
+/*
+ * Initialize interator that runs through all events and counters.
+ */
+static void perf_sched_init(struct perf_sched *sched, struct event_constraint **c,
+ int num, int wmin, int wmax)
+{
+ memset(sched, 0, sizeof(*sched));
+ sched->max_events = num;
+ sched->max_weight = wmax;
+ sched->constraints = c;
+
+ sched->state.weight = wmin;
+ sched->state.unassigned = num;
+}
+
+/*
+ * Select a counter for the current event to schedule. Return true on
+ * success.
+ */
+static bool perf_sched_find_counter(struct perf_sched *sched)
+{
+ struct event_constraint *c;
+ int idx;
+
+ if (!sched->state.unassigned)
+ return false;
+
+ c = sched->constraints[sched->state.event];
+
+ /* Grab the first unused counter starting with idx */
+ idx = sched->state.counter;
+ for_each_set_bit_cont(idx, c->idxmsk, X86_PMC_IDX_MAX) {
+ if (!__test_and_set_bit(idx, sched->state.used))
+ break;
+ }
+ sched->state.counter = idx;
+
+ if (idx >= X86_PMC_IDX_MAX)
+ return false;
+
+ return true;
+}
+
+/*
+ * Go through all unassigned events and find the next one to schedule.
+ * Take events with the least weight first. Return true on success.
+ */
+static bool perf_sched_next_event(struct perf_sched *sched)
+{
+ struct event_constraint *c;
+
+ if (!sched->state.unassigned || !--sched->state.unassigned)
+ return false;
+
+ do {
+ /* next event */
+ sched->state.event++;
+ if (sched->state.event >= sched->max_events) {
+ /* next weight */
+ sched->state.event = 0;
+ sched->state.weight++;
+ if (sched->state.weight > sched->max_weight)
+ return false;
+ }
+ c = sched->constraints[sched->state.event];
+ } while (c->weight != sched->state.weight);
+
+ sched->state.counter = 0; /* start with first counter */
+
+ return true;
+}
+
+/*
+ * Assign a counter for each event.
+ */
+static int perf_assign_events(struct event_constraint **constraints, int n,
+ int wmin, int wmax, int *assign)
+{
+ struct perf_sched sched;
+
+ perf_sched_init(&sched, constraints, n, wmin, wmax);
+
+ do {
+ if (!perf_sched_find_counter(&sched))
+ break; /* failed */
+ if (assign)
+ assign[sched.state.event] = sched.state.counter;
+ } while (perf_sched_next_event(&sched));
+
+ return sched.state.unassigned;
+}
+
int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
{
struct event_constraint *c, *constraints[X86_PMC_IDX_MAX];
unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
- int i, j, w, wmax, num = 0;
+ int i, wmin, wmax, num = 0;
struct hw_perf_event *hwc;
bitmap_zero(used_mask, X86_PMC_IDX_MAX);
- for (i = 0; i < n; i++) {
+ for (i = 0, wmin = X86_PMC_IDX_MAX, wmax = 0; i < n; i++) {
c = x86_pmu.get_event_constraints(cpuc, cpuc->event_list[i]);
constraints[i] = c;
+ wmin = min(wmin, c->weight);
+ wmax = max(wmax, c->weight);
}
/*
@@ -525,59 +641,11 @@ int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
if (assign)
assign[i] = hwc->idx;
}
- if (i == n)
- goto done;
- /*
- * begin slow path
- */
+ /* slow path */
+ if (i != n)
+ num = perf_assign_events(constraints, n, wmin, wmax, assign);
- bitmap_zero(used_mask, X86_PMC_IDX_MAX);
-
- /*
- * weight = number of possible counters
- *
- * 1 = most constrained, only works on one counter
- * wmax = least constrained, works on any counter
- *
- * assign events to counters starting with most
- * constrained events.
- */
- wmax = x86_pmu.num_counters;
-
- /*
- * when fixed event counters are present,
- * wmax is incremented by 1 to account
- * for one more choice
- */
- if (x86_pmu.num_counters_fixed)
- wmax++;
-
- for (w = 1, num = n; num && w <= wmax; w++) {
- /* for each event */
- for (i = 0; num && i < n; i++) {
- c = constraints[i];
- hwc = &cpuc->event_list[i]->hw;
-
- if (c->weight != w)
- continue;
-
- for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
- if (!test_bit(j, used_mask))
- break;
- }
-
- if (j == X86_PMC_IDX_MAX)
- break;
-
- __set_bit(j, used_mask);
-
- if (assign)
- assign[i] = j;
- num--;
- }
- }
-done:
/*
* scheduling failed or is just a simulation,
* free resources if necessary
diff --git a/include/linux/bitops.h b/include/linux/bitops.h
index a3ef66a..3c1063a 100644
--- a/include/linux/bitops.h
+++ b/include/linux/bitops.h
@@ -22,8 +22,14 @@ extern unsigned long __sw_hweight64(__u64 w);
#include <asm/bitops.h>
#define for_each_set_bit(bit, addr, size) \
- for ((bit) = find_first_bit((addr), (size)); \
- (bit) < (size); \
+ for ((bit) = find_first_bit((addr), (size)); \
+ (bit) < (size); \
+ (bit) = find_next_bit((addr), (size), (bit) + 1))
+
+/* same as for_each_set_bit() but use bit as value to start with */
+#define for_each_set_bit_cont(bit, addr, size) \
+ for ((bit) = find_next_bit((addr), (size), (bit)); \
+ (bit) < (size); \
(bit) = find_next_bit((addr), (size), (bit) + 1))
static __inline__ int get_bitmask_order(unsigned int count)
--
1.7.7
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH v3 2/2] perf, x86: Fix event scheduler for constraints with overlapping counters
2011-11-14 17:51 ` [PATCH v3 0/2] perf, x86: handle overlapping counters Robert Richter
2011-11-14 17:51 ` [PATCH v3 1/2] perf, x86: Implement event scheduler helper functions Robert Richter
@ 2011-11-14 17:51 ` Robert Richter
1 sibling, 0 replies; 30+ messages in thread
From: Robert Richter @ 2011-11-14 17:51 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Stephane Eranian, Ingo Molnar, LKML, Robert Richter
The current x86 event scheduler fails to resolve scheduling problems
of certain combinations of events and constraints. This happens if the
counter mask of such an event is not a subset of any other counter
mask of a constraint with an equal or higher weight, e.g. constraints
of the AMD family 15h pmu:
counter mask weight
amd_f15_PMC30 0x09 2 <--- overlapping counters
amd_f15_PMC20 0x07 3
amd_f15_PMC53 0x38 3
The scheduler does not find then an existing solution. Here is an
example:
event code counter failure possible solution
0x02E PMC[3,0] 0 3
0x043 PMC[2:0] 1 0
0x045 PMC[2:0] 2 1
0x046 PMC[2:0] FAIL 2
The event scheduler may not select the correct counter in the first
cycle because it needs to know which subsequent events will be
scheduled. It may fail to schedule the events then.
To solve this, we now save the scheduler state of events with
overlapping counter counstraints. If we fail to schedule the events
we rollback to those states and try to use another free counter.
Constraints with overlapping counters are marked with a new introduced
overlap flag. We set the overlap flag for such constraints to give the
scheduler a hint which events to select for counter rescheduling. The
EVENT_CONSTRAINT_OVERLAP() macro can be used for this.
Care must be taken as the rescheduling algorithm is O(n!) which will
increase scheduling cycles for an over-commited system dramatically.
The number of such EVENT_CONSTRAINT_OVERLAP() macros and its counter
masks must be kept at a minimum. Thus, the current stack is limited to
2 states to limit the number of loops the algorithm takes in the worst
case.
On systems with no overlapping-counter constraints, this
implementation does not increase the loop count compared to the
previous algorithm.
V2:
* Renamed redo -> overlap.
* Reimplementation using perf scheduling helper functions.
V3:
* Added WARN_ON_ONCE() if out of save states.
* Changed function interface of perf_sched_restore_state() to use bool
as return value.
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
---
arch/x86/kernel/cpu/perf_event.c | 45 ++++++++++++++++++++++++++++++++-
arch/x86/kernel/cpu/perf_event.h | 30 +++++++++++++++++++++-
arch/x86/kernel/cpu/perf_event_amd.c | 2 +-
3 files changed, 72 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index ad73725..68baed0e 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -503,11 +503,16 @@ struct sched_state {
unsigned long used[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
};
+/* Total max is X86_PMC_IDX_MAX, but we are O(n!) limited */
+#define SCHED_STATES_MAX 2
+
struct perf_sched {
int max_weight;
int max_events;
struct event_constraint **constraints;
struct sched_state state;
+ int saved_states;
+ struct sched_state saved[SCHED_STATES_MAX];
};
/*
@@ -525,11 +530,34 @@ static void perf_sched_init(struct perf_sched *sched, struct event_constraint **
sched->state.unassigned = num;
}
+static void perf_sched_save_state(struct perf_sched *sched)
+{
+ if (WARN_ON_ONCE(sched->saved_states >= SCHED_STATES_MAX))
+ return;
+
+ sched->saved[sched->saved_states] = sched->state;
+ sched->saved_states++;
+}
+
+static bool perf_sched_restore_state(struct perf_sched *sched)
+{
+ if (!sched->saved_states)
+ return false;
+
+ sched->saved_states--;
+ sched->state = sched->saved[sched->saved_states];
+
+ /* continue with next counter: */
+ clear_bit(sched->state.counter++, sched->state.used);
+
+ return true;
+}
+
/*
* Select a counter for the current event to schedule. Return true on
* success.
*/
-static bool perf_sched_find_counter(struct perf_sched *sched)
+static bool __perf_sched_find_counter(struct perf_sched *sched)
{
struct event_constraint *c;
int idx;
@@ -550,6 +578,19 @@ static bool perf_sched_find_counter(struct perf_sched *sched)
if (idx >= X86_PMC_IDX_MAX)
return false;
+ if (c->overlap)
+ perf_sched_save_state(sched);
+
+ return true;
+}
+
+static bool perf_sched_find_counter(struct perf_sched *sched)
+{
+ while (!__perf_sched_find_counter(sched)) {
+ if (!perf_sched_restore_state(sched))
+ return false;
+ }
+
return true;
}
@@ -1243,7 +1284,7 @@ static int __init init_hw_perf_events(void)
unconstrained = (struct event_constraint)
__EVENT_CONSTRAINT(0, (1ULL << x86_pmu.num_counters) - 1,
- 0, x86_pmu.num_counters);
+ 0, x86_pmu.num_counters, 0);
if (x86_pmu.event_constraints) {
for_each_event_constraint(c, x86_pmu.event_constraints) {
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index b9698d4..51a985c 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -45,6 +45,7 @@ struct event_constraint {
u64 code;
u64 cmask;
int weight;
+ int overlap;
};
struct amd_nb {
@@ -151,15 +152,40 @@ struct cpu_hw_events {
void *kfree_on_online;
};
-#define __EVENT_CONSTRAINT(c, n, m, w) {\
+#define __EVENT_CONSTRAINT(c, n, m, w, o) {\
{ .idxmsk64 = (n) }, \
.code = (c), \
.cmask = (m), \
.weight = (w), \
+ .overlap = (o), \
}
#define EVENT_CONSTRAINT(c, n, m) \
- __EVENT_CONSTRAINT(c, n, m, HWEIGHT(n))
+ __EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 0)
+
+/*
+ * The overlap flag marks event constraints with overlapping counter
+ * masks. This is the case if the counter mask of such an event is not
+ * a subset of any other counter mask of a constraint with an equal or
+ * higher weight, e.g.:
+ *
+ * c_overlaps = EVENT_CONSTRAINT_OVERLAP(0, 0x09, 0);
+ * c_another1 = EVENT_CONSTRAINT(0, 0x07, 0);
+ * c_another2 = EVENT_CONSTRAINT(0, 0x38, 0);
+ *
+ * The event scheduler may not select the correct counter in the first
+ * cycle because it needs to know which subsequent events will be
+ * scheduled. It may fail to schedule the events then. So we set the
+ * overlap flag for such constraints to give the scheduler a hint which
+ * events to select for counter rescheduling.
+ *
+ * Care must be taken as the rescheduling algorithm is O(n!) which
+ * will increase scheduling cycles for an over-commited system
+ * dramatically. The number of such EVENT_CONSTRAINT_OVERLAP() macros
+ * and its counter masks must be kept at a minimum.
+ */
+#define EVENT_CONSTRAINT_OVERLAP(c, n, m) \
+ __EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 1)
/*
* Constraint on the Event code.
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index aeefd45..0397b23 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -492,7 +492,7 @@ static __initconst const struct x86_pmu amd_pmu = {
static struct event_constraint amd_f15_PMC0 = EVENT_CONSTRAINT(0, 0x01, 0);
static struct event_constraint amd_f15_PMC20 = EVENT_CONSTRAINT(0, 0x07, 0);
static struct event_constraint amd_f15_PMC3 = EVENT_CONSTRAINT(0, 0x08, 0);
-static struct event_constraint amd_f15_PMC30 = EVENT_CONSTRAINT(0, 0x09, 0);
+static struct event_constraint amd_f15_PMC30 = EVENT_CONSTRAINT_OVERLAP(0, 0x09, 0);
static struct event_constraint amd_f15_PMC50 = EVENT_CONSTRAINT(0, 0x3F, 0);
static struct event_constraint amd_f15_PMC53 = EVENT_CONSTRAINT(0, 0x38, 0);
--
1.7.7
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-14 17:39 ` Stephane Eranian
@ 2011-11-14 21:43 ` Peter Zijlstra
2011-11-16 10:28 ` Stephane Eranian
2011-11-14 22:16 ` Peter Zijlstra
1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2011-11-14 21:43 UTC (permalink / raw)
To: Stephane Eranian
Cc: Robert Richter, linux-kernel@vger.kernel.org, mingo@elte.hu,
ming.m.lin@intel.com, ak@linux.intel.com
On Mon, 2011-11-14 at 18:39 +0100, Stephane Eranian wrote:
> On Mon, Nov 14, 2011 at 5:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Mon, 2011-11-14 at 15:26 +0100, Stephane Eranian wrote:
> >> There is an edge from the source to all the events.
> >> There is an edge from all the counters to the sync.
> >> There is an edge between an event and a counter, if
> >> it can count the event.
> >>
> >> The capacity of any edge is 1.
> >
> > Ah indeed.
> >
> > So that gives:
> >
> > E = e+e*c+c ~= O(c^2); since e<=c
> > V = 2+e+c ~= O(c)
> >
> > Then going by:
> >
> > http://en.wikipedia.org/wiki/Maximum_flow_problem
> >
> > we have to stay away from Edmonds-Karp.
> >
> > Ford-Fulkerson would end up being O(E * c) = O(c^3), since max |f| is c.
> > Which is pretty much identical to all these O(V^2 E) = O(c^3) as well.
> >
> > Dinitz blocking flow with dynamic trees looks more interesting at O(c^2
> > log(c)). Push relabel with dynamic trees looks to be best at O(c^2),
> > since V^2/E ends up being c^2/c^2 = 1.
> >
> > Creating the graph itself will be O(c^2) as well, due to E.
> >
> I think we are in the special case of a bi-partite graph with unit capacities,
> thus the complexity can be reduced even more.
>
> See Special Cases in http://en.wikipedia.org/wiki/Dinic%27s_algorithm
Yeah, I found that, but that still reduces to O(c^2.5) which is over the
O(c^2) of Push relabel with dynamic trees.
I haven't managed to wrap my head around this stuff well enough to even
start to have an idea if this constraint (bi-partite and unit
capacities) will have any considerable effect on the other algorithms.
Also, we don't need an exhaustive max flow solution, any flow that's
high enough to fit the required capacity will do, this too could
possibly be used to lower the (average) complexity bound.
I would really really like it to not average to O(n^3), that's just
silly expensive.
Also, do you have any objections to me merging Roberts stuff (provided
it passes review etc.) while you poke at alternative solutions? We can
always replace the stuff again if we find anything that works better.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-14 17:39 ` Stephane Eranian
2011-11-14 21:43 ` Peter Zijlstra
@ 2011-11-14 22:16 ` Peter Zijlstra
2011-11-16 10:06 ` Stephane Eranian
1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2011-11-14 22:16 UTC (permalink / raw)
To: Stephane Eranian
Cc: Robert Richter, linux-kernel@vger.kernel.org, mingo@elte.hu,
ming.m.lin@intel.com, ak@linux.intel.com
On Mon, 2011-11-14 at 22:43 +0100, Peter Zijlstra wrote:
>
> Also, we don't need an exhaustive max flow solution, any flow that's
> high enough to fit the required capacity will do, this too could
> possibly be used to lower the (average) complexity bound.
I think that for the typically very dense graphs this in particular
could be very helpful in keeping the actual runtime low. For a fully
connected e*c it really doesn't matter how you program things, all flows
will have the same max and iterating them all is just wasting time.
We just need to figure out which of the many different algorithms are
best on average for our particular constraint sets.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-14 22:16 ` Peter Zijlstra
@ 2011-11-16 10:06 ` Stephane Eranian
0 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2011-11-16 10:06 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Robert Richter, linux-kernel@vger.kernel.org, mingo@elte.hu,
ming.m.lin@intel.com, ak@linux.intel.com
On Mon, Nov 14, 2011 at 11:16 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2011-11-14 at 22:43 +0100, Peter Zijlstra wrote:
>>
>> Also, we don't need an exhaustive max flow solution, any flow that's
>> high enough to fit the required capacity will do, this too could
>> possibly be used to lower the (average) complexity bound.
>
> I think that for the typically very dense graphs this in particular
> could be very helpful in keeping the actual runtime low. For a fully
> connected e*c it really doesn't matter how you program things, all flows
> will have the same max and iterating them all is just wasting time.
>
> We just need to figure out which of the many different algorithms are
> best on average for our particular constraint sets.
>
I agree, I think there are optimization opportunities.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH] perf_events: fix and improve x86 event scheduling
2011-11-14 21:43 ` Peter Zijlstra
@ 2011-11-16 10:28 ` Stephane Eranian
0 siblings, 0 replies; 30+ messages in thread
From: Stephane Eranian @ 2011-11-16 10:28 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Robert Richter, linux-kernel@vger.kernel.org, mingo@elte.hu,
ming.m.lin@intel.com, ak@linux.intel.com
On Mon, Nov 14, 2011 at 10:43 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, 2011-11-14 at 18:39 +0100, Stephane Eranian wrote:
>> On Mon, Nov 14, 2011 at 5:00 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Mon, 2011-11-14 at 15:26 +0100, Stephane Eranian wrote:
>> >> There is an edge from the source to all the events.
>> >> There is an edge from all the counters to the sync.
>> >> There is an edge between an event and a counter, if
>> >> it can count the event.
>> >>
>> >> The capacity of any edge is 1.
>> >
>> > Ah indeed.
>> >
>> > So that gives:
>> >
>> > E = e+e*c+c ~= O(c^2); since e<=c
>> > V = 2+e+c ~= O(c)
>> >
>> > Then going by:
>> >
>> > http://en.wikipedia.org/wiki/Maximum_flow_problem
>> >
>> > we have to stay away from Edmonds-Karp.
>> >
>> > Ford-Fulkerson would end up being O(E * c) = O(c^3), since max |f| is c.
>> > Which is pretty much identical to all these O(V^2 E) = O(c^3) as well.
>> >
>> > Dinitz blocking flow with dynamic trees looks more interesting at O(c^2
>> > log(c)). Push relabel with dynamic trees looks to be best at O(c^2),
>> > since V^2/E ends up being c^2/c^2 = 1.
>> >
>> > Creating the graph itself will be O(c^2) as well, due to E.
>> >
>> I think we are in the special case of a bi-partite graph with unit capacities,
>> thus the complexity can be reduced even more.
>>
>> See Special Cases in http://en.wikipedia.org/wiki/Dinic%27s_algorithm
>
> Yeah, I found that, but that still reduces to O(c^2.5) which is over the
> O(c^2) of Push relabel with dynamic trees.
>
> I haven't managed to wrap my head around this stuff well enough to even
> start to have an idea if this constraint (bi-partite and unit
> capacities) will have any considerable effect on the other algorithms.
>
> Also, we don't need an exhaustive max flow solution, any flow that's
> high enough to fit the required capacity will do, this too could
> possibly be used to lower the (average) complexity bound.
>
> I would really really like it to not average to O(n^3), that's just
> silly expensive.
>
> Also, do you have any objections to me merging Roberts stuff (provided
> it passes review etc.) while you poke at alternative solutions? We can
> always replace the stuff again if we find anything that works better.
>
I don't have objections to Robert's patch. It will take a bit of time to come
up with a good implementation of those max_flow algorithms.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/2] perf, x86: Implement event scheduler helper functions
2011-11-14 17:51 ` [PATCH v3 1/2] perf, x86: Implement event scheduler helper functions Robert Richter
@ 2011-11-16 16:02 ` Peter Zijlstra
2011-11-16 19:23 ` Robert Richter
0 siblings, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2011-11-16 16:02 UTC (permalink / raw)
To: Robert Richter; +Cc: Stephane Eranian, Ingo Molnar, LKML
On Mon, 2011-11-14 at 18:51 +0100, Robert Richter wrote:
> @@ -22,8 +22,14 @@ extern unsigned long __sw_hweight64(__u64 w);
> #include <asm/bitops.h>
>
> #define for_each_set_bit(bit, addr, size) \
> - for ((bit) = find_first_bit((addr), (size)); \
> - (bit) < (size); \
> + for ((bit) = find_first_bit((addr), (size)); \
> + (bit) < (size); \
> + (bit) = find_next_bit((addr), (size), (bit) + 1))
> +
> +/* same as for_each_set_bit() but use bit as value to start with */
> +#define for_each_set_bit_cont(bit, addr, size) \
> + for ((bit) = find_next_bit((addr), (size), (bit)); \
> + (bit) < (size); \
> (bit) = find_next_bit((addr), (size), (bit) + 1))
So my version has the +1 for the first as well, this is from the
assumption that the bit passed in has been dealt with and should not be
the first. ie. cont _after_ @bit instead of cont _at_ @bit.
This seems consistent with the list_*_continue primitives as well, which
will start with the element after (or before for _reverse) the given
position.
Thoughts?
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH v3 1/2] perf, x86: Implement event scheduler helper functions
2011-11-16 16:02 ` Peter Zijlstra
@ 2011-11-16 19:23 ` Robert Richter
0 siblings, 0 replies; 30+ messages in thread
From: Robert Richter @ 2011-11-16 19:23 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Stephane Eranian, Ingo Molnar, LKML
On 16.11.11 17:02:16, Peter Zijlstra wrote:
> On Mon, 2011-11-14 at 18:51 +0100, Robert Richter wrote:
> > @@ -22,8 +22,14 @@ extern unsigned long __sw_hweight64(__u64 w);
> > #include <asm/bitops.h>
> >
> > #define for_each_set_bit(bit, addr, size) \
> > - for ((bit) = find_first_bit((addr), (size)); \
> > - (bit) < (size); \
> > + for ((bit) = find_first_bit((addr), (size)); \
> > + (bit) < (size); \
> > + (bit) = find_next_bit((addr), (size), (bit) + 1))
> > +
> > +/* same as for_each_set_bit() but use bit as value to start with */
> > +#define for_each_set_bit_cont(bit, addr, size) \
> > + for ((bit) = find_next_bit((addr), (size), (bit)); \
> > + (bit) < (size); \
> > (bit) = find_next_bit((addr), (size), (bit) + 1))
>
> So my version has the +1 for the first as well, this is from the
> assumption that the bit passed in has been dealt with and should not be
> the first. ie. cont _after_ @bit instead of cont _at_ @bit.
>
> This seems consistent with the list_*_continue primitives as well, which
> will start with the element after (or before for _reverse) the given
> position.
>
> Thoughts?
The problem is that you then can't start with bit 0 unless you pass a
-1 which seems unsane.
You hit this actually too in your code with
for_each_set_bit_cont(j, c->idxmsk, X86_PMC_IDX_MAX)
...
Your intention was to start with X86_PMC_IDX_MAX, the first fixed
counter. But you always started with X86_PMC_IDX_MAX+1 never getting
the first fixed counter.
With list_*_continue it is slightly different because there is the
list header that *points* to the first element. Thus it can start with
the fist element of the list too by passing the list header.
In the end, passing the first bit to start with seems more logically.
-Robert
--
Advanced Micro Devices, Inc.
Operating System Research Center
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2011-11-16 19:23 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-07 11:01 [PATCH] perf_events: fix and improve x86 event scheduling Stephane Eranian
2011-11-07 11:55 ` Peter Zijlstra
2011-11-07 12:10 ` Peter Zijlstra
2011-11-07 13:52 ` Stephane Eranian
2011-11-07 14:56 ` Peter Zijlstra
2011-11-10 14:37 ` Peter Zijlstra
2011-11-10 15:09 ` Stephane Eranian
2011-11-10 16:41 ` Gleb Natapov
2011-11-10 16:59 ` Stephane Eranian
2011-11-10 17:13 ` Gleb Natapov
2011-11-10 16:59 ` Robert Richter
2011-11-10 18:31 ` Peter Zijlstra
2011-11-10 18:03 ` Robert Richter
2011-11-10 18:41 ` Peter Zijlstra
2011-11-10 18:52 ` Peter Zijlstra
2011-11-11 14:29 ` Robert Richter
2011-11-14 17:51 ` [PATCH v3 0/2] perf, x86: handle overlapping counters Robert Richter
2011-11-14 17:51 ` [PATCH v3 1/2] perf, x86: Implement event scheduler helper functions Robert Richter
2011-11-16 16:02 ` Peter Zijlstra
2011-11-16 19:23 ` Robert Richter
2011-11-14 17:51 ` [PATCH v3 2/2] perf, x86: Fix event scheduler for constraints with overlapping counters Robert Richter
2011-11-14 12:55 ` [PATCH] perf_events: fix and improve x86 event scheduling Stephane Eranian
2011-11-14 14:12 ` Peter Zijlstra
2011-11-14 14:26 ` Stephane Eranian
2011-11-14 16:00 ` Peter Zijlstra
2011-11-14 17:39 ` Stephane Eranian
2011-11-14 21:43 ` Peter Zijlstra
2011-11-16 10:28 ` Stephane Eranian
2011-11-14 22:16 ` Peter Zijlstra
2011-11-16 10:06 ` Stephane Eranian
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox