[RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU utilization

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU  utilization
@ 2010-05-06 14:03 Stephane Eranian
  2010-05-06 14:20 ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Stephane Eranian @ 2010-05-06 14:03 UTC (permalink / raw)
  To: LKML
  Cc: Peter Zijlstra, mingo, Paul Mackerras,
	Frédéric Weisbecker, David S. Miller, perfmon2-devel

Hi,

Looking at ctx_flexible_sched_in(), the logic is that if  group_sched_in()
fails for a HW group, then no other HW group in the list is even tried.
I don't understand this restriction. Groups are independent of each other.
The failure of one group should not block others from being scheduled,
otherwise you under-utilize the PMU.

What is the reason for this restriction? Can we lift it somehow?

static void
ctx_flexible_sched_in(struct perf_event_context *ctx,
                      struct perf_cpu_context *cpuctx)
{
        struct perf_event *event;
        int can_add_hw = 1;

        list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
                /* Ignore events in OFF or ERROR state */
                if (event->state <= PERF_EVENT_STATE_OFF)
                        continue;
                /*
                 * Listen to the 'cpu' scheduling filter constraint
                 * of events:
                 */
                if (event->cpu != -1 && event->cpu != smp_processor_id())
                        continue;

                if (group_can_go_on(event, cpuctx, can_add_hw))
                        if (group_sched_in(event, cpuctx, ctx))
                                can_add_hw = 0;
        }
}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU  utilization
  2010-05-06 14:03 [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU utilization Stephane Eranian
@ 2010-05-06 14:20 ` Peter Zijlstra
  2010-05-06 14:41   ` Stephane Eranian
  2010-05-06 17:11   ` Frederic Weisbecker
  0 siblings, 2 replies; 16+ messages in thread
From: Peter Zijlstra @ 2010-05-06 14:20 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: LKML, mingo, Paul Mackerras, Frédéric Weisbecker,
	David S. Miller

On Thu, 2010-05-06 at 16:03 +0200, Stephane Eranian wrote:
> Hi,
> 
> Looking at ctx_flexible_sched_in(), the logic is that if  group_sched_in()
> fails for a HW group, then no other HW group in the list is even tried.
> I don't understand this restriction. Groups are independent of each other.
> The failure of one group should not block others from being scheduled,
> otherwise you under-utilize the PMU.
> 
> What is the reason for this restriction? Can we lift it somehow?

Sure, but it will make scheduling much more expensive. The current
scheme will only ever check the first N events because it stops at the
first that fails, and since you can max fix N events on the PMU its
constant time.

To fix this issue you'd have to basically always iterate all events and
only stop once the PMU is fully booked, which reduces to an O(n) worst
case algorithm.

But yeah, I did think of making the thing an RB-tree and basically
schedule on service received, that should fix the lop-sided RR we get
with constrained events.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU  utilization
  2010-05-06 14:20 ` Peter Zijlstra
@ 2010-05-06 14:41   ` Stephane Eranian
  2010-05-06 15:08     ` Peter Zijlstra
  2010-05-06 17:11   ` Frederic Weisbecker
  1 sibling, 1 reply; 16+ messages in thread
From: Stephane Eranian @ 2010-05-06 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, mingo, Paul Mackerras, Frédéric Weisbecker,
	David S. Miller

On Thu, May 6, 2010 at 4:20 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-05-06 at 16:03 +0200, Stephane Eranian wrote:
>> Hi,
>>
>> Looking at ctx_flexible_sched_in(), the logic is that if  group_sched_in()
>> fails for a HW group, then no other HW group in the list is even tried.
>> I don't understand this restriction. Groups are independent of each other.
>> The failure of one group should not block others from being scheduled,
>> otherwise you under-utilize the PMU.
>>
>> What is the reason for this restriction? Can we lift it somehow?
>
> Sure, but it will make scheduling much more expensive. The current
> scheme will only ever check the first N events because it stops at the
> first that fails, and since you can max fix N events on the PMU its
> constant time.
>
You may fail not because the PMU is full but because an event is incompatible
with the others, i.e., there may still be room for more evens. By relying on the
RR to get coverage for all events, you also increase blind spots for
events which
have been skipped. Longer blind spots implies less accuracy when you scale.

> To fix this issue you'd have to basically always iterate all events and
> only stop once the PMU is fully booked, which reduces to an O(n) worst
> case algorithm.
>

Yes, but if you have X events and you don't know if you have at least N
that are compatible with each other, then you have to scan the whole list.

> But yeah, I did think of making the thing an RB-tree and basically
> schedule on service received, that should fix the lop-sided RR we get
> with constrained events.
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU  utilization
  2010-05-06 14:41   ` Stephane Eranian
@ 2010-05-06 15:08     ` Peter Zijlstra
  2010-05-06 16:26       ` Stephane Eranian
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2010-05-06 15:08 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: LKML, mingo, Paul Mackerras, Frédéric Weisbecker,
	David S. Miller

On Thu, 2010-05-06 at 16:41 +0200, Stephane Eranian wrote:
> On Thu, May 6, 2010 at 4:20 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, 2010-05-06 at 16:03 +0200, Stephane Eranian wrote:
> >> Hi,
> >>
> >> Looking at ctx_flexible_sched_in(), the logic is that if  group_sched_in()
> >> fails for a HW group, then no other HW group in the list is even tried.
> >> I don't understand this restriction. Groups are independent of each other.
> >> The failure of one group should not block others from being scheduled,
> >> otherwise you under-utilize the PMU.
> >>
> >> What is the reason for this restriction? Can we lift it somehow?
> >
> > Sure, but it will make scheduling much more expensive. The current
> > scheme will only ever check the first N events because it stops at the
> > first that fails, and since you can max fix N events on the PMU its
> > constant time.
> >
> You may fail not because the PMU is full but because an event is incompatible
> with the others, i.e., there may still be room for more evens. By relying on the
> RR to get coverage for all events, you also increase blind spots for
> events which
> have been skipped. Longer blind spots implies less accuracy when you scale.
> 
> > To fix this issue you'd have to basically always iterate all events and
> > only stop once the PMU is fully booked, which reduces to an O(n) worst
> > case algorithm.
> >
> 
> Yes, but if you have X events and you don't know if you have at least N
> that are compatible with each other, then you have to scan the whole list.

I'm not sure why you're arguing, you asked why it did as it did, I gave
an answer ;-)

I agree its not optimal, but fixing it isn't trivial, I would very much
like to avoid a full O(n) loop over all events, esp since creating them
is a non-privilidged operation.

So what we can look at is trying to do better, and making it a service
based scheduler instead of a strict RR should at least get a more equal
distribution.

Another thing we can do is quit at the second or third fail.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU  utilization
  2010-05-06 15:08     ` Peter Zijlstra
@ 2010-05-06 16:26       ` Stephane Eranian
  0 siblings, 0 replies; 16+ messages in thread
From: Stephane Eranian @ 2010-05-06 16:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, mingo, Paul Mackerras, Frédéric Weisbecker,
	David S. Miller

On Thu, May 6, 2010 at 5:08 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-05-06 at 16:41 +0200, Stephane Eranian wrote:
>> On Thu, May 6, 2010 at 4:20 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Thu, 2010-05-06 at 16:03 +0200, Stephane Eranian wrote:
>> >> Hi,
>> >>
>> >> Looking at ctx_flexible_sched_in(), the logic is that if  group_sched_in()
>> >> fails for a HW group, then no other HW group in the list is even tried.
>> >> I don't understand this restriction. Groups are independent of each other.
>> >> The failure of one group should not block others from being scheduled,
>> >> otherwise you under-utilize the PMU.
>> >>
>> >> What is the reason for this restriction? Can we lift it somehow?
>> >
>> > Sure, but it will make scheduling much more expensive. The current
>> > scheme will only ever check the first N events because it stops at the
>> > first that fails, and since you can max fix N events on the PMU its
>> > constant time.
>> >
>> You may fail not because the PMU is full but because an event is incompatible
>> with the others, i.e., there may still be room for more evens. By relying on the
>> RR to get coverage for all events, you also increase blind spots for
>> events which
>> have been skipped. Longer blind spots implies less accuracy when you scale.
>>
>> > To fix this issue you'd have to basically always iterate all events and
>> > only stop once the PMU is fully booked, which reduces to an O(n) worst
>> > case algorithm.
>> >
>>
>> Yes, but if you have X events and you don't know if you have at least N
>> that are compatible with each other, then you have to scan the whole list.
>
> I'm not sure why you're arguing, you asked why it did as it did, I gave
> an answer ;-)
>
yes, you did.

> I agree its not optimal, but fixing it isn't trivial, I would very much
> like to avoid a full O(n) loop over all events, esp since creating them
> is a non-privilidged operation.
>
I understand that.

> So what we can look at is trying to do better, and making it a service
> based scheduler instead of a strict RR should at least get a more equal
> distribution.
>
How would you define "service" here?

> Another thing we can do is quit at the second or third fail.
>
Not sure if this would be any better.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU utilization
  2010-05-06 14:20 ` Peter Zijlstra
  2010-05-06 14:41   ` Stephane Eranian
@ 2010-05-06 17:11   ` Frederic Weisbecker
  2010-05-06 17:30     ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Frederic Weisbecker @ 2010-05-06 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stephane Eranian, LKML, mingo, Paul Mackerras, David S. Miller

On Thu, May 06, 2010 at 04:20:40PM +0200, Peter Zijlstra wrote:
> On Thu, 2010-05-06 at 16:03 +0200, Stephane Eranian wrote:
> > Hi,
> > 
> > Looking at ctx_flexible_sched_in(), the logic is that if  group_sched_in()
> > fails for a HW group, then no other HW group in the list is even tried.
> > I don't understand this restriction. Groups are independent of each other.
> > The failure of one group should not block others from being scheduled,
> > otherwise you under-utilize the PMU.
> > 
> > What is the reason for this restriction? Can we lift it somehow?
> 
> Sure, but it will make scheduling much more expensive. The current
> scheme will only ever check the first N events because it stops at the
> first that fails, and since you can max fix N events on the PMU its
> constant time.
> 
> To fix this issue you'd have to basically always iterate all events and
> only stop once the PMU is fully booked, which reduces to an O(n) worst
> case algorithm.
> 
> But yeah, I did think of making the thing an RB-tree and basically
> schedule on service received, that should fix the lop-sided RR we get
> with constrained events.


I don't understand what you mean by schedule on service received, and why
an rbtree would solve that.

Unless you think about giving a weight to a group that has hardware events.
This weight is the number of "slots" a group would use in the hardware pmu
and you can compare this weight against the available remaining slots
in the pmu?

So yeah, if the hw group of events are sorted by weight, once one fails,
we know the following will fail too. But that doesn't seem a right solution
as it is going to always give more chances to low weight groups and very
fewer opportunities for the heavy ones to be scheduled.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU utilization
  2010-05-06 17:11   ` Frederic Weisbecker
@ 2010-05-06 17:30     ` Peter Zijlstra
  2010-05-07  8:25       ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2010-05-06 17:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Stephane Eranian, LKML, mingo, Paul Mackerras, David S. Miller

On Thu, 2010-05-06 at 19:11 +0200, Frederic Weisbecker wrote:

> > But yeah, I did think of making the thing an RB-tree and basically
> > schedule on service received, that should fix the lop-sided RR we get
> > with constrained events.

> I don't understand what you mean by schedule on service received, and why
> an rbtree would solve that.

Schedule those events that got scheduled least, if because of
constraints we didn't fully utilize the PMU it is very likely that
strict RR (like we do now) will not end up giving equal service to each
counter/group.

Therefore, if you sort them in a tree, based on the amount of time they
got on the PMU, and always schedule the leftmost, you do get fairness.

Sure you can go add weights to them, but that's not the immediate goal.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU utilization
  2010-05-06 17:30     ` Peter Zijlstra
@ 2010-05-07  8:25       ` Peter Zijlstra
  2010-05-07  8:44         ` Peter Zijlstra
  2010-05-07  9:37         ` Stephane Eranian
  0 siblings, 2 replies; 16+ messages in thread
From: Peter Zijlstra @ 2010-05-07  8:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Stephane Eranian, LKML, mingo, Paul Mackerras, David S. Miller

On Thu, 2010-05-06 at 19:30 +0200, Peter Zijlstra wrote:
> On Thu, 2010-05-06 at 19:11 +0200, Frederic Weisbecker wrote:
> 
> > > But yeah, I did think of making the thing an RB-tree and basically
> > > schedule on service received, that should fix the lop-sided RR we get
> > > with constrained events.
> 
> > I don't understand what you mean by schedule on service received, and why
> > an rbtree would solve that.
> 
> Schedule those events that got scheduled least, if because of
> constraints we didn't fully utilize the PMU it is very likely that
> strict RR (like we do now) will not end up giving equal service to each
> counter/group.
> 
> Therefore, if you sort them in a tree, based on the amount of time they
> got on the PMU, and always schedule the leftmost, you do get fairness.

OK a simple example, suppose we have 3 events, A, B and C. A and B are
exclusive. With the current RR scheduling we get:

/ {A}, {B, C}, {C, A} /

Which isn't fair, because we get a 2:1:2 ratio.

If we were to do as Stephane suggested and always schedule the PMU in a
greedy fashion, we'd get:

/ {A, C}, {B, C} /

Which isn't fair either 1:1:2.

The perfect case would be if we could always schedule all events on the
PMU, that way they'd get equal service. But since we cannot, we much
approximate that.

If we define lag to be the difference between perfect service and our
approximation thereof: lag_i = S - s_i, then for a scheduler to be fair
we must place two conditions thereon:

 1) There must be a bound on the maximal lag_i, this ensures progress.
 2) The sum of lags must be zero, this ensures we converge to the ideal
    service. [*]

We can satisfy 1 by always at least scheduling the event which has thus
far received the least service, this ensures everybody makes progress,
since by definition everybody will at one point be that one.

This by itself however is not enough, since we can schedule multiple
events, lets again try a greedy model with the previous example. That
would again reduce to the unfair:

/ {A, C}, {B, C} /

Simply because C is always schedulable and A and B alternate between
being the one who received least service.

This would however violate 2, since C will always be scheduled, its lag
will decrease without bound.

We can fix this by explicitly not scheduling C when its not eligible.
That is, we only consider eligible events for placement on the PMU.
Eligibility means an event has received less service than it would have
had under the perfect model.

The second condition will help us here: \Sum_i S - s_i = 0. That can be
written as: n*S - \Sum s_i = 0, where n = \Sum_i 1. From this it
trivially follows that: S = 1/n \Sum s_i, or in other words S =
avg(s_i).

So eligibility can be expressed as: s_i < avg(s_i).

With this, we will get a schedule like:

/ {A, C}, {B} /

We are however still fully greedy, which is still O(n), which we don't
want. However if we stop being greedy and use the same heuristic we do
now, stop filling the PMU at the first fail, we'll still be fair,
because the algorithm ensures that.

[*] it might be possible these conditions are the same one.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU utilization
  2010-05-07  8:25       ` Peter Zijlstra
@ 2010-05-07  8:44         ` Peter Zijlstra
  2010-05-07  9:37         ` Stephane Eranian
  1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2010-05-07  8:44 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Stephane Eranian, LKML, mingo, Paul Mackerras, David S. Miller

On Fri, 2010-05-07 at 10:25 +0200, Peter Zijlstra wrote:
> We are however still fully greedy, which is still O(n), which we don't
> want. However if we stop being greedy and use the same heuristic we do
> now, stop filling the PMU at the first fail, we'll still be fair,
> because the algorithm ensures that.
> 
FWIW the non-greedy algorithm is O(m * log(n)), where m is the number of
counters on the PMU and n the number of competing events, since m is
pretty much a constant, we can say the full algorithm is O(log n).




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU  utilization
  2010-05-07  8:25       ` Peter Zijlstra
  2010-05-07  8:44         ` Peter Zijlstra
@ 2010-05-07  9:37         ` Stephane Eranian
  2010-05-07 10:06           ` Peter Zijlstra
  1 sibling, 1 reply; 16+ messages in thread
From: Stephane Eranian @ 2010-05-07  9:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, mingo, Paul Mackerras, David S. Miller

On Fri, May 7, 2010 at 10:25 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, 2010-05-06 at 19:30 +0200, Peter Zijlstra wrote:
>> On Thu, 2010-05-06 at 19:11 +0200, Frederic Weisbecker wrote:
>>
>> > > But yeah, I did think of making the thing an RB-tree and basically
>> > > schedule on service received, that should fix the lop-sided RR we get
>> > > with constrained events.
>>
>> > I don't understand what you mean by schedule on service received, and why
>> > an rbtree would solve that.
>>
>> Schedule those events that got scheduled least, if because of
>> constraints we didn't fully utilize the PMU it is very likely that
>> strict RR (like we do now) will not end up giving equal service to each
>> counter/group.
>>
>> Therefore, if you sort them in a tree, based on the amount of time they
>> got on the PMU, and always schedule the leftmost, you do get fairness.
>
>
> OK a simple example, suppose we have 3 events, A, B and C. A and B are
> exclusive. With the current RR scheduling we get:
>
> / {A}, {B, C}, {C, A} /
>
> Which isn't fair, because we get a 2:1:2 ratio.
>
> If we were to do as Stephane suggested and always schedule the PMU in a
> greedy fashion, we'd get:
>
> / {A, C}, {B, C} /
>
> Which isn't fair either 1:1:2.
>
>
> The perfect case would be if we could always schedule all events on the
> PMU, that way they'd get equal service. But since we cannot, we much
> approximate that.
>
> If we define lag to be the difference between perfect service and our
> approximation thereof: lag_i = S - s_i, then for a scheduler to be fair
> we must place two conditions thereon:
>

I assume S represents the time an event would be on the PMU in the
case of perfect scheduling. And thus S is the same for all events. The
index i represents the event index.

If I have 3 events, regardless of their constraints and grouping, the perfect
service would be that each gets 1/3rd of the time.

>  1) There must be a bound on the maximal lag_i, this ensures progress.
>  2) The sum of lags must be zero, this ensures we converge to the ideal
>    service. [*]
>
> We can satisfy 1 by always at least scheduling the event which has thus
> far received the least service, this ensures everybody makes progress,
> since by definition everybody will at one point be that one.
>
> This by itself however is not enough, since we can schedule multiple
> events, lets again try a greedy model with the previous example. That
> would again reduce to the unfair:
>
> / {A, C}, {B, C} /
>
> Simply because C is always schedulable and A and B alternate between
> being the one who received least service.
>
> This would however violate 2, since C will always be scheduled, its lag
> will decrease without bound.
>
> We can fix this by explicitly not scheduling C when its not eligible.
> That is, we only consider eligible events for placement on the PMU.
> Eligibility means an event has received less service than it would have
> had under the perfect model.
>
> The second condition will help us here: \Sum_i S - s_i = 0. That can be
> written as: n*S - \Sum s_i = 0, where n = \Sum_i 1. From this it
> trivially follows that: S = 1/n \Sum s_i, or in other words S =
> avg(s_i).
>
> So eligibility can be expressed as: s_i < avg(s_i).
>
Which would mean: if my total time on PMU is less than the average
time on the PMU
for all events thus far, then "schedule me now".

You would have to sort the event by increasing s_i (using the RB tree, I assume)

> With this, we will get a schedule like:
>
> / {A, C}, {B} /
>
> We are however still fully greedy, which is still O(n), which we don't
> want. However if we stop being greedy and use the same heuristic we do
> now, stop filling the PMU at the first fail, we'll still be fair,
> because the algorithm ensures that.
>
Let's see if I understand with an example. Assume the PMU multiplex
timing is 1ms, 2 counters. s(n) = total time in ms at time n.

evt   A  B  C
s(0)  0   0  0 -> avg = 0/3=0.00, sort = A, B, C, schedule A, fail on B
s(1)  1   0  0 -> avg = 1/3=0.33, sort = B, C, A, schedule B, C,
s(2)  1   1  1 -> avg = 3/3=1.00, sort = A, B, C, schedule A, fail on B
s(3)  2   1  1 -> avg = 4/3=1.33, sort = B, C, A, schedule B, C
s(4)  2   2  2 -> avg = 6/3=2.00, sort = A, B, C, schedule A, fail on B
s(5)  3   2  2 -> avg = 5/3=1.66, sort = B, C, A, schedule B, C

What if there is no constraints on all 3 events?

evt   A  B  C
s(0)  0   0  0 -> avg = 0/3=0.00, sort = A, B, C, schedule A, B
s(1)  1   1  0 -> avg = 2/3=0.66, sort = C, A, B, schedule C (A, B > avg)
s(2)  1   1  1 -> avg = 3/3=1.00, sort = A, B, C, schedule A, B
s(3)  2   2  1 -> avg = 5/3=1.66, sort = C, A, B, schedule C (A, B > avg)
s(4)  2   2  2 -> avg = 6/3=2.00, sort = B, C, A, schedule B, C
s(5)  2   3  3 -> avg = 8/3=2.66, sort = A, B, C, schedule A (B, C > avg)
s(6)  3   3  3 -> avg = 9/3=3.00, sort = A, B, C, schedule A, B

When all timings are equal, sort could yield any order, it would not matter
because overtime each event will be scheduled if it lags.

Am I understanding your algorithm right?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU  utilization
  2010-05-07  9:37         ` Stephane Eranian
@ 2010-05-07 10:06           ` Peter Zijlstra
  2010-05-07 10:49             ` Stephane Eranian
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2010-05-07 10:06 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Frederic Weisbecker, LKML, mingo, Paul Mackerras, David S. Miller

On Fri, 2010-05-07 at 11:37 +0200, Stephane Eranian wrote:

> > If we define lag to be the difference between perfect service and our
> > approximation thereof: lag_i = S - s_i, then for a scheduler to be fair
> > we must place two conditions thereon:
> >
> 
> I assume S represents the time an event would be on the PMU in the
> case of perfect scheduling. And thus S is the same for all events. The
> index i represents the event index.

Ah indeed, I should have clarified that.

> > So eligibility can be expressed as: s_i < avg(s_i).
> >
> Which would mean: if my total time on PMU is less than the average
> time on the PMU for all events thus far, then "schedule me now".

Yes, although I would state the action like: "consider me for
scheduling", since there might not be place for all eligible events on
the PMU.

[ If you start adding weights (like we do for task scheduling) this
becomes a weighted average. ]

> You would have to sort the event by increasing s_i (using the RB tree, I assume)

Exactly.

> > With this, we will get a schedule like:
> >
> > / {A, C}, {B} /
> >
> > We are however still fully greedy, which is still O(n), which we don't
> > want. However if we stop being greedy and use the same heuristic we do
> > now, stop filling the PMU at the first fail, we'll still be fair,
> > because the algorithm ensures that.
> >
> Let's see if I understand with an example. Assume the PMU multiplex
> timing is 1ms, 2 counters. s(n) = total time in ms at time n.
> 
> evt   A  B  C
> s(0)  0   0  0 -> avg = 0/3=0.00, sort = A, B, C, schedule A, fail on B
> s(1)  1   0  0 -> avg = 1/3=0.33, sort = B, C, A, schedule B, C,
> s(2)  1   1  1 -> avg = 3/3=1.00, sort = A, B, C, schedule A, fail on B
> s(3)  2   1  1 -> avg = 4/3=1.33, sort = B, C, A, schedule B, C
> s(4)  2   2  2 -> avg = 6/3=2.00, sort = A, B, C, schedule A, fail on B
> s(5)  3   2  2 -> avg = 5/3=1.66, sort = B, C, A, schedule B, C
> 
> What if there is no constraints on all 3 events?
> 
> evt   A  B  C
> s(0)  0   0  0 -> avg = 0/3=0.00, sort = A, B, C, schedule A, B
> s(1)  1   1  0 -> avg = 2/3=0.66, sort = C, A, B, schedule C (A, B > avg)
> s(2)  1   1  1 -> avg = 3/3=1.00, sort = A, B, C, schedule A, B
> s(3)  2   2  1 -> avg = 5/3=1.66, sort = C, A, B, schedule C (A, B > avg)
> s(4)  2   2  2 -> avg = 6/3=2.00, sort = B, C, A, schedule B, C
> s(5)  2   3  3 -> avg = 8/3=2.66, sort = A, B, C, schedule A (B, C > avg)
> s(6)  3   3  3 -> avg = 9/3=3.00, sort = A, B, C, schedule A, B
> 
> When all timings are equal, sort could yield any order, it would not matter
> because overtime each event will be scheduled if it lags.
> 
> Am I understanding your algorithm right?

Perfectly!

So the ramification of not using a greedy algorithm is that the
potential schedule of constrained events/groups gets longer than is
absolutely required, but I think that is something we'll have to live
with, since O(n) just isn't a nice option.

This can be illustrated if we consider B to be exclusive with both A and
C, in that case we could end up with:

/ {A}, {B}, {C} /

instead of

/ {A, C}, {B} /

Depending on the order in which we find events sorted.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU  utilization
  2010-05-07 10:06           ` Peter Zijlstra
@ 2010-05-07 10:49             ` Stephane Eranian
  2010-05-07 11:15               ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Stephane Eranian @ 2010-05-07 10:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, mingo, Paul Mackerras, David S. Miller

On Fri, May 7, 2010 at 12:06 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2010-05-07 at 11:37 +0200, Stephane Eranian wrote:
>
>> > If we define lag to be the difference between perfect service and our
>> > approximation thereof: lag_i = S - s_i, then for a scheduler to be fair
>> > we must place two conditions thereon:
>> >
>>
>
>> > So eligibility can be expressed as: s_i < avg(s_i).
>> >
>> Which would mean: if my total time on PMU is less than the average
>> time on the PMU for all events thus far, then "schedule me now".
>
Agreed.

>
>> You would have to sort the event by increasing s_i (using the RB tree, I assume)
>
> Exactly.
>
You'd have to insert all event into the tree, read leftmost.
I believe you need more than just basic integer arithmetic
to compare s_i to avg. Or you need to tweak the values
to get more precisions. But you may already be doing that
elsewhere in the kernel.

>> > With this, we will get a schedule like:
>> >
>> > / {A, C}, {B} /
>> >
>> > We are however still fully greedy, which is still O(n), which we don't
>> > want. However if we stop being greedy and use the same heuristic we do
>> > now, stop filling the PMU at the first fail, we'll still be fair,
>> > because the algorithm ensures that.
>> >
>> Let's see if I understand with an example. Assume the PMU multiplex
>> timing is 1ms, 2 counters. s(n) = total time in ms at time n.
>>
>> evt   A  B  C
>> s(0)  0   0  0 -> avg = 0/3=0.00, sort = A, B, C, schedule A, fail on B
>> s(1)  1   0  0 -> avg = 1/3=0.33, sort = B, C, A, schedule B, C,
>> s(2)  1   1  1 -> avg = 3/3=1.00, sort = A, B, C, schedule A, fail on B
>> s(3)  2   1  1 -> avg = 4/3=1.33, sort = B, C, A, schedule B, C
>> s(4)  2   2  2 -> avg = 6/3=2.00, sort = A, B, C, schedule A, fail on B
>> s(5)  3   2  2 -> avg = 5/3=1.66, sort = B, C, A, schedule B, C
>>
>> What if there is no constraints on all 3 events?
>>
>> evt   A  B  C
>> s(0)  0   0  0 -> avg = 0/3=0.00, sort = A, B, C, schedule A, B
>> s(1)  1   1  0 -> avg = 2/3=0.66, sort = C, A, B, schedule C (A, B > avg)
>> s(2)  1   1  1 -> avg = 3/3=1.00, sort = A, B, C, schedule A, B
>> s(3)  2   2  1 -> avg = 5/3=1.66, sort = C, A, B, schedule C (A, B > avg)
>> s(4)  2   2  2 -> avg = 6/3=2.00, sort = B, C, A, schedule B, C
>> s(5)  2   3  3 -> avg = 8/3=2.66, sort = A, B, C, schedule A (B, C > avg)
>> s(6)  3   3  3 -> avg = 9/3=3.00, sort = A, B, C, schedule A, B
>>
>> When all timings are equal, sort could yield any order, it would not matter
>> because overtime each event will be scheduled if it lags.
>>
>> Am I understanding your algorithm right?
>
> Perfectly!
>
> So the ramification of not using a greedy algorithm is that the
> potential schedule of constrained events/groups gets longer than is
> absolutely required, but I think that is something we'll have to live
> with, since O(n) just isn't a nice option.
>
> This can be illustrated if we consider B to be exclusive with both A and
> C, in that case we could end up with:
>
> / {A}, {B}, {C} /
>
> instead of
>
> / {A, C}, {B} /
>
> Depending on the order in which we find events sorted.

Yes. Not clear how you could avoid this without having a global
view of the dependencies between events (which are really event
groups, BTW). Here you'd need to know that if you have
evt   A  B  C
s(0)  0   0  0 -> avg = 0/3=0.00, sort = A, B, C, schedule A, fail on B
s(1)  1   0  0 -> avg = 1/3=0.33, sort = B, C, A, schedule B, fail on C

You'd have two options:
   s(2)  1   1  0 -> avg = 2/3=0.66, sort = C, A, B, schedule A, C
or
   s(2)  1   1  0 -> avg = 2/3=0.66, sort = C, B, A  schedule C

The first is more attractive because it shortens the blind spots on C.
Both are equally fair, though. Looks like you'd need to add a 2nd
parameter to the sort when s_i are equal. That would have to be
related to the number of constraints. You could  sort in reverse order
to the number of constraints, assuming you can express the constraint
as a number. For simple constraints, such as counter restrictions, you
could simply compare the number of possible counters between events.
But there are PMU where there is no counter constraints but events are
incompatible. What values do you compare  in this case?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU  utilization
  2010-05-07 10:49             ` Stephane Eranian
@ 2010-05-07 11:15               ` Peter Zijlstra
  2010-05-10  9:41                 ` Stephane Eranian
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2010-05-07 11:15 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Frederic Weisbecker, LKML, mingo, Paul Mackerras, David S. Miller

On Fri, 2010-05-07 at 12:49 +0200, Stephane Eranian wrote:
> You'd have to insert all event into the tree, read leftmost.
> I believe you need more than just basic integer arithmetic
> to compare s_i to avg. Or you need to tweak the values
> to get more precisions. But you may already be doing that
> elsewhere in the kernel. 

I've got a modification to CFS which implements EEVDF which needs
similar eligibility checks. So yeah, I've got code to do this.

The trick to computable avg is to keep a monotonic min_s around and use
ds_i = s_i - min_s. These ds_i will be 'small', in the order of the max
lag.

We can thus keep a sum of ds_i up-to-date when inserting/removing events
from the tree without fear of overflowing our integer.

When we update min_s, we must also update our relative sum
proportionally and in the opposite direction.

Comparing for eligibility can be done by:

s_i < 1/n \Sum s_i, or s_i - min_s < 1/n \Sum s_i - min_s, which we can
write as: n*ds_i < \Sum ds_i

Again, this can be done without fear of overflows because ds_i is small.

> Yes. Not clear how you could avoid this without having a global
> view of the dependencies between events (which are really event
> groups, BTW). Here you'd need to know that if you have
> evt   A  B  C
> s(0)  0   0  0 -> avg = 0/3=0.00, sort = A, B, C, schedule A, fail on B
> s(1)  1   0  0 -> avg = 1/3=0.33, sort = B, C, A, schedule B, fail on C
> 
> You'd have two options:
>    s(2)  1   1  0 -> avg = 2/3=0.66, sort = C, A, B, schedule A, C
> or
>    s(2)  1   1  0 -> avg = 2/3=0.66, sort = C, B, A  schedule C
> 
> The first is more attractive because it shortens the blind spots on C.
> Both are equally fair, though. Looks like you'd need to add a 2nd
> parameter to the sort when s_i are equal. That would have to be
> related to the number of constraints. You could  sort in reverse order
> to the number of constraints, assuming you can express the constraint
> as a number. For simple constraints, such as counter restrictions, you
> could simply compare the number of possible counters between events.
> But there are PMU where there is no counter constraints but events are
> incompatible. What values do you compare  in this case?

Not sure, but yeah, using constraint data to tie break is indeed an
interesting option. 

I wonder how much tie breaking we'll really need in practice though, if
we use event->total_time_running as our s_i we've got ns resolution
timestamps, and with sub jiffies preemption like is common I doubt we'll
actually see a lot equal service numbers.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU  utilization
  2010-05-07 11:15               ` Peter Zijlstra
@ 2010-05-10  9:41                 ` Stephane Eranian
  2010-05-14 14:55                   ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Stephane Eranian @ 2010-05-10  9:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, mingo, Paul Mackerras, David S. Miller

On Fri, May 7, 2010 at 1:15 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, 2010-05-07 at 12:49 +0200, Stephane Eranian wrote:
>> You'd have to insert all event into the tree, read leftmost.
>> I believe you need more than just basic integer arithmetic
>> to compare s_i to avg. Or you need to tweak the values
>> to get more precisions. But you may already be doing that
>> elsewhere in the kernel.
>
> I've got a modification to CFS which implements EEVDF which needs
> similar eligibility checks. So yeah, I've got code to do this.
>
> The trick to computable avg is to keep a monotonic min_s around and use
> ds_i = s_i - min_s. These ds_i will be 'small', in the order of the max
> lag.
>
> We can thus keep a sum of ds_i up-to-date when inserting/removing events
> from the tree without fear of overflowing our integer.
>
> When we update min_s, we must also update our relative sum
> proportionally and in the opposite direction.
>
> Comparing for eligibility can be done by:
>
> s_i < 1/n \Sum s_i, or s_i - min_s < 1/n \Sum s_i - min_s, which we can
> write as: n*ds_i < \Sum ds_i
>
> Again, this can be done without fear of overflows because ds_i is small.
>
Ok, that's what I thought. You had that for the scheduler.

Looks like a good solution, at least better than what is there right now.

>> Yes. Not clear how you could avoid this without having a global
>> view of the dependencies between events (which are really event
>> groups, BTW). Here you'd need to know that if you have
>> evt   A  B  C
>> s(0)  0   0  0 -> avg = 0/3=0.00, sort = A, B, C, schedule A, fail on B
>> s(1)  1   0  0 -> avg = 1/3=0.33, sort = B, C, A, schedule B, fail on C
>>
>> You'd have two options:
>>    s(2)  1   1  0 -> avg = 2/3=0.66, sort = C, A, B, schedule A, C
>> or
>>    s(2)  1   1  0 -> avg = 2/3=0.66, sort = C, B, A  schedule C
>>
>> The first is more attractive because it shortens the blind spots on C.
>> Both are equally fair, though. Looks like you'd need to add a 2nd
>> parameter to the sort when s_i are equal. That would have to be
>> related to the number of constraints. You could  sort in reverse order
>> to the number of constraints, assuming you can express the constraint
>> as a number. For simple constraints, such as counter restrictions, you
>> could simply compare the number of possible counters between events.
>> But there are PMU where there is no counter constraints but events are
>> incompatible. What values do you compare  in this case?
>
> Not sure, but yeah, using constraint data to tie break is indeed an
> interesting option.
>
> I wonder how much tie breaking we'll really need in practice though, if
> we use event->total_time_running as our s_i we've got ns resolution
> timestamps, and with sub jiffies preemption like is common I doubt we'll
> actually see a lot equal service numbers.
>
I suspect you are right. I would not worry about this now. This can be
fixed later if it turns out to be problematic in corner cases.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU  utilization
  2010-05-10  9:41                 ` Stephane Eranian
@ 2010-05-14 14:55                   ` Peter Zijlstra
  2010-05-14 15:07                     ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2010-05-14 14:55 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Frederic Weisbecker, LKML, mingo, Paul Mackerras, David S. Miller

On Mon, 2010-05-10 at 11:41 +0200, Stephane Eranian wrote:
> Looks like a good solution, at least better than what is there right now.

Something like the below I guess, still need to make it an actual patch
though ;-)

The interesting bit is the schedule() function which tries to schedule
two queues fairly (cpu context and task context), and selects between
the two based on lag -- I'm not quite sure that that works out as
expected though, still have to do the actual math.

Also, I guess we should come up with a saner solution for the
per-task-per-cpu events (the unservicable stuff)

---


struct flexible_queue {
	struct rb_root waiting;
	struct rb_node *rb_leftmost;

	u64 min_service;
	u64 rel_avg;

	u64 clock;
	u64 running_stamp;

	struct list_head running;
	struct list_head unservicable;
	int unservicable_cpu;
	int nr_waiting;
};

enum flexible_type {
	flexible_waiting,
	flexible_running,
	flexible_unservicable,
};

struct flexible_event {
	union {
		struct rb_node rb_entry;
		struct list_head list_entry;
	};
	enum flexible_type type;

	u64 service;
};

static inline
s64 flexible_event_rel(struct flexible_queue *fq, struct flexible_event *fe)
{
	return fe->sevice - fq->min_service;
}

static void __enqueue_event(struct flexible_queue *fq, struct flexible_event *fe)
{
	struct rb_node **link = &fq->waiting.rb_node;
	struct rb_node *parent = NULL;
	struct flexible_event *entry;
	s64 rel = flexible_event_rel(fq, fe);
	int leftmost = 1;

	if (rel > 0) {
		fq->min_service += rel;
		fq->rel_avg -= fq->nr_waiting * rel;
		rel = 0;
	}

	fq->nr_waiting++;
	fq->rel_avg += rel;

	fe->type = flexible_waiting;

	while (*link) {
		parent = *link;
		entry = rb_entry(parent, struct flexible_event, rb_entry);
		/*
		 * We dont care about collisions. Nodes with
		 * the same rel stay together.
		 */
		if (rel < flexible_event_rel(fq, fe)) {
			link = &parent->rb_left;
		} else {
			link = &parent->rb_right;
			leftmost = 0;
		}
	}

	if (leftmost)
		cfs_rq->rb_leftmost = &se->rb_entry;

	rb_link_node(&fe->rb_entry, parent, link);
	rb_insert_color(&fe->rb_entry, &fq->waiting);
}

static void __dequeue_event(struct flexible_queue *fq, struct flexible_event *fe)
{
	if (fq->rb_leftmost == &fe->rb_entry) {
		struct rb_node *next_node;

		next_node = rb_next(&fe->rb_entry);
		fq->rb_leftmost = next_node;
	}

	rb_erase(&fe->rb_entry, &fq->waiting);

	fq->nr_waiting--;
	fq->rel_avg -= flexible_event_rel(fq, fe);
}

static void
flexible_event_add(struct flexible_queue *fq, struct flexible_event *fe)
{
	fe->service = fq->min_service + fq->rel_avg;
	__enqueue_event(fq, fe);
}

static void
flexible_event_del(struct flexible_queue *fq, struct flexible_event *fe)
{
	switch (fe->type) {
	case flexible_waiting:
		__dequeue_event(fq, fe);
		break;

	case flexible_running:
	case flexible_unservicable:
		list_del(&fe->list_entry);
		break;
	}
}

static void flexible_unschedule(struct flexible_queue *fq)
{
	struct flexible_event *fe, *tmp;
	s64 delta = (s64)(fq->clock - fq->running_stamp);

	list_for_each_entry_safe(fe, tmp, &fq->running, list_entry) {
		list_del(&fe->list_entry);
		fe->service += delta;
		__enqueue_event(fq, fe);
	}
}

static
s64 flexible_event_lag(struct flexible_queue *fq, struct flexible_event *fe)
{
	return fq->min_service + (fq->rel_avg / fq->nr_waiting) - fe->service;
}

static struct flexible_event *__pick_event(struct flexible_queue *fq)
{
	struct flexible_event *fe;

	if (!fq->rb_leftmost)
		return NULL;

	fe = rb_entry(fq->rb_leftmost, struct flexible_event, rb_entry);

	return fe;
}

static
int flexible_schedule(struct flexible_queue *fq1, struct flexible_queue *fq2)
{
	struct flexible_event *tmp, *fe;
	struct flexible_queue *fq;
	s64 tmp_lag, max_lag;

	if (fq->unservicable_cpu != smp_processor_id()) {
		list_for_each_entry_save(fe, tmp, &fq->unservicable, list_entry) {
			list_del(&fe->list_entry);
			flexible_event_add(fq, fe);
		}

		fq->unservicable_cpu = smp_processor_id();
	}

again:
	max_lag = 0;
	fe = NULL;

	tmp =__pick_event(fq1);
	if (tmp) {
		tmp_lag = flexible_event_lag(fq1, fe1);
		if (tmp_lag > max_lag) {
			fq = fq1;
			fe = fe1;
			max_lag = tmp_lag;
		}
	}

	tmp =__pick_event(fq2);
	if (tmp) {
		tmp_lag = flexible_event_lag(fq2, fe2);
		if (tmp_lag > max_lag) {
			fq = fq2;
			fe = fe2;
			max_lag = tmp_lag;
		}
	}

	if (!fe)
		return 1; /* no more events to schedule */

	fq->running_stamp = fq->clock;

	if (event_of(fe)->cpu != -1 && event_of(fe)->cpu != smp_processor_id()) {
		__dequeue_event(fq, fe);
		fe->type = flexible_unservicable;
		list_add(&fe->list_entry, &fq->unservicable);
		goto again; /* can't run this event, try another */
	}

	if (group_sched_in(event_of(fe), ...) == 0) {
		__dequeue_event(fq, fe);
		fe->type = flexible_running;
		list_add(&fe->list_entry, &fq->running);
		return 0; /* success */
	}

	return -EBUSY; /* doesn't fit */
}



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU  utilization
  2010-05-14 14:55                   ` Peter Zijlstra
@ 2010-05-14 15:07                     ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2010-05-14 15:07 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: Frederic Weisbecker, LKML, mingo, Paul Mackerras, David S. Miller

On Fri, 2010-05-14 at 16:55 +0200, Peter Zijlstra wrote:
> and selects between
> the two based on lag -- I'm not quite sure that that works out as
> expected though, still have to do the actual math.

Having had a bit of a think it doesn't sound right. I'll have to come up
with something else instead.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2010-05-14 15:07 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-06 14:03 [RFC] perf_events: ctx_flexible_sched_in() not maximizing PMU utilization Stephane Eranian
2010-05-06 14:20 ` Peter Zijlstra
2010-05-06 14:41   ` Stephane Eranian
2010-05-06 15:08     ` Peter Zijlstra
2010-05-06 16:26       ` Stephane Eranian
2010-05-06 17:11   ` Frederic Weisbecker
2010-05-06 17:30     ` Peter Zijlstra
2010-05-07  8:25       ` Peter Zijlstra
2010-05-07  8:44         ` Peter Zijlstra
2010-05-07  9:37         ` Stephane Eranian
2010-05-07 10:06           ` Peter Zijlstra
2010-05-07 10:49             ` Stephane Eranian
2010-05-07 11:15               ` Peter Zijlstra
2010-05-10  9:41                 ` Stephane Eranian
2010-05-14 14:55                   ` Peter Zijlstra
2010-05-14 15:07                     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox