About vcpu wakeup and runq tickling in credit

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* About vcpu wakeup and runq tickling in credit
@ 2012-10-23 13:34 Dario Faggioli
  2012-10-23 15:16 ` George Dunlap
  0 siblings, 1 reply; 9+ messages in thread
From: Dario Faggioli @ 2012-10-23 13:34 UTC (permalink / raw)
  To: George Dunlap; +Cc: Keir Fraser, Jan Beulich, xen-devel

[-- Attachment #1.1: Type: text/plain, Size: 4490 bytes --]

Hi George, Everyone,

While reworking a bit my NUMA aware scheduling patches I figured I'm not
sure I understand what __runq_tickle() (in xen/common/sched_credit.c, of
course) does.

Here's the thing. Upon every vcpu wakeup we put the new vcpu in a runq
and then call __runq_tickle(), passing the waking vcpu via 'new'. Let's
call the vcpu that just woke up v_W, and the vcpu that is currently
running on the cpu where that happens v_C. Let's also call the CPU where
all is happening P.

As far as I've understood, in  __runq_tickle(), we:

static inline void
__runq_tickle(unsigned int cpu, struct csched_vcpu *new)
{
    [...]
    cpumask_t mask;

    cpumask_clear(&mask);

    /* If strictly higher priority than current VCPU, signal the CPU */
    if ( new->pri > cur->pri )
    {
        [...]
        cpumask_set_cpu(cpu, &mask);
    }

--> Make sure we put the CPU we are on (P) in 'mask', in case the woken
--> vcpu (v_W) has higher priority that the currently running one (v_C).

    /*
     * If this CPU has at least two runnable VCPUs, we tickle any idlers to
     * let them know there is runnable work in the system...
     */
    if ( cur->pri > CSCHED_PRI_IDLE )
    {
        if ( cpumask_empty(prv->idlers) )
        [...]
        else
        {
            cpumask_t idle_mask;

            cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
            if ( !cpumask_empty(&idle_mask) )
            {
                [...]
                if ( opt_tickle_one_idle )
                {
                    [...]
                    cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
                }
                else
                    cpumask_or(&mask, &mask, &idle_mask);
            }
            cpumask_and(&mask, &mask, new->vcpu->cpu_affinity);

--> Make sure we include one or more (depending on opt_tickle_one_idle)
--> CPUs that are both idle and part of v_W's CPU-affinity in 'mask'.

        }
    }

    /* Send scheduler interrupts to designated CPUs */
    if ( !cpumask_empty(&mask) )
        cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ);

--> Ask all the CPUs in 'mask' to reschedule. That would mean all the
--> idlers from v_W's CPU-affinity and, possibly, "ourself" (P). The
--> effect will be that all/some of the CPUs v_W's has affinity with
--> _and_ (let's assume so) P will go through scheduling as quickly as
--> possible.

}

Is the above right?

If yes, here's my question. Is that right to always tickle v_W's affine
CPUs and only them?

I'm asking because a possible scenario, at least according to me, is
that P schedules very quickly after this and, as prio(v_W)>prio(v_C), it
selects v_W and leaves v_C in its runq. At that point, one of the
tickled CPU (say P') enters schedule, sees that P is not idle, and tries
to steal a vcpu from its runq. Now we know that P' has affinity with
v_W, but v_W is not there, while v_C is, and if P' is not in its
affinity, we've forced P' to reschedule for nothing.
Also, there now might be another (or even a number of) CPU where v_C
could run that stays idle, as it has not being tickled.

So, if that is true, it seems we leave some room for sub-optimal CPU
utilization, as well as some non-work conserving windows.
Of course, it is very hard to tell how frequent this actually happens.

As it comes to possible solution, I think that, for instance, tickling
all the CPUs in both v_W's and v_C's affinity masks could solve this,
but that would also potentially increase the overhead (by asking _a_lot_
of CPUs to reschedule), and again, it's hard to say if/when it's
worth...

Actually, going all the way round, i.e., tickling only CPUs with
affinity with v_C (in this case) looks more reasonable, under the
assumption that v_w is going to be scheduled on P soon enough. In
general, that would mean tickling the CPUs in the affinity mask of the
vcpu with smaller priority, but I've not checked how that would interact
with the rest of the scheduling logic yet.

If I got things wrong and/or there's something I missed or overlooked,
please, accept my apologies. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: About vcpu wakeup and runq tickling in credit
  2012-10-23 13:34 About vcpu wakeup and runq tickling in credit Dario Faggioli
@ 2012-10-23 15:16 ` George Dunlap
  2012-10-24 16:48   ` Dario Faggioli
  2012-11-15 12:10   ` Dario Faggioli
  0 siblings, 2 replies; 9+ messages in thread
From: George Dunlap @ 2012-10-23 15:16 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: Keir Fraser, Jan Beulich, xen-devel

On 23/10/12 14:34, Dario Faggioli wrote:
> Hi George, Everyone,
>
> While reworking a bit my NUMA aware scheduling patches I figured I'm not
> sure I understand what __runq_tickle() (in xen/common/sched_credit.c, of
> course) does.
>
> Here's the thing. Upon every vcpu wakeup we put the new vcpu in a runq
> and then call __runq_tickle(), passing the waking vcpu via 'new'. Let's
> call the vcpu that just woke up v_W, and the vcpu that is currently
> running on the cpu where that happens v_C. Let's also call the CPU where
> all is happening P.
>
> As far as I've understood, in  __runq_tickle(), we:
>
>
> static inline void
> __runq_tickle(unsigned int cpu, struct csched_vcpu *new)
> {
>      [...]
>      cpumask_t mask;
>
>      cpumask_clear(&mask);
>
>      /* If strictly higher priority than current VCPU, signal the CPU */
>      if ( new->pri > cur->pri )
>      {
>          [...]
>          cpumask_set_cpu(cpu, &mask);
>      }
>
> --> Make sure we put the CPU we are on (P) in 'mask', in case the woken
> --> vcpu (v_W) has higher priority that the currently running one (v_C).
>
>      /*
>       * If this CPU has at least two runnable VCPUs, we tickle any idlers to
>       * let them know there is runnable work in the system...
>       */
>      if ( cur->pri > CSCHED_PRI_IDLE )
>      {
>          if ( cpumask_empty(prv->idlers) )
>          [...]
>          else
>          {
>              cpumask_t idle_mask;
>
>              cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
>              if ( !cpumask_empty(&idle_mask) )
>              {
>                  [...]
>                  if ( opt_tickle_one_idle )
>                  {
>                      [...]
>                      cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
>                  }
>                  else
>                      cpumask_or(&mask, &mask, &idle_mask);
>              }
>              cpumask_and(&mask, &mask, new->vcpu->cpu_affinity);
>
> --> Make sure we include one or more (depending on opt_tickle_one_idle)
> --> CPUs that are both idle and part of v_W's CPU-affinity in 'mask'.
>
>          }
>      }
>
>      /* Send scheduler interrupts to designated CPUs */
>      if ( !cpumask_empty(&mask) )
>          cpumask_raise_softirq(&mask, SCHEDULE_SOFTIRQ);
>
> --> Ask all the CPUs in 'mask' to reschedule. That would mean all the
> --> idlers from v_W's CPU-affinity and, possibly, "ourself" (P). The
> --> effect will be that all/some of the CPUs v_W's has affinity with
> --> _and_ (let's assume so) P will go through scheduling as quickly as
> --> possible.
>
> }
>
> Is the above right?

It looks right to me.

> If yes, here's my question. Is that right to always tickle v_W's affine
> CPUs and only them?
>
> I'm asking because a possible scenario, at least according to me, is
> that P schedules very quickly after this and, as prio(v_W)>prio(v_C), it
> selects v_W and leaves v_C in its runq. At that point, one of the
> tickled CPU (say P') enters schedule, sees that P is not idle, and tries
> to steal a vcpu from its runq. Now we know that P' has affinity with
> v_W, but v_W is not there, while v_C is, and if P' is not in its
> affinity, we've forced P' to reschedule for nothing.
> Also, there now might be another (or even a number of) CPU where v_C
> could run that stays idle, as it has not being tickled.

Yes -- the two clauses look a bit like they were conceived 
independently, and maybe no one thought about how they might interact.

> So, if that is true, it seems we leave some room for sub-optimal CPU
> utilization, as well as some non-work conserving windows.
> Of course, it is very hard to tell how frequent this actually happens.
>
> As it comes to possible solution, I think that, for instance, tickling
> all the CPUs in both v_W's and v_C's affinity masks could solve this,
> but that would also potentially increase the overhead (by asking _a_lot_
> of CPUs to reschedule), and again, it's hard to say if/when it's
> worth...

Well in my code, opt_tickle_idle_one is on by default, which means only 
one other cpu will be woken up.  If there were an easy way to make it 
wake up a CPU in v_C's affinity as well (supposing that there was no 
overlap), that would probably be a win.

Of course, that's only necessary if:
* v_C is lower priority than v_W
* There are no idlers that intersect both v_C and v_W's affinity mask.

It's probably a good idea though to try to set up a scenario where this 
might be an issue and see how often it actually happens.

  -George

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: About vcpu wakeup and runq tickling in credit
  2012-10-23 15:16 ` George Dunlap
@ 2012-10-24 16:48   ` Dario Faggioli
  2012-11-15 12:10   ` Dario Faggioli
  1 sibling, 0 replies; 9+ messages in thread
From: Dario Faggioli @ 2012-10-24 16:48 UTC (permalink / raw)
  To: George Dunlap; +Cc: Keir Fraser, Jan Beulich, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 3224 bytes --]

On Tue, 2012-10-23 at 16:16 +0100, George Dunlap wrote:
> > If yes, here's my question. Is that right to always tickle v_W's affine
> > CPUs and only them?
> >
> > I'm asking because a possible scenario, at least according to me, is
> > that P schedules very quickly after this and, as prio(v_W)>prio(v_C), it
> > selects v_W and leaves v_C in its runq. At that point, one of the
> > tickled CPU (say P') enters schedule, sees that P is not idle, and tries
> > to steal a vcpu from its runq. Now we know that P' has affinity with
> > v_W, but v_W is not there, while v_C is, and if P' is not in its
> > affinity, we've forced P' to reschedule for nothing.
> > Also, there now might be another (or even a number of) CPU where v_C
> > could run that stays idle, as it has not being tickled.
> 
> Yes -- the two clauses look a bit like they were conceived 
> independently, and maybe no one thought about how they might interact.
> 
Yep, it looked the same to me.

> > As it comes to possible solution, I think that, for instance, tickling
> > all the CPUs in both v_W's and v_C's affinity masks could solve this,
> > but that would also potentially increase the overhead (by asking _a_lot_
> > of CPUs to reschedule), and again, it's hard to say if/when it's
> > worth...
> 
> Well in my code, opt_tickle_idle_one is on by default, which means only 
> one other cpu will be woken up.  If there were an easy way to make it 
> wake up a CPU in v_C's affinity as well (supposing that there was no 
> overlap), that would probably be a win.
> 
Yes, default is to tickle only 1 idler. However, as we offer that as a
command line option, I think we should consider what could happen if one
disables it.

I double checked this on Linux and, mutatis mutandis, they sort of go
the way I was suggesting, i.e., "pinging" the CPUs with affinity to the
task that will likely stay in the runq rather than being picked up
locally. However, there are of course big difference between the two
schedulers and different assumptions being made, thus I'm not really
sure of that being the best thing to do for us.

So, yes, it probably make sense to think about something clever to try
to involve CPUs from both the masks without causing too much overhead.
I'll put that in my TODO List. :-)

> Of course, that's only necessary if:
> * v_C is lower priority than v_W
> * There are no idlers that intersect both v_C and v_W's affinity mask.
> 
Sure, I said that in the first place, and I don't think checking for
that is too hard... Just a couple of more bitmap ops. But again, I'll
give it some thinking.

> It's probably a good idea though to try to set up a scenario where this 
> might be an issue and see how often it actually happens.
> 
Definitely, before trying to "fix" it, I'm interested in finding out
what I'd be actually fixing. Will do that.

Thanks for taking time to check and answer this! :-)
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: About vcpu wakeup and runq tickling in credit
  2012-10-23 15:16 ` George Dunlap
  2012-10-24 16:48   ` Dario Faggioli
@ 2012-11-15 12:10   ` Dario Faggioli
  2012-11-15 12:18     ` George Dunlap
  1 sibling, 1 reply; 9+ messages in thread
From: Dario Faggioli @ 2012-11-15 12:10 UTC (permalink / raw)
  To: George Dunlap; +Cc: Keir Fraser, Jan Beulich, xen-devel

[-- Attachment #1.1: Type: text/plain, Size: 4123 bytes --]

On Tue, 2012-10-23 at 16:16 +0100, George Dunlap wrote:
> > As it comes to possible solution, I think that, for instance, tickling
> > all the CPUs in both v_W's and v_C's affinity masks could solve this,
> > but that would also potentially increase the overhead (by asking _a_lot_
> > of CPUs to reschedule), and again, it's hard to say if/when it's
> > worth...
> 
> Well in my code, opt_tickle_idle_one is on by default, which means only 
> one other cpu will be woken up.  If there were an easy way to make it 
> wake up a CPU in v_C's affinity as well (supposing that there was no 
> overlap), that would probably be a win.
> 
> Of course, that's only necessary if:
> * v_C is lower priority than v_W
> * There are no idlers that intersect both v_C and v_W's affinity mask.
> 
> It's probably a good idea though to try to set up a scenario where this 
> might be an issue and see how often it actually happens.
> 
Ok, I think I managed in reproducing this. Look at the following trace,
considering that d51 has vcpu-affinity with pcpus 8-15, while d0 has no
affinity at all (its vcpus can run everywhere):

 166.853945095 ---|-|-------x-| d51v1 runstate_change d0v7 blocked->runnable
]166.853945884 ---|-|-------x-| d51v1   28004(2:8:4) 2 [ 0 7 ]
.
]166.853986385 ---|-|-------x-| d51v1   2800e(2:8:e) 2 [ 33 4bf97be ]
]166.853986522 ---|-|-------x-| d51v1   2800f(2:8:f) 3 [ 0 a050 1c9c380 ]
]166.853986636 ---|-|-------x-| d51v1   2800a(2:8:a) 4 [ 33 1 0 7 ]
.
 166.853986775 ---|-|-------x-| d51v1 runstate_change d51v1 running->runnable
 166.853986905 ---|-|-------x-| d?v? runstate_change d0v7 runnable->running
.
.
.
]166.854195353 ---|-|-------x-| d0v7   28006(2:8:6) 2 [ 0 7 ]
]166.854196484 ---|-|-------x-| d0v7   2800e(2:8:e) 2 [ 0 33530 ]
]166.854196584 ---|-|-------x-| d0v7   2800f(2:8:f) 3 [ 33 33530 1c9c380 ]
]166.854196691 ---|-|-------x-| d0v7   2800a(2:8:a) 4 [ 0 7 33 1 ]
 166.854196809 ---|-|-------x-| d0v7 runstate_change d0v7 running->blocked
 166.854197175 ---|-|-------x-| d?v? runstate_change d51v1 runnable->running

So, if I'm not reading the trace wrong, when d0v7 wakes up (very first
event) it preempts d51v1. Now, even if almost all pcpus 8-15 are idle,
none of them get tickled and comes to pick d51v1 up, which has then to
wait in the runq until d0v7 goes back to sleep.

I suspect this could be because, at d0v7 wakeup time, we try to tickle
some pcpu which is in d0v7's affinity, but not in d51v1's one (as in the
second 'if() {}' block in __runq_tickle() we only care about
new->vcpu->cpu_affinity, and in this case, new is d0v7).

I know, looking at the timestamps it doesn't look like it is a big deal
in this case, and I'm still working on producing numbers that can better
show whether or not this is a real problem.

Anyway, and independently from the results of these tests, why do I care
so much?

Well, if you substitute the concept of "vcpu-affinity" with
"node-affinity" above (which is what I am doing in my NUMA aware
scheduling patches) you'll see why this is bothering me quite a bit. In
fact, in that case, waking up a random pcpu with which d0v7 has
node-affinity with, while d51v1 has not, would cause d51v1 being pulled
by that cpu (since node-affinity is only preference)!

So, in the vcpu-affinity case, if pcpu 3 get tickled, when it peeks at
pcpu 13's runq for work to steal it does not find anything suitable and
give up, leaving d51v1 in the runq even if there are idle pcpus on which
it could run, which is already bad.
In the node-affinity case, pcpu 3 will actually manage in stealing d51v1
and running it, even if there are idle pcpus with which it has
node-affinity, and thus defeating most of the benefits of the whole NUMA
aware scheduling thing (at least for some workloads).

:-(

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: About vcpu wakeup and runq tickling in credit
  2012-11-15 12:10   ` Dario Faggioli
@ 2012-11-15 12:18     ` George Dunlap
  2012-11-15 15:50       ` Dario Faggioli
  2012-11-16 10:53       ` Dario Faggioli
  0 siblings, 2 replies; 9+ messages in thread
From: George Dunlap @ 2012-11-15 12:18 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: Keir Fraser, Jan Beulich, xen-devel

On 15/11/12 12:10, Dario Faggioli wrote:
> On Tue, 2012-10-23 at 16:16 +0100, George Dunlap wrote:
>>> As it comes to possible solution, I think that, for instance, tickling
>>> all the CPUs in both v_W's and v_C's affinity masks could solve this,
>>> but that would also potentially increase the overhead (by asking _a_lot_
>>> of CPUs to reschedule), and again, it's hard to say if/when it's
>>> worth...
>> Well in my code, opt_tickle_idle_one is on by default, which means only
>> one other cpu will be woken up.  If there were an easy way to make it
>> wake up a CPU in v_C's affinity as well (supposing that there was no
>> overlap), that would probably be a win.
>>
>> Of course, that's only necessary if:
>> * v_C is lower priority than v_W
>> * There are no idlers that intersect both v_C and v_W's affinity mask.
>>
>> It's probably a good idea though to try to set up a scenario where this
>> might be an issue and see how often it actually happens.
>>
> Ok, I think I managed in reproducing this. Look at the following trace,
> considering that d51 has vcpu-affinity with pcpus 8-15, while d0 has no
> affinity at all (its vcpus can run everywhere):
>
>   166.853945095 ---|-|-------x-| d51v1 runstate_change d0v7 blocked->runnable
> ]166.853945884 ---|-|-------x-| d51v1   28004(2:8:4) 2 [ 0 7 ]
> .
> ]166.853986385 ---|-|-------x-| d51v1   2800e(2:8:e) 2 [ 33 4bf97be ]
> ]166.853986522 ---|-|-------x-| d51v1   2800f(2:8:f) 3 [ 0 a050 1c9c380 ]
> ]166.853986636 ---|-|-------x-| d51v1   2800a(2:8:a) 4 [ 33 1 0 7 ]
> .
>   166.853986775 ---|-|-------x-| d51v1 runstate_change d51v1 running->runnable
>   166.853986905 ---|-|-------x-| d?v? runstate_change d0v7 runnable->running
> .
> .
> .
> ]166.854195353 ---|-|-------x-| d0v7   28006(2:8:6) 2 [ 0 7 ]
> ]166.854196484 ---|-|-------x-| d0v7   2800e(2:8:e) 2 [ 0 33530 ]
> ]166.854196584 ---|-|-------x-| d0v7   2800f(2:8:f) 3 [ 33 33530 1c9c380 ]
> ]166.854196691 ---|-|-------x-| d0v7   2800a(2:8:a) 4 [ 0 7 33 1 ]
>   166.854196809 ---|-|-------x-| d0v7 runstate_change d0v7 running->blocked
>   166.854197175 ---|-|-------x-| d?v? runstate_change d51v1 runnable->running
>
> So, if I'm not reading the trace wrong, when d0v7 wakes up (very first
> event) it preempts d51v1. Now, even if almost all pcpus 8-15 are idle,
> none of them get tickled and comes to pick d51v1 up, which has then to
> wait in the runq until d0v7 goes back to sleep.
>
> I suspect this could be because, at d0v7 wakeup time, we try to tickle
> some pcpu which is in d0v7's affinity, but not in d51v1's one (as in the
> second 'if() {}' block in __runq_tickle() we only care about
> new->vcpu->cpu_affinity, and in this case, new is d0v7).
>
> I know, looking at the timestamps it doesn't look like it is a big deal
> in this case, and I'm still working on producing numbers that can better
> show whether or not this is a real problem.
>
> Anyway, and independently from the results of these tests, why do I care
> so much?
>
> Well, if you substitute the concept of "vcpu-affinity" with
> "node-affinity" above (which is what I am doing in my NUMA aware
> scheduling patches) you'll see why this is bothering me quite a bit. In
> fact, in that case, waking up a random pcpu with which d0v7 has
> node-affinity with, while d51v1 has not, would cause d51v1 being pulled
> by that cpu (since node-affinity is only preference)!
>
> So, in the vcpu-affinity case, if pcpu 3 get tickled, when it peeks at
> pcpu 13's runq for work to steal it does not find anything suitable and
> give up, leaving d51v1 in the runq even if there are idle pcpus on which
> it could run, which is already bad.
> In the node-affinity case, pcpu 3 will actually manage in stealing d51v1
> and running it, even if there are idle pcpus with which it has
> node-affinity, and thus defeating most of the benefits of the whole NUMA
> aware scheduling thing (at least for some workloads).

Maybe what we should do is do the wake-up based on who is likely to run 
on the current cpu: i.e., if "current" is likely to be pre-empted, look 
at idlers based on "current"'s mask; if "new" is likely to be put on the 
queue, look at idlers based on "new"'s mask.

What do you think?

  -George

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: About vcpu wakeup and runq tickling in credit
  2012-11-15 12:18     ` George Dunlap
@ 2012-11-15 15:50       ` Dario Faggioli
  2012-11-16 10:53       ` Dario Faggioli
  1 sibling, 0 replies; 9+ messages in thread
From: Dario Faggioli @ 2012-11-15 15:50 UTC (permalink / raw)
  To: George Dunlap; +Cc: Keir Fraser, Jan Beulich, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 1489 bytes --]

On Thu, 2012-11-15 at 12:18 +0000, George Dunlap wrote:
> > So, in the vcpu-affinity case, if pcpu 3 get tickled, when it peeks at
> > pcpu 13's runq for work to steal it does not find anything suitable and
> > give up, leaving d51v1 in the runq even if there are idle pcpus on which
> > it could run, which is already bad.
> > In the node-affinity case, pcpu 3 will actually manage in stealing d51v1
> > and running it, even if there are idle pcpus with which it has
> > node-affinity, and thus defeating most of the benefits of the whole NUMA
> > aware scheduling thing (at least for some workloads).
> 
> Maybe what we should do is do the wake-up based on who is likely to run 
> on the current cpu: i.e., if "current" is likely to be pre-empted, look 
> at idlers based on "current"'s mask; if "new" is likely to be put on the 
> queue, look at idlers based on "new"'s mask.
> 
EhEh, if you check  the whole thread, you'll find evidence that I
thought this to be a good idea from the very beginning. I've already a
patch for that, just let me see if numbers (with and without NUMA
scheduling) are aligned with impressions and then I'll send everything
together.

Thanks for your time,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: About vcpu wakeup and runq tickling in credit
  2012-11-15 12:18     ` George Dunlap
  2012-11-15 15:50       ` Dario Faggioli
@ 2012-11-16 10:53       ` Dario Faggioli
  2012-11-16 12:00         ` Dario Faggioli
  1 sibling, 1 reply; 9+ messages in thread
From: Dario Faggioli @ 2012-11-16 10:53 UTC (permalink / raw)
  To: George Dunlap; +Cc: Keir Fraser, David Vrabel, Jan Beulich, xen-devel


[-- Attachment #1.1.1: Type: text/plain, Size: 4033 bytes --]

(Cc-ing David as it looks like he uses xenalyze quite a bit, and I'm 
 seeking for any advice on how to squeeze data from there too :-P)

On Thu, 2012-11-15 at 12:18 +0000, George Dunlap wrote:
> Maybe what we should do is do the wake-up based on who is likely to run 
> on the current cpu: i.e., if "current" is likely to be pre-empted, look 
> at idlers based on "current"'s mask; if "new" is likely to be put on the 
> queue, look at idlers based on "new"'s mask.
> 
Ok, find attached the two (trivial) patches that I produced and am
testing in these days. Unfortunately, early results shows that I/we
might be missing something.

In fact, although I still don't yet have the numbers for the NUMA-aware
scheduling case (which is what originated all this! :-D), comparing
'upstream' and 'patched' (namely, 'upstream' plus the two attached
patches) I can spot some perf regressions. :-(

Here's the results of running some benchmarks on 2, 6 and 10 VMs. Each
VM has 2 VCPUs and they run and execute the benchmarks concurrently on a
16 CPUs host. (Each test is repeated 3 times, and avg+/-stddev is what
is reported).

Also, the VCPUs where statically pinned on the host's PCPUs. As already
said, numbers for no-pinning and NUMA-scheduling will follow.

+ sysbench --test=memory (throughput, higher is better)
 #VMs | upstream                | patched
    2 | 550.97667 +/- 2.3512355 | 540.185   +/- 21.416892
    6 | 443.15    +/- 5.7471797 | 442.66389 +/- 2.1071732
   10 | 313.89233 +/- 1.3237493 | 305.69567 +/- 0.3279853

+ sysbench --test=cpu (time, lower is better)
 #VMs | upstream                | patched
    2 | 47.8211   +/- 0.0215503 | 47.816117 +/- 0.0174079
    6 | 62.689122 +/- 0.0877172 | 62.789883 +/- 0.1892171
   10 | 90.321097 +/- 1.4803867 | 91.197767 +/- 0.1032667

+ specjbb2005 (throughput, higher is better)
 #VMs | upstream                | patched
    2 | 49591.057 +/- 952.93384 | 50008.28  +/- 1502.4863
    6 | 33538.247 +/- 1089.2115 | 33647.873 +/- 1007.3538
   10 | 21927.87  +/- 831.88742 | 21869.654 +/- 578.236


So, as you can easily see, the numbers are very similar, with cases
where the patches produces some slight performance reduction, while I
was expecting the opposite, i.e., similar but a little bit better with
the patches.

For most of the runs of all the benchmarks, I have the full traces
(although, only for SCHED-* events, IIRC), so I can investigate more.
It's an huge amount of data, so it's really hard to make sense out of
it, and any advice and direction on that would be much appreciated.


For instance, looking at one of the runs of sysbench-memory, here's what
I found. With 10 VMs, the memory throughput reported by one of the VM
during one of the runs is as follows:

 upstream: 315.68 MB/s
 patched:  306.69 MB/s

I then went through the traces and I found out that the patched case
lasted longer (for transferring the same amount of memory, hence the
lower throughput), but with the following runstate related results:

 upstream: running for 73.67% of the time
           runnable for 24.94% of the time

 patched:  running for 74.57% of the time
           runnable for 24.10% of the time

And that is consistent with other random instances I checked. So, it
looks like the patches are, after all, doing their job in increasing (at
least a little) the running time, at the expenses of the runnable time,
of the various VCPUs, but the benefits of that is being all eaten by
some other effect --to the point that sometimes things go even worse--
that I'm not able to identify... For now! :-P

Any idea about what's going on and what I should check to better figure
that out?


Thanks a lot and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.1.2: xen-sched_credit-clarify-cpumask-and-during-tickle.patch --]
[-- Type: text/x-patch, Size: 936 bytes --]

# HG changeset patch
# Parent b0c342b749765bf254c664883d4f5e2891c1ff18

diff -r b0c342b74976 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Fri Nov 09 11:02:54 2012 +0100
+++ b/xen/common/sched_credit.c	Thu Nov 15 18:22:56 2012 +0100
@@ -254,7 +254,11 @@ static inline void
     ASSERT(cur);
     cpumask_clear(&mask);
 
-    /* If strictly higher priority than current VCPU, signal the CPU */
+    /*
+     * If new is strictly higher priority than current VCPU, let CPU
+     * know that re-scheduling is needed. That will likely pick-up new
+     * and put cur back in the runqueue.
+     */
     if ( new->pri > cur->pri )
     {
         if ( cur->pri == CSCHED_PRI_IDLE )
@@ -296,7 +300,6 @@ static inline void
                 else
                     cpumask_or(&mask, &mask, &idle_mask);
             }
-            cpumask_and(&mask, &mask, new->vcpu->cpu_affinity);
         }
     }
 

[-- Attachment #1.1.3: xen-sched_credit-fix-tickling --]
[-- Type: text/plain, Size: 1435 bytes --]

# HG changeset patch
# Parent 3a70bd1d02c1334857c84c9fb5e1dd22b6603a2c

diff -r 3a70bd1d02c1 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Thu Nov 15 18:22:56 2012 +0100
+++ b/xen/common/sched_credit.c	Thu Nov 15 19:03:19 2012 +0100
@@ -274,7 +274,7 @@ static inline void
     }
 
     /*
-     * If this CPU has at least two runnable VCPUs, we tickle any idlers to
+     * If this CPU has at least two runnable VCPUs, we tickle some idlers to
      * let them know there is runnable work in the system...
      */
     if ( cur->pri > CSCHED_PRI_IDLE )
@@ -287,7 +287,17 @@ static inline void
         {
             cpumask_t idle_mask;
 
-            cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
+            /*
+             * Which idlers do we want to tickle? If new has higher priority,
+             * it will likely preempt cur and run here. We then need someone
+             * where cur can run to come and pick it up. Vice-versa, if it is
+             * cur that stays, we poke idlers where new can run.
+             */
+            if ( new->pri > cur->pri )
+                cpumask_and(&idle_mask, prv->idlers, cur->vcpu->cpu_affinity);
+            else
+                cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
+
             if ( !cpumask_empty(&idle_mask) )
             {
                 SCHED_STAT_CRANK(tickle_idlers_some);

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: About vcpu wakeup and runq tickling in credit
  2012-11-16 10:53       ` Dario Faggioli
@ 2012-11-16 12:00         ` Dario Faggioli
  2012-11-16 15:44           ` George Dunlap
  0 siblings, 1 reply; 9+ messages in thread
From: Dario Faggioli @ 2012-11-16 12:00 UTC (permalink / raw)
  To: George Dunlap; +Cc: Keir Fraser, David Vrabel, Jan Beulich, xen-devel


[-- Attachment #1.1: Type: text/plain, Size: 2619 bytes --]

On Fri, 2012-11-16 at 11:53 +0100, Dario Faggioli wrote:
> On Thu, 2012-11-15 at 12:18 +0000, George Dunlap wrote:
> > Maybe what we should do is do the wake-up based on who is likely to run 
> > on the current cpu: i.e., if "current" is likely to be pre-empted, look 
> > at idlers based on "current"'s mask; if "new" is likely to be put on the 
> > queue, look at idlers based on "new"'s mask.
> > 
> Ok, find attached the two (trivial) patches that I produced and am
> testing in these days. Unfortunately, early results shows that I/we
> might be missing something.
> 
I'm just came to thinking that this approach, although more, say,
correct, could have a bad impact on caches and locality in general.

In fact, suppose a new vcpu N wakes up on pcpu #x where another vcpu C
is running, with prio(N)>prio(C).

What upstream does is asking to #x and to all the idlers that can
execute N to reschedule. Doing both is, I think, wrong, as there's the
chance of ending up with N being scheduled on #x and C being runnable
but not running (in #x's runqueue) even if there are idle cpus that
could run it, as they're not poked (as already and repeatedly said).

What the patches do, in this case (remember (prio(N)>prio(C)), is asking
#x and all the idlers that can run C to reschedule, the effect being
that N will likely run on #x, after a context switch, and C will run
somewhere else, after a migration, potentially wasting its cache-hotness
(it is running after all!).

It looks like we can do better... Something like the below:
 + if there are no idlers where N can run, ask #x and the idlers where 
   C can run to reschedule (exactly what the patches do, although, they 
   do that _unconditionally_), as there isn't anything else we can do
   to try to make sure they both will run;
 + if *there*are* idlers where N can run, _do_not_ ask #x to reschedule 
   and only poke them to come pick N up. In fact, in this case, it is 
   not necessary to send C away for having both the vcpus ruunning, and 
   it seems better to have N experience the migration as, since it's 
   waking-up, it's more likely for him than for C to be cache-cold.

I'll run the benchmarks with this variant as soon as the one that I'm
running right now finish... In the meanwhile, any thoughts?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: About vcpu wakeup and runq tickling in credit
  2012-11-16 12:00         ` Dario Faggioli
@ 2012-11-16 15:44           ` George Dunlap
  0 siblings, 0 replies; 9+ messages in thread
From: George Dunlap @ 2012-11-16 15:44 UTC (permalink / raw)
  To: Dario Faggioli; +Cc: Keir Fraser, David Vrabel, Jan Beulich, xen-devel

On 16/11/12 12:00, Dario Faggioli wrote:
> On Fri, 2012-11-16 at 11:53 +0100, Dario Faggioli wrote:
>> On Thu, 2012-11-15 at 12:18 +0000, George Dunlap wrote:
>>> Maybe what we should do is do the wake-up based on who is likely to run
>>> on the current cpu: i.e., if "current" is likely to be pre-empted, look
>>> at idlers based on "current"'s mask; if "new" is likely to be put on the
>>> queue, look at idlers based on "new"'s mask.
>>>
>> Ok, find attached the two (trivial) patches that I produced and am
>> testing in these days. Unfortunately, early results shows that I/we
>> might be missing something.
>>
> I'm just came to thinking that this approach, although more, say,
> correct, could have a bad impact on caches and locality in general.

One thing that xenalyze will already tell you is statistics on how a 
vcpu migrates over pcpus.  For example:

  cpu affinity:     242 7009916158 {621089444|5643356292|19752063006}
    [0]:      15 6940230676 {400952|5643531152|27013831272}
    [1]:      19 6366861827 {117462|5031404806|19751998114}
    [2]:      31 6888557514 {1410800684|5643015454|19752100009}
    [3]:      18 7790887470 {109764|5920027975|25395539566}
...

The general format is: "$number $average_cycles {5th percentile|50th 
percentile|95th percentile}".  The first line includes samples from 
*all* cpus (i.e,. so it migrated a total of 242 times, averaging 7 
billion cycles each time); the subsequent numbers show statistics on 
specific pcpus (i.e., it had 15 sessions on pcpu 0, averaging 6.94 
billion cycles, &c).

You should be able to use this to do a basic verification of your 
hypothesis that vcpus are migrating more often.

> In fact, suppose a new vcpu N wakes up on pcpu #x where another vcpu C
> is running, with prio(N)>prio(C).
>
> What upstream does is asking to #x and to all the idlers that can
> execute N to reschedule. Doing both is, I think, wrong, as there's the
> chance of ending up with N being scheduled on #x and C being runnable
> but not running (in #x's runqueue) even if there are idle cpus that
> could run it, as they're not poked (as already and repeatedly said).
>
> What the patches do, in this case (remember (prio(N)>prio(C)), is asking
> #x and all the idlers that can run C to reschedule, the effect being
> that N will likely run on #x, after a context switch, and C will run
> somewhere else, after a migration, potentially wasting its cache-hotness
> (it is running after all!).
>
> It looks like we can do better... Something like the below:
>   + if there are no idlers where N can run, ask #x and the idlers where
>     C can run to reschedule (exactly what the patches do, although, they
>     do that _unconditionally_), as there isn't anything else we can do
>     to try to make sure they both will run;
>   + if *there*are* idlers where N can run, _do_not_ ask #x to reschedule
>     and only poke them to come pick N up. In fact, in this case, it is
>     not necessary to send C away for having both the vcpus ruunning, and
>     it seems better to have N experience the migration as, since it's
>     waking-up, it's more likely for him than for C to be cache-cold.

I think that makes a lot of sense -- look forward to seeing the results. :-)

There may be some other tricks we could look at.  For example, if N and 
C are both going to do a significant chunk of computation, then this 
strategy will work best.  But suppose that C does a significant junk of 
computation, but N is only going to run for a few hundred microseconds 
and then go to sleep again?  In that case, it may be easier to just run 
N on the current processor and not bother with IPIs and such; C will run 
again in a few microseconds.   Conversely, if N will do a significant 
chunk of work but C is fairly short, we might as well let C continue 
running, as N will shortly get to run.

How to know if the next time this vcpu runs will be long or short? We 
could try tracking the runtimes of the last N (maybe 3 or 5) this was 
scheduled, and using that to predict the results.

Do you have traces for any of those runs you did?  I might just take a 
look at them and see if I can make an analysis of cache "temperature" 
wrt scheduling. :-)

  -George

  -George

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-11-16 15:44 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-23 13:34 About vcpu wakeup and runq tickling in credit Dario Faggioli
2012-10-23 15:16 ` George Dunlap
2012-10-24 16:48   ` Dario Faggioli
2012-11-15 12:10   ` Dario Faggioli
2012-11-15 12:18     ` George Dunlap
2012-11-15 15:50       ` Dario Faggioli
2012-11-16 10:53       ` Dario Faggioli
2012-11-16 12:00         ` Dario Faggioli
2012-11-16 15:44           ` George Dunlap

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).