RUNSTATE_runnable delta time for idle_domain accounted to HVM guest.

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* RUNSTATE_runnable delta time for idle_domain accounted to HVM guest.
@ 2014-04-23 21:28 Konrad Rzeszutek Wilk
  2014-04-24  7:58 ` Jan Beulich
  2014-04-29  9:16 ` George Dunlap
  0 siblings, 2 replies; 9+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-23 21:28 UTC (permalink / raw)
  To: xen-devel, george.dunlap, dario.faggioli

[-- Attachment #1: Type: text/plain, Size: 5075 bytes --]

Hey George and Dario,

I am looking at a particular issue for which I would appreciate
your expert knowledge of the scheduler's inner workings.

I have two questions to which I hope will steer me in the right
debug direction. But before I go that route let me explain
the scenario.

There are seven guests (eight if you include initial domain)-
each guest is partitioned to have exclusive usage of a range 
of CPUs. That is, none of the guests (except initial domain)
overlap with other ones. This is enforced by having 'cpu=X-Y'
in the guest config.

As such the runqueue for each physical CPU contains the
idle domain, the guest vcpu, and then occasionally the initial
domain.

What we are observing is that if a domain is idle its steal
time* goes up. My first thought was - well that is the initial
domain taking the time - but after looking at the trace
did not see the initial domain to be scheduled at all.

(*steal time: RUNSTATE_runnable + RUNSTATE_offline).

The issue I see is that the guest ends up hitting
'HLT' and the scheduler picks the idle domain. It
kicks it off and .... 

	Question 1: Following the code path, schedule_tail
	for the idle domain would call idle_loop.

	How do we end up from idle_loop in vcpu_wake?

	Is that because the HPET (on another CPU)
	has raised the softirq(TIMER_SOFTIRQ) because the
	timer has expired?

	It certinaly looks like it could happen:

	evt_do_broadcast (on some other CPU), sets TIMER_SOFTIRQ 

	idle_loop
	  ->do_softirq
	       timer_softirq_action
                  vcpu_singleshot_timer_fn
	             vcpu_periodic_timer_work
	                send_timer_event->
		            send_guest_vcpu_virq->
			         evtchn_set_pending->
				      vcpu_mark_events_pending->
					vcpu_kick->
						vcpu_unblock->
							vcpu_wake

        And then on the next iteration, somebody has to
        set the (SCHEDULE_SOFTIRQ)??

	Anyhow,

.. and ends up calling vcpu_wake. vcpu_wake
changes the guest's runstate from runstate_BLOCKED
to runstate_RUNNABLE. Then we call schdule()

	Question 2:

	Who would trigger the SCHEDULE_SOFTIRQ for that?
	I was initially thinking that the 'do_block'. But that
	I think triggers the first call to 'schedule' which
	sets the idle domain to run. Help? It could be
	'vcpu_kick' but 'v->running=0' (done by schedule->context_saved).
	Help!? Who could it be?

Then 'schedule' is called where the 'prev' is the idle
domain and 'next' is the guest. However, because 'next' got
labelled as 'runstate_RUNNABLE' we account _all of the time
that the idle domain had been running as belonging to the guest_.

That is what I think is happening and it certainly explains
why the guest has such a large steal time. It looks like a bug
to me, but perhaps that is how it is suppose to be accounted for?

Here is the trace (I am also attaching a patch ,as the 'format'
file had a bug in it)

Upon closer inspection we have something like this:

 (+    1104)  do_yield          [ domid = 0x00000005, edomid = 0x00000004 ]
 (+  353598)  continue_running    [ dom:vcpu = 0x00050004 ]
 (+   41577)  VMENTRY
 (+   77715)  VMEXIT      [ exitcode = 0x0000000c, rIP  = 0xffffffff810402ea ]
 (+   47451)  do_block          [ domid = 0x00000005, vcpuid = 0x00000004 ]
 (+       0)  HLT         [ intpending = 0 ]
 (+  190338)  switch_infprev    [ old_domid = 0x00000005, runtime = 24803749 ]
 (+     264)  switch_infnext    [ new_domid = 0x00007fff, time = 24803749, r_time = 4294967295 ]
 (+     309)  __enter_scheduler [ prev<domid:vcpu> = 0x00000005 : 0x00000004, next<domid:vcpu> = 0x00007fff : 0x00000081 ]
 (+     318)  running_to_blocked  [ dom:vcpu = 0x00050004 ]

[VM 05 is now in runstate_blocked]
 (+   30516)  runnable_to_running [ dom:vcpu = 0x7fff0081 ]

=> Here schedule_tail is called for the idle domain. <= presumarily runs 'idle_loop'

... some time later...


vcpu_wake:
 (+  794301)  blocked_to_runnable [ dom:vcpu = 0x00050004 ]

=> Moves VM05 from RUNstate_blocked to RUNstate_runnable.

 (+   88113)  domain_wake       [ domid = 0x00000005, vcpu = 0x00000004 ]


 (+15749685)  switch_infprev    [ old_domid = 0x00007fff, runtime = 6863168 ]

(IDLE domain ran for [6.57msec] !! )

 (+     288)  switch_infnext    [ new_domid = 0x00000005, time = 6458163, r_time = 30000000 ]

And we switch to VM05:

 TRACE_3D(TRC_SCHED_SWITCH_INFNEXT,
   VM05 ==>  next->domain->domain_id,	
   YES?!?!?  (next->runstate.state == RUNSTATE_runnable) ? [yes, because vcpu_wake changed it]
             (now - next->runstate.state_entry_time) : 0,
             next_slice.time);

 And account the time that the idle domain ran to the VCPU of
the VM05.

 (+   27717)  __enter_scheduler [ prev<domid:vcpu> = 0x00007fff : 0x00000081, next<domid:vcpu> = 0x00000005 : 0x00000004 ]
 (+     363)  running_to_runnable [ dom:vcpu = 0x7fff0081 ]
 (+   63174)  runnable_to_running [ dom:vcpu = 0x00050004 ]
 (+       0)  INJ_VIRQ    [ vector = 0xf3, fake = 0 ]
 (+       0)  INTR_WINDOW [ value = 0x000000f3 ]
 (+  164763)  VMENTRY

[Start VM05 injecting the callback vector. Presumarily with
the VIRQ_TIMER event]

[-- Attachment #2: 0001-xentrace-formats-Fix-TRC_SCHED_-outputs.patch --]
[-- Type: text/plain, Size: 3273 bytes --]

>From e95845dc96459bbeeb1cc3e5735abaf4e17ddb1b Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Wed, 23 Apr 2014 15:07:40 -0400
Subject: [PATCH] xentrace/formats: Fix TRC_SCHED_*  outputs.

Most of the trace syntax have as the second argument the
'vcpu_id' not an domain id. Hence swapping to the right name.

The TRC_SCHED_SWITCH has the vcpu-id as second and fourth
argument - updating that too.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 tools/xentrace/formats | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index da658bf..9c1bb94 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -19,17 +19,17 @@
 0x00021311  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  offline_to_runnable [ dom:vcpu = 0x%(1)08x ]
 0x00021321  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  offline_to_blocked  [ dom:vcpu = 0x%(1)08x ]
 
-0x00028001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  sched_add_domain  [ domid = 0x%(1)08x, edomid = 0x%(2)08x ]
-0x00028002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  sched_rem_domain  [ domid = 0x%(1)08x, edomid = 0x%(2)08x ]
-0x00028003  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  domain_sleep      [ domid = 0x%(1)08x, edomid = 0x%(2)08x ]
-0x00028004  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  domain_wake       [ domid = 0x%(1)08x, edomid = 0x%(2)08x ]
-0x00028005  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  do_yield          [ domid = 0x%(1)08x, edomid = 0x%(2)08x ]
-0x00028006  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  do_block          [ domid = 0x%(1)08x, edomid = 0x%(2)08x ]
+0x00028001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  sched_add_domain  [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ]
+0x00028002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  sched_rem_domain  [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ]
+0x00028003  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  domain_sleep      [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ]
+0x00028004  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  domain_wake       [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ]
+0x00028005  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  do_yield          [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ]
+0x00028006  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  do_block          [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ]
 0x00022006  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  do_block          [ dom:vcpu = 0x%(1)08x, domid = 0x%(2)08x ]
-0x00028007  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  domain_shutdown	  [ domid = 0x%(1)08x, edomid = 0x%(2)08x, reason = 0x%(3)08x ]
+0x00028007  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  domain_shutdown	  [ domid = 0x%(1)08x, vcpu = 0x%(2)08x, reason = 0x%(3)08x ]
 0x00028008  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  sched_ctl
 0x00028009  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  sched_adjdom      [ domid = 0x%(1)08x ]
-0x0002800a  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  __enter_scheduler [ prev<domid:edomid> = 0x%(1)08x : 0x%(2)08x, next<domid:edomid> = 0x%(3)08x : 0x%(4)08x ]
+0x0002800a  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  __enter_scheduler [ prev<domid:vcpu> = 0x%(1)08x : 0x%(2)08x, next<domid:vcpu> = 0x%(3)08x : 0x%(4)08x ]
 0x0002800b  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  s_timer_fn
 0x0002800c  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  t_timer_fn
 0x0002800d  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  dom_timer_fn
-- 
1.8.5.3


[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: RUNSTATE_runnable delta time for idle_domain accounted to HVM guest.
  2014-04-23 21:28 RUNSTATE_runnable delta time for idle_domain accounted to HVM guest Konrad Rzeszutek Wilk
@ 2014-04-24  7:58 ` Jan Beulich
  2014-04-24 18:02   ` Konrad Rzeszutek Wilk
  2014-04-29  9:16 ` George Dunlap
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Beulich @ 2014-04-24  7:58 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: george.dunlap, xen-devel, dario.faggioli

>>> On 23.04.14 at 23:28, <konrad.wilk@oracle.com> wrote:
> 	Question 1: Following the code path, schedule_tail
> 	for the idle domain would call idle_loop.
> 
> 	How do we end up from idle_loop in vcpu_wake?
> 
> 	Is that because the HPET (on another CPU)
> 	has raised the softirq(TIMER_SOFTIRQ) because the
> 	timer has expired?

On another or on the same CPU, because work got moved to the CPU
in question, because some other vCPU in the guest triggered activity
in a vCPU currently on that CPU, or because some guest set timer
expired, needing the vCPU to run again.

> 	Question 2:
> 
> 	Who would trigger the SCHEDULE_SOFTIRQ for that?
> 	I was initially thinking that the 'do_block'. But that
> 	I think triggers the first call to 'schedule' which
> 	sets the idle domain to run. Help? It could be
> 	'vcpu_kick' but 'v->running=0' (done by schedule->context_saved).
> 	Help!? Who could it be?

At the example of the credit scheduler, it's vcpu_wake() ->
csched_vcpu_wake() -> __runq_tickle() that raises the softirq
(if needed).

> Then 'schedule' is called where the 'prev' is the idle
> domain and 'next' is the guest. However, because 'next' got
> labelled as 'runstate_RUNNABLE' we account _all of the time
> that the idle domain had been running as belonging to the guest_.

Not really - together with the state change vcpu_runstate_change()
also sets v->runstate.state_entry_time for the new state, i.e. only
the time since the vCPU became runnable is accounted here.

Jan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RUNSTATE_runnable delta time for idle_domain accounted to HVM guest.
  2014-04-24  7:58 ` Jan Beulich
@ 2014-04-24 18:02   ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 9+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-24 18:02 UTC (permalink / raw)
  To: Jan Beulich; +Cc: george.dunlap, xen-devel, dario.faggioli

On Thu, Apr 24, 2014 at 08:58:25AM +0100, Jan Beulich wrote:
> >>> On 23.04.14 at 23:28, <konrad.wilk@oracle.com> wrote:
> > 	Question 1: Following the code path, schedule_tail
> > 	for the idle domain would call idle_loop.
> > 
> > 	How do we end up from idle_loop in vcpu_wake?
> > 
> > 	Is that because the HPET (on another CPU)
> > 	has raised the softirq(TIMER_SOFTIRQ) because the
> > 	timer has expired?
> 
> On another or on the same CPU, because work got moved to the CPU
> in question, because some other vCPU in the guest triggered activity
> in a vCPU currently on that CPU, or because some guest set timer
> expired, needing the vCPU to run again.
> 
> > 	Question 2:
> > 
> > 	Who would trigger the SCHEDULE_SOFTIRQ for that?
> > 	I was initially thinking that the 'do_block'. But that
> > 	I think triggers the first call to 'schedule' which
> > 	sets the idle domain to run. Help? It could be
> > 	'vcpu_kick' but 'v->running=0' (done by schedule->context_saved).
> > 	Help!? Who could it be?
> 
> At the example of the credit scheduler, it's vcpu_wake() ->
> csched_vcpu_wake() -> __runq_tickle() that raises the softirq
> (if needed).

<smacks his head>
And it is right there in 'vcpu_wake':

    if ( v->runstate.state >= RUNSTATE_blocked )
        vcpu_runstate_change(v, RUNSTATE_runnable, NOW());
--> SCHED_OP(VCPU2OP(v), wake, v);        <----

Now  I just have to figure out why there is a delta of 6.7msec after the
'vcpu_runstate_change' and the 'wake' triggering the 'schedule' on the
CPU that is idle.
> 
> > Then 'schedule' is called where the 'prev' is the idle
> > domain and 'next' is the guest. However, because 'next' got
> > labelled as 'runstate_RUNNABLE' we account _all of the time
> > that the idle domain had been running as belonging to the guest_.
> 
> Not really - together with the state change vcpu_runstate_change()
> also sets v->runstate.state_entry_time for the new state, i.e. only
> the time since the vCPU became runnable is accounted here.

Yup! I somehow missed in the 'vcpu_wake' the 'SCHED_OP' call.
Now off to figure out why it takes so long to get the SCHEDULE_SOFTIRQ
to get invoked on the CPU.

More debugging. Thanks for the pointers!
> 
> Jan
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RUNSTATE_runnable delta time for idle_domain accounted to HVM guest.
  2014-04-23 21:28 RUNSTATE_runnable delta time for idle_domain accounted to HVM guest Konrad Rzeszutek Wilk
  2014-04-24  7:58 ` Jan Beulich
@ 2014-04-29  9:16 ` George Dunlap
  2014-04-29 12:42   ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 9+ messages in thread
From: George Dunlap @ 2014-04-29  9:16 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, xen-devel, dario.faggioli

On 04/23/2014 10:28 PM, Konrad Rzeszutek Wilk wrote:
> What we are observing is that if a domain is idle its steal
> time* goes up. My first thought was - well that is the initial
> domain taking the time - but after looking at the trace
> did not see the initial domain to be scheduled at all.
>
> (*steal time: RUNSTATE_runnable + RUNSTATE_offline).

"Up" like how much?

Steal time includes the time *being woken* up.  It takes time to be 
woken up; typically if it's being woken up from domain 0, for instance, 
the wake (which sets it to RUNSTATE_runnable) will happen on a different 
pcpu than the vcpu being woken is on, so there's the delay of the IPI, 
waking up, going through the scheduler, &c.

The more frequently a VM is already running, the lower probability that 
an interrupt will actually wake it up.

BTW, is there a reason you're using xentrace_format instead of xenalyze?

hg clone http://xenbits.xenproject.org/ext/xenalyze

Unlike xentrace_format, xenalyze can:
1. Report the trace records in the order they happened across all pcpus, 
so you can see interactions between pcpus
2. Do its own runstate analysis on a per-vcpu level, allowing you to see 
not only how much time was spent in the "runnable" state, but how much 
of it was due to being woken up vs being preempted.
3. Allow you to see statistics on how long the "waking up" process took 
(average, and 5th/50th/95th %ile)
4. Give you a framework to allow you to easily add your own analysis if 
you want.  For instance, you could add statistics for how often a vcpu 
was woken up due to a local timer event vs woken up by an event from 
dom0 on another cpu, &c.

  -George

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RUNSTATE_runnable delta time for idle_domain accounted to HVM guest.
  2014-04-29  9:16 ` George Dunlap
@ 2014-04-29 12:42   ` Konrad Rzeszutek Wilk
  2014-05-06 17:36     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 9+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-29 12:42 UTC (permalink / raw)
  To: George Dunlap; +Cc: xen-devel, dario.faggioli

On Tue, Apr 29, 2014 at 10:16:39AM +0100, George Dunlap wrote:
> On 04/23/2014 10:28 PM, Konrad Rzeszutek Wilk wrote:
> >What we are observing is that if a domain is idle its steal
> >time* goes up. My first thought was - well that is the initial
> >domain taking the time - but after looking at the trace
> >did not see the initial domain to be scheduled at all.
> >
> >(*steal time: RUNSTATE_runnable + RUNSTATE_offline).
> 
> "Up" like how much?

6.7msec. Or ~1/4 of the timeslice
> 
> Steal time includes the time *being woken* up.  It takes time to be
> woken up; typically if it's being woken up from domain 0, for
> instance, the wake (which sets it to RUNSTATE_runnable) will happen
> on a different pcpu than the vcpu being woken is on, so there's the
> delay of the IPI, waking up, going through the scheduler, &c.

Right. In this case there are no IPIs. Just softirq handlers being
triggered (by some other VCPU it seems) and they run.. And the
time between the 'vcpu_wake' and the 'schedule' softirq are quite
long.
> 
> The more frequently a VM is already running, the lower probability
> that an interrupt will actually wake it up.

Right. But there are no interrupt here at all. It is just idling.
> 
> BTW, is there a reason you're using xentrace_format instead of xenalyze?

I did use xenanalyze and it told me that the vCPU is busy spending most
of its time in 'runnable' condition. The other vCPUs are doing other
work.
> 
> hg clone http://xenbits.xenproject.org/ext/xenalyze
> 
> Unlike xentrace_format, xenalyze can:
> 1. Report the trace records in the order they happened across all
> pcpus, so you can see interactions between pcpus

Right. Was thinking to look back at that, but for right now I just
want to understand why there is this long delay between 'vcpu_wake'
and 'schedule'. Added more tracing to help with this.

> 2. Do its own runstate analysis on a per-vcpu level, allowing you to
> see not only how much time was spent in the "runnable" state, but
> how much of it was due to being woken up vs being preempted.

Hm, that would be interesting, but I think it will tell me exactly
the same thing - very long time from switching from blocked to
runnable and then being scheduled.
> 3. Allow you to see statistics on how long the "waking up" process
> took (average, and 5th/50th/95th %ile)
> 4. Give you a framework to allow you to easily add your own analysis
> if you want.  For instance, you could add statistics for how often a
> vcpu was woken up due to a local timer event vs woken up by an event
> from dom0 on another cpu, &c.

I had issues with that. It seems to require some special record
in the trace that sometimes I don't have.
> 
>  -George
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RUNSTATE_runnable delta time for idle_domain accounted to HVM guest.
  2014-04-29 12:42   ` Konrad Rzeszutek Wilk
@ 2014-05-06 17:36     ` Konrad Rzeszutek Wilk
  2014-05-07  8:07       ` Jan Beulich
  0 siblings, 1 reply; 9+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-05-06 17:36 UTC (permalink / raw)
  To: George Dunlap, keir.xen; +Cc: xen-devel, dario.faggioli

On Tue, Apr 29, 2014 at 08:42:06AM -0400, Konrad Rzeszutek Wilk wrote:
> On Tue, Apr 29, 2014 at 10:16:39AM +0100, George Dunlap wrote:
> > On 04/23/2014 10:28 PM, Konrad Rzeszutek Wilk wrote:
> > >What we are observing is that if a domain is idle its steal
> > >time* goes up. My first thought was - well that is the initial
> > >domain taking the time - but after looking at the trace
> > >did not see the initial domain to be scheduled at all.
> > >
> > >(*steal time: RUNSTATE_runnable + RUNSTATE_offline).
> > 
> > "Up" like how much?
> 
> 6.7msec. Or ~1/4 of the timeslice
> > 
> > Steal time includes the time *being woken* up.  It takes time to be
> > woken up; typically if it's being woken up from domain 0, for
> > instance, the wake (which sets it to RUNSTATE_runnable) will happen
> > on a different pcpu than the vcpu being woken is on, so there's the
> > delay of the IPI, waking up, going through the scheduler, &c.
> 
> Right. In this case there are no IPIs. Just softirq handlers being

[edit: There is the IPI associated with raise_softirq_action being
broadcast to CPUs]

> triggered (by some other VCPU it seems) and they run.. And the
> time between the 'vcpu_wake' and the 'schedule' softirq are quite
> long.
> > 
> > The more frequently a VM is already running, the lower probability
> > that an interrupt will actually wake it up.
> 
> Right. But there are no interrupt here at all. It is just idling.

[edit: Just the IPI when it is halted and the idle guest has been
scheduled in]
> > 
> > BTW, is there a reason you're using xentrace_format instead of xenalyze?
> 
> I did use xenanalyze and it told me that the vCPU is busy spending most
> of its time in 'runnable' condition. The other vCPUs are doing other
> work.

I finally narrowed it down. We are contending for the 'tasklet_work' spinlock.

The steps that we take to get to this state are as follow (imagine four
30VCPU guests pinned to their sockets - and there is one socket per guest).

 a). Guest does 'HLT', we schedule in idle domain.
 b). The guest's timer is triggered, an IPI comes in, we get out of
	hlt;pause
 c).  and softirq_pending has 'TIMER' (0) set. We end up doing this:
	
        idle_loop
          ->do_softirq
               timer_softirq_action
                  vcpu_singleshot_timer_fn
                     vcpu_periodic_timer_work
                        send_timer_event->
                            send_guest_vcpu_virq->
                                 evtchn_set_pending->
                                      vcpu_mark_events_pending->

            [ here can call hvm_assert_evtchn_irq] which schedules a tasklet.
211         tasklet_schedule(&v->arch.hvm_vcpu.assert_evtchn_irq_tasklet);

 d). We got back to soft_irq and softirq_pending has 'TASKLET' set. We
     call:
        idle_loop
          ->do_softirq
		->tasklet_softirq_action
		   -> spin_lock_irq(&tasklet_lock);
		[takes a while]
		   -> do_tasklet_work
			-> unlock
			-> call the work function:
                	   ->hvm_assert_evtchn_irq
                       		vcpu_kick->
                             	   vcpu_unblock->
                                    vcpu_wake
					[vcpu_runstate_change] [blocked->runnable]
   -> from here on any activity is accounted to the guest ]
                             [_runq_tickle], sets the SCHEDULE_SOFTIRQ softirq]
			-> spin_lock_irq(&tasklet_lock);
    	[takes also a bit of time]
			-> unlock

  e). We get back to soft_irq and softirq_pending has 'SCHEDULE' set. We
      swap out the idle domain and stick in the new guest. The runtime
      in RUNNABLE includes the time to take the 'tasklet_lock'.
  f). Call INJ_VIRQ with the 0xf3 to wake guest up.

N.B.
The softirq handlers that are run end up being: TIMER, TASKLET, and SCHEDULE.
As in, low latency (TIMER), high latency (TASKLET), and low latency (SCHEDULE).


The 'tasklet_lock' on this particular setup ends up being hit by three
different NUMA nodes, and of course at the same time. My belief is that
the application in question that sets the user-space times sets the same
'alarm' timer in all the guests - and when they all go to sleep, they
are suppose to wake up at the same time. And I think this is done on all
of the guests, so it is stampede of everybody waking up at the same
time from 'hlt'.  There is no oversubscription.

The reason we schedule a tasklet instead of continuing with an
'vcpu_kick' is not yet known to me. This commit added the mechanism
to do it via the tasklet:

commit a5db2986d47fafc5e62f992616f057bfa43015d9
Author: Keir Fraser <keir.fraser@citrix.com>
Date:   Fri May 8 11:50:12 2009 +0100

    x86 hvm: hvm_set_callback_irq_level() must not be called in IRQ
    context or with IRQs disabled. Ensure this by deferring to tasklet
    (softirq) context if required.
    
    Signed-off-by: Keir Fraser <keir.fraser@citrix.com>

But I am not sure why:

 a). 'must not be called in IRQ context or with IRQs disabled' is
    important - I haven't dug in the code yet to understand the
    crucial reasons for - is there a known issue about this?

 b). Why do we have a per-cpu tasklet lists, but any manipulation of the
     items of them are protected by a global lock. Looking at the code in
     Linux and Xen the major difference is that Xen can schedule on specific CPUs
     (or even the tasklet can schedule itself on another CPU).

     Linux's variants of tasklets are much simpler - and don't have
     any spinlocks (except the atomic state of the tasklet running
     or scheduled to be run).

     I can see the need for the tasklets being on different CPUs for
     the microcode, and I am digging through the other ones to get
     a feel for it - but has anybody thought about improving this
     code? Has there been any suggestions/ideas tossed around in the
     past (the mailing list didn't help or my Google-fun sucks).

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RUNSTATE_runnable delta time for idle_domain accounted to HVM guest.
  2014-05-06 17:36     ` Konrad Rzeszutek Wilk
@ 2014-05-07  8:07       ` Jan Beulich
  2014-05-07 13:33         ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Beulich @ 2014-05-07  8:07 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: George Dunlap, xen-devel, dario.faggioli, keir.xen

>>> On 06.05.14 at 19:36, <konrad.wilk@oracle.com> wrote:
> The reason we schedule a tasklet instead of continuing with an
> 'vcpu_kick' is not yet known to me. This commit added the mechanism
> to do it via the tasklet:
> 
> commit a5db2986d47fafc5e62f992616f057bfa43015d9
> Author: Keir Fraser <keir.fraser@citrix.com>
> Date:   Fri May 8 11:50:12 2009 +0100
> 
>     x86 hvm: hvm_set_callback_irq_level() must not be called in IRQ
>     context or with IRQs disabled. Ensure this by deferring to tasklet
>     (softirq) context if required.
>     
>     Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
> 
> But I am not sure why:
> 
>  a). 'must not be called in IRQ context or with IRQs disabled' is
>     important - I haven't dug in the code yet to understand the
>     crucial reasons for - is there a known issue about this?

Because of its use of spin_lock(), which would have the potential
for a deadlock if the function was called in the wrong context. The
apparent alternative would be to make all users of
d->arch.hvm_domain.irq_lock use the IRQ-safe variant; I didn't
check whether that would in turn cause new problems.

>  b). Why do we have a per-cpu tasklet lists, but any manipulation of the
>      items of them are protected by a global lock. Looking at the code in
>      Linux and Xen the major difference is that Xen can schedule on specific CPUs
>      (or even the tasklet can schedule itself on another CPU).

tasklet_kill() and migrate_tasklets_from_cpu() at the very least
would need very careful modification if you were to replace the
global lock.

>      I can see the need for the tasklets being on different CPUs for
>      the microcode, and I am digging through the other ones to get
>      a feel for it - but has anybody thought about improving this
>      code? Has there been any suggestions/ideas tossed around in the
>      past (the mailing list didn't help or my Google-fun sucks).

I'm not aware of any such, quite likely because no-one so far
spotted a problem with the current implementation.

Jan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RUNSTATE_runnable delta time for idle_domain accounted to HVM guest.
  2014-05-07  8:07       ` Jan Beulich
@ 2014-05-07 13:33         ` Konrad Rzeszutek Wilk
  2014-05-07 14:10           ` Jan Beulich
  0 siblings, 1 reply; 9+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-05-07 13:33 UTC (permalink / raw)
  To: Jan Beulich; +Cc: George Dunlap, xen-devel, dario.faggioli, keir.xen

On Wed, May 07, 2014 at 09:07:32AM +0100, Jan Beulich wrote:
> >>> On 06.05.14 at 19:36, <konrad.wilk@oracle.com> wrote:
> > The reason we schedule a tasklet instead of continuing with an
> > 'vcpu_kick' is not yet known to me. This commit added the mechanism
> > to do it via the tasklet:

I should also add that a bit more digging and I realized that
the reason we have so many of the tasklets running (the guests
couldn't be that insane to schedule so many alarms for one-shot timers!)
is that we do PCI passthrough. And that uses the 'hvm_dirq_assist'
tasklet (hvm_do_IRQ_dpci). As in we serialize passthrough interrupts
for all of the guests via this tasklet lock. 

And in this particular setups, the other sockets are busy
doing I/O over PCI passthrough devices. Hence the lock is taken
quite frequently - over all off the sockets.

<Big sigh>
> > 
> > commit a5db2986d47fafc5e62f992616f057bfa43015d9
> > Author: Keir Fraser <keir.fraser@citrix.com>
> > Date:   Fri May 8 11:50:12 2009 +0100
> > 
> >     x86 hvm: hvm_set_callback_irq_level() must not be called in IRQ
> >     context or with IRQs disabled. Ensure this by deferring to tasklet
> >     (softirq) context if required.
> >     
> >     Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
> > 
> > But I am not sure why:
> > 
> >  a). 'must not be called in IRQ context or with IRQs disabled' is
> >     important - I haven't dug in the code yet to understand the
> >     crucial reasons for - is there a known issue about this?
> 
> Because of its use of spin_lock(), which would have the potential
> for a deadlock if the function was called in the wrong context. The
> apparent alternative would be to make all users of
> d->arch.hvm_domain.irq_lock use the IRQ-safe variant; I didn't
> check whether that would in turn cause new problems.

<chuckles> Thanks for the pointer.
> 
> >  b). Why do we have a per-cpu tasklet lists, but any manipulation of the
> >      items of them are protected by a global lock. Looking at the code in
> >      Linux and Xen the major difference is that Xen can schedule on specific CPUs
> >      (or even the tasklet can schedule itself on another CPU).
> 
> tasklet_kill() and migrate_tasklets_from_cpu() at the very least
> would need very careful modification if you were to replace the
> global lock.

<nods> That was my feeling as well.
> 
> >      I can see the need for the tasklets being on different CPUs for
> >      the microcode, and I am digging through the other ones to get
> >      a feel for it - but has anybody thought about improving this
> >      code? Has there been any suggestions/ideas tossed around in the
> >      past (the mailing list didn't help or my Google-fun sucks).
> 
> I'm not aware of any such, quite likely because no-one so far
> spotted a problem with the current implementation.

Yeey! First one to discover!

> 
> Jan
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: RUNSTATE_runnable delta time for idle_domain accounted to HVM guest.
  2014-05-07 13:33         ` Konrad Rzeszutek Wilk
@ 2014-05-07 14:10           ` Jan Beulich
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Beulich @ 2014-05-07 14:10 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: George Dunlap, xen-devel, dario.faggioli, keir.xen

>>> On 07.05.14 at 15:33, <konrad.wilk@oracle.com> wrote:
> On Wed, May 07, 2014 at 09:07:32AM +0100, Jan Beulich wrote:
>> >>> On 06.05.14 at 19:36, <konrad.wilk@oracle.com> wrote:
>> > The reason we schedule a tasklet instead of continuing with an
>> > 'vcpu_kick' is not yet known to me. This commit added the mechanism
>> > to do it via the tasklet:
> 
> I should also add that a bit more digging and I realized that
> the reason we have so many of the tasklets running (the guests
> couldn't be that insane to schedule so many alarms for one-shot timers!)
> is that we do PCI passthrough. And that uses the 'hvm_dirq_assist'
> tasklet (hvm_do_IRQ_dpci). As in we serialize passthrough interrupts
> for all of the guests via this tasklet lock. 

Now that is indeed something in need of improvement. I'm afraid
the Intel guys who wrote all this aren't around anymore...

Jan

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-05-07 14:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-23 21:28 RUNSTATE_runnable delta time for idle_domain accounted to HVM guest Konrad Rzeszutek Wilk
2014-04-24  7:58 ` Jan Beulich
2014-04-24 18:02   ` Konrad Rzeszutek Wilk
2014-04-29  9:16 ` George Dunlap
2014-04-29 12:42   ` Konrad Rzeszutek Wilk
2014-05-06 17:36     ` Konrad Rzeszutek Wilk
2014-05-07  8:07       ` Jan Beulich
2014-05-07 13:33         ` Konrad Rzeszutek Wilk
2014-05-07 14:10           ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).