xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* RUNSTATE_runnable delta time for idle_domain accounted to HVM guest.
@ 2014-04-23 21:28 Konrad Rzeszutek Wilk
  2014-04-24  7:58 ` Jan Beulich
  2014-04-29  9:16 ` George Dunlap
  0 siblings, 2 replies; 9+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-04-23 21:28 UTC (permalink / raw)
  To: xen-devel, george.dunlap, dario.faggioli

[-- Attachment #1: Type: text/plain, Size: 5075 bytes --]

Hey George and Dario,

I am looking at a particular issue for which I would appreciate
your expert knowledge of the scheduler's inner workings.

I have two questions to which I hope will steer me in the right
debug direction. But before I go that route let me explain
the scenario.

There are seven guests (eight if you include initial domain)-
each guest is partitioned to have exclusive usage of a range 
of CPUs. That is, none of the guests (except initial domain)
overlap with other ones. This is enforced by having 'cpu=X-Y'
in the guest config.

As such the runqueue for each physical CPU contains the
idle domain, the guest vcpu, and then occasionally the initial
domain.

What we are observing is that if a domain is idle its steal
time* goes up. My first thought was - well that is the initial
domain taking the time - but after looking at the trace
did not see the initial domain to be scheduled at all.

(*steal time: RUNSTATE_runnable + RUNSTATE_offline).

The issue I see is that the guest ends up hitting
'HLT' and the scheduler picks the idle domain. It
kicks it off and .... 

	Question 1: Following the code path, schedule_tail
	for the idle domain would call idle_loop.

	How do we end up from idle_loop in vcpu_wake?

	Is that because the HPET (on another CPU)
	has raised the softirq(TIMER_SOFTIRQ) because the
	timer has expired?

	It certinaly looks like it could happen:

	evt_do_broadcast (on some other CPU), sets TIMER_SOFTIRQ 

	idle_loop
	  ->do_softirq
	       timer_softirq_action
                  vcpu_singleshot_timer_fn
	             vcpu_periodic_timer_work
	                send_timer_event->
		            send_guest_vcpu_virq->
			         evtchn_set_pending->
				      vcpu_mark_events_pending->
					vcpu_kick->
						vcpu_unblock->
							vcpu_wake

        And then on the next iteration, somebody has to
        set the (SCHEDULE_SOFTIRQ)??

	Anyhow,

.. and ends up calling vcpu_wake. vcpu_wake
changes the guest's runstate from runstate_BLOCKED
to runstate_RUNNABLE. Then we call schdule()

	Question 2:

	Who would trigger the SCHEDULE_SOFTIRQ for that?
	I was initially thinking that the 'do_block'. But that
	I think triggers the first call to 'schedule' which
	sets the idle domain to run. Help? It could be
	'vcpu_kick' but 'v->running=0' (done by schedule->context_saved).
	Help!? Who could it be?

Then 'schedule' is called where the 'prev' is the idle
domain and 'next' is the guest. However, because 'next' got
labelled as 'runstate_RUNNABLE' we account _all of the time
that the idle domain had been running as belonging to the guest_.

That is what I think is happening and it certainly explains
why the guest has such a large steal time. It looks like a bug
to me, but perhaps that is how it is suppose to be accounted for?

Here is the trace (I am also attaching a patch ,as the 'format'
file had a bug in it)

Upon closer inspection we have something like this:

 (+    1104)  do_yield          [ domid = 0x00000005, edomid = 0x00000004 ]
 (+  353598)  continue_running    [ dom:vcpu = 0x00050004 ]
 (+   41577)  VMENTRY
 (+   77715)  VMEXIT      [ exitcode = 0x0000000c, rIP  = 0xffffffff810402ea ]
 (+   47451)  do_block          [ domid = 0x00000005, vcpuid = 0x00000004 ]
 (+       0)  HLT         [ intpending = 0 ]
 (+  190338)  switch_infprev    [ old_domid = 0x00000005, runtime = 24803749 ]
 (+     264)  switch_infnext    [ new_domid = 0x00007fff, time = 24803749, r_time = 4294967295 ]
 (+     309)  __enter_scheduler [ prev<domid:vcpu> = 0x00000005 : 0x00000004, next<domid:vcpu> = 0x00007fff : 0x00000081 ]
 (+     318)  running_to_blocked  [ dom:vcpu = 0x00050004 ]

[VM 05 is now in runstate_blocked]
 (+   30516)  runnable_to_running [ dom:vcpu = 0x7fff0081 ]

=> Here schedule_tail is called for the idle domain. <= presumarily runs 'idle_loop'

... some time later...


vcpu_wake:
 (+  794301)  blocked_to_runnable [ dom:vcpu = 0x00050004 ]

=> Moves VM05 from RUNstate_blocked to RUNstate_runnable.

 (+   88113)  domain_wake       [ domid = 0x00000005, vcpu = 0x00000004 ]


 (+15749685)  switch_infprev    [ old_domid = 0x00007fff, runtime = 6863168 ]

(IDLE domain ran for [6.57msec] !! )

 (+     288)  switch_infnext    [ new_domid = 0x00000005, time = 6458163, r_time = 30000000 ]

And we switch to VM05:

 TRACE_3D(TRC_SCHED_SWITCH_INFNEXT,
   VM05 ==>  next->domain->domain_id,	
   YES?!?!?  (next->runstate.state == RUNSTATE_runnable) ? [yes, because vcpu_wake changed it]
             (now - next->runstate.state_entry_time) : 0,
             next_slice.time);

 And account the time that the idle domain ran to the VCPU of
the VM05.

 (+   27717)  __enter_scheduler [ prev<domid:vcpu> = 0x00007fff : 0x00000081, next<domid:vcpu> = 0x00000005 : 0x00000004 ]
 (+     363)  running_to_runnable [ dom:vcpu = 0x7fff0081 ]
 (+   63174)  runnable_to_running [ dom:vcpu = 0x00050004 ]
 (+       0)  INJ_VIRQ    [ vector = 0xf3, fake = 0 ]
 (+       0)  INTR_WINDOW [ value = 0x000000f3 ]
 (+  164763)  VMENTRY

[Start VM05 injecting the callback vector. Presumarily with
the VIRQ_TIMER event]

[-- Attachment #2: 0001-xentrace-formats-Fix-TRC_SCHED_-outputs.patch --]
[-- Type: text/plain, Size: 3273 bytes --]

>From e95845dc96459bbeeb1cc3e5735abaf4e17ddb1b Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Wed, 23 Apr 2014 15:07:40 -0400
Subject: [PATCH] xentrace/formats: Fix TRC_SCHED_*  outputs.

Most of the trace syntax have as the second argument the
'vcpu_id' not an domain id. Hence swapping to the right name.

The TRC_SCHED_SWITCH has the vcpu-id as second and fourth
argument - updating that too.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 tools/xentrace/formats | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index da658bf..9c1bb94 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -19,17 +19,17 @@
 0x00021311  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  offline_to_runnable [ dom:vcpu = 0x%(1)08x ]
 0x00021321  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  offline_to_blocked  [ dom:vcpu = 0x%(1)08x ]
 
-0x00028001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  sched_add_domain  [ domid = 0x%(1)08x, edomid = 0x%(2)08x ]
-0x00028002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  sched_rem_domain  [ domid = 0x%(1)08x, edomid = 0x%(2)08x ]
-0x00028003  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  domain_sleep      [ domid = 0x%(1)08x, edomid = 0x%(2)08x ]
-0x00028004  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  domain_wake       [ domid = 0x%(1)08x, edomid = 0x%(2)08x ]
-0x00028005  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  do_yield          [ domid = 0x%(1)08x, edomid = 0x%(2)08x ]
-0x00028006  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  do_block          [ domid = 0x%(1)08x, edomid = 0x%(2)08x ]
+0x00028001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  sched_add_domain  [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ]
+0x00028002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  sched_rem_domain  [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ]
+0x00028003  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  domain_sleep      [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ]
+0x00028004  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  domain_wake       [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ]
+0x00028005  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  do_yield          [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ]
+0x00028006  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  do_block          [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ]
 0x00022006  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  do_block          [ dom:vcpu = 0x%(1)08x, domid = 0x%(2)08x ]
-0x00028007  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  domain_shutdown	  [ domid = 0x%(1)08x, edomid = 0x%(2)08x, reason = 0x%(3)08x ]
+0x00028007  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  domain_shutdown	  [ domid = 0x%(1)08x, vcpu = 0x%(2)08x, reason = 0x%(3)08x ]
 0x00028008  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  sched_ctl
 0x00028009  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  sched_adjdom      [ domid = 0x%(1)08x ]
-0x0002800a  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  __enter_scheduler [ prev<domid:edomid> = 0x%(1)08x : 0x%(2)08x, next<domid:edomid> = 0x%(3)08x : 0x%(4)08x ]
+0x0002800a  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  __enter_scheduler [ prev<domid:vcpu> = 0x%(1)08x : 0x%(2)08x, next<domid:vcpu> = 0x%(3)08x : 0x%(4)08x ]
 0x0002800b  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  s_timer_fn
 0x0002800c  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  t_timer_fn
 0x0002800d  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  dom_timer_fn
-- 
1.8.5.3


[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-05-07 14:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-04-23 21:28 RUNSTATE_runnable delta time for idle_domain accounted to HVM guest Konrad Rzeszutek Wilk
2014-04-24  7:58 ` Jan Beulich
2014-04-24 18:02   ` Konrad Rzeszutek Wilk
2014-04-29  9:16 ` George Dunlap
2014-04-29 12:42   ` Konrad Rzeszutek Wilk
2014-05-06 17:36     ` Konrad Rzeszutek Wilk
2014-05-07  8:07       ` Jan Beulich
2014-05-07 13:33         ` Konrad Rzeszutek Wilk
2014-05-07 14:10           ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).