From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Rzeszutek Wilk Subject: RUNSTATE_runnable delta time for idle_domain accounted to HVM guest. Date: Wed, 23 Apr 2014 17:28:24 -0400 Message-ID: <20140423212824.GB12560@phenom.dumpdata.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="17pEHd4RhPHOinZp" Return-path: Received: from mail6.bemta14.messagelabs.com ([193.109.254.103]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1Wd4ib-0001bj-45 for xen-devel@lists.xenproject.org; Wed, 23 Apr 2014 21:28:37 +0000 Content-Disposition: inline List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: xen-devel@lists.xenproject.org, george.dunlap@eu.citrix.com, dario.faggioli@citrix.com List-Id: xen-devel@lists.xenproject.org --17pEHd4RhPHOinZp Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Hey George and Dario, I am looking at a particular issue for which I would appreciate your expert knowledge of the scheduler's inner workings. I have two questions to which I hope will steer me in the right debug direction. But before I go that route let me explain the scenario. There are seven guests (eight if you include initial domain)- each guest is partitioned to have exclusive usage of a range of CPUs. That is, none of the guests (except initial domain) overlap with other ones. This is enforced by having 'cpu=X-Y' in the guest config. As such the runqueue for each physical CPU contains the idle domain, the guest vcpu, and then occasionally the initial domain. What we are observing is that if a domain is idle its steal time* goes up. My first thought was - well that is the initial domain taking the time - but after looking at the trace did not see the initial domain to be scheduled at all. (*steal time: RUNSTATE_runnable + RUNSTATE_offline). The issue I see is that the guest ends up hitting 'HLT' and the scheduler picks the idle domain. It kicks it off and .... Question 1: Following the code path, schedule_tail for the idle domain would call idle_loop. How do we end up from idle_loop in vcpu_wake? Is that because the HPET (on another CPU) has raised the softirq(TIMER_SOFTIRQ) because the timer has expired? It certinaly looks like it could happen: evt_do_broadcast (on some other CPU), sets TIMER_SOFTIRQ idle_loop ->do_softirq timer_softirq_action vcpu_singleshot_timer_fn vcpu_periodic_timer_work send_timer_event-> send_guest_vcpu_virq-> evtchn_set_pending-> vcpu_mark_events_pending-> vcpu_kick-> vcpu_unblock-> vcpu_wake And then on the next iteration, somebody has to set the (SCHEDULE_SOFTIRQ)?? Anyhow, .. and ends up calling vcpu_wake. vcpu_wake changes the guest's runstate from runstate_BLOCKED to runstate_RUNNABLE. Then we call schdule() Question 2: Who would trigger the SCHEDULE_SOFTIRQ for that? I was initially thinking that the 'do_block'. But that I think triggers the first call to 'schedule' which sets the idle domain to run. Help? It could be 'vcpu_kick' but 'v->running=0' (done by schedule->context_saved). Help!? Who could it be? Then 'schedule' is called where the 'prev' is the idle domain and 'next' is the guest. However, because 'next' got labelled as 'runstate_RUNNABLE' we account _all of the time that the idle domain had been running as belonging to the guest_. That is what I think is happening and it certainly explains why the guest has such a large steal time. It looks like a bug to me, but perhaps that is how it is suppose to be accounted for? Here is the trace (I am also attaching a patch ,as the 'format' file had a bug in it) Upon closer inspection we have something like this: (+ 1104) do_yield [ domid = 0x00000005, edomid = 0x00000004 ] (+ 353598) continue_running [ dom:vcpu = 0x00050004 ] (+ 41577) VMENTRY (+ 77715) VMEXIT [ exitcode = 0x0000000c, rIP = 0xffffffff810402ea ] (+ 47451) do_block [ domid = 0x00000005, vcpuid = 0x00000004 ] (+ 0) HLT [ intpending = 0 ] (+ 190338) switch_infprev [ old_domid = 0x00000005, runtime = 24803749 ] (+ 264) switch_infnext [ new_domid = 0x00007fff, time = 24803749, r_time = 4294967295 ] (+ 309) __enter_scheduler [ prev = 0x00000005 : 0x00000004, next = 0x00007fff : 0x00000081 ] (+ 318) running_to_blocked [ dom:vcpu = 0x00050004 ] [VM 05 is now in runstate_blocked] (+ 30516) runnable_to_running [ dom:vcpu = 0x7fff0081 ] => Here schedule_tail is called for the idle domain. <= presumarily runs 'idle_loop' ... some time later... vcpu_wake: (+ 794301) blocked_to_runnable [ dom:vcpu = 0x00050004 ] => Moves VM05 from RUNstate_blocked to RUNstate_runnable. (+ 88113) domain_wake [ domid = 0x00000005, vcpu = 0x00000004 ] (+15749685) switch_infprev [ old_domid = 0x00007fff, runtime = 6863168 ] (IDLE domain ran for [6.57msec] !! ) (+ 288) switch_infnext [ new_domid = 0x00000005, time = 6458163, r_time = 30000000 ] And we switch to VM05: TRACE_3D(TRC_SCHED_SWITCH_INFNEXT, VM05 ==> next->domain->domain_id, YES?!?!? (next->runstate.state == RUNSTATE_runnable) ? [yes, because vcpu_wake changed it] (now - next->runstate.state_entry_time) : 0, next_slice.time); And account the time that the idle domain ran to the VCPU of the VM05. (+ 27717) __enter_scheduler [ prev = 0x00007fff : 0x00000081, next = 0x00000005 : 0x00000004 ] (+ 363) running_to_runnable [ dom:vcpu = 0x7fff0081 ] (+ 63174) runnable_to_running [ dom:vcpu = 0x00050004 ] (+ 0) INJ_VIRQ [ vector = 0xf3, fake = 0 ] (+ 0) INTR_WINDOW [ value = 0x000000f3 ] (+ 164763) VMENTRY [Start VM05 injecting the callback vector. Presumarily with the VIRQ_TIMER event] --17pEHd4RhPHOinZp Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="0001-xentrace-formats-Fix-TRC_SCHED_-outputs.patch" >>From e95845dc96459bbeeb1cc3e5735abaf4e17ddb1b Mon Sep 17 00:00:00 2001 From: Konrad Rzeszutek Wilk Date: Wed, 23 Apr 2014 15:07:40 -0400 Subject: [PATCH] xentrace/formats: Fix TRC_SCHED_* outputs. Most of the trace syntax have as the second argument the 'vcpu_id' not an domain id. Hence swapping to the right name. The TRC_SCHED_SWITCH has the vcpu-id as second and fourth argument - updating that too. Signed-off-by: Konrad Rzeszutek Wilk --- tools/xentrace/formats | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/tools/xentrace/formats b/tools/xentrace/formats index da658bf..9c1bb94 100644 --- a/tools/xentrace/formats +++ b/tools/xentrace/formats @@ -19,17 +19,17 @@ 0x00021311 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) offline_to_runnable [ dom:vcpu = 0x%(1)08x ] 0x00021321 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) offline_to_blocked [ dom:vcpu = 0x%(1)08x ] -0x00028001 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) sched_add_domain [ domid = 0x%(1)08x, edomid = 0x%(2)08x ] -0x00028002 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) sched_rem_domain [ domid = 0x%(1)08x, edomid = 0x%(2)08x ] -0x00028003 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) domain_sleep [ domid = 0x%(1)08x, edomid = 0x%(2)08x ] -0x00028004 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) domain_wake [ domid = 0x%(1)08x, edomid = 0x%(2)08x ] -0x00028005 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) do_yield [ domid = 0x%(1)08x, edomid = 0x%(2)08x ] -0x00028006 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) do_block [ domid = 0x%(1)08x, edomid = 0x%(2)08x ] +0x00028001 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) sched_add_domain [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ] +0x00028002 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) sched_rem_domain [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ] +0x00028003 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) domain_sleep [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ] +0x00028004 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) domain_wake [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ] +0x00028005 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) do_yield [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ] +0x00028006 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) do_block [ domid = 0x%(1)08x, vcpu = 0x%(2)08x ] 0x00022006 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) do_block [ dom:vcpu = 0x%(1)08x, domid = 0x%(2)08x ] -0x00028007 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) domain_shutdown [ domid = 0x%(1)08x, edomid = 0x%(2)08x, reason = 0x%(3)08x ] +0x00028007 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) domain_shutdown [ domid = 0x%(1)08x, vcpu = 0x%(2)08x, reason = 0x%(3)08x ] 0x00028008 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) sched_ctl 0x00028009 CPU%(cpu)d %(tsc)d (+%(reltsc)8d) sched_adjdom [ domid = 0x%(1)08x ] -0x0002800a CPU%(cpu)d %(tsc)d (+%(reltsc)8d) __enter_scheduler [ prev = 0x%(1)08x : 0x%(2)08x, next = 0x%(3)08x : 0x%(4)08x ] +0x0002800a CPU%(cpu)d %(tsc)d (+%(reltsc)8d) __enter_scheduler [ prev = 0x%(1)08x : 0x%(2)08x, next = 0x%(3)08x : 0x%(4)08x ] 0x0002800b CPU%(cpu)d %(tsc)d (+%(reltsc)8d) s_timer_fn 0x0002800c CPU%(cpu)d %(tsc)d (+%(reltsc)8d) t_timer_fn 0x0002800d CPU%(cpu)d %(tsc)d (+%(reltsc)8d) dom_timer_fn -- 1.8.5.3 --17pEHd4RhPHOinZp Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --17pEHd4RhPHOinZp--