From: George Dunlap <george.dunlap@citrix.com>
To: Dario Faggioli <dario.faggioli@citrix.com>,
"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>,
Joao Martins <joao.m.martins@oracle.com>,
Tim Deegan <tim@xen.org>, Jan Beulich <JBeulich@suse.com>
Subject: Re: BUG: NOW() seems to (sometimes) go backwards!
Date: Thu, 9 Jun 2016 11:19:36 +0100 [thread overview]
Message-ID: <575942B8.8060904@citrix.com> (raw)
In-Reply-To: <1465314895.15816.165.camel@citrix.com>
On 07/06/16 16:54, Dario Faggioli wrote:
> Hi,
>
> I've been fighting with this during the past few days, while testing
> and benchmarking some of the modification I'm doing to Credit2.
>
> In summary, what I see is NOW() going backwards. The issues shows up
> when I call NOW() on different pCPUs, and compare the results (which is
> something the Credit2 scheduler does a lot, and I would like it to be
> able to continue to do so. :-D).
>
> I can elaborate on why this is really really bad for scheduling in
> general, and even more so for Credit2. For now, let me just say it's
> pretty bad. Feel free to ask for details if interested.
>
> In any case, IMO, we need to fix this.
>
> It does not only happen cross-socket, i.e., it happens even if the two
> pCPUs are part of the same socket (in the example shown below, they're
> two SMT threads of the same core!!).
>
> I'm able to trigger this on more than just one box, all with invariant
> TSC support. (All Intel, though, I don't have any AMD test hardware
> available right now).
>
> In some more details.
>
> The scheduler (both in scheduler.c and in sched_credit{1,2}.c) uses
> NOW() for all the timing and time sampling. In the trace below, at time
> T1 (equal to 1.264582607), we do this:
>
> d0v9->start_time = NOW()
>
> and we are on pCPU 3.
>
> Then, at time T2, on pCPU2, there is this:
>
> if ( d0v9->start_time < NOW() )
> {
> /* OMG!! */
> __trace("csched2:time_backwards ...")
> }
>
> and we indeed hit the __trace().
>
> What the trace point says is that NOW() was 1972180296557 at time T1,
> on pCPU 3, and it is 1972180270044 at time T2 on pCPU2 (i.e., ~26.5
> microsecs in the past).
>
> I speak about NOW() going backwards because, by looking at both the
> code and the trace, I am positive about the fact that T2 happens later
> than T1 (go to [1], if interested)... So, why NOW() is smaller at T2?
> :-(
>
> 1.264577304 ||x||||||||||||| d0v10 runstate_change d0v10 running->runnable
> 1.264577611 ||x||||||||||||| d?v? runstate_change d0v5 runnable->running
> 1.264580025 |||x||||||-||||| d0v0 vcpu_block d0v0
> 1.264580795 |||x||||||-||||| d0v0 csched2:schedule current = d0v0, now = 1972180296557
> 1.264581014 |||x||||||-||||| d0v0 csched2:burn_credits d0v0, credit = 9091956, delta = 39433, start_time = 1972180296557
> 1.264581660 |||x||||||-||||| d0v0 sched_switch prev d0, run for 112.503us
> 1.264581809 |||x||||||-||||| d0v0 sched_switch next d0, was runnable for 54.532us, next slice 2000.0us
> 1.264582033 |||x||||||-||||| d0v0 sched_switch prev d0v0 next d0v9
> 1.264582269 |||x||||||-||||| d0v0 runstate_change d0v0 running->blocked
> T1 1.264582607 |||x||||||-||||| d?v? runstate_change d0v9 runnable->running
> 1.264583109 ||x|||||||-||||| d0v5 csched2:runq_insert d0v10, position 0
> 1.264583341 ||x|||||||-||||| d0v5 csched2:runq_tickle_new d0v10, credit = 5691963
> 1.264583907 ||x|||||||-||||| d0v5 csched2:burn_credits d0v5, credit = 8575685, delta = 8241, start_time = 1972180270044
> 1.264584236 ||x|||||||-||||| d0v5 csched2:burn_credits d1v1, credit = 8696613, delta = 12920, start_time = 1972180270044
> 1.264584431 ||x|||||||-||||| d0v5 csched2:tickle_check d1v1, credit = 8696613
> 1.264584745 ||x|||||||-||||| d0v5 csched2:burn_credits d1v3, credit = 10182550, delta = 12920, start_time = 1972180270044
> 1.264584966 ||x|||||||-||||| d0v5 csched2:tickle_check d1v3, credit = 10182550
> T2 1.264585207 ||x|||||||-||||| d0v5 csched2:time_backwards d0v9: cur_time = 1972180270044, cur_cpu = 2, old_time = 1972180296557, old_cpu = 3
> 1.264585364 ||x|||||||-||||| d0v5 csched2:burn_credits d0v9, credit = 5759499, delta = -26513, start_time = 1972180270044
> 1.264585537 ||x|||||||-||||| d0v5 csched2:tickle_check d0v9, credit = 5759499
> 1.264585876 ||x|||||||-||||| d0v5 csched2:burn_credits d0v7, credit = 8835314, delta = 12920, start_time = 1972180270044
> 1.264586041 ||x|||||||-||||| d0v5 csched2:tickle_check d0v7, credit = 8835314
> 1.264586491 ||x|||||||-||||| d0v5 csched2:burn_credits d0v15, credit = 5822635, delta = 12920, start_time = 1972180270044
> 1.264586655 ||x|||||||-||||| d0v5 csched2:tickle_check d0v15, credit = 5822635
> 1.264586952 ||x|||||||-||||| d0v5 csched2:burn_credits d0v13, credit = 7956174, delta = 12920, start_time = 1972180270044
> 1.264587115 ||x|||||||-||||| d0v5 csched2:tickle_check d0v13, credit = 7956174
> 1.264587385 ||x|||||||-||||| d0v5 csched2:burn_credits d0v3, credit = 7870338, delta = 12920, start_time = 1972180270044
> 1.264587545 ||x|||||||-||||| d0v5 csched2:tickle_check d0v3, credit = 7870338
> 1.264595859 |||x||||||-||||| d0v9 csched2:schedule current = d0v9, now = 1972180311345
> 1.264596313 |||x||||||-||||| d0v9 csched2:burn_credits d0v9, credit = 5718198, delta = 4877, start_time = 1972180311345
>
> Interestingly enough, if we look at the same (well, sort of) set of
> events with just xentrace_format, we, after having sorted the records
> by their timestamps, get this:
>
> CPU2 5613124777311 (+ 507) running_to_runnable [ dom:vcpu = 0x0000000a ]
> CPU2 5613124778046 (+ 735) runnable_to_running [ dom:vcpu = 0x00000005 ]
>
> CPU3 5613124788657 (+ 537) __enter_scheduler [ prev<dom:vcpu> = 0x00000000, next<dom:vcpu> = 0x00000009 ]
> CPU3 5613124789224 (+ 567) running_to_blocked [ dom:vcpu = 0x00000000 ]
> T1 CPU3 5613124790034 (+ 810) runnable_to_running [ dom:vcpu = 0x00000009 ]
>
> CPU2 5613124791240 (+ 13194) csched2:runq_pos [ dom:vcpu = 0x0000000a, pos = 0]
> CPU2 5613124791795 (+ 555) csched2:tickle_new [ dom:vcpu = 0x0000000a, credit = 0x5691963 ]
> CPU2 5613124793154 (+ 1359) csched2:credit burn [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 5 ]
> CPU2 5613124793943 (+ 789) csched2:credit burn [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 65537 ]
> CPU2 5613124794411 (+ 468) csched2:tickle_check [ dom:vcpu = 0x00010001, credit = 8696613 ]
> CPU2 5613124795164 (+ 753) csched2:credit burn [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 65539 ]
> CPU2 5613124795695 (+ 531) csched2:tickle_check [ dom:vcpu = 0x00010003, credit = 10182550 ]
> T2 CPU2 5613124796274 (+ 579) unknown (0x000000000002221e) [ 0x2f1abbdc 0x000001cb 0x2f1b236d 0x000001cb 0x00000009 0x00030002 0x00000000 ]
> CPU2 5613124796649 (+ 375) csched2:credit burn [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 9 ]
> CPU2 5613124797066 (+ 417) csched2:tickle_check [ dom:vcpu = 0x00000009, credit = 5759499 ]
> CPU2 5613124797879 (+ 813) csched2:credit burn [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 7 ]
> CPU2 5613124798275 (+ 396) csched2:tickle_check [ dom:vcpu = 0x00000007, credit = 8835314 ]
> CPU2 5613124799355 (+ 1080) csched2:credit burn [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 15 ]
> CPU2 5613124799748 (+ 393) csched2:tickle_check [ dom:vcpu = 0x0000000f, credit = 5822635 ]
> CPU2 5613124800459 (+ 711) csched2:credit burn [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 13 ]
> CPU2 5613124800852 (+ 393) csched2:tickle_check [ dom:vcpu = 0x0000000d, credit = 7956174 ]
> CPU2 5613124801500 (+ 648) csched2:credit burn [ dom:vcpu = 0x2f1abbdc, credit = 459, delta = 3 ]
> CPU2 5613124801884 (+ 384) csched2:tickle_check [ dom:vcpu = 0x00000003, credit = 7870338 ]
>
> CPU3 5613124803177 (+ 13143) csched2:update_load
> CPU3 5613124803786 (+ 609) unknown (0x0000000000022220) [ 0x0000d61e 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 ]
> CPU3 5613124804560 (+ 774) csched2:updt_runq_load [ rq_load[16]:rq_id[8]:wshift[8] = 0x12000009, rq_avgload = 0x0000000000216566, b_avgload = 0x0000000000216566,
> CPU3 5613124804929 (+ 369) unknown (0x0000000000022220) [ 0x0001e8de 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000 ]
> CPU3 5613124805256 (+ 327) csched2:updt_vcpu_load [ dom:vcpu = 0x00000000, vcpuload = 0x0000000000026040, wshift = 18 ]
>
> CPU2 5613124975596 (+ 173712) domain_wake [ dom:vcpu = 0x00010005 ]
>
> Interestingly, the timestamp at T1 is 5613124790034, which looks
> _correctly_ smaller than the timestamp at T2 (5613124796274).
>
> NOW(), as it is called from the scheduling code, does this:
> s_time_t NOW(u64 at_tsc) //it's actually get_s_time_fixed()
> {
> struct cpu_time *t = &this_cpu(cpu_time);
> u64 tsc, delta;
> s_time_t now;
>
> tsc = rdtsc();
> delta = tsc - t->local_tsc_stamp;
> now = t->stime_local_stamp + scale_delta(delta, &t->tsc_scale);
>
> return now;
> }
>
> While the timestamps in the trace are the result of get_cycles():
> static inline cycles_t get_cycles(void)
> {
> return rdtsc();
> }
>
> So, it looks to me that the TSC is actually ok, and it could be the
> local_tsc_stamp and scale_delta() magic done in get_s_time_fixed()
> (which, AFAICUI, is there to convert cycles to time) that does not
> guarantee the result to be monotonic, even if the input is... Thoughts?
What about adding a trace point inside NOW() to record the various
inputs into the calculation?
Adding some sort of global ASSERT() that time doesn't go backwards by
too much would be nice, but I think would require spinlocks -- OTOH,
just for debugging purposes it might not be that bad.
-George
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
prev parent reply other threads:[~2016-06-09 10:20 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-06-07 15:54 BUG: NOW() seems to (sometimes) go backwards! Dario Faggioli
2016-06-08 10:42 ` Jan Beulich
2016-06-08 13:43 ` Dario Faggioli
2016-06-08 14:01 ` Jan Beulich
2016-06-08 20:36 ` Joao Martins
2016-06-09 8:05 ` Jan Beulich
2016-06-09 10:24 ` Joao Martins
2016-06-09 10:43 ` Joao Martins
2016-06-09 11:25 ` Jan Beulich
2016-06-09 11:31 ` Jan Beulich
2016-06-09 10:19 ` George Dunlap [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=575942B8.8060904@citrix.com \
--to=george.dunlap@citrix.com \
--cc=JBeulich@suse.com \
--cc=andrew.cooper3@citrix.com \
--cc=dario.faggioli@citrix.com \
--cc=joao.m.martins@oracle.com \
--cc=tim@xen.org \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.