Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts

From: Dario Faggioli <dfaggioli@suse.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>,
	David Woodhouse <dwmw@amazon.co.uk>,
	Xen-devel <xen-devel@lists.xen.org>
Subject: Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
Date: Sat, 27 Jan 2018 02:27:49 +0100	[thread overview]
Message-ID: <1517016469.15341.153.camel@suse.com> (raw)
In-Reply-To: <be65dc15-ab25-8522-8720-1ab64c6ef959@citrix.com>

[-- Attachment #1.1: Type: text/plain, Size: 7842 bytes --]

> On 25/01/18 16:09, Andrew Cooper wrote:
> > On 25/01/18 15:57, Jan Beulich wrote:
> > > > > > 
> > > For the record, the overwhelming majority of calls to
> > > __sync_local_execstate() being responsible for the behavior
> > > come from invalidate_interrupt(), which suggests to me that
> > > there's a meaningful number of cases where a vCPU is migrated
> > > to another CPU and then back, without another vCPU having
> > > run on the original CPU in between. If I'm not wrong with this,
> > > I have to question why the vCPU is migrated then in the first
> > > place.
> > 
So, about this. I haven't applied Jan's measurement patch yet (I'm
doing some reshuffling of my dev and test hardware here), but I have
given a look at traces.

So, Jan, a question: why are you saying "migrated to another CPU **and
then back**"? I'm asking because, AFAICT, the fact that
__sync_local_execstate() is called from invalidate_interrupt() means
that:
* a vCPU is running on a pCPU
* the vCPU is migrated, and the pCPU became idle
* the vCPU starts to run where it was migrated, while its 'original'
  pCPU is still idle ==> inv. IPI ==> sync state.

So there seems to me to be no need for the vCPU to actually "go back",
is there it?

Anyway, looking at traces, I observed the following:

    28.371352689 --|------x------ d32767v9 csched:schedule cpu 9, idle
    28.371354095 --|------x------ d32767v9 sched_switch prev d32767v9, run for 3412.789us
    28.371354475 --|------x------ d32767v9 sched_switch next d3v8, was runnable for 59.917us, next slice 30000.0us
    28.371354752 --|------x------ d32767v9 sched_switch prev d32767v9 next d3v8
    28.371355267 --|------x------ d32767v9 runstate_change d32767v9 running->runnable
(1) 28.371355728 --|------x------ d?v? runstate_change d3v8 runnable->running
    ............ ................ ...
(2) 28.375501037 -----|||-x----|- d3v8 vcpu_wake d3v5
    28.375501540 -----|||-x----|- d3v8 runstate_change d3v5 blocked->runnable
(3) 28.375502300 -----|||-x----|- d3v8 csched:runq_tickle, cpu 8
    ............ ................ ...
    28.375509472 --|--|||x|----|- d32767v8 csched:schedule cpu 8, idle
    28.375510682 --|--|||x|----|- d32767v8 sched_switch prev d32767v8, run for 724.165us
    28.375511034 --|--|||x|----|- d32767v8 sched_switch next d3v5, was runnable for 7.396us, next slice 30000.0us
    28.375511300 --|--|||x|----|- d32767v8 sched_switch prev d32767v8 next d3v5
    28.375511640 --|--|||x|----|- d32767v8 runstate_change d32767v8 running->runnable
(4) 28.375512060 --|--|||x|----|- d?v? runstate_change d3v5 runnable->running
    ............ ................ ...
(5) 28.375624977 ----|-|||x----|- d3v8 csched: d3v8 unboosted
(6) 28.375628208 ----|-|||x----|- d3v8 csched:pick_cpu 11 for d3v8

At (1) d3v8 starts running on CPU 9. Then, at (2), d3v5 wakes up, and
at (3) CPU 8 (which is idle) is tickled, as a consequence of that. At
(4), CPU 8 picks up d3v5 and run it (this may seem unrelated, but bear
with me a little).

At (5), a periodic tick arrives on CPU 9. Periodic ticks are a core
part of the Credit1 algorithm, and are used for accounting and load
balancing. In fact, csched_tick() calls csched_vcpu_acct() which, at
(6), calls _csched_cpu_pick().

Pick realizes that d3v8 is running on CPU 9, and that CPU 8 is also
busy. Now, since CPU 8 and 9 are hyperthreads of the same core, and
since there are fully idle cores, Credit1 decides that it's better to
kick d3v8 to one of those fully idle cores, so both d3v5 and d3v8
itslef can run at full "core speed". In fact, we see that CPU 11 is
picked, as both the hyperthreads --CPU 10 and CPU 11 itself-- are idle.
(To be continued, below)

(7) 28.375630686 ----|-|||x----|- d3v8 csched:schedule cpu 9, busy
(*) 28.375631619 ----|-|||x----|- d3v8 csched:load_balance skipping 14
    28.375632094 ----|-|||x----|- d3v8 csched:load_balance skipping 8
    28.375632612 ----|-|||x----|- d3v8 csched:load_balance skipping 4
    28.375633004 ----|-|||x----|- d3v8 csched:load_balance skipping 6
    28.375633364 ----|-|||x----|- d3v8 csched:load_balance skipping 7
    28.375633960 ----|-|||x------ d3v8 csched:load_balance skipping 8
    28.375634470 ----|-|||x------ d3v8 csched:load_balance skipping 4
    28.375634820 ----|-|||x------ d3v8 csched:load_balance skipping 6
(**)28.375635067 ----|-|||x------ d3v8 csched:load_balance skipping 7
    28.375635560 ----|-|||x------ d3v8 sched_switch prev d3v8, run for 4288.140us
    28.375635988 ----|-|||x------ d3v8 sched_switch next d32767v9, was runnable for 4288.140us
    28.375636233 ----|-|||x------ d3v8 sched_switch prev d3v8 next d32767v9
    28.375636615 ----|-|||x------ d3v8 runstate_change d3v8 running->offline
(8) 28.375637015 ----|-|||x------ d?v? runstate_change d32767v9 runnable->running
    28.375638146 ----|-x||------- d3v2 vcpu_block d3v2
    ............ ................ ...
    28.375645627 ----|--||x------ d32767v9 csched:pick_cpu 11 for d3v8
    28.375647138 ----|--||x------ d32767v9 vcpu_wake d3v8
    28.375647640 ----|--||x------ d32767v9 runstate_change d3v8 offline->runnable
(9) 28.375648353 ----|--||x------ d32767v9 csched:runq_tickle, cpu 11
    ............ ................ ...
    28.375709505 ----|--||--x---- d32767v11 sched_switch prev d32767v11, run for 2320182.912us
    28.375709778 ----|--||--x---- d32767v11 sched_switch next d3v8, was runnable for 59.670us, next slice 30000.0us
    28.375710001 ----|--||--x---- d32767v11 sched_switch prev d32767v11 next d3v8
    28.375710501 ----|--||--x---- d32767v11 runstate_change d32767v11 running->runnable
(10)28.375710858 ----|--||--x---- d?v? runstate_change d3v8 runnable->running

At (7) we see that CPU 9 re-schedules, as a consequence of pick
deciding to migrate d3v8. As a side note, all the "load_balance
skipping xx" lines between (*) and (**) show that stealing work
attempts are actually prevented on all those CPUs, because they have
only 1 runnable (either running or ready to do so) vCPU. I.e., my patch
 works and achieves its goal of avoiding even trying to steal (which
means avoiding having to take a lock!), when there's no need. :-)

Anyway, at (8) d3v8 is gone, and CPU 9 eventually becomes idle. At (9),
another call to pick_cpu() confirms that d3v8 will land on CPU 11, and
at (10) we see it starting to run there. It should be at this point
that the invalidate IPI is sent, which causes the state sync request
(note, in fact, that CPU 8 is still idle).

Now, this is just _one_ example, but I am quite convinced that this may
actually be one of the most prominent causes of the behavior Jan
reported.

The problem, as I was expecting, is not work stealing, the problem is,
well... Credit1! :-/

In fact, when d3v5 wakes up, why, at point (3), CPU 8 is tickled,
instead of, for instance, CPU 10 (or 11, or 12, or 13)? CPU 8 and CPU 9
are hyperthread siblings, and CPU 9 is busy, so it would have been
better to try to leave CPU 8 alone. And that would have been possible,
as both the core of CPUs 10 and 11, and of CPUs 12 or 13 are fully
idle.

Well, point is, tickling in Credit1 does not check/consider
hyperthreading. Can it then start doing so? Not easily, IMO, and at an
added cost --which will be payed on the vCPU wakeup path (which is
already quite convoluted and complex, in that scheduler).

Credit2, for instance, does not suffer from this issue. In fact,
hyperthreading, there, is considered during wakeup/tickling already.

Hope this helps clarifying things a bit,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel