From: Dario Faggioli <dfaggioli@suse.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>,
David Woodhouse <dwmw@amazon.co.uk>,
Xen-devel <xen-devel@lists.xen.org>
Subject: Re: [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts
Date: Sat, 27 Jan 2018 02:27:49 +0100 [thread overview]
Message-ID: <1517016469.15341.153.camel@suse.com> (raw)
In-Reply-To: <be65dc15-ab25-8522-8720-1ab64c6ef959@citrix.com>
[-- Attachment #1.1: Type: text/plain, Size: 7842 bytes --]
> On 25/01/18 16:09, Andrew Cooper wrote:
> > On 25/01/18 15:57, Jan Beulich wrote:
> > > > > >
> > > For the record, the overwhelming majority of calls to
> > > __sync_local_execstate() being responsible for the behavior
> > > come from invalidate_interrupt(), which suggests to me that
> > > there's a meaningful number of cases where a vCPU is migrated
> > > to another CPU and then back, without another vCPU having
> > > run on the original CPU in between. If I'm not wrong with this,
> > > I have to question why the vCPU is migrated then in the first
> > > place.
> >
So, about this. I haven't applied Jan's measurement patch yet (I'm
doing some reshuffling of my dev and test hardware here), but I have
given a look at traces.
So, Jan, a question: why are you saying "migrated to another CPU **and
then back**"? I'm asking because, AFAICT, the fact that
__sync_local_execstate() is called from invalidate_interrupt() means
that:
* a vCPU is running on a pCPU
* the vCPU is migrated, and the pCPU became idle
* the vCPU starts to run where it was migrated, while its 'original'
pCPU is still idle ==> inv. IPI ==> sync state.
So there seems to me to be no need for the vCPU to actually "go back",
is there it?
Anyway, looking at traces, I observed the following:
28.371352689 --|------x------ d32767v9 csched:schedule cpu 9, idle
28.371354095 --|------x------ d32767v9 sched_switch prev d32767v9, run for 3412.789us
28.371354475 --|------x------ d32767v9 sched_switch next d3v8, was runnable for 59.917us, next slice 30000.0us
28.371354752 --|------x------ d32767v9 sched_switch prev d32767v9 next d3v8
28.371355267 --|------x------ d32767v9 runstate_change d32767v9 running->runnable
(1) 28.371355728 --|------x------ d?v? runstate_change d3v8 runnable->running
............ ................ ...
(2) 28.375501037 -----|||-x----|- d3v8 vcpu_wake d3v5
28.375501540 -----|||-x----|- d3v8 runstate_change d3v5 blocked->runnable
(3) 28.375502300 -----|||-x----|- d3v8 csched:runq_tickle, cpu 8
............ ................ ...
28.375509472 --|--|||x|----|- d32767v8 csched:schedule cpu 8, idle
28.375510682 --|--|||x|----|- d32767v8 sched_switch prev d32767v8, run for 724.165us
28.375511034 --|--|||x|----|- d32767v8 sched_switch next d3v5, was runnable for 7.396us, next slice 30000.0us
28.375511300 --|--|||x|----|- d32767v8 sched_switch prev d32767v8 next d3v5
28.375511640 --|--|||x|----|- d32767v8 runstate_change d32767v8 running->runnable
(4) 28.375512060 --|--|||x|----|- d?v? runstate_change d3v5 runnable->running
............ ................ ...
(5) 28.375624977 ----|-|||x----|- d3v8 csched: d3v8 unboosted
(6) 28.375628208 ----|-|||x----|- d3v8 csched:pick_cpu 11 for d3v8
At (1) d3v8 starts running on CPU 9. Then, at (2), d3v5 wakes up, and
at (3) CPU 8 (which is idle) is tickled, as a consequence of that. At
(4), CPU 8 picks up d3v5 and run it (this may seem unrelated, but bear
with me a little).
At (5), a periodic tick arrives on CPU 9. Periodic ticks are a core
part of the Credit1 algorithm, and are used for accounting and load
balancing. In fact, csched_tick() calls csched_vcpu_acct() which, at
(6), calls _csched_cpu_pick().
Pick realizes that d3v8 is running on CPU 9, and that CPU 8 is also
busy. Now, since CPU 8 and 9 are hyperthreads of the same core, and
since there are fully idle cores, Credit1 decides that it's better to
kick d3v8 to one of those fully idle cores, so both d3v5 and d3v8
itslef can run at full "core speed". In fact, we see that CPU 11 is
picked, as both the hyperthreads --CPU 10 and CPU 11 itself-- are idle.
(To be continued, below)
(7) 28.375630686 ----|-|||x----|- d3v8 csched:schedule cpu 9, busy
(*) 28.375631619 ----|-|||x----|- d3v8 csched:load_balance skipping 14
28.375632094 ----|-|||x----|- d3v8 csched:load_balance skipping 8
28.375632612 ----|-|||x----|- d3v8 csched:load_balance skipping 4
28.375633004 ----|-|||x----|- d3v8 csched:load_balance skipping 6
28.375633364 ----|-|||x----|- d3v8 csched:load_balance skipping 7
28.375633960 ----|-|||x------ d3v8 csched:load_balance skipping 8
28.375634470 ----|-|||x------ d3v8 csched:load_balance skipping 4
28.375634820 ----|-|||x------ d3v8 csched:load_balance skipping 6
(**)28.375635067 ----|-|||x------ d3v8 csched:load_balance skipping 7
28.375635560 ----|-|||x------ d3v8 sched_switch prev d3v8, run for 4288.140us
28.375635988 ----|-|||x------ d3v8 sched_switch next d32767v9, was runnable for 4288.140us
28.375636233 ----|-|||x------ d3v8 sched_switch prev d3v8 next d32767v9
28.375636615 ----|-|||x------ d3v8 runstate_change d3v8 running->offline
(8) 28.375637015 ----|-|||x------ d?v? runstate_change d32767v9 runnable->running
28.375638146 ----|-x||------- d3v2 vcpu_block d3v2
............ ................ ...
28.375645627 ----|--||x------ d32767v9 csched:pick_cpu 11 for d3v8
28.375647138 ----|--||x------ d32767v9 vcpu_wake d3v8
28.375647640 ----|--||x------ d32767v9 runstate_change d3v8 offline->runnable
(9) 28.375648353 ----|--||x------ d32767v9 csched:runq_tickle, cpu 11
............ ................ ...
28.375709505 ----|--||--x---- d32767v11 sched_switch prev d32767v11, run for 2320182.912us
28.375709778 ----|--||--x---- d32767v11 sched_switch next d3v8, was runnable for 59.670us, next slice 30000.0us
28.375710001 ----|--||--x---- d32767v11 sched_switch prev d32767v11 next d3v8
28.375710501 ----|--||--x---- d32767v11 runstate_change d32767v11 running->runnable
(10)28.375710858 ----|--||--x---- d?v? runstate_change d3v8 runnable->running
At (7) we see that CPU 9 re-schedules, as a consequence of pick
deciding to migrate d3v8. As a side note, all the "load_balance
skipping xx" lines between (*) and (**) show that stealing work
attempts are actually prevented on all those CPUs, because they have
only 1 runnable (either running or ready to do so) vCPU. I.e., my patch
works and achieves its goal of avoiding even trying to steal (which
means avoiding having to take a lock!), when there's no need. :-)
Anyway, at (8) d3v8 is gone, and CPU 9 eventually becomes idle. At (9),
another call to pick_cpu() confirms that d3v8 will land on CPU 11, and
at (10) we see it starting to run there. It should be at this point
that the invalidate IPI is sent, which causes the state sync request
(note, in fact, that CPU 8 is still idle).
Now, this is just _one_ example, but I am quite convinced that this may
actually be one of the most prominent causes of the behavior Jan
reported.
The problem, as I was expecting, is not work stealing, the problem is,
well... Credit1! :-/
In fact, when d3v5 wakes up, why, at point (3), CPU 8 is tickled,
instead of, for instance, CPU 10 (or 11, or 12, or 13)? CPU 8 and CPU 9
are hyperthread siblings, and CPU 9 is busy, so it would have been
better to try to leave CPU 8 alone. And that would have been possible,
as both the core of CPUs 10 and 11, and of CPUs 12 or 13 are fully
idle.
Well, point is, tickling in Credit1 does not check/consider
hyperthreading. Can it then start doing so? Not easily, IMO, and at an
added cost --which will be payed on the vCPU wakeup path (which is
already quite convoluted and complex, in that scheduler).
Credit2, for instance, does not suffer from this issue. In fact,
hyperthreading, there, is considered during wakeup/tickling already.
Hope this helps clarifying things a bit,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Software Engineer @ SUSE https://www.suse.com/
[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 157 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
next prev parent reply other threads:[~2018-01-27 1:27 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-24 13:12 [PATCH v10 00/11] x86: Mitigations for SP2/CVE-2017-5715/Branch Target Injection Andrew Cooper
2018-01-24 13:12 ` [PATCH v10 01/11] x86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests Andrew Cooper
2018-02-01 9:06 ` Jan Beulich
2018-02-01 13:53 ` Andrew Cooper
2018-01-24 13:12 ` [PATCH v10 02/11] x86/msr: Emulation of MSR_{SPEC_CTRL, PRED_CMD} " Andrew Cooper
2018-01-25 12:25 ` Jan Beulich
2018-01-24 13:12 ` [PATCH v10 03/11] x86/migrate: Move MSR_SPEC_CTRL on migrate Andrew Cooper
2018-01-24 13:12 ` [PATCH v10 04/11] x86/hvm: Permit guests direct access to MSR_{SPEC_CTRL, PRED_CMD} Andrew Cooper
2018-01-24 13:12 ` [PATCH v10 05/11] x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point Andrew Cooper
2018-01-25 13:08 ` Jan Beulich
2018-01-25 14:12 ` Andrew Cooper
2018-01-25 14:36 ` Jan Beulich
2018-01-25 14:46 ` Andrew Cooper
2018-01-25 15:08 ` Jan Beulich
2018-01-25 15:10 ` Andrew Cooper
2018-01-25 16:52 ` [PATCH v11 5/11] " Andrew Cooper
2018-01-24 13:12 ` [PATCH v10 06/11] x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen Andrew Cooper
2018-01-25 13:19 ` Jan Beulich
2018-01-25 14:17 ` Andrew Cooper
2018-01-25 14:40 ` Jan Beulich
2018-01-25 14:44 ` Andrew Cooper
2018-01-25 14:48 ` Jan Beulich
2018-01-25 16:54 ` [PATCH v11 6/11] " Andrew Cooper
2018-01-26 12:17 ` Jan Beulich
2018-01-24 13:12 ` [PATCH v10 07/11] x86/entry: Avoid using alternatives in NMI/#MC paths Andrew Cooper
2018-01-25 13:43 ` Jan Beulich
2018-01-25 15:04 ` Andrew Cooper
2018-01-25 15:14 ` Jan Beulich
2018-01-25 15:19 ` Andrew Cooper
2018-01-25 16:17 ` Jan Beulich
2018-01-25 17:21 ` [PATCH v11 7/11] " Andrew Cooper
2018-01-26 12:23 ` Jan Beulich
2018-01-26 12:28 ` Andrew Cooper
2018-01-24 13:12 ` [PATCH v10 08/11] x86/boot: Calculate the most appropriate BTI mitigation to use Andrew Cooper
2018-01-25 13:52 ` Jan Beulich
2018-02-01 8:41 ` Jan Beulich
2018-02-01 13:58 ` Andrew Cooper
2018-01-24 13:12 ` [PATCH v10 09/11] x86/ctxt: Issue a speculation barrier between vcpu contexts Andrew Cooper
2018-01-24 13:34 ` Woodhouse, David
2018-01-24 13:49 ` Andrew Cooper
2018-01-24 14:31 ` David Woodhouse
2018-01-25 14:46 ` Konrad Rzeszutek Wilk
2018-01-25 15:57 ` Jan Beulich
2018-01-25 16:09 ` Andrew Cooper
2018-01-25 16:15 ` Andrew Cooper
2018-01-27 1:27 ` Dario Faggioli [this message]
2018-01-29 9:28 ` Jan Beulich
2018-02-05 11:37 ` George Dunlap
2018-01-25 16:31 ` Jan Beulich
2018-01-25 16:48 ` Andrew Cooper
2018-01-25 18:49 ` Dario Faggioli
2018-01-26 1:08 ` Dario Faggioli
2018-01-26 9:43 ` Jan Beulich
2018-01-26 11:13 ` Dario Faggioli
2018-01-26 11:38 ` Jan Beulich
2018-01-24 13:12 ` [PATCH v10 10/11] x86/cpuid: Offer Indirect Branch Controls to guests Andrew Cooper
2018-01-24 13:12 ` [PATCH v10 11/11] x86/idle: Clear SPEC_CTRL while idle Andrew Cooper
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1517016469.15341.153.camel@suse.com \
--to=dfaggioli@suse.com \
--cc=JBeulich@suse.com \
--cc=andrew.cooper3@citrix.com \
--cc=dwmw@amazon.co.uk \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).