From: Juergen Gross <jgross@suse.com>
To: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <elena.ufimtseva@oracle.com>,
Wei Liu <wei.liu2@citrix.com>,
Andrew Cooper <andrew.cooper3@citrix.com>,
David Vrabel <david.vrabel@citrix.com>,
Jan Beulich <JBeulich@suse.com>,
"xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
Boris Ostrovsky <boris.ostrovsky@oracle.com>
Subject: Re: PV-vNUMA issue: topology is misinterpreted by the guest
Date: Fri, 24 Jul 2015 17:14:24 +0200 [thread overview]
Message-ID: <55B25650.4030402@suse.com> (raw)
In-Reply-To: <1437749076.4682.47.camel@citrix.com>
On 07/24/2015 04:44 PM, Dario Faggioli wrote:
> On Fri, 2015-07-24 at 12:28 +0200, Juergen Gross wrote:
>> On 07/23/2015 04:07 PM, Dario Faggioli wrote:
>
>>> FWIW, I was thinking that the kernel were a better place, as Juergen is
>>> saying, while now I'm more convinced that tools would be more
>>> appropriate, as Boris is saying.
>>
>> I've collected some information from the linux kernel sources as a base
>> for the discussion:
>>
> That's great, thanks for this!
>
>> The complete numa information (cpu->node and memory->node relations) is
>> taken from the acpi tables (srat, slit for "distances").
>>
> Ok. And I already have a question (as I lost track of things a bit).
> What you just said about ACPI tables is certainly true for baremetal and
> HVM guests, but for PV? At the time I was looking into it, together with
> Elena, there were Linux patches being produced for the PV case, which
> makes sense.
> However, ISTR that both Wei and Elena mentioned recently that those
> patches have not been upstreamed in Linux yet... Is that the case? Maybe
> not all, but at least some of them are there? Because if not, I'm not
> sure I see how a PV guest would even see a vNUMA topology (which it
> does).
>
> Of course, I can go and check, but since you just looked, you may have
> it fresh and clear already. :-)
I checked "bottom up", so when I found the acpi scan stuff I stopped
searching how the kernel obtains numa info. During my search I found no
clue of an pv-numa stuff in the kernel. And a quick "grep -i numa" in
arch/x86/xen and drivers/xen didn't reveal anything. Same for a complete
kernel source search for "vnuma".
>
>> The topology information is obtained via:
>> - intel:
>> + cpuid leaf b with subleafs, leaf 4
>> + cpuid leaf 2 and/or leaf 1 if leaf b and/or 4 isn't available
>> - amd:
>> + cpuid leaf 8000001e, leaf 8000001d, leaf 4
>> + msr c001100c
>> + cpuid leaf 2 and/or leaf 1 if leaf b and/or 4 isn't available
>>
>> The scheduler is aware of:
>> - smt siblings (from topology)
>> - last-level-cache siblings (from topology)
>> - node siblings (from numa information)
>>
> Right. So, this confirms what we were guessing: we need to "reconcile"
> these two sources of information (from the guest point of view).
>
> Both the 'in kernel' and 'in toolstack' approach should have all the
> necessary information to make things match, I think. In fact, in
> toolstack, we know what the vNUMA topology is (we're parsing and
> actually putting it in place!). In kernel, we know it as we read it from
> tables or hypercalls (isn't that so, for PV guest?).
>
> In fact, I think that it is the topology, i.e., what comes from MSRs,
> that needs to adapt, and follow vNUMA, as much as possible. Do we agree
> on this?
I think we have to be very careful here. I see two possible scenarios:
1) The vcpus are not pinned 1:1 on physical cpus. The hypervisor will
try to schedule the vcpus according to their numa affinity. So they
can change pcpus at any time in case of very busy guests. I don't
think the linux kernel should treat the cpus differently in this
case as it will be in vane regarding the Xen scheduler's activity.
So we should use the "null" topology in this case.
2) The vcpus of the guest are all pinned 1:1 to physical cpus. The Xen
scheduler can't move vcpus between pcpus, so the linux kernel should
see the real topology of the used pcpus in order to optimize for this
picture.
This only covers the scheduling aspect, of course.
>
> IMO, the thing boils down to these:
>
> 1) from where (kernel vs. toolstack) is it the most easy and effective
> to enact the CPUID fiddling? As in, can we do that in toolstack?
> (Andrew was not so sure, and Boris found issues, although Jan seems
> to think they're no show stopper.)
> I'm quite certain that we can do that from inside the kernel,
> although, how early would we need to be doing it? Do we have the
> vNUMA info already?
>
> 2) when tweaking the values of CPUID and other MSRs, are there other
> vNUMA (and topology in general) constraints and requirements we
> should take into account? For instance, do we want, for licensing
> reasons, all (or most) of the vcpus to be siblings, rather than full
> sockets? Etc.
> 2a) if yes, how and where are these constraints specified?
>
> If looking at 1) only, it still looks to me that doing things within the
> kernel would be the way to go.
>
> When looking at 2), OTOH, toolstacks variants start to be more
> appealing. Especially depending on our answer to 2a). In fact,
> in case we want to give the user a way to specify this
> siblings-vs-cores-vs-sockets information, it IMO would be good to deal
> with that in tools, rather than having to involve Xen or Linux!
>
>> It will especially move tasks from one cpu to another first between smt
>> siblings, second between llc siblings, third between node siblings and
>> last all cpus.
>>
> Yep, this part, I knew.
>
> Maybe, there is room for "fixing" this at this level, hooking up inside
> the scheduler code... but I'm shooting in the dark, without having check
> whether and how this could be really feasible, should I?
Uuh, I don't think a change of the scheduler on behalf of Xen is really
appreciated. :-)
I'd rather fiddle with the cpu masks on the different levels to let the
scheduler do the right thing.
> One thing I don't like about this approach is that it would potentially
> solve vNUMA and other scheduling anomalies, but...
>
>> cpuid instruction is available for user mode as well.
>>
> ...it would not do any good for other subsystems, and user level code
> and apps.
Indeed. I think the optimal solution would be two-fold: give the
scheduler the information it is needing to react correctly via a
kernel patch not relying on cpuid values and fiddle with the cpuid
values from xen tools according to any needs of other subsystems and/or
user code (e.g. licensing).
Juergen
next prev parent reply other threads:[~2015-07-24 15:14 UTC|newest]
Thread overview: 95+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-16 10:32 PV-vNUMA issue: topology is misinterpreted by the guest Dario Faggioli
2015-07-16 10:47 ` Jan Beulich
2015-07-16 10:56 ` Andrew Cooper
2015-07-16 15:25 ` Wei Liu
2015-07-16 15:45 ` Andrew Cooper
2015-07-16 15:50 ` Boris Ostrovsky
2015-07-16 16:29 ` Jan Beulich
2015-07-16 16:39 ` Andrew Cooper
2015-07-16 16:59 ` Boris Ostrovsky
2015-07-17 6:09 ` Jan Beulich
2015-07-17 7:27 ` Dario Faggioli
2015-07-17 7:42 ` Jan Beulich
2015-07-17 8:44 ` Wei Liu
2015-07-17 18:17 ` Boris Ostrovsky
2015-07-20 14:09 ` Dario Faggioli
2015-07-20 14:43 ` Boris Ostrovsky
2015-07-21 20:00 ` Boris Ostrovsky
2015-07-22 13:36 ` Dario Faggioli
2015-07-22 13:50 ` Juergen Gross
2015-07-22 13:58 ` Boris Ostrovsky
2015-07-22 14:09 ` Juergen Gross
2015-07-22 14:44 ` Boris Ostrovsky
2015-07-23 4:43 ` Juergen Gross
2015-07-23 7:28 ` Jan Beulich
2015-07-23 9:42 ` Andrew Cooper
2015-07-23 14:07 ` Dario Faggioli
2015-07-23 14:13 ` Juergen Gross
2015-07-24 10:28 ` Juergen Gross
2015-07-24 14:44 ` Dario Faggioli
2015-07-24 15:14 ` Juergen Gross [this message]
2015-07-24 15:24 ` Juergen Gross
2015-07-24 15:58 ` Dario Faggioli
2015-07-24 16:09 ` Konrad Rzeszutek Wilk
2015-07-24 16:14 ` Dario Faggioli
2015-07-24 16:18 ` Juergen Gross
2015-07-24 16:29 ` Konrad Rzeszutek Wilk
2015-07-24 16:39 ` Juergen Gross
2015-07-24 16:44 ` Boris Ostrovsky
2015-07-27 4:35 ` Juergen Gross
2015-07-27 10:43 ` George Dunlap
2015-07-27 10:54 ` Andrew Cooper
2015-07-27 11:13 ` Juergen Gross
2015-07-27 10:54 ` Juergen Gross
2015-07-27 11:11 ` George Dunlap
2015-07-27 12:01 ` Juergen Gross
2015-07-27 12:16 ` Tim Deegan
2015-07-27 13:23 ` Dario Faggioli
2015-07-27 14:02 ` Juergen Gross
2015-07-27 14:02 ` Dario Faggioli
2015-07-27 10:41 ` George Dunlap
2015-07-27 10:49 ` Andrew Cooper
2015-07-27 13:11 ` Dario Faggioli
2015-07-24 16:10 ` Juergen Gross
2015-07-24 16:40 ` Boris Ostrovsky
2015-07-24 16:48 ` Juergen Gross
2015-07-24 17:11 ` Boris Ostrovsky
2015-07-27 13:40 ` Dario Faggioli
2015-07-27 4:24 ` Juergen Gross
2015-07-27 14:09 ` Dario Faggioli
2015-07-27 14:34 ` Boris Ostrovsky
2015-07-27 14:43 ` Juergen Gross
2015-07-27 14:51 ` Boris Ostrovsky
2015-07-27 15:03 ` Juergen Gross
2015-07-27 14:47 ` Juergen Gross
2015-07-27 14:58 ` Dario Faggioli
2015-07-28 4:29 ` Juergen Gross
2015-07-28 15:11 ` Juergen Gross
2015-07-28 16:17 ` Dario Faggioli
2015-07-28 17:13 ` Dario Faggioli
2015-07-29 6:04 ` Juergen Gross
2015-07-29 7:09 ` Dario Faggioli
2015-07-29 7:44 ` Dario Faggioli
2015-07-24 16:05 ` Dario Faggioli
2015-07-28 10:05 ` Wei Liu
2015-07-28 15:17 ` Dario Faggioli
2015-07-24 20:27 ` Elena Ufimtseva
2015-07-22 14:50 ` Dario Faggioli
2015-07-22 15:32 ` Boris Ostrovsky
2015-07-22 15:49 ` Dario Faggioli
2015-07-22 18:10 ` Boris Ostrovsky
2015-07-23 7:25 ` Jan Beulich
2015-07-24 16:03 ` Boris Ostrovsky
2015-07-23 13:46 ` Dario Faggioli
2015-07-17 10:17 ` Andrew Cooper
2015-07-16 15:26 ` Wei Liu
2015-07-27 15:13 ` David Vrabel
2015-07-27 16:02 ` Dario Faggioli
2015-07-27 16:31 ` David Vrabel
2015-07-27 16:33 ` Andrew Cooper
2015-07-27 17:42 ` Dario Faggioli
2015-07-27 17:50 ` Konrad Rzeszutek Wilk
2015-07-27 23:19 ` Andrew Cooper
2015-07-28 3:52 ` Juergen Gross
2015-07-28 9:40 ` Andrew Cooper
2015-07-28 9:28 ` Dario Faggioli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55B25650.4030402@suse.com \
--to=jgross@suse.com \
--cc=JBeulich@suse.com \
--cc=andrew.cooper3@citrix.com \
--cc=boris.ostrovsky@oracle.com \
--cc=dario.faggioli@citrix.com \
--cc=david.vrabel@citrix.com \
--cc=elena.ufimtseva@oracle.com \
--cc=wei.liu2@citrix.com \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).