From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:40717)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <ehabkost@redhat.com>) id 1YlHL7-0002J8-Ny
	for qemu-devel@nongnu.org; Thu, 23 Apr 2015 09:38:55 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <ehabkost@redhat.com>) id 1YlHL4-0007NC-Dv
	for qemu-devel@nongnu.org; Thu, 23 Apr 2015 09:38:49 -0400
Received: from mx1.redhat.com ([209.132.183.28]:41811)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <ehabkost@redhat.com>) id 1YlHL4-0007N8-3j
	for qemu-devel@nongnu.org; Thu, 23 Apr 2015 09:38:46 -0400
Date: Thu, 23 Apr 2015 10:17:36 -0300
From: Eduardo Habkost <ehabkost@redhat.com>
Message-ID: <20150423131736.GA17796@thinpad.lan.raisama.net>
References: <1427131923-4670-1-git-send-email-afaerber@suse.de>
	<5523D0FF.7090609@de.ibm.com>
	<20150423073233.GB26536@voom.redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150423073233.GB26536@voom.redhat.com>
Subject: Re: [Qemu-devel] cpu modelling and hotplug (was: [PATCH RFC 0/4]
 target-i386: PC socket/core/thread modeling, part 1)
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: Peter Maydell <peter.maydell@linaro.org>, Bharata B Rao <bharata@linux.vnet.ibm.com>, qemu-devel@nongnu.org, Alexander Graf <agraf@suse.de>, Christian Borntraeger <borntraeger@de.ibm.com>, "Jason J. Herne" <jjherne@linux.vnet.ibm.com>, Paolo Bonzini <pbonzini@redhat.com>, Cornelia Huck <cornelia.huck@de.ibm.com>, Igor Mammedov <imammedo@redhat.com>, Andreas =?iso-8859-1?Q?F=E4rber?= <afaerber@suse.de>

On Thu, Apr 23, 2015 at 05:32:33PM +1000, David Gibson wrote:
> On Tue, Apr 07, 2015 at 02:43:43PM +0200, Christian Borntraeger wrote:
> > We had a call and I was asked to write a summary about our conclusion.
> > 
> > The more I wrote, there more I became uncertain if we really came to a 
> > conclusion and became more certain that we want to define the QMP/HMP/CLI
> > interfaces first (or quite early in the process)
> > 
> > As discussed I will provide an initial document as a discussion starter
> > 
> > So here is my current understanding with each piece of information on one line, so 
> > that everybody can correct me or make additions:
> > 
> > current wrap-up of architecture support
> > -------------------
> > x86
> > - Topology possible
> >    - can be hierarchical
> >    - interfaces to query topology
> > - SMT: fanout in host, guest uses host threads to back guest vCPUS
> > - supports cpu hotplug via cpu_add
> > 
> > power
> > - Topology possible
> >    - interfaces to query topology?
> 
> For power, topology information is communicated via the
> "ibm,associativity" (and related) properties in the device tree.  This
> is can encode heirarchical topologies, but it is *not* bound to the
> socket/core/thread heirarchy.  On the guest side in Power there's no
> real notion of "socket", just cores with specified proximities to
> various memory nodes.
> 
> > - SMT: Power8: no threads in host and full core passed in due to HW design
> >        may change in the future
> > 
> > s/390
> > - Topology possible
> >     - can be hierarchical
> >     - interfaces to query topology
> > - always virtualized via PR/SM LPAR
> >     - host topology from LPAR can be heterogenous (e.g. 3 cpus in 1st socket, 4 in 2nd)
> > - SMT: fanout in host, guest uses host threads to back guest vCPUS
> > 
> > 
> > Current downsides of CPU definitions/hotplug
> > -----------------------------------------------
> > - smp, sockets=,cores=,threads= builds only homogeneous topology
> > - cpu_add does not tell were to add
> > - artificial icc bus construct on x86 for several reasons (link, sysbus not hotpluggable..)
> 
> Artificial though it may be, I think having a "cpus" pseudo-bus is not
> such a bad idea

That was considered before[1][2]. We have use cases for adding
additional information about VCPUs to query-cpus, but we could simply
use qom-get for that. The only thing missing is a predictable QOM path
for VCPU objects.

If we provide something like "/cpus/<cpu>" links on all machines,
callers could simply use qom-get to get just the information they need,
instead of getting too much information from query-cpus (which also has
the side-effect of interrupting all running VCPUs to synchronize
register information).

Quoting part of your proposal below:
> Ignoring NUMA topology (I'll come back to that in a moment) qemu
> should really only care about two things:
> 
>   a) the unit of execution scheduling (a vCPU or "thread")
>   b) the unit of plug/unplug
>
[...]
> 3) I'm thinking we'd have a "cpus" virtual bus represented in QOM,
> which would contain the vCMs (also QOM objects).  Their existence
> would be generic, though we'd almost certainly use arch and/or machine
> specific subtypes.
> 
> 4) There would be a (generic) way of finding the vCPUS (threads) in a
> vCM and the vCM for a specific vCPU.
>

What I propose now is a bit simpler: just a mechanism for enumerating
VCPUs/threads (a), that would replace query-cpus. Later we could also
have a generic mechanism for (b), if we decide to introduce a generic
"CPU module" abstraction for plug/unplug.

A more complex mechanism to enumerating vCMs and the vCPUs inside a vCM
would be a superset of (a), so in theory we wouldn't need both. But I
believe that: 1) we will take some time to define the details of the
vCM/plug/unplug abstractions; 2) we already have use cases today[2] that
could benefit from a generic QOM path for (a).

[1] Message-ID: <20140516151641.GY3302@otherpad.lan.raisama.net>
    http://article.gmane.org/gmane.comp.emulators.qemu/273463
[2] Message-ID: <20150331131623.GG7031@thinpad.lan.raisama.net>
    http://article.gmane.org/gmane.comp.emulators.kvm.devel/134625

> 
> > discussions
> > -------------------
> > - we want to be able to (most important question, IHMO)
> >  - hotplug CPUs on power/x86/s390 and maybe others
> >  - define topology information
> >  - bind the guest topology to the host topology in some way
> >     - to host nodes
> >     - maybe also for gang scheduling of threads (might face reluctance from
> >       the linux scheduler folks)
> >     - not really deeply outlined in this call
> > - QOM links must be allocated at boot time, but can be set later on
> >     - nothing that we want to expose to users
> >     - Machine provides QOM links that the device_add hotplug mechanism can use to add
> >       new CPUs into preallocated slots. "CPUs" can be groups of cores and/or threads. 
> > - hotplug and initial config should use same semantics
> > - cpu and memory topology might be somewhat independent
> > --> - define nodes
> >     - map CPUs to nodes
> >     - map memory to nodes
> > 
> > - hotplug per
> >     - socket
> >     - core
> >     - thread
> >     ?
> > Now comes the part where I am not sure if we came to a conclusion or not:
> > - hotplug/definition per core (but not per thread) seems to handle all cases
> >     - core might have multiple threads ( and thus multiple cpustates)
> >     - as device statement (or object?)
> > - mapping of cpus to nodes or defining the topology not really
> >   outlined in this call
> > 
> > To be defined:
> > - QEMU command line for initial setup
> > - QEMU hmp/qmp interfaces for dynamic setup
> 
> So, I can't say I've entirely got my head around this, but here's my
> thoughts so far.
> 
> I think the basic problem here is that the fixed socket -> core ->
> thread heirarchy is something from x86 land that's become integrated
> into qemu's generic code where it doesn't entirely make sense.
> 
> Ignoring NUMA topology (I'll come back to that in a moment) qemu
> should really only care about two things:
> 
>   a) the unit of execution scheduling (a vCPU or "thread")
>   b) the unit of plug/unplug
> 
> Now, returning to NUMA topology.  What the guest, and therefore qemu,
> really needs to know is the relative proximity of each thread to each
> block of memory.  That usually forms some sort of node heirarchy,
> but it doesn't necessarily correspond to a socket->core->thread
> heirarchy you can see in physical units.
> 
> On Power, an arbitrary NUMA node heirarchy can be described in the
> device tree without reference to "cores" or "sockets", so really qemu
> has no business even talking about such units.
> 
> IIUC, on x86 the NUMA topology is bound up to the socket->core->thread
> heirarchy so it needs to have a notion of those layers, but ideally
> that would be specific to the pc machine type.
> 
> So, here's what I'd propose:
> 
> 1) I think we really need some better terminology to refer to the unit
> of plug/unplug.  Until someone comes up with something better, I'm
> going to use "CPU Module" (CM), to distinguish from the NUMA baggage
> of "socket" and also to refer more clearly to the thing that goes into
> the socket, rather than the socket itself.
> 
> 2) A Virtual CPU Module (vCM) need not correspond to a real physical
> object.  For machine types which we want to faithfully represent a
> specific physical machine, it would.  For generic or pure virtual
> machines, the vCMs would be as small as possible.  So for current
> Power, they'd be one virtual core, for future power (maybe) or s390 a
> single virtual thread.  For x86 I'm not sure what they'd be.
> 
> 3) I'm thinking we'd have a "cpus" virtual bus represented in QOM,
> which would contain the vCMs (also QOM objects).  Their existence
> would be generic, though we'd almost certainly use arch and/or machine
> specific subtypes.
> 
> 4) There would be a (generic) way of finding the vCPUS (threads) in a
> vCM and the vCM for a specific vCPU.
> 
> 5) A vCM *might* have internal subdivisions into "cores" or "nodes" or
> "chips" or "MCMs" or whatever, but that would be up to the machine
> type specific code, and not represented in the QOM heirarchy.
> 
> 6) Obviously we'd need some backwards compat goo to sort out existing
> command line options referring to cores and sockets into the new
> representation.  This will need machine type specific hooks - so for
> x86 it would need to set up the right vCM subdivisions and make sure
> the right NUMA topology info goes into ACPI.  For -machine pseries I'm
> thinking that "-smp sockets=2,cores=1,threads=4" and "-smp
> sockets=1,cores=2,threads=4" should result in exactly the same thing
> internally.
> 
> 
> Thoughts?
> 
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


-- 
Eduardo