From mboxrd@z Thu Jan 1 00:00:00 1970 From: Avi Kivity Subject: Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests Date: Mon, 01 Dec 2008 16:29:04 +0200 Message-ID: <4933F4B0.7040500@redhat.com> References: <492F1DD9.8030901@amd.com> <49318A10.7080801@redhat.com> <4933F177.5040802@amd.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: kvm@vger.kernel.org, "Daniel P. Berrange" , Andi Kleen To: Andre Przywara Return-path: Received: from mx2.redhat.com ([66.187.237.31]:42812 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752340AbYLAO3L (ORCPT ); Mon, 1 Dec 2008 09:29:11 -0500 In-Reply-To: <4933F177.5040802@amd.com> Sender: kvm-owner@vger.kernel.org List-ID: Andre Przywara wrote: > Avi Kivity wrote: >> Andre Przywara wrote: >>> The user (or better: management application) specifies the host nodes >>> the guest should use: -nodes 2,3 would create a two node guest >>> mapped to >>> node 2 and 3 on the host. These numbers are handed over to libnuma: >>> VCPUs are pinned to the nodes and the allocated guest memory is >>> bound to >>> it's respective node. Since libnuma seems not to be installed >>> everywhere, the user has to enable this via configure --enable-numa >>> In the BIOS code an ACPI SRAT table was added, which describes the NUMA >>> topology to the guest. The number of nodes is communicated via the CMOS >>> RAM (offset 0x3E). If someone thinks of this as a bad idea, tell me. >> >> There exists now a firmware interface in qemu for this kind of >> communications. > Oh, right you are, I missed that (was well hidden). I was looking at > how the BIOS detects memory size and CPU numbers and these methods are > quite cumbersome. Why not convert them to the FW_CFG methods (which > the qemu side already sets)? To not diverge too much from the original > BOCHS BIOS? > Mostly. Also, no one felt the urge. >>> Node over-committing is allowed (-nodes 0,0,0,0), omitting the -nodes >>> parameter reverts to the old behavior. >> >> '-nodes' is too generic a name ('node' could also mean a host). >> Suggest -numanode. >> >> Need more flexibility: specify the range of memory per node, which >> cpus are in the node, relative weights for the SRAT table: >> >> -numanode node=1,cpu=2,cpu=3,start=1G,size=1G,hostnode=3 > > I converted my code to use the new firmware interface. This also makes > it possible to pass more information between qemu and BIOS (which > prevented a more flexible command line in the first version). > So I would opt for the following: > - use numanode (or simply numa?) instead of the misleading -nodes > - allow passing memory sizes, VCPU subsets and host CPU pin info > I would prefer Daniel's version: > -numa [,mem:[;...]] > [,cpu:[;...]] > [,pin:[;...]] > > That would allow easy things like -numa 2 (for a two guest node), not > given options would result in defaults (equally split-up resources). > Yes, that look good. > The only problem is the default option for the host side, as libnuma > requires to explicitly name the nodes. Maybe make the pin: part _not_ > optional? I would at least want to pin the memory, one could discuss > about the VCPUs... > If you can bench it, that would be best. My guess is that we would need to pin the vcpus. >> hange host nodes dynamically: > Implementing a monitor interface is a good idea. >> (qemu) numanode 1 0 > Does that include page migration? That would be easily possible with > mbind(MPOL_MF_MOVE), but would take some time and resources (which I > think is OK if explicitly triggered in the monitor). Yes, that's the main interest. Allow management to load balance numa nodes (as Linux doesn't do so automatically for long running processes). > Any other useful commands for the monitor? Maybe (temporary) VCPU > migration without page migration? Right now vcpu migration is done externally (we export the thread IDs so management can pin them as it wishes). If we add numa support, I think it makes sense do it internally as well. I suggest using the same syntax for the monitor as for the command line; that's simplest to learn and to implement. -- error compiling committee.c: too many arguments to function