From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andre Przywara Subject: Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests Date: Mon, 1 Dec 2008 15:15:19 +0100 Message-ID: <4933F177.5040802@amd.com> References: <492F1DD9.8030901@amd.com> <49318A10.7080801@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Cc: kvm@vger.kernel.org, "Daniel P. Berrange" , Andi Kleen To: Avi Kivity Return-path: Received: from outbound-sin.frontbridge.com ([207.46.51.80]:62520 "EHLO SG2EHSOBE004.bigfish.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752123AbYLAOPA (ORCPT ); Mon, 1 Dec 2008 09:15:00 -0500 In-Reply-To: <49318A10.7080801@redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: Avi Kivity wrote: > Andre Przywara wrote: >> The user (or better: management application) specifies the host nodes >> the guest should use: -nodes 2,3 would create a two node guest mapped to >> node 2 and 3 on the host. These numbers are handed over to libnuma: >> VCPUs are pinned to the nodes and the allocated guest memory is bound to >> it's respective node. Since libnuma seems not to be installed >> everywhere, the user has to enable this via configure --enable-numa >> In the BIOS code an ACPI SRAT table was added, which describes the NUMA >> topology to the guest. The number of nodes is communicated via the CMOS >> RAM (offset 0x3E). If someone thinks of this as a bad idea, tell me. > > There exists now a firmware interface in qemu for this kind of > communications. Oh, right you are, I missed that (was well hidden). I was looking at how the BIOS detects memory size and CPU numbers and these methods are quite cumbersome. Why not convert them to the FW_CFG methods (which the qemu side already sets)? To not diverge too much from the original BOCHS BIOS? >> Node over-committing is allowed (-nodes 0,0,0,0), omitting the -nodes >> parameter reverts to the old behavior. > > '-nodes' is too generic a name ('node' could also mean a host). Suggest > -numanode. > > Need more flexibility: specify the range of memory per node, which cpus > are in the node, relative weights for the SRAT table: > > -numanode node=1,cpu=2,cpu=3,start=1G,size=1G,hostnode=3 I converted my code to use the new firmware interface. This also makes it possible to pass more information between qemu and BIOS (which prevented a more flexible command line in the first version). So I would opt for the following: - use numanode (or simply numa?) instead of the misleading -nodes - allow passing memory sizes, VCPU subsets and host CPU pin info I would prefer Daniel's version: -numa [,mem:[;...]] [,cpu:[;...]] [,pin:[;...]] That would allow easy things like -numa 2 (for a two guest node), not given options would result in defaults (equally split-up resources). The only problem is the default option for the host side, as libnuma requires to explicitly name the nodes. Maybe make the pin: part _not_ optional? I would at least want to pin the memory, one could discuss about the VCPUs... > > Also need a monitor command to change host nodes dynamically: Implementing a monitor interface is a good idea. > (qemu) numanode 1 0 Does that include page migration? That would be easily possible with mbind(MPOL_MF_MOVE), but would take some time and resources (which I think is OK if explicitly triggered in the monitor). Any other useful commands for the monitor? Maybe (temporary) VCPU migration without page migration? Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 277-84917 ----to satisfy European Law for business letters: AMD Saxony Limited Liability Company & Co. KG, Wilschdorfer Landstr. 101, 01109 Dresden, Germany Register Court Dresden: HRA 4896, General Partner authorized to represent: AMD Saxony LLC (Wilmington, Delaware, US) General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy