From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andre Przywara <andre.przywara@amd.com>
Subject: Re: [PATCH 0/3] v2: KVM-userspace: add NUMA support for guests
Date: Fri, 5 Dec 2008 16:22:37 +0100
Message-ID: <4939473D.6080606@amd.com>
References: <49392CB6.9000000@amd.com> <49393A78.5030601@codemonkey.ws>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Avi Kivity <avi@redhat.com>, kvm@vger.kernel.org,
	"Daniel P. Berrange" <berrange@redhat.com>
To: Anthony Liguori <anthony@codemonkey.ws>
Return-path: <kvm-owner@vger.kernel.org>
Received: from outbound-wa4.frontbridge.com ([216.32.181.16]:50732 "EHLO
	WA4EHSOBE003.bigfish.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753644AbYLEPWX (ORCPT <rfc822;kvm@vger.kernel.org>);
	Fri, 5 Dec 2008 10:22:23 -0500
In-Reply-To: <49393A78.5030601@codemonkey.ws>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Anthony,

> This patch series needs to be posted to qemu-devel.  I know qemu doesn't 
> do true SMP yet, but it will in the relatively near future.  Either way, 
> some of the design points needs review from a larger audience than 
> present on kvm-devel.
OK, I already started looking at that. The first patch applies with only 
some fuzz, so no problems here. The second patch could be changed to 
promote the values via the firmware configuration interface only, 
leaving the host side pinning alone (which wouldn't make much sense 
without true SMP anyway).
The third patch is actually BOCHS BIOS, and I am confused here:
I see the host side of the firmware config interface in QEMU SVN, but 
neither in the BOCHS CVS nor in the qemu/pc-bios/bios.diff there is any 
sign of usage from the BIOS side. Is the kvm-patched qemu the only user 
of the interface? If so I would have to introduce the interface to 
QEMU's bios.diff (or better send to bochs-developers?)
Do you know what BOCHS version the bios.diff applies against? Is that 
the 2.3.7 release?

> I'm not a big fan of the libnuma dependency.  I'll willing to concede 
> this if there's a wide agreement that we should support this directly in 
> QEMU.
As long as QEMU is not true SMP, libnuma is rather useless. One could 
pin the memory to the appropriate host nodes, but without the proper 
scheduling this doesn't make much sense. And rescheduling the qemu 
process each time a new VCPU is scheduled doesn't seem so smart, either.
So for qemu we could just drop libnuma (at least for now) (which is also 
the case for the current KVM patches, if there is no libnuma, the whole 
host side pinning is skipped).

> I don't think there's such a thing as a casual NUMA user.  The default 
> NUMA policy in Linux is node-local memory.  As long as a VM is smaller 
> than a single node, everything will work out fine.
Almost right, but simply calling qemu-system-x86_64 can lead to bad 
situations. I lately saw that VCPU #0 was scheduled on one node and VCPU 
#1 on another. This leads to random (probably excessive) remote accesses 
from the VCPUs, since the guest assumes uniform memory. Of course one 
could cure this small guest case with numactl, but in my experience the 
existence of this tool isn't as well-known as one would expect.
> 
> In the event that the VM is larger than a single node, if a user is 
> creating it via qemu-system-x86_64, they're going to either not care at 
> all about NUMA, or be familiar enough with the numactl tools that 
> they'll probably just want to use that.  Once you've got your head 
> around the fact that VCPUs are just threads and the memory is just a 
> shared memory segment, any knowledgable sysadmin will have no problem 
> doing whatever sort of NUMA layout they want.
Really? How do you want to assign certain _parts_ of guest memory with 
numactl? (Let alone the rather weird way of using -mempath, which is 
much easier done within QEMU). The same applies to the threads. You can 
assign _all_ the threads to certain nodes, but pinning single threads 
only requires some tedious work (QEMU monitor or top, then taskset -p). 
Isn't that OK if qemu would do this automatically (or at least give some 
support here)?

> The other case is where management tools are creating VMs.  In this 
> case, it's probably better to use numactl as an external tool because 
> then it keeps things consistent wrt CPU pinning.
> 
> There's also a good argument for not introducing CPU pinning directly to 
> QEMU.  There are multiple ways to effectively do CPU pinning.  You can 
> use taskset, you can use cpusets or even something like libcgroup.
I agree that pinning isn't the last word on the subject, but it works 
pretty well. But I wouldn't load the admin with the burden of pinning, 
but let this be done by QEMU/KVM. Maybe one could introduce a way to 
tell QEMU/KVM to not pin the threads.
I also had the idea to start with some sort of pinning (either 
automatically or user-chosen) and lift the affinity later (after the 
thread has done something and touched some memory). In this case Linux 
could (but probably will not easily) move the thread to another node. 
One could think about triggering this from a management app: If the app 
detects a congestion on one node, it could first lift the affinity 
restriction of some VCPU threads to achieve a better load balancing. If 
the situation persists (and doesn't turn out to be a short time peak), 
the manager could migrate the memory too and pin the VCPUs to the new 
node. I thought the migration and temporary un-pinning could be 
implemented in the monitor.

> If you refactor the series so that the libnuma patch is the very last 
> one and submit to qemu-devel, I'll review and apply all of the first 
> patches.  We can continue to discuss the last patch independently of the 
> first three if needed.
Sounds like a plan. I will start with this and hope for some advice on 
the BOCHS BIOS issue.

Thanks for your ideas!

Regards,
Andre.

> 
> Andre Przywara wrote:
>> Hi,
>>
>> this patch series introduces multiple NUMA nodes support within KVM 
>> guests.
>> This is the second try incorporating several requests from the list:
>> - use the QEMU firmware configuration interface instead of CMOS-RAM
>> - detect presence of libnuma automatically, can be disabled with
>>   ./configure --disable-numa
>> This only applies to the host side, the command line and guest (BIOS)
>> side are always built and functional, although this configuration
>> is only useful for research and debugging
>> - use a more flexible command line interface allowing:
>>   - specifying the distribution of memory across the guest nodes:
>>     mem:1536M;512M
>>   - specifying the distribution of the CPUs:
>>     cpu:0-2;3
>>   - specifying the host nodes the guest nodes should be pinned to:
>>     pin:3;2
>> All of these options are optional, in case of mem and cpu the 
>> resources are split equally across all guest nodes if omitted. Please 
>> note that at least in Linux SRAT takes precedence over E820, so the 
>> total usable memory will be the sum specified at the mem: option 
>> (although QEMU will still allocate the amount at -m).
>> If pin: is omitted, the guest nodes will be pinned to those host nodes 
>> where the threads are happen to be scheduled at on start-up time. This 
>> requires the (v)getcpu (v)syscall to be usable, this is true for 
>> kernels up from 2.6.19 and glibc >= 2.6 (sched_getcpu()). I have a 
>> hack if glibc doesn't support this, tell me if you are interested.
>> The only non-optional argument is the number of guest nodes, a 
>> possible command line looks like:
>> -numa 3,mem:1024M;512M;512M,cpu:0-1;2;3
>> Please note that you have to quote the semicolons on the shell.
>>
>> The monitor command is left out for now and will be send later.
>>
>> Please apply.
>>
>> Regards,
>> Andre.
>>
>> Signed-off-by: Andre Przywara <andre.przywara@amd.com>
>>
> 
> 


-- 
Andre Przywara
AMD-OSRC (Dresden)
Tel: x84917