From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andre Przywara <andre.przywara@amd.com>
Subject: Re: [PATCH 0/3][RFC] NUMA: add host side pinning
Date: Thu, 24 Jun 2010 12:58:53 +0200
Message-ID: <4C233A6D.7030805@amd.com>
References: <1277327377-29629-1-git-send-email-andre.przywara@amd.com> <4C2288DD.3020207@codemonkey.ws> <865764AB-4E51-4ED4-8832-AED6A237A9D3@suse.de>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Anthony Liguori <anthony@codemonkey.ws>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>
To: Alexander Graf <agraf@suse.de>
Return-path: <kvm-owner@vger.kernel.org>
Received: from outbound-va3.frontbridge.com ([216.32.180.16]:23216 "EHLO
	VA3EHSOBE009.bigfish.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1754585Ab0FXLCf (ORCPT <rfc822;kvm@vger.kernel.org>);
	Thu, 24 Jun 2010 07:02:35 -0400
In-Reply-To: <865764AB-4E51-4ED4-8832-AED6A237A9D3@suse.de>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Alexander Graf wrote:
> On 24.06.2010, at 00:21, Anthony Liguori wrote:
> 
>> On 06/23/2010 04:09 PM, Andre Przywara wrote:
>>> Hi,
>>>
>>> these three patches add basic NUMA pinning to KVM. According to a user
>>> provided assignment parts of the guest's memory will be bound to different
>>> host nodes. This should increase performance in large virtual machines
>>> and on loaded hosts.
>>> These patches are quite basic (but work) and I send them as RFC to get
>>> some feedback before implementing stuff in vain.
>>>
>>> To use it you need to provide a guest NUMA configuration, this could be
>>> as simple as "-numa node -numa node" to give two nodes in the guest. Then
>>> you pin these nodes on a separate command line option to different host
>>> nodes: "-numa pin,nodeid=0,host=0 -numa pin,nodeid=1,host=2"
>>> This separation of host and guest config sounds a bit complicated, but
>>> was demanded last time I submitted a similar version.
>>> I refrained from binding the vCPUs to physical CPUs for now, but this
>>> can be added later with an "cpubind" option to "-numa pin,". Also this
>>> could be done from a management application by using sched_setaffinity().
>>>
>>> Please note that this is currently made for qemu-kvm, although I am not
>>> up-to-date regarding the curent status of upstreams QEMU's true SMP
>>> capabilities. The final patch will be made against upstream QEMU anyway.
>>> Also this is currently for Linux hosts (any other KVM hosts alive?) and
>>> for PC guests only. I think both can be fixed easily if someone requests
>>> it (and gives me a pointer to further information).
>>>
>>> Please comment on the approach in general and the implementation.
>>>   
>> If we extended integrated -mem-path with -numa such that a different path could be used with each numa node (and we let an explicit file be specified instead of just a directory), then if I understand correctly, we could use numactl without any specific integration in qemu.  Does this sound correct?
>>
>> IOW:
>>
>> qemu -numa node,mem=1G,nodeid=0,cpus=0-1,memfile=/dev/shm/node0.mem -numa node,mem=2G,nodeid=1,cpus=1-2,memfile=/dev/shm/node1.mem
>>
>> It's then possible to say:
>>
>> numactl --file /dev/shm/node0.mem --interleave=0,1
>> numactl --file /dev/shm/node1.mem --membind=2
>>
>> I think this approach is nicer because it gives the user a lot more flexibility without having us chase other tools like numactl.  For instance, your patches only support pinning and not interleaving.
> 
> Interesting idea.
> 
> So who would create the /dev/shm/nodeXX files?
Currently it is QEMU. It creates a somewhat unique filename, opens and 
unlinks it. The difference would be to name the file after the option 
and to not unlink it.

 > I can imagine starting numactl before qemu, even though that's
 > cumbersome. I don't think it's feasible to start numactl after
 > qemu is running. That'd involve way too much magic that I'd prefer
 > qemu to call numactl itself.
Using the current code the files would not exist before QEMU allocated 
RAM, and after that it could already touch pages before numactl set the 
policy.
To avoid this I'd like to see the pinning done from within QEMU. I am 
not sure whether calling numactl via system() and friends is OK, I'd 
prefer to run the syscalls directly (like in patch 3/3) and pull the 
necessary options into the -numa pin,... command line. We could mimic 
numactl's syntax here.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12