From mboxrd@z Thu Jan  1 00:00:00 1970
From: Anthony Liguori <anthony@codemonkey.ws>
Subject: Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
Date: Mon, 23 Aug 2010 16:27:01 -0500
Message-ID: <4C72E7A5.5090302@codemonkey.ws>
References: <1281534738-8310-1-git-send-email-andre.przywara@amd.com> <1281534738-8310-5-git-send-email-andre.przywara@amd.com> <20100823185958.GC32690@amt.cnet> <4C72CBA5.1020805@codemonkey.ws> <4C72E548.4030701@amd.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Marcelo Tosatti <mtosatti@redhat.com>,
	"avi@redhat.com" <avi@redhat.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>
To: Andre Przywara <andre.przywara@amd.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mail-iw0-f174.google.com ([209.85.214.174]:49685 "EHLO
	mail-iw0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1754114Ab0HWV1E (ORCPT <rfc822;kvm@vger.kernel.org>);
	Mon, 23 Aug 2010 17:27:04 -0400
Received: by iwn5 with SMTP id 5so3850454iwn.19
        for <kvm@vger.kernel.org>; Mon, 23 Aug 2010 14:27:03 -0700 (PDT)
In-Reply-To: <4C72E548.4030701@amd.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On 08/23/2010 04:16 PM, Andre Przywara wrote:
> Anthony Liguori wrote:
>> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
>>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
>>>> According to the user-provided assignment bind the respective part
>>>> of the guest's memory to the given host node. This uses Linux'
>>>> mbind syscall (which is wrapped only in libnuma) to realize the
>>>> pinning right after the allocation.
>>>> Failures are not fatal, but produce a warning.
>>>>
>>>> Signed-off-by: Andre Przywara<andre.przywara@amd.com>
> >>> ...
>>> Why is it not possible (or perhaps not desired) to change the binding
>>> after the guest is started?
>>>
>>> Sounds unflexible.
> The solution is to introduce a monitor interface to later adjust the 
> pinning, allowing both changing the affinity only (only valid for 
> future fault-ins) and actually copying the memory (more costly).

This is just duplicating numactl.

> Actually this is the next item on my list, but I wanted to bring up 
> the basics first to avoid recoding parts afterwards. Also I am not 
> (yet) familiar with the QMP protocol.
>>
>> We really need a solution that lets a user use a tool like numactl 
>> outside of the QEMU instance.
> I fear that is not how it's meant to work with the Linux' NUMA API. In 
> opposite to the VCPU threads, which are externally visible entities 
> (PIDs), the memory should be private to the QEMU process. While you 
> can change the NUMA allocation policy of the _whole_ process, there is 
> no way to externally distinguish parts of the process' memory. 
> Although you could later (and externally) migrate already faulted 
> pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you 
> would let an external tool interfere with QEMUs internal memory 
> management. Take for instance the change of the allocation policy 
> regarding the 1MB and 3.5-4GB holes. An external tool would have to 
> either track such changes or you simply could not change such things 
> in QEMU.

It's extremely likely that if you're doing NUMA pinning, you're also 
doing large pages via hugetlbfs.  numactl can already set policies for 
files in hugetlbfs so all you need to do is have a separate hugetlbfs 
file for each numa node.

Then you have all the flexibility of numactl and you can implement node 
migration external to QEMU if you so desire.

> So what is wrong with keeping that code in QEMU, which knows best 
> about the internals and already has flexible and mighty ways (command 
> line and QMP) of manipulating its behavior?

NUMA is a last-mile optimization.  For the audience that cares about 
this level of optimization, only providing an interface that allows a 
small set of those optimizations to be used is unacceptable.

There's a very simple way to do this right and that's by adding 
interfaces to QEMU that let's us work with existing tooling instead of 
inventing new interfaces.

Regards,

Anthony Liguori

> Regards,
> Andre.
>