From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andre Przywara <andre.przywara@amd.com>
Subject: Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
Date: Thu, 9 Sep 2010 22:00:17 +0200
Message-ID: <4C893CD1.3080608@amd.com>
References: <1281534738-8310-1-git-send-email-andre.przywara@amd.com>	 <1281534738-8310-5-git-send-email-andre.przywara@amd.com>	 <20100823185958.GC32690@amt.cnet> <4C72CBA5.1020805@codemonkey.ws>	 <4C72E548.4030701@amd.com>  <4C72E7A5.5090302@codemonkey.ws> <1283288095.4439.9.camel@localhost.localdomain> <4C7D7C2A.7000205@codemonkey.ws>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "habanero@linux.vnet.ibm.com" <habanero@linux.vnet.ibm.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	"avi@redhat.com" <avi@redhat.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>
To: Anthony Liguori <anthony@codemonkey.ws>
Return-path: <kvm-owner@vger.kernel.org>
Received: from tx2ehsobe003.messaging.microsoft.com ([65.55.88.13]:11018 "EHLO
	TX2EHSOBE006.bigfish.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753652Ab0IIUAl (ORCPT <rfc822;kvm@vger.kernel.org>);
	Thu, 9 Sep 2010 16:00:41 -0400
In-Reply-To: <4C7D7C2A.7000205@codemonkey.ws>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Anthony Liguori wrote:
> On 08/31/2010 03:54 PM, Andrew Theurer wrote:
>> On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote:
>>    
>>> On 08/23/2010 04:16 PM, Andre Przywara wrote:
>>>> Anthony Liguori wrote:
>>>>> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
>>>>>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
>>>>>>

Sorry for the delay in this discussion, I was busy with other things.
...
>>>>        
>>> It's extremely likely that if you're doing NUMA pinning, you're also
>>> doing large pages via hugetlbfs.  numactl can already set policies for
>>> files in hugetlbfs so all you need to do is have a separate hugetlbfs
>>> file for each numa node.
>>>      
>> Why would we resort to hugetlbfs when we have transparent hugepages?
>>    
> 
> If you care about NUMA pinning, I can't believe you don't want 
> guaranteed large page allocation which THP does not provide.
I doubt that anyone _wants_ to care about NUMA. You only _have_ to care
about it sometimes, mostly if your performance drops on scaling up. So I
wouldn't consider NUMA a special HPC only scenario, since virtually
every recent server has NUMA. I don't want to tell the people that their
shiny new 48-core 96GB RAM box can only run VMs with at most 8 GB RAM
and not more than 6 cores.
So I don't want to see NUMA tied to the (IMHO clumsy) hugetlbfs
interface, if everyone else uses THP and is happy with that.

> The general point though is that we should find a way to partition 
> memory in qemu such that an external process can control the actual NUMA 
> placement.  This gives us maximum flexibility.
Why not do this from the management application and use the QEMU monitor
protocol?
> 
> Otherwise, what do we implement in QEMU?  Direct pinning of memory to 
> nodes?  Can we migrate memory between nodes?  Should we support 
> interleaving memory between two virtual nodes?  Why pick and choose when 
> we can have it all.
We use either libnuma or the interface the kernel provides. I don't see
how this restricts us.
In general I don't see why you care so much about avoiding duplicating
numactl. numactl is only a small wrapper around the libnuma, which is
itself a wrapper around the kernel interface (mostly mbind and
setpolicy). numactl itself mostly provides command line parsing, which
is no rocket science and is even partly provided by the lib itself. I
only couldn't use it because the comma is used in QEMUs own cmdline parsing.

>> FWIW, large apps like databases have set a precedent for managing their
>> own NUMA policies.
> 
> Of course because they know what their NUMA policy should be.  They live 
> in a simple world where they assume they're the only application in the 
> system, they read the distance tables, figure they'll use XX% of all 
> physical memory, and then pin how they see fit.
> 
> But an individual QEMU process lives in a complex world.  It's almost 
> never the only thing on the system and it's only allowed to use a subset 
> of resources.  It's not sure what set of resources it can and can't use 
> and that's often times changing.  The topology chosen for a guest is 
> static but it's host topology may be dynamic due to thinks like live 
> migration.
> 
> In short, QEMU absolutely cannot implement a NUMA policy in a vacuum.  
> Instead, it needs to let something with a larger view of the system 
> determine a NUMA policy that makes sense overall.
That's why I want to provide an interface to an external management
application (which you probably have anyway).
> 
> There are two ways we can do this.  We can implement monitor commands 
> that attempt to expose every single NUMA tunable possible.  Or, we can 
> tie into the existing commands which guarantee that we support every 
> possible tunable and that as NUMA support in Linux evolves, we get all 
> the new features for free.
I don't see that the tuning parameters are that many that QEMU cannot
sanely implement them.
And basically that means that your virt management app calls numactl
which tunes QEMU. I consider the direct way cleaner.
> 
> And, since numactl already supports setting policies on files in 
> hugetlbfs, all we need is a simple change to qemu to allow -mem-path to 
> work per-node instead of globally.  And it's useful to implement other 
> types of things like having one node be guaranteed large pages and 
> another node THP or some other fanciness.
> 
> Sounds awfully appealing to me.
I see your point, it would be rather easy to just do so. But I also see
the defiances of this approach on the other hand:
1. We tie to hugetlbfs, which I consider broken and on the demise. If
THP is upstream (I hope that it's only when and not if), I don't want to
being tied to the old way if I have a NUMA machine and need larger guests.
2. We need to expose QEMU's internal guest memory layout and stick to
that "interface". Currently we allocate all in one large chunk, but it
wasn't ever so and may change again in the future. What about the
dynamic memory approaches, would this still work with visibly mmaped files?
3. We need to go with the shortcomings of hugetlbfs, namely not being
able to swap out, the need to early allocate all memory, the need to
reserve it beforehand and the missing possibility to scatter it again.

I don't want to become NUMA a second class citizen, in that it restricts
the possibilities.
> 
>>    I don't see why qemu should be any different.
>> Numactl is great for small apps that need to be pinned in one node, or
>> spread evenly on all nodes.  Having to get hugetlbfs involved just to
>> workaround a shortcoming of numactl just seems like a bad idea.
>>    
> 
> You seem to be asserting that we should implement a full NUMA policy in 
> QEMU.  What should it be when we don't (in QEMU) know what else is 
> running on the system?
?? Nobody wants QEMU to do the assignment itself, it is all the job of a
management application. We just provide an interface (which QEMU owns!)
to allow control.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12