public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Anthony Liguori <anthony@codemonkey.ws>
To: habanero@linux.vnet.ibm.com
Cc: Andre Przywara <andre.przywara@amd.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	"avi@redhat.com" <avi@redhat.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>
Subject: Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
Date: Tue, 31 Aug 2010 17:03:22 -0500	[thread overview]
Message-ID: <4C7D7C2A.7000205@codemonkey.ws> (raw)
In-Reply-To: <1283288095.4439.9.camel@localhost.localdomain>

On 08/31/2010 03:54 PM, Andrew Theurer wrote:
> On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote:
>    
>> On 08/23/2010 04:16 PM, Andre Przywara wrote:
>>      
>>> Anthony Liguori wrote:
>>>        
>>>> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
>>>>          
>>>>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
>>>>>            
>>>>>> According to the user-provided assignment bind the respective part
>>>>>> of the guest's memory to the given host node. This uses Linux'
>>>>>> mbind syscall (which is wrapped only in libnuma) to realize the
>>>>>> pinning right after the allocation.
>>>>>> Failures are not fatal, but produce a warning.
>>>>>>
>>>>>> Signed-off-by: Andre Przywara<andre.przywara@amd.com>
>>>>>> ...
>>>>>>              
>>>>> Why is it not possible (or perhaps not desired) to change the binding
>>>>> after the guest is started?
>>>>>
>>>>> Sounds unflexible.
>>>>>            
>>> The solution is to introduce a monitor interface to later adjust the
>>> pinning, allowing both changing the affinity only (only valid for
>>> future fault-ins) and actually copying the memory (more costly).
>>>        
>> This is just duplicating numactl.
>>
>>      
>>> Actually this is the next item on my list, but I wanted to bring up
>>> the basics first to avoid recoding parts afterwards. Also I am not
>>> (yet) familiar with the QMP protocol.
>>>        
>>>> We really need a solution that lets a user use a tool like numactl
>>>> outside of the QEMU instance.
>>>>          
>>> I fear that is not how it's meant to work with the Linux' NUMA API. In
>>> opposite to the VCPU threads, which are externally visible entities
>>> (PIDs), the memory should be private to the QEMU process. While you
>>> can change the NUMA allocation policy of the _whole_ process, there is
>>> no way to externally distinguish parts of the process' memory.
>>> Although you could later (and externally) migrate already faulted
>>> pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you
>>> would let an external tool interfere with QEMUs internal memory
>>> management. Take for instance the change of the allocation policy
>>> regarding the 1MB and 3.5-4GB holes. An external tool would have to
>>> either track such changes or you simply could not change such things
>>> in QEMU.
>>>        
>> It's extremely likely that if you're doing NUMA pinning, you're also
>> doing large pages via hugetlbfs.  numactl can already set policies for
>> files in hugetlbfs so all you need to do is have a separate hugetlbfs
>> file for each numa node.
>>      
> Why would we resort to hugetlbfs when we have transparent hugepages?
>    

If you care about NUMA pinning, I can't believe you don't want 
guaranteed large page allocation which THP does not provide.

The general point though is that we should find a way to partition 
memory in qemu such that an external process can control the actual NUMA 
placement.  This gives us maximum flexibility.

Otherwise, what do we implement in QEMU?  Direct pinning of memory to 
nodes?  Can we migrate memory between nodes?  Should we support 
interleaving memory between two virtual nodes?  Why pick and choose when 
we can have it all.

> FWIW, large apps like databases have set a precedent for managing their
> own NUMA policies.

Of course because they know what their NUMA policy should be.  They live 
in a simple world where they assume they're the only application in the 
system, they read the distance tables, figure they'll use XX% of all 
physical memory, and then pin how they see fit.

But an individual QEMU process lives in a complex world.  It's almost 
never the only thing on the system and it's only allowed to use a subset 
of resources.  It's not sure what set of resources it can and can't use 
and that's often times changing.  The topology chosen for a guest is 
static but it's host topology may be dynamic due to thinks like live 
migration.

In short, QEMU absolutely cannot implement a NUMA policy in a vacuum.  
Instead, it needs to let something with a larger view of the system 
determine a NUMA policy that makes sense overall.

There are two ways we can do this.  We can implement monitor commands 
that attempt to expose every single NUMA tunable possible.  Or, we can 
tie into the existing commands which guarantee that we support every 
possible tunable and that as NUMA support in Linux evolves, we get all 
the new features for free.

And, since numactl already supports setting policies on files in 
hugetlbfs, all we need is a simple change to qemu to allow -mem-path to 
work per-node instead of globally.  And it's useful to implement other 
types of things like having one node be guaranteed large pages and 
another node THP or some other fanciness.

Sounds awfully appealing to me.

>    I don't see why qemu should be any different.
> Numactl is great for small apps that need to be pinned in one node, or
> spread evenly on all nodes.  Having to get hugetlbfs involved just to
> workaround a shortcoming of numactl just seems like a bad idea.
>    

You seem to be asserting that we should implement a full NUMA policy in 
QEMU.  What should it be when we don't (in QEMU) know what else is 
running on the system?

Regards,

Anthony Liguori


  reply	other threads:[~2010-08-31 22:03 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-08-11 13:52 [PATCH 0/4]: NUMA: add host binding Andre Przywara
2010-08-11 13:52 ` [PATCH 1/4] NUMA: change existing NUMA guest code to use new bitmap implementation Andre Przywara
2010-08-11 13:52 ` [PATCH 2/4] NUMA: add Linux libnuma detection Andre Przywara
2010-08-11 13:52 ` [PATCH 3/4] NUMA: parse new host dependent command line options Andre Przywara
2010-08-11 13:52 ` [PATCH 4/4] NUMA: realize NUMA memory pinning Andre Przywara
2010-08-23 18:59   ` Marcelo Tosatti
2010-08-23 19:27     ` Anthony Liguori
2010-08-23 21:16       ` Andre Przywara
2010-08-23 21:27         ` Anthony Liguori
2010-08-31 20:54           ` Andrew Theurer
2010-08-31 22:03             ` Anthony Liguori [this message]
2010-09-01  3:38               ` Andrew Theurer
2010-09-09 20:00               ` Andre Przywara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C7D7C2A.7000205@codemonkey.ws \
    --to=anthony@codemonkey.ws \
    --cc=andre.przywara@amd.com \
    --cc=avi@redhat.com \
    --cc=habanero@linux.vnet.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=mtosatti@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox