Re: [PATCH 4/4] NUMA: realize NUMA memory pinning

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: Andrew Theurer <habanero@linux.vnet.ibm.com>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Andre Przywara <andre.przywara@amd.com>,
	Marcelo Tosatti <mtosatti@redhat.com>,
	"avi@redhat.com" <avi@redhat.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>
Subject: Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
Date: Tue, 31 Aug 2010 22:38:45 -0500	[thread overview]
Message-ID: <1283312325.3936.20.camel@localhost.localdomain> (raw)
In-Reply-To: <4C7D7C2A.7000205@codemonkey.ws>

On Tue, 2010-08-31 at 17:03 -0500, Anthony Liguori wrote:
> On 08/31/2010 03:54 PM, Andrew Theurer wrote:
> > On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote:
> >    
> >> On 08/23/2010 04:16 PM, Andre Przywara wrote:
> >>      
> >>> Anthony Liguori wrote:
> >>>        
> >>>> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
> >>>>          
> >>>>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
> >>>>>            
> >>>>>> According to the user-provided assignment bind the respective part
> >>>>>> of the guest's memory to the given host node. This uses Linux'
> >>>>>> mbind syscall (which is wrapped only in libnuma) to realize the
> >>>>>> pinning right after the allocation.
> >>>>>> Failures are not fatal, but produce a warning.
> >>>>>>
> >>>>>> Signed-off-by: Andre Przywara<andre.przywara@amd.com>
> >>>>>> ...
> >>>>>>              
> >>>>> Why is it not possible (or perhaps not desired) to change the binding
> >>>>> after the guest is started?
> >>>>>
> >>>>> Sounds unflexible.
> >>>>>            
> >>> The solution is to introduce a monitor interface to later adjust the
> >>> pinning, allowing both changing the affinity only (only valid for
> >>> future fault-ins) and actually copying the memory (more costly).
> >>>        
> >> This is just duplicating numactl.
> >>
> >>      
> >>> Actually this is the next item on my list, but I wanted to bring up
> >>> the basics first to avoid recoding parts afterwards. Also I am not
> >>> (yet) familiar with the QMP protocol.
> >>>        
> >>>> We really need a solution that lets a user use a tool like numactl
> >>>> outside of the QEMU instance.
> >>>>          
> >>> I fear that is not how it's meant to work with the Linux' NUMA API. In
> >>> opposite to the VCPU threads, which are externally visible entities
> >>> (PIDs), the memory should be private to the QEMU process. While you
> >>> can change the NUMA allocation policy of the _whole_ process, there is
> >>> no way to externally distinguish parts of the process' memory.
> >>> Although you could later (and externally) migrate already faulted
> >>> pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you
> >>> would let an external tool interfere with QEMUs internal memory
> >>> management. Take for instance the change of the allocation policy
> >>> regarding the 1MB and 3.5-4GB holes. An external tool would have to
> >>> either track such changes or you simply could not change such things
> >>> in QEMU.
> >>>        
> >> It's extremely likely that if you're doing NUMA pinning, you're also
> >> doing large pages via hugetlbfs.  numactl can already set policies for
> >> files in hugetlbfs so all you need to do is have a separate hugetlbfs
> >> file for each numa node.
> >>      
> > Why would we resort to hugetlbfs when we have transparent hugepages?
> >    
> 
> If you care about NUMA pinning, I can't believe you don't want 
> guaranteed large page allocation which THP does not provide.

I personally want a more automatic approach to placing VMs in NUMA nodes
(not directed by the qemu process itself), but I'd also like to support
a user's desire to pin and place cpus and memory, especially for large
VMs that need to be defined as multi-node.  For user defined pinning,
libhugetlbfs will probably be fine, but for most VMs, I'd like to ensure
we can do things like ballooning well, and I am not so sure that will be
easy with libhugetlbfs.  

> The general point though is that we should find a way to partition 
> memory in qemu such that an external process can control the actual NUMA 
> placement.  This gives us maximum flexibility.
> 
> Otherwise, what do we implement in QEMU?  Direct pinning of memory to 
> nodes?  Can we migrate memory between nodes?  Should we support 
> interleaving memory between two virtual nodes?  Why pick and choose when 
> we can have it all.

If there were a better way to do this than hugetlbfs, then I don't think
I would shy away from this.  Is there another way to change NUMA
policies on mappings from a user tool?  We can already inspect
with /proc/<pid>/numamaps.  Is this something that could be added to
numactl?

> 
> > FWIW, large apps like databases have set a precedent for managing their
> > own NUMA policies.
> 
> Of course because they know what their NUMA policy should be.  They live 
> in a simple world where they assume they're the only application in the 
> system, they read the distance tables, figure they'll use XX% of all 
> physical memory, and then pin how they see fit.
> 
> But an individual QEMU process lives in a complex world.  It's almost 
> never the only thing on the system and it's only allowed to use a subset 
> of resources.  It's not sure what set of resources it can and can't use 
> and that's often times changing.  The topology chosen for a guest is 
> static but it's host topology may be dynamic due to thinks like live 
> migration.

True, that's why this would require support to change in the monitor.

> In short, QEMU absolutely cannot implement a NUMA policy in a vacuum.  
> Instead, it needs to let something with a larger view of the system 
> determine a NUMA policy that makes sense overall.

I agree.

> There are two ways we can do this.  We can implement monitor commands 
> that attempt to expose every single NUMA tunable possible.  Or, we can 
> tie into the existing commands which guarantee that we support every 
> possible tunable and that as NUMA support in Linux evolves, we get all 
> the new features for free.

Assuming there's no new thing one needs to expose in qemu to work with
whatever new feature numactl/libnuma gets.  But perhaps that's a lot
less likely.

> And, since numactl already supports setting policies on files in 
> hugetlbfs, all we need is a simple change to qemu to allow -mem-path to 
> work per-node instead of globally.  And it's useful to implement other 
> types of things like having one node be guaranteed large pages and 
> another node THP or some other fanciness.

If it were not dependent on hugetlbfs, then I don't think I would have
an issue.

> Sounds awfully appealing to me.
> 
> >    I don't see why qemu should be any different.
> > Numactl is great for small apps that need to be pinned in one node, or
> > spread evenly on all nodes.  Having to get hugetlbfs involved just to
> > workaround a shortcoming of numactl just seems like a bad idea.
> >    
> 
> You seem to be asserting that we should implement a full NUMA policy in 
> QEMU.  What should it be when we don't (in QEMU) know what else is 
> running on the system?

I don't think the qemu itself should decide where to "be" on the system.
I would like to have -something- else make those decisions, either a
user or some mgmt daemon that looks at the whole picture.  Or <gulp> get
the scheduler involved (with new algorithms).

I am still quite curious of numactl/libnuma could be extended to set
some policies on individual mappings.  Then we will not even need to
have multiple -mem-path's.

-Andrew

> 
> Regards,
> 
> Anthony Liguori
>

next prev parent reply	other threads:[~2010-09-01  3:38 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-08-11 13:52 [PATCH 0/4]: NUMA: add host binding Andre Przywara
2010-08-11 13:52 ` [PATCH 1/4] NUMA: change existing NUMA guest code to use new bitmap implementation Andre Przywara
2010-08-11 13:52 ` [PATCH 2/4] NUMA: add Linux libnuma detection Andre Przywara
2010-08-11 13:52 ` [PATCH 3/4] NUMA: parse new host dependent command line options Andre Przywara
2010-08-11 13:52 ` [PATCH 4/4] NUMA: realize NUMA memory pinning Andre Przywara
2010-08-23 18:59   ` Marcelo Tosatti
2010-08-23 19:27     ` Anthony Liguori
2010-08-23 21:16       ` Andre Przywara
2010-08-23 21:27         ` Anthony Liguori
2010-08-31 20:54           ` Andrew Theurer
2010-08-31 22:03             ` Anthony Liguori
2010-09-01  3:38               ` Andrew Theurer [this message]
2010-09-09 20:00               ` Andre Przywara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1283312325.3936.20.camel@localhost.localdomain \
    --to=habanero@linux.vnet.ibm.com \
    --cc=andre.przywara@amd.com \
    --cc=anthony@codemonkey.ws \
    --cc=avi@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=mtosatti@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox