Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: Andi Kleen <andi@firstfloor.org>
To: Avi Kivity <avi@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>,
	Andre Przywara <andre.przywara@amd.com>,
	kvm@vger.kernel.org
Subject: Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Date: Sun, 30 Nov 2008 18:42:51 +0100	[thread overview]
Message-ID: <20081130174250.GY6703@one.firstfloor.org> (raw)
In-Reply-To: <4932C94C.1090504@redhat.com>

On Sun, Nov 30, 2008 at 07:11:40PM +0200, Avi Kivity wrote:
> Andi Kleen wrote:
> >On Sun, Nov 30, 2008 at 06:38:14PM +0200, Avi Kivity wrote:
> >  
> >>The guest allocates when it touches the page for the first time.  This 
> >>means very little since all of memory may be touched during guest bootup 
> >>or shortly afterwards.  Even if not, it is still a one-time operation, 
> >>and any choices we make based on it will last the lifetime of the guest.
> >>    
> >
> >I was more thinking about some heuristics that checks when a page
> >is first mapped into user space. The only problem is that it is zeroed
> >through the direct mapping before, but perhaps there is a way around it. 
> >That's one of the rare cases when 32bit highmem actually makes things 
> >easier.
> >It might be also easier on some other OS than Linux who don't use
> >direct mapping that aggressively.
> >  
> 
> In the context of kvm, the mmap() calls happen before the guest ever 

The mmap call doesn't matter at all, what matters is when the
page is allocated.

> executes.  First access happens somewhat later, but still we cannot 
> count on the majority of accesses to come from the same cpu as the first 
> access.

It is a reasonable heuristic. It's just like the rather
successfull default local allocation heuristic the native kernel uses.

> >
> >The alternative is to keep your own pools and allocate from the
> >correct pool, but then you either need pinning or getcpu()
> >  
> 
> This is meaningless in kvm context.  Other than small bits of memory 
> needed for I/O and shadow page tables, the bulk of memory is allocated 
> once. 

Mapped once. Anyways that could be changed too if there was need.

> 
> >>We need to mimic real hardware.
> >>    
> >
> >The underlying allocation is in pages, so the NUMA affinity can 
> >be as well handled by this. 
> >
> >Basic algorithm:
> >- If guest touches virtual node that is the same as the local node
> >of the current vcpu assume it's a local allocation.
> >  
> 
> The guest is not making the same assumption; lying to the guest is 

Huh? Pretty much all NUMA aware OS should. Linux will definitely.


> (1) with npt/ept we have no clue as to guest mappings

Yes that is tricky. With A bits in theory it could be made 
to work with EPT, but there are none, and it would still
not work very well.

> (2) even without npt/ept, we have no idea how often mappings are used 
> and by which cpu.  finding out is expensive.

You see a fault on the first mapping. That fault is on the CPU that
did the access.  Therefore you know which one it was.

> (3) for many workloads, there are no unused pages.  the guest 
> application allocates all memory and manages memory by itself.

First a common case of guest using all memory is file cache,
but for NUMA purposes file cache locality typically doesn't
matter because it's not accessed frequently enough that
non locality is a problem. It really only matters for mapping
that are used often by the CPU.

When a single application allocates everything and keeps it that is fine
too because you'll give it approximately local memory on the initial
set up (assuming the application has reasonable NUMA behaviour by itself
on a first touch local allocation policy)

When there's lots of remapping/new processes one would probably need some 
heuristics to detect reallocations, like the mapping heuristics I described 
earlier or PV help.

> Right.  The situation I'm trying to avoid is process A with memory on 
> node X running on node Y, and process B with memory on node Y running on 
> node X.  The scheduler arrives at a local optimum, caused by some 
> spurious load, and won't move to the global optimum because migrating 
> processes across cpus is considered expensive.
> 
> I don't know, perhaps the current scheduler is clever enough to do this 
> already.

It tries too, but there are always extreme cases where it doesn't work.
Also once a process is migrated it won't find back to its memory.
Still for a approximate dynamic solution trusting it is not the worst
you can do.

-Andi
-- 
ak@linux.intel.com

next prev parent reply	other threads:[~2008-11-30 17:32 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-11-27 22:23 [PATCH 0/3] KVM-userspace: add NUMA support for guests Andre Przywara
2008-11-28  8:14 ` Andi Kleen
2008-11-29 18:43   ` Avi Kivity
2008-11-29 20:10     ` Andi Kleen
2008-11-29 20:35       ` Avi Kivity
2008-11-30 15:41         ` Andi Kleen
2008-11-30 15:38           ` Avi Kivity
2008-11-30 16:05             ` Andi Kleen
2008-11-30 16:38               ` Avi Kivity
2008-11-30 17:04                 ` Andi Kleen
2008-11-30 17:11                   ` Avi Kivity
2008-11-30 17:42                     ` Andi Kleen [this message]
2008-11-30 18:07                       ` Avi Kivity
2008-11-30 18:55                         ` Andi Kleen
2008-11-30 19:11                           ` Skywing
2008-11-30 20:08                             ` Avi Kivity
2008-11-30 20:07                           ` Avi Kivity
2008-11-30 21:41                             ` Andi Kleen
2008-11-30 21:50                               ` Avi Kivity
2008-11-30 22:08                                 ` Skywing
2008-11-28 10:40 ` Daniel P. Berrange
2008-11-29 18:29 ` Avi Kivity
2008-12-01 14:15   ` Andre Przywara
2008-12-01 14:29     ` Avi Kivity
2008-12-01 15:27       ` Anthony Liguori
2008-12-01 15:34         ` Anthony Liguori
2008-12-01 15:37         ` Avi Kivity
2008-12-01 15:49           ` Anthony Liguori
2008-12-01 14:44     ` Daniel P. Berrange
2008-12-01 14:53       ` Avi Kivity
2008-12-01 15:18 ` Anthony Liguori
2008-12-01 15:35   ` Avi Kivity

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20081130174250.GY6703@one.firstfloor.org \
    --to=andi@firstfloor.org \
    --cc=andre.przywara@amd.com \
    --cc=avi@redhat.com \
    --cc=kvm@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox