Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

From: Avi Kivity <avi@redhat.com>
To: Andi Kleen <andi@firstfloor.org>
Cc: Andre Przywara <andre.przywara@amd.com>, kvm@vger.kernel.org
Subject: Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Date: Sun, 30 Nov 2008 22:07:01 +0200	[thread overview]
Message-ID: <4932F265.8000204@redhat.com> (raw)
In-Reply-To: <20081130185558.GZ6703@one.firstfloor.org>

Andi Kleen wrote:
>> The page is allocated at an uninteresting point in time.  For example, 
>> the boot loaded allocates a bunch of pages.
>>     
>
> The far majority of pages are allocated when a process wants them
> or the kernel uses them for file cache.
>   

Right.  Allocated from the guest kernel's perspective.  This may be 
different from the host kernel's perspective.

Linux will delay touching memory until the last moment, Windows will not 
(likely it zeros pages on their own nodes, but who knows)?

The bigger problem is lifetime.  Inside a guest, 'allocation' happens 
when a page is used for pagecache, or when a process is created and 
starts using memory.  From the host perspective, it happens just once.

>> It's very different.  The kernel expects an application that touched 
>> page X on node Y to continue using page X on node Y.  Because 
>> applications know this, they are written to this assumption.  However, 
>>     
>
> The far majority of applications do not actually know where memory is. 
>   

In our case, the application is the guest kernel, which does know.

> What matters is that you get local accesses most of the time for the memory
> that is touched on a specific CPU. Even the applications who
> know won't break if it's somewhere else, because it's only
> an optimization. As long as you're faster on average (or in the worst
> case not significantly worse) than not having it you're fine.
>
> Also the Linux first touch is a heuristic that can be wrong
> later, and I don't see too much difference in having another
> heuristic level on top of it.
>   

The difference is, Linux (as a guest) will try to reuse freed pages from 
an application or pagecache, knowing which node they belong to.

I agree that if all you do is HPC style computation (boot a kernel and 
one app with one process per cpu), then the heuristics work well.

> The scheme I described is a approximate heuristic to get local
> memory access in many cases without pinning anything to CPUs.
> It is certainly not perfect and has holes (like any heuristics),
> but it has the advantage of being fully dynamic. 
>   

It also has the advantage of being already implemented (apart from fake 
SRAT tables; and that isn't necessary for HPC apps).

>> in a virtualization context, the guest kernel expects that page X 
>> belongs to whatever node the SRAT table points at, without regard to the 
>> first access.
>>
>> Guest kernels behave differently from applications, because real 
>> hardware doesn't allocate pages dynamically like the kernel can for 
>> applications.
>>     
>
> Again the kernel just wants local memory access most of the time
> for the allocations where it matters.
>
>   

It does.  But the kernel doesn't allocate memory (except for the first 
time); it recycles memory.

> Also NUMA is always an optimization, it's not a big issue if you're
> wrong occasionally because that doesn't affect correctness.
>   

Agreed.

>> Mapped once and allocated once (not at the same time, but fairly close).
>>     
>
> That seems dubious to me.
>
>   

That's how it works.

Qemu will mmap() the guest's memory at initialization time.  When the 
guest touches memory, kvm will call get_user_pages_fast() (here's the 
allocation) and instantiate a pte in the qemu address space, as well as 
a shadow pte (using either ept/npt two-level maps or direct shadow maps).

With ept/npt, in the absence os swapping, the story ends here.  Without 
ept/npt, the guest will continue to fault, but now get_user_pages_fast() 
will return the already allocated page from the pte in the qemu address 
space.

>> No.  Linux will assume a page belongs to the node the SRAT table says it 
>> belongs to.  Whether first access will be from the local node depends on 
>> the workload.  If the first application running accesses all memory from 
>> a single cpu, we will allocate all memory from one node, but this is wrong.
>>     
>
> Sorry I don't get your point. 

Yeah, we're talking a bit past each other.  I'll try to expand.

> Wrong doesn't make sense in this context.
>
> You seem to be saying that an allocation that is not local on a native
> kernel wouldn't be local in the approximate heuristic either. 
> But that's a triviality that is of course true and likely not what 
> you meant anyways.
>
>   

I'm saying, that sometimes the guest kernel allocates memory from 
virtual node A but uses it on virtual node B, due to its memory policy 
or perhaps due to resource scarcity.  Unlike a normal application, the 
guest kernel still tracks the page as belonging to node A (even though 
it is used on node B).  Because of this, when the page is recycled, the 
guest kernel will try to assign it to processes running on node A.  But 
the host has allocated it from node B.

When we export a virtual SRAT, we promise to the guest something about 
memory access latency.  The guest will try to optimize according to this 
SRAT, and if we don't fulfil the promise, it will make incorrect decisions.

So long as a page has a single use in the lifetime of the guest, it 
doesn't matter.  But general purpose applications don't behave like that.

>> It's meaningless information.  First access means nothing.  And again, 
>>     
>
> At least in Linux the first access to the majority of memory is 
> either through a process page allocation or through a file cache
> page cache allocation. Yes there are are a few boot loader
> and temporary kernel pages for which this is not true, 
> but they are a small insignificant fraction of the total memory
> in a reasonably sized guest. I'm just ignoring them.
>
> This can be often observed in that if you have a broken DIMM
> you only get problems after using some program that uses
> most of your memory.
>
>   

Right.  If you have a single allocate-once workload, the heuristic works.

Windows will zero all memory in the background, btw.

>> Sure, for the simple cases it works.  But consider your first example 
>> followed by the second (you can even reboot the guest in the middle, but 
>> the bad assignment sticks).
>>     
>
> If you have a heuristic to detect remapping you'll recover on each remapping.
>   

We do, for non ?pt.

>> We should try to be predictable,
>>     
>
> NUMA is unfortunately somewhat unpredictable even on native kernels
> There are always situations whe where a hot page can end up on the wrong node.
> That tends to make benchmakers unhappy. But so far no good general
> way is known to avoid it.
>   

Doesn't the cache insulate workloads against small numbers of mislocated 
pages?

>> not depend on behavior the guest has no 
>> real reason to follow, if it follows hardware specs.
>>     
>
> Sorry, Avi, I suspect you have a somewhat unrealistic mental model of 
> NUMA knowledge in applications and OS. 
>
>   

That may well be.  I haven't programmed large NUMA apps.

> At least Linux's behaviour (and I assume most NUMA optimized
> OS) will be handled reasonably well by this scheme. I think.
> It was just one proposal.
>   

Well, if it is, we'll find out easily, as that's what we implement right 
now (less migrating-on-remap and providing a virtual SRAT, which isn't 
even really needed).

> Anyways it might still not work well in practice -- the only way
> to find out would be to implement and try -- but I think
> it should not be dismissed out of hand.
>   

I can't dismiss it even if I want to -- it's how kvm works now (well 
except when a device is assigned).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

next prev parent reply	other threads:[~2008-11-30 20:06 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-11-27 22:23 [PATCH 0/3] KVM-userspace: add NUMA support for guests Andre Przywara
2008-11-28  8:14 ` Andi Kleen
2008-11-29 18:43   ` Avi Kivity
2008-11-29 20:10     ` Andi Kleen
2008-11-29 20:35       ` Avi Kivity
2008-11-30 15:41         ` Andi Kleen
2008-11-30 15:38           ` Avi Kivity
2008-11-30 16:05             ` Andi Kleen
2008-11-30 16:38               ` Avi Kivity
2008-11-30 17:04                 ` Andi Kleen
2008-11-30 17:11                   ` Avi Kivity
2008-11-30 17:42                     ` Andi Kleen
2008-11-30 18:07                       ` Avi Kivity
2008-11-30 18:55                         ` Andi Kleen
2008-11-30 19:11                           ` Skywing
2008-11-30 20:08                             ` Avi Kivity
2008-11-30 20:07                           ` Avi Kivity [this message]
2008-11-30 21:41                             ` Andi Kleen
2008-11-30 21:50                               ` Avi Kivity
2008-11-30 22:08                                 ` Skywing
2008-11-28 10:40 ` Daniel P. Berrange
2008-11-29 18:29 ` Avi Kivity
2008-12-01 14:15   ` Andre Przywara
2008-12-01 14:29     ` Avi Kivity
2008-12-01 15:27       ` Anthony Liguori
2008-12-01 15:34         ` Anthony Liguori
2008-12-01 15:37         ` Avi Kivity
2008-12-01 15:49           ` Anthony Liguori
2008-12-01 14:44     ` Daniel P. Berrange
2008-12-01 14:53       ` Avi Kivity
2008-12-01 15:18 ` Anthony Liguori
2008-12-01 15:35   ` Avi Kivity

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4932F265.8000204@redhat.com \
    --to=avi@redhat.com \
    --cc=andi@firstfloor.org \
    --cc=andre.przywara@amd.com \
    --cc=kvm@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox