From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi@redhat.com>
Subject: Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Date: Sun, 30 Nov 2008 20:07:25 +0200
Message-ID: <4932D65D.8000509@redhat.com>
References: <87fxlcxo62.fsf@basil.nowhere.org> <49318D57.6040601@redhat.com> <20081129201032.GT6703@one.firstfloor.org> <4931A77F.6050505@redhat.com> <20081130154145.GU6703@one.firstfloor.org> <4932B37A.3070305@redhat.com> <20081130160539.GV6703@one.firstfloor.org> <4932C176.7020102@redhat.com> <20081130170414.GX6703@one.firstfloor.org> <4932C94C.1090504@redhat.com> <20081130174250.GY6703@one.firstfloor.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Andre Przywara <andre.przywara@amd.com>, kvm@vger.kernel.org
To: Andi Kleen <andi@firstfloor.org>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx2.redhat.com ([66.187.237.31]:54284 "EHLO mx2.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751471AbYK3SHd (ORCPT <rfc822;kvm@vger.kernel.org>);
	Sun, 30 Nov 2008 13:07:33 -0500
In-Reply-To: <20081130174250.GY6703@one.firstfloor.org>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Andi Kleen wrote:
>>> I was more thinking about some heuristics that checks when a page
>>> is first mapped into user space. The only problem is that it is zeroed
>>> through the direct mapping before, but perhaps there is a way around it. 
>>> That's one of the rare cases when 32bit highmem actually makes things 
>>> easier.
>>> It might be also easier on some other OS than Linux who don't use
>>> direct mapping that aggressively.
>>>  
>>>       
>> In the context of kvm, the mmap() calls happen before the guest ever 
>>     
>
> The mmap call doesn't matter at all, what matters is when the
> page is allocated.
>
>   

The page is allocated at an uninteresting point in time.  For example, 
the boot loaded allocates a bunch of pages.

>> executes.  First access happens somewhat later, but still we cannot 
>> count on the majority of accesses to come from the same cpu as the first 
>> access.
>>     
>
> It is a reasonable heuristic. It's just like the rather
> successfull default local allocation heuristic the native kernel uses.
>   

It's very different.  The kernel expects an application that touched 
page X on node Y to continue using page X on node Y.  Because 
applications know this, they are written to this assumption.  However, 
in a virtualization context, the guest kernel expects that page X 
belongs to whatever node the SRAT table points at, without regard to the 
first access.

Guest kernels behave differently from applications, because real 
hardware doesn't allocate pages dynamically like the kernel can for 
applications.

(btw, what do you do with cpu-less nodes? I think some sgi hardware has 
them)

>>> The alternative is to keep your own pools and allocate from the
>>> correct pool, but then you either need pinning or getcpu()
>>>  
>>>       
>> This is meaningless in kvm context.  Other than small bits of memory 
>> needed for I/O and shadow page tables, the bulk of memory is allocated 
>> once. 
>>     
>
> Mapped once. Anyways that could be changed too if there was need.
>
>   

Mapped once and allocated once (not at the same time, but fairly close).

We can't change it without changing the guest.

>>> Basic algorithm:
>>> - If guest touches virtual node that is the same as the local node
>>> of the current vcpu assume it's a local allocation.
>>>  
>>>       
>> The guest is not making the same assumption; lying to the guest is 
>>     
>
> Huh? Pretty much all NUMA aware OS should. Linux will definitely.
>
>   

No.  Linux will assume a page belongs to the node the SRAT table says it 
belongs to.  Whether first access will be from the local node depends on 
the workload.  If the first application running accesses all memory from 
a single cpu, we will allocate all memory from one node, but this is wrong.

>> (2) even without npt/ept, we have no idea how often mappings are used 
>> and by which cpu.  finding out is expensive.
>>     
>
> You see a fault on the first mapping. That fault is on the CPU that
> did the access.  Therefore you know which one it was.
>   

It's meaningless information.  First access means nothing.  And again, 
the guest doesn't expect the page to move to the node where it touched it.

(we also see first access with ept)

>> (3) for many workloads, there are no unused pages.  the guest 
>> application allocates all memory and manages memory by itself.
>>     
>
> First a common case of guest using all memory is file cache,
> but for NUMA purposes file cache locality typically doesn't
> matter because it's not accessed frequently enough that
> non locality is a problem. It really only matters for mapping
> that are used often by the CPU.
>
> When a single application allocates everything and keeps it that is fine
> too because you'll give it approximately local memory on the initial
> set up (assuming the application has reasonable NUMA behaviour by itself
> on a first touch local allocation policy)
>   

Sure, for the simple cases it works.  But consider your first example 
followed by the second (you can even reboot the guest in the middle, but 
the bad assignment sticks).

And if the vcpu moves for some reason, things get screwed up permanently.

We should try to be predictable, not depend on behavior the guest has no 
real reason to follow, if it follows hardware specs.


-- 
error compiling committee.c: too many arguments to function