From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi@redhat.com>
Subject: Re: [PATCH 0/3] KVM-userspace: add NUMA support for guests
Date: Sat, 29 Nov 2008 22:35:11 +0200
Message-ID: <4931A77F.6050505@redhat.com>
References: <492F1DD9.8030901@amd.com> <87fxlcxo62.fsf@basil.nowhere.org> <49318D57.6040601@redhat.com> <20081129201032.GT6703@one.firstfloor.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Andre Przywara <andre.przywara@amd.com>, kvm@vger.kernel.org
To: Andi Kleen <andi@firstfloor.org>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx2.redhat.com ([66.187.237.31]:39788 "EHLO mx2.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752103AbYK2UfH (ORCPT <rfc822;kvm@vger.kernel.org>);
	Sat, 29 Nov 2008 15:35:07 -0500
In-Reply-To: <20081129201032.GT6703@one.firstfloor.org>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Andi Kleen wrote:
> On Sat, Nov 29, 2008 at 08:43:35PM +0200, Avi Kivity wrote:
>   
>> Andi Kleen wrote:
>>     
>>> It depends -- it's not necessarily an improvement. e.g. if it leads to
>>> some CPUs being idle while others are oversubscribed because of the
>>> pinning you typically lose more than you win. In general default
>>> pinning is a bad idea in my experience.
>>>
>>> Alternative more flexible strategies:
>>>
>>> - Do a mapping from CPU to node at runtime by using getcpu()
>>> - Migrate to complete nodes using migrate_pages when qemu detects
>>> node migration on the host.
>>>  
>>>       
>> Wouldn't that cause lots of migrations?  Migrating a 1GB guest can take 
>>     
>
> I assume you mean the second one (the two points were orthogonal)
> The first one is an approximate method, also has advantages
> and disadvantages.
>
>   

I don't think the first one works without the second.  Calling getcpu() 
on startup is meaningless since the initial placement doesn't take the 
current workload into account.

>> a huge amount of cpu time (tens or even hundreds of milliseconds?) 
>> compared to very high frequency activity like the scheduler.
>>     
>
> Yes migration is expensive, although you can do it on demand of course, 
> but the scheduler typically has pretty strong cpu affinity so it shouldn't 
> happen too often. Also it's only a temporary cost compared to the 
> endless overhead of running forever non local or running forever with 
> some cores idle.
>
> Another strategy would be to tune the load balancer in the scheduler
> for this case and make it only migrate in extreme situations.
>
> Anyways it's not ideal either, but in my mind would be all preferable
> to default CPU pinning.

I agree we need something dynamic, and that we need to tie cpu affinity 
and memory affinity together.

This could happen completely in the kernel (not an easy task), or by 
having a second-level scheduler in userspace polling for cpu usage an 
rebalancing processes across numa nodes.  Given that with virtualization 
you have a few long lived processes, this does not seem too difficult.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.