From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alexander Graf <agraf@suse.de>
Subject: Re: [PATCH 0/3][RFC] NUMA: add host side pinning
Date: Mon, 28 Jun 2010 18:26:01 +0200
Message-ID: <4C28CD19.9000001@suse.de>
References: <1277327377-29629-1-git-send-email-andre.przywara@amd.com> <4C2288DD.3020207@codemonkey.ws> <865764AB-4E51-4ED4-8832-AED6A237A9D3@suse.de> <4C233A6D.7030805@amd.com> <4C233DAB.60106@redhat.com> <4C2342D1.4090103@amd.com> <4C234493.2050408@redhat.com> <4C28CBC5.80109@codemonkey.ws>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: Avi Kivity <avi@redhat.com>,
	Andre Przywara <andre.przywara@amd.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>
To: Anthony Liguori <anthony@codemonkey.ws>
Return-path: <kvm-owner@vger.kernel.org>
Received: from cantor2.suse.de ([195.135.220.15]:55595 "EHLO mx2.suse.de"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751029Ab0F1Q0D (ORCPT <rfc822;kvm@vger.kernel.org>);
	Mon, 28 Jun 2010 12:26:03 -0400
In-Reply-To: <4C28CBC5.80109@codemonkey.ws>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

Anthony Liguori wrote:
> On 06/24/2010 06:42 AM, Avi Kivity wrote:
>> On 06/24/2010 02:34 PM, Andre Przywara wrote:
>>>> Non-anonymous memory doesn't work well with ksm and transparent
>>>> hugepages.  Is it possible to use anonymous memory rather than file
>>>> backed?
>>>
>>> I'd prefer non-file backed, too. But that is how the current huge
>>> pages implementation is done. We could use MAP_HUGETLB and declare
>>> NUMA _and_ huge pages as 2.6.32+ only. Unfortunately I didn't find
>>> an easy way to detect the presence of the MAP_HUGETLB flag. If the
>>> kernel does not support it, it seems that mmap silently ignores it
>>> and uses 4KB pages instead.
>>
>> That sucks, unfortunately it is normal practice.  However it is a
>> soft failure, everything works just a bit slower.  So it's probably
>> acceptable.
>>
>>>>> To avoid this I'd like to see the pinning done from within QEMU. I
>>>>> am not sure whether calling numactl via system() and friends is
>>>>> OK, I'd prefer to run the syscalls directly (like in patch 3/3)
>>>>> and pull the necessary options into the -numa pin,... command
>>>>> line. We could mimic numactl's syntax here.
>>>>
>>>> Definitely not use system(), but IIRC numactl has a library interface?
>>> Right, that is what I include in patch 3/3 and use. I got the
>>> impression Anthony wanted to avoid reimplementing parts of numactl,
>>> especially enabling the full flexibility of the command line
>>> interface (like specifying nodes, policies and interleaving).
>>> I want QEMU to use the library and pull the necessary options into
>>> the -numa pin,... parsing, even if this means duplicating numactl
>>> functionality.
>>>
>>
>> I agree with that.  It's a lot easier to use a single tool than to
>> try to integrate things yourself, the unix tradition of grep | sort |
>> uniq -c | sort -n notwithstanding.  Especially when one of the tools
>> is qemu.
>
> I could disagree more here.  This is why we don't support CPU pinning
> and instead provide PID information for each VCPU thread.
>
> The folks that want to use pinning are not notice users.  They are not
> going to be happy unless you can make full use of existing tools. 
> That means replicating all of numactl's functionality (which is not
> what the current patches do) or enable numactl to be used with a guest.

So how about some QMP plumbing that would allow numactl to create the
VMs at defined ranges? So you'd basically get numactl --run-qemu --
qemu-kvm -blah -foo

Alex