From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1NpmbX-0004Ra-4R
	for qemu-devel@nongnu.org; Thu, 11 Mar 2010 12:55:27 -0500
Received: from [199.232.76.173] (port=37805 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1NpmbW-0004R3-DU
	for qemu-devel@nongnu.org; Thu, 11 Mar 2010 12:55:26 -0500
Received: from Debian-exim by monty-python.gnu.org with spam-scanned (Exim
	4.60) (envelope-from <paul@codesourcery.com>) id 1NpmbV-0000y8-DO
	for qemu-devel@nongnu.org; Thu, 11 Mar 2010 12:55:26 -0500
Received: from mx20.gnu.org ([199.232.41.8]:40409)
	by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32)
	(Exim 4.60) (envelope-from <paul@codesourcery.com>)
	id 1NpmbV-0000xx-59
	for qemu-devel@nongnu.org; Thu, 11 Mar 2010 12:55:25 -0500
Received: from mail.codesourcery.com ([38.113.113.100])
	by mx20.gnu.org with esmtp (Exim 4.60)
	(envelope-from <paul@codesourcery.com>) id 1NpmbT-0000P2-BP
	for qemu-devel@nongnu.org; Thu, 11 Mar 2010 12:55:23 -0500
From: Paul Brook <paul@codesourcery.com>
Subject: Re: [Qemu-devel] [PATCH QEMU] transparent hugepage support
Date: Thu, 11 Mar 2010 17:55:10 +0000
References: <20100311151427.GE5677@random.random>
	<201003111628.04566.paul@codesourcery.com>
	<20100311164642.GI5677@random.random>
In-Reply-To: <20100311164642.GI5677@random.random>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201003111755.10914.paul@codesourcery.com>
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: qemu-devel@nongnu.org, Avi Kivity <avi@redhat.com>

> On Thu, Mar 11, 2010 at 04:28:04PM +0000, Paul Brook wrote:
> > > > +		/*
> > > > +		 * Align on HPAGE_SIZE so "(gfn ^ pfn)&
> > > > +		 * (HPAGE_SIZE-1) == 0" to allow KVM to take advantage
> > > > +		 * of hugepages with NPT/EPT.
> > > > +		 */
> > > > +		new_block->host = qemu_memalign(1<<  TARGET_HPAGE_BITS, size);
> >
> > This should not be target dependent. i.e. it should be the host page
> > size.
> 
> Yep I noticed. I'm not aware of an official way to get that
> information out of the kernel (hugepagesize in /proc/meminfo is
> dependent on hugetlbfs which in turn is not a dependency for
> transparent hugepage support) but hey I can add it myself to
> /sys/kernel/mm/transparent_hugepage/hugepage_size !

sysconf(_SC_HUGEPAGESIZE); would seem to be the obvious answer.
 
> > > That is a little wasteful.  How about a hint to mmap() requesting
> > > proper alignment (MAP_HPAGE_ALIGN)?
> >
> > I'd kinda hope that we wouldn't need to. i.e. the host kernel is smart
> > enough to automatically align large allocations anyway.
> 
> Kernel won't do that, and the main reason is to avoid creating more
> vmas, it's more efficient to waste virtual space and have userland
> allocate more than needed, than ask the kernel alignment and force it
> to create more vmas because of holes generated out of it. virtual
> memory costs nothing.

Huh. That seems unfortunate :-(

> Also khugepaged can later zero out the pte_none regions to create a
> full segment all backed by hugepages, however if we do that khugepaged
> will eat into the free memory space. At the moment I kept khugepaged a
> zero-memory-footprint thing. But I'm currently adding an option called
> collapse_unmapped to allow khugepaged to collapse unmapped pages too
> so if there are only 2/3 pages in the region before the memalign, they
> also can be mapped by a large tlb to allow qemu run faster.

I don't really understand what you're getting at here. Surely a naturally 
aligned block is always going to be easier to defragment than a misaligned 
block.

If the allocation size is not a multiple of the preferred alignment, then you 
probably loose either way, and we shouldn't be requesting increased alignment.

> > This is probably a useful optimization regardless of KVM.
> 
> HPAGE alignment is only useful with KVM because it can only payoff
> with EPT/NPT, transparent hugepage already works fine without that
> (but ok it'd be a microoptimization for the first and last few pages
> in the whole vma). This is why I made it conditional to
> kvm_enabled(). I can remove the kvm_enabled() check if you worry about
> the first and last pages in the huge anon vma.

I wouldn't be surprised if putting the start of guest ram on a large TLB entry 
was a win. Your guest kernel often lives there!

> OTOH the madvise(MADV_HUGEPAGE) is surely good idea for qemu too. KVM
> normally runs on 64bit hosts, so it's no big deal if we waste 1M of
> virtual memory here and there but I thought on qemu you preferred not
> to have alignment and have the first few and last few pages in a vma
> not backed by large tlb. Ideally we should also align on hpage size if
> sizeof(long) = 8. Not sure what's the recommended way to code that
> though and it'll make it a bit more complex for little good.

Assuming we're allocating in large chunks, I doubt an extra hugepage worth of 
VMA is a big issue.

Either way I'd argue that this isn't something qemu should have to care about, 
and is actually a bug in posix_memalign.

Paul