From mboxrd@z Thu Jan  1 00:00:00 1970
From: Marcelo Tosatti <marcelo@kvack.org>
Subject: Re: [PATCH] QEMU/KVM: large page support
Date: Sat, 23 Feb 2008 19:54:07 -0300
Message-ID: <20080223225407.GA8465@dmt>
References: <20080223144812.GB1040@dmt> <47C06607.80200@codemonkey.ws>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Cc: Marcelo Tosatti <marcelo@kvack.org>,
	kvm-devel <kvm-devel@lists.sourceforge.net>,
	Avi Kivity <avi@qumranet.com>
To: Anthony Liguori <anthony@codemonkey.ws>
Return-path: <kvm-devel-bounces@lists.sourceforge.net>
Content-Disposition: inline
In-Reply-To: <47C06607.80200@codemonkey.ws>
List-Unsubscribe: <https://lists.sourceforge.net/lists/listinfo/kvm-devel>,
	<mailto:kvm-devel-request@lists.sourceforge.net?subject=unsubscribe>
List-Archive: <http://sourceforge.net/mailarchive/forum.php?forum_name=kvm-devel>
List-Post: <mailto:kvm-devel@lists.sourceforge.net>
List-Help: <mailto:kvm-devel-request@lists.sourceforge.net?subject=help>
List-Subscribe: <https://lists.sourceforge.net/lists/listinfo/kvm-devel>,
	<mailto:kvm-devel-request@lists.sourceforge.net?subject=subscribe>
Sender: kvm-devel-bounces@lists.sourceforge.net
Errors-To: kvm-devel-bounces@lists.sourceforge.net
List-Id: kvm.vger.kernel.org

Hi Anthony, 

Thanks for your comments.

On Sat, Feb 23, 2008 at 12:29:27PM -0600, Anthony Liguori wrote:

> In general, I don't think it causes any real harm if we always align the 
> ram address to a large page boundary.  If we aren't on Linux (and can't 
> determine what the large page size is), we can just set hpagesize to 
> getpagesize().  I think there's a good reason for this that I'll explain 
> below.

I thought about doing that (gets rid of the 4GB+ special casing) but we
lose the ability to compact smaller allocations in a single largepage.

Right now the VGA BIOS and the BIOS fit in the same largepage, for
example.

> > +
> > +void *alloc_huge_area(unsigned long memory, const char *path)
> > +{
> > +	void *area;
> > +	int fd;
> > +	char *filename;
> > +	char *tmpfile = "/kvm.XXXXXX";
> > +
> > +	filename = qemu_malloc(4096);
> > +	if (!filename) 
> > +		return NULL;
> > +
> > +	memset(filename, 0, 4096);
> > +	strncpy(filename, path, 4096 - strlen(tmpfile) - 1);
> > +	strcat(filename, tmpfile);
> > +	
> > +	hpagesize = gethugepagesize() * 1024;
> > +	if (!hpagesize)
> > +		return NULL;
> > +
> > +	mkstemp(filename);
> >   
> 
> mkstemp returns a file descriptor so the following open is not required.

Right, will fix.

> > +	fd = open(filename, O_RDWR);
> > +	if (fd < 0) {
> > +		perror("open");
> > +		hpagesize = 0;
> > +		exit(0);
> > +	}
> > +	memory = (memory+hpagesize-1) & ~(hpagesize-1);
> >   
> 
> I'm a little surprised that hugetlbfs doesn't require an ftruncate() 
> before the mmap().  Does an ftruncate() do any harm?  If so, it would be 
> better to have one.

Its not needed because hugetlbfs automatically adjusts the inode size on
demand.

Don't think it will do any harm and its compliant with normal
filesystems.

> > +	area = mmap(0, memory, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
> >   
> 
> I think a MAP_SHARED may have some advantages in that it then becomes 
> possible to pass this file descriptor around to other processes so they 
> can mmap() the same memory region.  I don't know if that works with 
> hugetlbfs but it certainly does with tmpfs.  My thinking is that this 
> code can be made generic so it works with either hugetlbfs or tmpfs.

Yes, makes sense.

> Furthermore, I think it would be interesting if we defaulted to trying 
> to create the memory file in something like /dev/kvm-mem or something 
> more appropriately named.  An administrator can then either mount a 
> hugetlbfs or tmpfs mount there.  We still probably want to provide an 
> option to override where to create the file and we want to be able to 
> fall back to a normal malloc() of course but at least this makes it 
> possible for the distros to set things up so that it Just Work without 
> the user having to understand things like hugetlbfs and tmpfs.

I like the idea. I'll change "-hugetlb-path" to "-mem-path" for now (and
test it with tmpfs). Not so fond of hardcoding a path in QEMU though.

> Maybe we'll even see something that merges the two filesystems in the 
> future so that if a huge page allocation fails, it falls back to 
> creating a normal tmpfs file.  Perhaps that's a reasonable mount option 
> to add to hugetlbfs.

> Instead of registering an atexit() handler, I think it would be better 
> to unlink immediately after doing the mkstemp().  This reduces the 
> possibility of leaking the file in the event of catastrophe (like a kill 
> -SIGKILL).

Right, will fix.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/