public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Anthony Liguori <anthony@codemonkey.ws>
To: john cooper <john.cooper@third-harmonic.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>,
	kvm@vger.kernel.org, john.cooper@redhat.com
Subject: Re: patch: qemu + hugetlbfs..
Date: Thu, 10 Jul 2008 12:58:07 -0500	[thread overview]
Message-ID: <48764DAF.6060502@codemonkey.ws> (raw)
In-Reply-To: <48763B86.6060402@third-harmonic.com>

john cooper wrote:
> Anthony Liguori wrote:
>> Marcelo Tosatti wrote:
>>> This is Linux's behaviour for all filesystems. There is no error
>>> checking on MAP_POPULATE's attempt to prefault pages.
>>>   
>>
>> I thought that mmap() will fail with hugetlbfs if enough large pages 
>> can't be reserved?
>
> This appears to be the case for MAP_SHARED.  For MAP_PRIVATE
> the accounting is side-stepped, the result of which was the
> best-effort prealloc/populate behavior we had see seen.
>
> Marcelo and I batted this around some yesterday.  And there
> doesn't appear to be any concern using MAP_SHARED for the
> phys_ram map as potential access to the backing file is
> still regulated by filesystem permissions.
>
> Doing so dispenses with the syscall/useraddr hack to fault
> in pages since mmap() fails as expected in the case
> sufficient pages are unavailable immediately to back
> MAP_SHARED pages.  We also decided to defer the map-fallback
> logic.  This is essentially behavior we'd like to see
> provided by kvm, ideally on a dynamic (guest fault) basis
> and was technically beyond the issue we were trying to
> target.
>
> Attached is the resulting patch which also addresses the
> other concerns which arose.
>
> -john
>
> --- a/qemu/vl.c
> +++ b/qemu/vl.c
> @@ -45,6 +45,7 @@
>  #include <signal.h>
>  #include <time.h>
>  #include <errno.h>
> +#include <asm/param.h>
>   

I don't think this is necessary anymore.  Depending on a Linux headers 
breaks the QEMU build on other unices so it's a bad thing.

>  #include <sys/time.h>
>  #include <zlib.h>
>  
> @@ -234,6 +235,7 @@ int autostart = 1;
>  int time_drift_fix = 0;
>  unsigned int kvm_shadow_memory = 0;
>  const char *mem_path = NULL;
> +int mem_prealloc = 1;	/* force preallocation of physical target memory */
>  int hpagesize = 0;
>  const char *cpu_vendor_string;
>  #ifdef TARGET_ARM
> @@ -7809,7 +7811,10 @@ static void help(int exitcode)
>  #endif
>             "-tdf            inject timer interrupts that got lost\n"
>             "-kvm-shadow-memory megs set the amount of shadow pages to be allocated\n"
> -           "-mem-path       set the path to hugetlbfs/tmpfs mounted directory, also enables allocation of guest memory with huge pages\n"
> +           "-mem-path       set the path to hugetlbfs/tmpfs mounted directory, also\n"
> +           "                enables allocation of guest memory with huge pages\n"
> +           "-mem-prealloc   toggles preallocation of huge page physical memory at\n"
> +           "                startup when -mem-path is specified.  Default is enabled.\n"
>   

This really isn't huge page specific.  It should work equally well with 
tmpfs, no?

>  	   "-option-rom rom load a file, rom, into the option ROM space\n"
>  #ifdef TARGET_SPARC
>             "-prom-env variable=value  set OpenBIOS nvram variables\n"
> @@ -7932,6 +7937,7 @@ enum {
>      QEMU_OPTION_tdf,
>      QEMU_OPTION_kvm_shadow_memory,
>      QEMU_OPTION_mempath,
> +    QEMU_OPTION_mem_prealloc
>  };
>  
>  typedef struct QEMUOption {
> @@ -8059,6 +8065,7 @@ const QEMUOption qemu_options[] = {
>      { "startdate", HAS_ARG, QEMU_OPTION_startdate },
>      { "tb-size", HAS_ARG, QEMU_OPTION_tb_size },
>      { "mem-path", HAS_ARG, QEMU_OPTION_mempath },
> +    { "mem-prealloc", 0, QEMU_OPTION_mem_prealloc },
>      { NULL },
>  };
>  
> @@ -8276,11 +8283,13 @@ static int gethugepagesize(void)
>      return hugepagesize;
>  }
>  
> -void *alloc_mem_area(unsigned long memory, const char *path)
> +/* attempt to allocate memory mmap'ed to mem_path
> + */
> +void *alloc_hpage_mem(unsigned long memory, const char *path)
>  {
>      char *filename;
>      void *area;
> -    int fd;
> +    int fd, flags;
>  
>      if (asprintf(&filename, "%s/kvm.XXXXXX", path) == -1)
>  	return NULL;
> @@ -8308,26 +8317,24 @@ void *alloc_mem_area(unsigned long memor
>       */
>      ftruncate(fd, memory);
>  
> -    area = mmap(0, memory, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
> -    if (area == MAP_FAILED) {
> -	perror("mmap");
> -	close(fd);
> -	return NULL;
> -    }
> -
> -    return area;
> +    /* NB: MAP_POPULATE won't exhaustively alloc all phys pages in the case
> +     * MAP_PRIVATE is requested.  For mem_prealloc we mmap as MAP_SHARED
> +     * to sidestep this quirk.
> +     */
> +    flags = mem_prealloc ? MAP_POPULATE|MAP_SHARED : MAP_PRIVATE;
> +    area = mmap(0, memory, PROT_READ|PROT_WRITE, flags, fd, 0);
> +    if (area != MAP_FAILED)
> +	return (area);
> +    perror("alloc_hpage_mem: can't mmap hugetlbfs pages");
> +    close(fd);
> +    return (NULL);
>  }
>  
> -void *qemu_alloc_physram(unsigned long memory)
> +/* allocate guest memory as requested
> + */
> +void *qemu_alloc_physram(unsigned long size)
>  {
> -    void *area = NULL;
> -
> -    if (mem_path)
> -	area = alloc_mem_area(memory, mem_path);
> -    if (!area)
> -	area = qemu_vmalloc(memory);
> -
> -    return area;
> +    return (mem_path ? alloc_hpage_mem(size, mem_path) : qemu_vmalloc(size));
>  }
>   

I'd prefer if you didn't convert from a if/else to conditional.  hpage 
is a misnomer too as we aren't actually dependent on huge pages (this 
code should work equally well for tmpfs).

Regards,

Anthony Liguori

>  int main(int argc, char **argv)
> @@ -8962,6 +8969,9 @@ int main(int argc, char **argv)
>              case QEMU_OPTION_mempath:
>  		mem_path = optarg;
>  		break;
> +            case QEMU_OPTION_mem_prealloc:
> +		mem_prealloc = !mem_prealloc;
> +		break;
>              case QEMU_OPTION_name:
>                  qemu_name = optarg;
>                  break;
>   


  reply	other threads:[~2008-07-10 17:59 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-07-08 22:02 patch: qemu + hugetlbfs john cooper
2008-07-08 23:09 ` Anthony Liguori
2008-07-09  0:23   ` john cooper
2008-07-09  1:08     ` Anthony Liguori
2008-07-09 17:03       ` Marcelo Tosatti
2008-07-09 17:11         ` Anthony Liguori
2008-07-10 16:40           ` john cooper
2008-07-10 17:58             ` Anthony Liguori [this message]
2008-07-10 20:16               ` john cooper
2008-07-10 20:47                 ` Anthony Liguori
2008-07-10 21:12                   ` john cooper
2008-07-10 21:38                     ` Anthony Liguori
2008-08-25 23:05                       ` Resend: " john cooper
2008-08-26  8:11                         ` Avi Kivity
2008-08-27  4:13                           ` john cooper
2009-01-16  2:19                         ` john cooper
2009-01-20 10:29                           ` Avi Kivity
2009-01-23 21:21                             ` john cooper
2009-02-05 15:42                               ` Avi Kivity
2009-02-05 16:12                                 ` Marcelo Tosatti
2009-02-05 16:15                                   ` Avi Kivity

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=48764DAF.6060502@codemonkey.ws \
    --to=anthony@codemonkey.ws \
    --cc=john.cooper@redhat.com \
    --cc=john.cooper@third-harmonic.com \
    --cc=kvm@vger.kernel.org \
    --cc=mtosatti@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox