From: Anthony Liguori <anthony@codemonkey.ws>
To: john cooper <john.cooper@third-harmonic.com>
Cc: kvm@vger.kernel.org, mtosatti@redhat.com, john.cooper@redhat.com
Subject: Re: patch: qemu + hugetlbfs..
Date: Tue, 08 Jul 2008 18:09:09 -0500 [thread overview]
Message-ID: <4873F395.6030209@codemonkey.ws> (raw)
In-Reply-To: <4873E400.4000409@third-harmonic.com>
john cooper wrote:
> Attached is a patch to qemu which allows preallocation of huge
> pages at startup time. Also in cases where sufficient huge
> pages are not available, the user may request to selectively fall
> back to 4K pages for those portions of phys_mem which couldn't be
> backed by huge pages. This patch is relative to kvm-70.
>
> The motivation for this arose from odd behavior seen in qemu
> when access to huge page backed phys_mem failed during startup
> (eg: loading the bios), and during runtime where a guest will
> terminate via signal if a free hpage isn't available to satisfy
> a guest page fault.
>
> Two flags are introduced:
>
> -mem-prealloc [0|1]
>
> Attempt to fault-in all hpages and if all can be allocated
> proceed as normal. If all hpages can't be resolved and
> -mem-fallback behavior wasn't is requested, qemu will error
> exit. This behavior is enabled by default when -mem-path
> is specified.
>
> -mem-fallback [0|1]
>
> In the case some hpages can't be allocated, mapping a
> series of 4K pages will be attempted in place of them.
> In the case zero hpages were allocated a diagnostic to
> this effect will be printed (flagging scenarios such
> as /proc/sys/vm/nr_hugepages failing to be setup) and
> the normal non-hpage 4K allocation will be attempted.
> This behavior by default is disabled. Recognition of
> -mem-fallback will only occur when -mem-path is
> specified and -mem-prealloc behavior is enabled.
>
> This is believed to be an interim bandaid until such support
> is available from within the kernel+kvm, ideally allowing
> fallback to 4K pages on a dynamic basis and ultimately allowing
> huge pages faults to reclaim huge pages from other users in the
> system.
>
> -john
>
>
> Signed-off-by: john cooper <john.cooper <at> redhat.com>
>
>
> --- a/qemu/vl.c
> +++ b/qemu/vl.c
> @@ -45,6 +45,7 @@
> #include <signal.h>
> #include <time.h>
> #include <errno.h>
> +#include <asm/param.h>
> #include <sys/time.h>
> #include <zlib.h>
>
> @@ -160,6 +161,15 @@ int inet_aton(const char *cp, struct in_
> /* XXX: use a two level table to limit memory usage */
> #define MAX_IOPORTS 65536
>
> +/* assertion checking
> + */
> +#define ASSERT(c) if (!(c)) _assert(#c, __FILE__, __LINE__)
> +
> +static inline void _assert(char *text, char *file, int line) {
> + fprintf(stderr, "ASSERTION FAILED: [%s] %s:%d\n", text, file, line);
> + kill(getpid(), 12); /* dump core */
> +}
> +
>
Is it really necessary to add this as part of this patch?
> const char *bios_dir = CONFIG_QEMU_SHAREDIR;
> const char *bios_name = NULL;
> void *ioport_opaque[MAX_IOPORTS];
> @@ -234,6 +244,8 @@ int autostart = 1;
> int time_drift_fix = 0;
> unsigned int kvm_shadow_memory = 0;
> const char *mem_path = NULL;
> +int mem_prealloc = {1}; /* force preallocation of physical target memory */
> +int mem_fallback = {0}; /* allow alloc from alternate page size */
>
This is a bizarre way to initialize an integer. I had no idea you could
even do this for a scalar. Just initialize to 1 and 0.
> int hpagesize = 0;
> const char *cpu_vendor_string;
> #ifdef TARGET_ARM
> @@ -7809,7 +7821,12 @@ static void help(int exitcode)
> #endif
> "-tdf inject timer interrupts that got lost\n"
> "-kvm-shadow-memory megs set the amount of shadow pages to be allocated\n"
> - "-mem-path set the path to hugetlbfs/tmpfs mounted directory, also enables allocation of guest memory with huge pages\n"
> + "-mem-path set the path to hugetlbfs/tmpfs mounted directory, also\n"
> + " enables allocation of guest memory with huge pages\n"
> + "-mem-prealloc forces preallocation of huge page physical memory\n"
> + " at startup when -mem-path is specified. Default 1 (on)\n"
> + "-mem-fallback enables fallback to conventional pages as needed\n"
> + " when -mem-prealloc is enabled. Default: 0 (off)\n"
> "-option-rom rom load a file, rom, into the option ROM space\n"
> #ifdef TARGET_SPARC
> "-prom-env variable=value set OpenBIOS nvram variables\n"
> @@ -7932,6 +7949,8 @@ enum {
> QEMU_OPTION_tdf,
> QEMU_OPTION_kvm_shadow_memory,
> QEMU_OPTION_mempath,
> + QEMU_OPTION_mem_prealloc,
> + QEMU_OPTION_mem_fallback,
> };
>
> typedef struct QEMUOption {
> @@ -8059,6 +8078,8 @@ const QEMUOption qemu_options[] = {
> { "startdate", HAS_ARG, QEMU_OPTION_startdate },
> { "tb-size", HAS_ARG, QEMU_OPTION_tb_size },
> { "mem-path", HAS_ARG, QEMU_OPTION_mempath },
> + { "mem-prealloc", HAS_ARG, QEMU_OPTION_mem_prealloc },
> + { "mem-fallback", HAS_ARG, QEMU_OPTION_mem_fallback },
> { NULL },
> };
>
> @@ -8276,11 +8297,36 @@ static int gethugepagesize(void)
> return hugepagesize;
> }
>
> -void *alloc_mem_area(unsigned long memory, const char *path)
> +/* fault in associated page, fail syscall when free page is unavailable
> + */
> +static inline int fault_in(void *a)
> +{
> + return (gettimeofday(a, NULL) < 0 ? errno != EFAULT : 1);
> +}
>
I don't like this very much. Why not just MAP_POPULATE?
> +/* we failed to fault in hpage *a, fall back to conventional page mapping
> + */
> +int remap_hpage(void *a, int sz)
> +{
> + ASSERT(!(sz & (EXEC_PAGESIZE - 1)));
> + if (munmap(a, sz) < 0)
> + perror("remap_hpage: munmap");
> + else if (mmap(a, sz, PROT_READ|PROT_WRITE,
> + MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0) == MAP_FAILED)
> + perror("remap_hpage: mmap");
> + else
> + return (1);
> + return (0);
> +}
>
I think this would be simplified with MAP_POPULATE since you can fail in
large chunks of memory instead of potentially having a highly fragmented
set of VMAs.
> +/* attempt to allocate memory mmap'ed to mem_path
> + */
> +void *alloc_hpage_mem(unsigned long memory, const char *path)
> {
> char *filename;
> - void *area;
> - int fd;
> + void *area, *a;
> + int fd, fail;
> + long i;
>
> if (asprintf(&filename, "%s/kvm.XXXXXX", path) == -1)
> return NULL;
> @@ -8308,27 +8354,80 @@ void *alloc_mem_area(unsigned long memor
> */
> ftruncate(fd, memory);
>
> + fail = 0;
> area = mmap(0, memory, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0);
> - if (area == MAP_FAILED) {
> - perror("mmap");
> - close(fd);
> - return NULL;
> - }
> + if (area == MAP_FAILED)
> + ++fail, perror("alloc_hpage_mem: hugetlbfs mmap");
>
> - return area;
> + /* success of mmap() doesn't guarantee phys pages are present in the
> + * map. if requested we attempt to fault in all associated huge pages * and where unable if requested, selectively revert back to a mapping
> + * of conventional sized pages
> + */
> + ASSERT(!((unsigned long)area & (hpagesize - 1)));
> + if (mem_prealloc)
> + for (a = area, i = memory / hpagesize; !fail && 0 < i--; a += hpagesize)
> + if (fault_in(a))
> + ; /* hpage is mapped */
> + else if (!mem_fallback)
> + ++fail;
> + else if (!remap_hpage(a, hpagesize))
> + ++fail;
> + if (!fail)
> + return (area);
> + if (area != MAP_FAILED)
> + munmap(area, memory);
> + close(fd);
> + return (NULL);
> }
>
> +/* allocate from hpage mem if requested and to the extent possible, else
> + * revert to conventional page allocation.
> + */
> void *qemu_alloc_physram(unsigned long memory)
> {
> void *area = NULL;
>
> - if (mem_path)
> - area = alloc_mem_area(memory, mem_path);
> - if (!area)
> - area = qemu_vmalloc(memory);
> + if (mem_path && !(area = alloc_hpage_mem(memory, mem_path)))
> + fprintf(stderr, "warning: couldn't alloc memory as requested from %s\n",
> + mem_path);
> + return (!area && mem_fallback ? qemu_vmalloc(memory) : area);
> +}
>
> - return area;
> +/* match string *s in pattern *p. *p is of the form: [<str> | [<str> '|']*]
> + * and <str> is a string to be matched which may be empty.
> + * Return pointer to tail of matched *p for success, NULL otherwise.
> + */
> +char const *altmatch(const char *p, const char *s)
> +{
> + const char *bp, *q;
> +
> + for (bp = p, q = s; ; )
> + if ((!*p || *p == '|') && !*q)
> + return (q == s && bp != p ? NULL : p);
> + else if (!*p)
> + return (0);
> + else if (*p == *q)
> + ++p, ++q;
> + else { /* failed, try next field */
> + for (q = s; *p && *p != '|'; ++p)
> + ;
> + if (*p == '|')
> + bp = ++p;
> + }
> +}
> +
> +int bool_arg(const char *optarg, const char *flag_name)
> +{
> + if (!optarg)
> + ;
> + else if (altmatch("y|yes|1", optarg))
> + return (1);
> + else if (altmatch("n|no|0", optarg))
> + return (0);
> + fprintf(stderr, "usage: %s [0|1]\n", flag_name);
> + exit(1);
> }
> +
>
This is introducing too much infrastructure. Just follow the
conventions in the rest of the code. No need to make the options take a
boolean argument. The options themselves are boolean arguments.
Regards,
Anthony Liguori
> int main(int argc, char **argv)
> {
> @@ -8962,6 +9061,12 @@ int main(int argc, char **argv)
> case QEMU_OPTION_mempath:
> mem_path = optarg;
> break;
> + case QEMU_OPTION_mem_prealloc:
> + mem_prealloc = bool_arg(optarg, "mem_prealloc");
> + break;
> + case QEMU_OPTION_mem_fallback:
> + mem_fallback = bool_arg(optarg, "mem_fallback");
> + break;
> case QEMU_OPTION_name:
> qemu_name = optarg;
> break;
>
next prev parent reply other threads:[~2008-07-08 23:09 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-07-08 22:02 patch: qemu + hugetlbfs john cooper
2008-07-08 23:09 ` Anthony Liguori [this message]
2008-07-09 0:23 ` john cooper
2008-07-09 1:08 ` Anthony Liguori
2008-07-09 17:03 ` Marcelo Tosatti
2008-07-09 17:11 ` Anthony Liguori
2008-07-10 16:40 ` john cooper
2008-07-10 17:58 ` Anthony Liguori
2008-07-10 20:16 ` john cooper
2008-07-10 20:47 ` Anthony Liguori
2008-07-10 21:12 ` john cooper
2008-07-10 21:38 ` Anthony Liguori
2008-08-25 23:05 ` Resend: " john cooper
2008-08-26 8:11 ` Avi Kivity
2008-08-27 4:13 ` john cooper
2009-01-16 2:19 ` john cooper
2009-01-20 10:29 ` Avi Kivity
2009-01-23 21:21 ` john cooper
2009-02-05 15:42 ` Avi Kivity
2009-02-05 16:12 ` Marcelo Tosatti
2009-02-05 16:15 ` Avi Kivity
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4873F395.6030209@codemonkey.ws \
--to=anthony@codemonkey.ws \
--cc=john.cooper@redhat.com \
--cc=john.cooper@third-harmonic.com \
--cc=kvm@vger.kernel.org \
--cc=mtosatti@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox