From: Wanlong Gao <gaowanlong@cn.fujitsu.com>
To: qemu-devel@nongnu.org
Cc: aliguori@us.ibm.com, ehabkost@redhat.com, bsd@redhat.com,
pbonzini@redhat.com, y-goto@jp.fujitsu.com, afaerber@suse.de,
gaowanlong@cn.fujitsu.com
Subject: [Qemu-devel] [PATCH V2 5/9] NUMA: set guest numa nodes memory policy
Date: Fri, 21 Jun 2013 14:25:56 +0800 [thread overview]
Message-ID: <1371795960-10478-6-git-send-email-gaowanlong@cn.fujitsu.com> (raw)
In-Reply-To: <1371795960-10478-1-git-send-email-gaowanlong@cn.fujitsu.com>
Set the guest numa nodes memory policies using the mbind(2)
system call node by node.
After this patch, we are able to set guest nodes memory policies
through the QEMU options, this arms to solve the guest cross
nodes memory access performance issue.
And as you all know, if PCI-passthrough is used,
direct-attached-device uses DMA transfer between device and qemu process.
All pages of the guest will be pinned by get_user_pages().
KVM_ASSIGN_PCI_DEVICE ioctl
kvm_vm_ioctl_assign_device()
=>kvm_assign_device()
=> kvm_iommu_map_memslots()
=> kvm_iommu_map_pages()
=> kvm_pin_pages()
So, with direct-attached-device, all guest page's page count will be +1 and
any page migration will not work. AutoNUMA won't too.
So, we should set the guest nodes memory allocation policies before
the pages are really mapped.
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
---
cpus.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 87 insertions(+)
diff --git a/cpus.c b/cpus.c
index e123d3f..677ee15 100644
--- a/cpus.c
+++ b/cpus.c
@@ -60,6 +60,15 @@
#endif /* CONFIG_LINUX */
+#ifdef CONFIG_NUMA
+#include <numa.h>
+#include <numaif.h>
+#ifndef MPOL_F_RELATIVE_NODES
+#define MPOL_F_RELATIVE_NODES (1 << 14)
+#define MPOL_F_STATIC_NODES (1 << 15)
+#endif
+#endif
+
static CPUArchState *next_cpu;
static bool cpu_thread_is_idle(CPUArchState *env)
@@ -1186,6 +1195,75 @@ static void tcg_exec_all(void)
exit_request = 0;
}
+#ifdef CONFIG_NUMA
+static int node_parse_bind_mode(unsigned int nodeid)
+{
+ int bind_mode;
+
+ switch (numa_info[nodeid].flags & NODE_HOST_POLICY_MASK) {
+ case NODE_HOST_BIND:
+ bind_mode = MPOL_BIND;
+ break;
+ case NODE_HOST_INTERLEAVE:
+ bind_mode = MPOL_INTERLEAVE;
+ break;
+ case NODE_HOST_PREFERRED:
+ bind_mode = MPOL_PREFERRED;
+ break;
+ default:
+ bind_mode = MPOL_DEFAULT;
+ return bind_mode;
+ }
+
+ bind_mode |= (numa_info[nodeid].flags & NODE_HOST_RELATIVE) ?
+ MPOL_F_RELATIVE_NODES : MPOL_F_STATIC_NODES;
+
+ return bind_mode;
+}
+#endif
+
+static int set_node_mpol(unsigned int nodeid)
+{
+#ifdef CONFIG_NUMA
+ void *ram_ptr;
+ RAMBlock *block;
+ ram_addr_t len, ram_offset = 0;
+ int bind_mode;
+ int i;
+
+ QTAILQ_FOREACH(block, &ram_list.blocks, next) {
+ if (!strcmp(block->mr->name, "pc.ram")) {
+ break;
+ }
+ }
+
+ if (block->host == NULL)
+ return -1;
+
+ ram_ptr = block->host;
+ for (i = 0; i < nodeid; i++) {
+ len = numa_info[i].node_mem;
+ ram_offset += len;
+ }
+
+ len = numa_info[i].node_mem;
+ bind_mode = node_parse_bind_mode(i);
+
+ /* This is a workaround for a long standing bug in Linux'
+ * mbind implementation, which cuts off the last specified
+ * node. To stay compatible should this bug be fixed, we
+ * specify one more node and zero this one out.
+ */
+ clear_bit(numa_num_configured_nodes() + 1, numa_info[i].host_mem);
+ if (mbind(ram_ptr + ram_offset, len, bind_mode,
+ numa_info[i].host_mem, numa_num_configured_nodes() + 1, 0)) {
+ perror("mbind");
+ return -1;
+ }
+#endif
+ return 0;
+}
+
void set_numa_modes(void)
{
CPUArchState *env;
@@ -1200,6 +1278,15 @@ void set_numa_modes(void)
}
}
}
+
+#ifdef CONFIG_NUMA
+ for (i = 0; i < nb_numa_nodes; i++) {
+ if (set_node_mpol(i) == -1) {
+ fprintf(stderr,
+ "qemu: can't set host memory policy for node%d\n", i);
+ }
+ }
+#endif
}
void list_cpus(FILE *f, fprintf_function cpu_fprintf, const char *optarg)
--
1.8.3.1.448.gfb7dfaa
next prev parent reply other threads:[~2013-06-21 6:33 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-06-21 6:25 [Qemu-devel] [PATCH V2 0/9] Add support for binding guest numa nodes to host numa nodes Wanlong Gao
2013-06-21 6:25 ` [Qemu-devel] [PATCH V2 1/9] NUMA: Support multiple CPU ranges on -numa option Wanlong Gao
2013-06-21 6:25 ` [Qemu-devel] [PATCH V2 2/9] NUMA: Add numa_info structure to contain numa nodes info Wanlong Gao
2013-06-21 6:25 ` [Qemu-devel] [PATCH V2 3/9] NUMA: Add Linux libnuma detection Wanlong Gao
2013-06-21 6:25 ` [Qemu-devel] [PATCH V2 4/9] NUMA: parse guest numa nodes memory policy Wanlong Gao
2013-06-21 15:34 ` Bandan Das
2013-06-21 6:25 ` Wanlong Gao [this message]
2013-06-21 6:25 ` [Qemu-devel] [PATCH V2 6/9] NUMA: handle Error in mpol and hostnode parser Wanlong Gao
2013-06-21 6:25 ` [Qemu-devel] [PATCH V2 7/9] NUMA: add qmp command set-mpol to set memory policy for NUMA node Wanlong Gao
2013-06-21 6:25 ` [Qemu-devel] [PATCH V2 8/9] NUMA: add hmp command set-mpol Wanlong Gao
2013-06-21 6:26 ` [Qemu-devel] [PATCH V2 9/9] NUMA: show host memory policy info in info numa command Wanlong Gao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1371795960-10478-6-git-send-email-gaowanlong@cn.fujitsu.com \
--to=gaowanlong@cn.fujitsu.com \
--cc=afaerber@suse.de \
--cc=aliguori@us.ibm.com \
--cc=bsd@redhat.com \
--cc=ehabkost@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=y-goto@jp.fujitsu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).