From: Dario Faggioli <dario.faggioli@citrix.com>
To: xen-devel@lists.xen.org
Cc: Marcus Granado <Marcus.Granado@eu.citrix.com>,
Keir Fraser <keir@xen.org>,
Ian Campbell <Ian.Campbell@citrix.com>,
Li Yechen <lccycc123@gmail.com>,
George Dunlap <george.dunlap@eu.citrix.com>,
Andrew Cooper <Andrew.Cooper3@citrix.com>,
Juergen Gross <juergen.gross@ts.fujitsu.com>,
Ian Jackson <Ian.Jackson@eu.citrix.com>,
Jan Beulich <JBeulich@suse.com>,
Justin Weaver <jtweaver@hawaii.edu>,
Daniel De Graaf <dgdegra@tycho.nsa.gov>,
Matt Wilson <msw@amazon.com>,
Elena Ufimtseva <ufimtseva@gmail.com>
Subject: [PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file
Date: Tue, 05 Nov 2013 15:36:17 +0100 [thread overview]
Message-ID: <20131105143617.30446.53267.stgit@Solace> (raw)
In-Reply-To: <20131105142844.30446.78671.stgit@Solace>
in a similar way to how it is possible to specify vcpu-affinity.
Manual page is updated accordingly.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
docs/man/xl.cfg.pod.5 | 70 +++++++++++++++++++++++++++++++++------
tools/libxl/libxl_dom.c | 18 ++++++----
tools/libxl/libxl_numa.c | 14 +-------
tools/libxl/libxl_utils.h | 12 ++++++-
tools/libxl/xl_cmdimpl.c | 80 ++++++++++++++++++++++++++++++++++++++++++++-
5 files changed, 159 insertions(+), 35 deletions(-)
diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index 1c98cb4..1212426 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -144,18 +144,64 @@ run on cpu #3 of the host.
=back
If this option is not specified, no vcpu to cpu pinning is established,
-and the vcpus of the guest can run on all the cpus of the host.
-
-If we are on a NUMA machine (i.e., if the host has more than one NUMA
-node) and this option is not specified, libxl automatically tries to
-place the guest on the least possible number of nodes. That, however,
-will not affect vcpu pinning, so the guest will still be able to run on
-all the cpus, it will just prefer the ones from the node it has been
-placed on. A heuristic approach is used for choosing the best node (or
-set of nodes), with the goals of maximizing performance for the guest
-and, at the same time, achieving efficient utilization of host cpus
-and memory. See F<docs/misc/xl-numa-placement.markdown> for more
-details.
+and the vcpus of the guest can run on all the cpus of the host. If this
+option is specified, and no B<nodes=> option present the vcpu pinning
+mask of each vcpu is utilized to compute its vcpu node-affinity, and the
+union of all the vcpus node-affinities is what constitutes the domain
+node-affinity (which drives memory allocations).
+
+=back
+
+=item B<nodes="NODE-LIST">
+
+List of on which NUMA nodes the memory for the guest is allocated. This
+also means (starting from Xen 4.3 and if the credit scheduler is used)
+the vcpus of the domain prefers to run on the those NUMA nodes. Default is
+xl (via libxl) guesses. A C<NODE-LIST> may be specified as follows:
+
+=item "all"
+
+To specify no particular preference and avoid xl to automatically pick
+a (set of) NUMA ndoe(s). In practice, using "all", this domain will have
+no NUMA node-affinity and it's memory will be spread on all the host's
+NUMA nodes.
+
+=item "0-3,5,^1"
+
+To specify a NUMA node-affinity with the host NUMA nodes 0,2,3,5.
+Combining this with "all" is possible, meaning "all,^7" results in the
+memory being allocated on all the host NUMA nodes except node 7, as well
+as trying to avoid running the domain's vcpu on the pcpus that belong to
+node 7.
+
+=item ["1", "4"] (or [1, 4])
+
+To ask for specific vcpu to NUMA node mapping. That means (in this example),
+memory will be allocated on host NUMA nodes 1 and 4 but, at the same time,
+vcpu #0 of the guest prefers to run on the pcpus of host NUMA node 1, while
+vcpu #1 on the pcpus of host NUMA node 4.
+
+=back
+
+If this option is not specified, xl picks up a NUMA node (or a set of NUMA
+nodes), according to some heuristics, and use that as the NUMA node-affinity
+for the guest.
+
+If we are on a NUMA machine (i.e., if the host has more than one NUMA node)
+and this option is not specified, libxl automatically tries to place the
+guest on the least possible number of nodes. A heuristic approach is used
+for choosing the best node (or set of nodes), with the goals of maximizing
+performance for the guest and, at the same time, achieving efficient
+utilization of host cpus and memory. In this case, all the vcpus of the
+guest will have the same vcpu node-affinity.
+
+Notice that, independently from whether the node-affinity is specified
+via this parameter, or automatically decided by libxl, that does not affect
+vcpu pinning, so the guest will still be able to run on all the cpus to
+which its vcpus are pinned, or all the cpus, if no B<cpus=> option is
+provided.
+
+See F<docs/misc/xl-numa-placement.markdown> for more details.
=back
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 1812bdc..bc4cf9a 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -215,19 +215,21 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
}
/*
- * Check if the domain has any CPU affinity. If not, try to build
- * up one. In case numa_place_domain() find at least a suitable
- * candidate, it will affect info->nodemap accordingly; if it
- * does not, it just leaves it as it is. This means (unless
- * some weird error manifests) the subsequent call to
- * libxl_domain_set_nodeaffinity() will do the actual placement,
+ * Check if the domain has any pinning or node-affinity and, if not, try
+ * to build up one.
+ *
+ * In case numa_place_domain() find at least a suitable candidate, it will
+ * affect info->nodemap accordingly; if it does not, it just leaves it as
+ * it is. This means (unless some weird error manifests) the subsequent
+ * call to libxl_domain_set_nodeaffinity() will do the actual placement,
* whatever that turns out to be.
*/
if (libxl_defbool_val(info->numa_placement)) {
- if (!libxl_bitmap_is_full(&info->cpumap)) {
+ if (!libxl_bitmap_is_full(&info->cpumap) ||
+ !libxl_bitmap_is_full(&info->nodemap)) {
LOG(ERROR, "Can run NUMA placement only if no vcpu "
- "affinity is specified");
+ "pinning or node-affinity is specified");
return ERROR_INVAL;
}
diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
index 20c99ac..1026579 100644
--- a/tools/libxl/libxl_numa.c
+++ b/tools/libxl/libxl_numa.c
@@ -184,7 +184,7 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo,
int vcpus_on_node[])
{
libxl_dominfo *dinfo = NULL;
- libxl_bitmap dom_nodemap, nodes_counted;
+ libxl_bitmap nodes_counted;
int nr_doms, nr_cpus;
int i, j, k;
@@ -197,12 +197,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo,
return ERROR_FAIL;
}
- if (libxl_node_bitmap_alloc(CTX, &dom_nodemap, 0) < 0) {
- libxl_bitmap_dispose(&nodes_counted);
- libxl_dominfo_list_free(dinfo, nr_doms);
- return ERROR_FAIL;
- }
-
for (i = 0; i < nr_doms; i++) {
libxl_vcpuinfo *vinfo;
int nr_dom_vcpus;
@@ -211,9 +205,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo,
if (vinfo == NULL)
continue;
- /* Retrieve the domain's node-affinity map */
- libxl_domain_get_nodeaffinity(CTX, dinfo[i].domid, &dom_nodemap);
-
for (j = 0; j < nr_dom_vcpus; j++) {
/*
* For each vcpu of each domain, it must have both vcpu-affinity
@@ -225,7 +216,7 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo,
int node = tinfo[k].node;
if (libxl_bitmap_test(suitable_cpumap, k) &&
- libxl_bitmap_test(&dom_nodemap, node) &&
+ libxl_bitmap_test(&vinfo[j].nodemap, node) &&
!libxl_bitmap_test(&nodes_counted, node)) {
libxl_bitmap_set(&nodes_counted, node);
vcpus_on_node[node]++;
@@ -236,7 +227,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo,
libxl_vcpuinfo_list_free(vinfo, nr_dom_vcpus);
}
- libxl_bitmap_dispose(&dom_nodemap);
libxl_bitmap_dispose(&nodes_counted);
libxl_dominfo_list_free(dinfo, nr_doms);
return 0;
diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h
index 7b84e6a..cac057c 100644
--- a/tools/libxl/libxl_utils.h
+++ b/tools/libxl/libxl_utils.h
@@ -90,7 +90,7 @@ static inline void libxl_bitmap_set_none(libxl_bitmap *bitmap)
{
memset(bitmap->map, 0, bitmap->size);
}
-static inline int libxl_bitmap_cpu_valid(libxl_bitmap *bitmap, int bit)
+static inline int libxl_bitmap_valid(libxl_bitmap *bitmap, int bit)
{
return bit >= 0 && bit < (bitmap->size * 8);
}
@@ -125,6 +125,16 @@ static inline int libxl_node_bitmap_alloc(libxl_ctx *ctx,
return libxl_bitmap_alloc(ctx, nodemap, max_nodes);
}
+static inline int libxl_bitmap_cpu_valid(libxl_bitmap *cpumap, int cpu)
+{
+ return libxl_bitmap_valid(cpumap, cpu);
+}
+
+static inline int libxl_bitmap_node_valid(libxl_bitmap *nodemap, int node)
+{
+ return libxl_bitmap_valid(nodemap, node);
+}
+
/* Populate cpumap with the cpus spanned by the nodes in nodemap */
int libxl_nodemap_to_cpumap(libxl_ctx *ctx,
const libxl_bitmap *nodemap,
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 1659259..a035162 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -76,8 +76,9 @@ xlchild children[child_max];
static const char *common_domname;
static int fd_lock = -1;
-/* Stash for specific vcpu to pcpu mappping */
+/* Stash for specific vcpu to pcpu and vcpu to node mappping */
static int *vcpu_to_pcpu;
+static int *vcpu_to_node;
static const char savefileheader_magic[32]=
"Xen saved domain, xl format\n \0 \r";
@@ -670,7 +671,7 @@ static void parse_config_data(const char *config_source,
const char *buf;
long l;
XLU_Config *config;
- XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms;
+ XLU_ConfigList *cpus, *nodes, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms;
XLU_ConfigList *ioports, *irqs, *iomem;
int num_ioports, num_irqs, num_iomem;
int pci_power_mgmt = 0;
@@ -846,6 +847,53 @@ static void parse_config_data(const char *config_source,
libxl_defbool_set(&b_info->numa_placement, false);
}
+ if (!xlu_cfg_get_list (config, "nodes", &nodes, 0, 1)) {
+ int n_cpus = 0;
+
+ if (libxl_node_bitmap_alloc(ctx, &b_info->nodemap, 0)) {
+ fprintf(stderr, "Unable to allocate nodemap\n");
+ exit(1);
+ }
+
+ /*
+ * As above, use a temporary storage for the single vcpus'
+ * node-affinities.
+ */
+ vcpu_to_node = xmalloc(sizeof(int) * b_info->max_vcpus);
+ memset(vcpu_to_node, -1, sizeof(int) * b_info->max_vcpus);
+
+ libxl_bitmap_set_none(&b_info->nodemap);
+ while ((buf = xlu_cfg_get_listitem(nodes, n_cpus)) != NULL) {
+ i = atoi(buf);
+ if (!libxl_bitmap_node_valid(&b_info->nodemap, i)) {
+ fprintf(stderr, "node %d illegal\n", i);
+ exit(1);
+ }
+ libxl_bitmap_set(&b_info->nodemap, i);
+ if (n_cpus < b_info->max_vcpus)
+ vcpu_to_node[n_cpus] = i;
+ n_cpus++;
+ }
+
+ /* We have a nodemap, disable automatic placement */
+ libxl_defbool_set(&b_info->numa_placement, false);
+ }
+ else if (!xlu_cfg_get_string (config, "nodes", &buf, 0)) {
+ char *buf2 = strdup(buf);
+
+ if (libxl_node_bitmap_alloc(ctx, &b_info->nodemap, 0)) {
+ fprintf(stderr, "Unable to allocate nodemap\n");
+ exit(1);
+ }
+
+ libxl_bitmap_set_none(&b_info->nodemap);
+ if (parse_bitmap_range(buf2, &b_info->nodemap))
+ exit(1);
+ free(buf2);
+
+ libxl_defbool_set(&b_info->numa_placement, false);
+ }
+
if (!xlu_cfg_get_long (config, "memory", &l, 0)) {
b_info->max_memkb = l * 1024;
b_info->target_memkb = b_info->max_memkb;
@@ -2205,6 +2253,34 @@ start:
free(vcpu_to_pcpu); vcpu_to_pcpu = NULL;
}
+ /* And do the same for single vcpu to node-affinity mapping */
+ if (vcpu_to_node) {
+ libxl_bitmap vcpu_nodemap;
+
+ ret = libxl_node_bitmap_alloc(ctx, &vcpu_nodemap, 0);
+ if (ret)
+ goto error_out;
+ for (i = 0; i < d_config.b_info.max_vcpus; i++) {
+
+ if (vcpu_to_node[i] != -1) {
+ libxl_bitmap_set_none(&vcpu_nodemap);
+ libxl_bitmap_set(&vcpu_nodemap, vcpu_to_node[i]);
+ } else {
+ libxl_bitmap_set_any(&vcpu_nodemap);
+ }
+ if (libxl_set_vcpunodeaffinity(ctx, domid, i, &vcpu_nodemap)) {
+ fprintf(stderr, "setting node-affinity failed"
+ " on vcpu `%d'.\n", i);
+ libxl_bitmap_dispose(&vcpu_nodemap);
+ free(vcpu_to_node);
+ ret = ERROR_FAIL;
+ goto error_out;
+ }
+ }
+ libxl_bitmap_dispose(&vcpu_nodemap);
+ free(vcpu_to_node); vcpu_to_node = NULL;
+ }
+
ret = libxl_userdata_store(ctx, domid, "xl",
config_data, config_len);
if (ret) {
next prev parent reply other threads:[~2013-11-05 14:36 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-11-05 14:33 [PATCH RESEND 00/12] Implement per-vcpu NUMA node-affinity for credit1 Dario Faggioli
2013-11-05 14:34 ` [PATCH RESEND 01/12] xen: numa-sched: leave node-affinity alone if not in "auto" mode Dario Faggioli
2013-11-05 14:43 ` George Dunlap
2013-11-05 14:34 ` [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning Dario Faggioli
2013-11-05 14:50 ` George Dunlap
2013-11-06 8:48 ` Dario Faggioli
2013-11-07 18:17 ` Ian Jackson
2013-11-08 9:24 ` Dario Faggioli
2013-11-08 15:20 ` Ian Jackson
2013-11-05 14:34 ` [PATCH RESEND 03/12] xl: implement and enable dryrun mode for `xl vcpu-pin' Dario Faggioli
2013-11-05 14:34 ` [PATCH RESEND 04/12] xl: test script for the cpumap parser (for vCPU pinning) Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity Dario Faggioli
2013-11-05 14:52 ` Jan Beulich
2013-11-05 15:03 ` George Dunlap
2013-11-05 15:11 ` Jan Beulich
2013-11-05 15:24 ` George Dunlap
2013-11-05 22:15 ` Dario Faggioli
2013-11-05 15:11 ` George Dunlap
2013-11-05 15:23 ` Jan Beulich
2013-11-05 15:39 ` George Dunlap
2013-11-05 16:56 ` George Dunlap
2013-11-05 17:16 ` George Dunlap
2013-11-05 17:30 ` Jan Beulich
2013-11-05 23:12 ` Dario Faggioli
2013-11-05 23:01 ` Dario Faggioli
2013-11-06 9:39 ` Dario Faggioli
2013-11-06 9:46 ` Jan Beulich
2013-11-06 10:00 ` Dario Faggioli
2013-11-06 11:44 ` George Dunlap
2013-11-06 14:26 ` Dario Faggioli
2013-11-06 14:56 ` George Dunlap
2013-11-06 15:14 ` Jan Beulich
2013-11-06 16:12 ` George Dunlap
2013-11-06 16:22 ` Jan Beulich
2013-11-06 16:48 ` Dario Faggioli
2013-11-06 16:20 ` Dario Faggioli
2013-11-06 16:23 ` Dario Faggioli
2013-11-05 17:24 ` Jan Beulich
2013-11-05 17:31 ` George Dunlap
2013-11-05 23:08 ` Dario Faggioli
2013-11-05 22:54 ` Dario Faggioli
2013-11-05 22:22 ` Dario Faggioli
2013-11-06 11:41 ` Dario Faggioli
2013-11-06 14:47 ` George Dunlap
2013-11-06 16:53 ` Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 06/12] xen: numa-sched: domain node-affinity always comes from vcpu node-affinity Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 07/12] xen: numa-sched: use per-vcpu node-affinity for actual scheduling Dario Faggioli
2013-11-05 16:20 ` George Dunlap
2013-11-06 9:15 ` Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 08/12] xen: numa-sched: enable getting/specifying per-vcpu node-affinity Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 09/12] libxc: " Dario Faggioli
2013-11-07 18:27 ` Ian Jackson
2013-11-12 16:01 ` Konrad Rzeszutek Wilk
2013-11-12 16:43 ` George Dunlap
2013-11-12 16:55 ` Konrad Rzeszutek Wilk
2013-11-12 18:40 ` Dario Faggioli
2013-11-12 19:13 ` Konrad Rzeszutek Wilk
2013-11-12 21:36 ` Dario Faggioli
2013-11-13 10:57 ` Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 10/12] libxl: " Dario Faggioli
2013-11-07 18:29 ` Ian Jackson
2013-11-08 9:18 ` Dario Faggioli
2013-11-08 15:07 ` Ian Jackson
2013-11-05 14:36 ` [PATCH RESEND 11/12] xl: " Dario Faggioli
2013-11-07 18:33 ` Ian Jackson
2013-11-08 9:33 ` Dario Faggioli
2013-11-08 15:18 ` Ian Jackson
2013-11-05 14:36 ` Dario Faggioli [this message]
2013-11-07 18:35 ` [PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file Ian Jackson
2013-11-08 9:49 ` Dario Faggioli
2013-11-08 15:22 ` Ian Jackson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20131105143617.30446.53267.stgit@Solace \
--to=dario.faggioli@citrix.com \
--cc=Andrew.Cooper3@citrix.com \
--cc=Ian.Campbell@citrix.com \
--cc=Ian.Jackson@eu.citrix.com \
--cc=JBeulich@suse.com \
--cc=Marcus.Granado@eu.citrix.com \
--cc=dgdegra@tycho.nsa.gov \
--cc=george.dunlap@eu.citrix.com \
--cc=jtweaver@hawaii.edu \
--cc=juergen.gross@ts.fujitsu.com \
--cc=keir@xen.org \
--cc=lccycc123@gmail.com \
--cc=msw@amazon.com \
--cc=ufimtseva@gmail.com \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).