[PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Dario Faggioli <dario.faggioli@citrix.com>
To: xen-devel@lists.xen.org
Cc: Marcus Granado <Marcus.Granado@eu.citrix.com>,
	Keir Fraser <keir@xen.org>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Li Yechen <lccycc123@gmail.com>,
	George Dunlap <george.dunlap@eu.citrix.com>,
	Andrew Cooper <Andrew.Cooper3@citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	Jan Beulich <JBeulich@suse.com>,
	Justin Weaver <jtweaver@hawaii.edu>,
	Daniel De Graaf <dgdegra@tycho.nsa.gov>,
	Matt Wilson <msw@amazon.com>,
	Elena Ufimtseva <ufimtseva@gmail.com>
Subject: [PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file
Date: Tue, 05 Nov 2013 15:36:17 +0100	[thread overview]
Message-ID: <20131105143617.30446.53267.stgit@Solace> (raw)
In-Reply-To: <20131105142844.30446.78671.stgit@Solace>

in a similar way to how it is possible to specify vcpu-affinity.

Manual page is updated accordingly.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 docs/man/xl.cfg.pod.5     |   70 +++++++++++++++++++++++++++++++++------
 tools/libxl/libxl_dom.c   |   18 ++++++----
 tools/libxl/libxl_numa.c  |   14 +-------
 tools/libxl/libxl_utils.h |   12 ++++++-
 tools/libxl/xl_cmdimpl.c  |   80 ++++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 159 insertions(+), 35 deletions(-)

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index 1c98cb4..1212426 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -144,18 +144,64 @@ run on cpu #3 of the host.
 =back
 
 If this option is not specified, no vcpu to cpu pinning is established,
-and the vcpus of the guest can run on all the cpus of the host.
-
-If we are on a NUMA machine (i.e., if the host has more than one NUMA
-node) and this option is not specified, libxl automatically tries to
-place the guest on the least possible number of nodes. That, however,
-will not affect vcpu pinning, so the guest will still be able to run on
-all the cpus, it will just prefer the ones from the node it has been
-placed on. A heuristic approach is used for choosing the best node (or
-set of nodes), with the goals of maximizing performance for the guest
-and, at the same time, achieving efficient utilization of host cpus
-and memory. See F<docs/misc/xl-numa-placement.markdown> for more
-details.
+and the vcpus of the guest can run on all the cpus of the host. If this
+option is specified, and no B<nodes=> option present the vcpu pinning
+mask of each vcpu is utilized to compute its vcpu node-affinity, and the
+union of all the vcpus node-affinities is what constitutes the domain
+node-affinity (which drives memory allocations).
+
+=back
+
+=item B<nodes="NODE-LIST">
+
+List of on which NUMA nodes the memory for the guest is allocated. This
+also means (starting from Xen 4.3 and if the credit scheduler is used)
+the vcpus of the domain prefers to run on the those NUMA nodes. Default is
+xl (via libxl) guesses. A C<NODE-LIST> may be specified as follows:
+
+=item "all"
+
+To specify no particular preference and avoid xl to automatically pick
+a (set of) NUMA ndoe(s). In practice, using "all", this domain will have
+no NUMA node-affinity and it's memory will be spread on all the host's
+NUMA nodes.
+
+=item "0-3,5,^1"
+
+To specify a NUMA node-affinity with the host NUMA nodes 0,2,3,5.
+Combining this with "all" is possible, meaning "all,^7" results in the
+memory being allocated on all the host NUMA nodes except node 7, as well
+as trying to avoid running the domain's vcpu on the pcpus that belong to
+node 7.
+
+=item ["1", "4"] (or [1, 4])
+
+To ask for specific vcpu to NUMA node mapping. That means (in this example),
+memory will be allocated on host NUMA nodes 1 and 4 but, at the same time,
+vcpu #0 of the guest prefers to run on the pcpus of host NUMA node 1, while
+vcpu #1 on the pcpus of host NUMA node 4.
+
+=back
+
+If this option is not specified, xl picks up a NUMA node (or a set of NUMA
+nodes), according to some heuristics, and use that as the NUMA node-affinity
+for the guest.
+
+If we are on a NUMA machine (i.e., if the host has more than one NUMA node)
+and this option is not specified, libxl automatically tries to place the
+guest on the least possible number of nodes. A heuristic approach is used
+for choosing the best node (or set of nodes), with the goals of maximizing
+performance for the guest and, at the same time, achieving efficient
+utilization of host cpus and memory. In this case, all the vcpus of the
+guest will have the same vcpu node-affinity.
+
+Notice that, independently from whether the node-affinity is specified
+via this parameter, or automatically decided by libxl, that does not affect
+vcpu pinning, so the guest will still be able to run on all the cpus to
+which its vcpus are pinned, or all the cpus, if no B<cpus=> option is
+provided.
+
+See F<docs/misc/xl-numa-placement.markdown> for more details.
 
 =back
 
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 1812bdc..bc4cf9a 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -215,19 +215,21 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
     }
 
     /*
-     * Check if the domain has any CPU affinity. If not, try to build
-     * up one. In case numa_place_domain() find at least a suitable
-     * candidate, it will affect info->nodemap accordingly; if it
-     * does not, it just leaves it as it is. This means (unless
-     * some weird error manifests) the subsequent call to
-     * libxl_domain_set_nodeaffinity() will do the actual placement,
+     * Check if the domain has any pinning or node-affinity and, if not, try
+     * to build up one.
+     *
+     * In case numa_place_domain() find at least a suitable candidate, it will
+     * affect info->nodemap accordingly; if it does not, it just leaves it as
+     * it is. This means (unless some weird error manifests) the subsequent
+     * call to libxl_domain_set_nodeaffinity() will do the actual placement,
      * whatever that turns out to be.
      */
     if (libxl_defbool_val(info->numa_placement)) {
 
-        if (!libxl_bitmap_is_full(&info->cpumap)) {
+        if (!libxl_bitmap_is_full(&info->cpumap) ||
+            !libxl_bitmap_is_full(&info->nodemap)) {
             LOG(ERROR, "Can run NUMA placement only if no vcpu "
-                       "affinity is specified");
+                       "pinning or node-affinity is specified");
             return ERROR_INVAL;
         }
 
diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
index 20c99ac..1026579 100644
--- a/tools/libxl/libxl_numa.c
+++ b/tools/libxl/libxl_numa.c
@@ -184,7 +184,7 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo,
                              int vcpus_on_node[])
 {
     libxl_dominfo *dinfo = NULL;
-    libxl_bitmap dom_nodemap, nodes_counted;
+    libxl_bitmap nodes_counted;
     int nr_doms, nr_cpus;
     int i, j, k;
 
@@ -197,12 +197,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo,
         return ERROR_FAIL;
     }
 
-    if (libxl_node_bitmap_alloc(CTX, &dom_nodemap, 0) < 0) {
-        libxl_bitmap_dispose(&nodes_counted);
-        libxl_dominfo_list_free(dinfo, nr_doms);
-        return ERROR_FAIL;
-    }
-
     for (i = 0; i < nr_doms; i++) {
         libxl_vcpuinfo *vinfo;
         int nr_dom_vcpus;
@@ -211,9 +205,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo,
         if (vinfo == NULL)
             continue;
 
-        /* Retrieve the domain's node-affinity map */
-        libxl_domain_get_nodeaffinity(CTX, dinfo[i].domid, &dom_nodemap);
-
         for (j = 0; j < nr_dom_vcpus; j++) {
             /*
              * For each vcpu of each domain, it must have both vcpu-affinity
@@ -225,7 +216,7 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo,
                 int node = tinfo[k].node;
 
                 if (libxl_bitmap_test(suitable_cpumap, k) &&
-                    libxl_bitmap_test(&dom_nodemap, node) &&
+                    libxl_bitmap_test(&vinfo[j].nodemap, node) &&
                     !libxl_bitmap_test(&nodes_counted, node)) {
                     libxl_bitmap_set(&nodes_counted, node);
                     vcpus_on_node[node]++;
@@ -236,7 +227,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo,
         libxl_vcpuinfo_list_free(vinfo, nr_dom_vcpus);
     }
 
-    libxl_bitmap_dispose(&dom_nodemap);
     libxl_bitmap_dispose(&nodes_counted);
     libxl_dominfo_list_free(dinfo, nr_doms);
     return 0;
diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h
index 7b84e6a..cac057c 100644
--- a/tools/libxl/libxl_utils.h
+++ b/tools/libxl/libxl_utils.h
@@ -90,7 +90,7 @@ static inline void libxl_bitmap_set_none(libxl_bitmap *bitmap)
 {
     memset(bitmap->map, 0, bitmap->size);
 }
-static inline int libxl_bitmap_cpu_valid(libxl_bitmap *bitmap, int bit)
+static inline int libxl_bitmap_valid(libxl_bitmap *bitmap, int bit)
 {
     return bit >= 0 && bit < (bitmap->size * 8);
 }
@@ -125,6 +125,16 @@ static inline int libxl_node_bitmap_alloc(libxl_ctx *ctx,
     return libxl_bitmap_alloc(ctx, nodemap, max_nodes);
 }
 
+static inline int libxl_bitmap_cpu_valid(libxl_bitmap *cpumap, int cpu)
+{
+    return libxl_bitmap_valid(cpumap, cpu);
+}
+
+static inline int libxl_bitmap_node_valid(libxl_bitmap *nodemap, int node)
+{
+    return libxl_bitmap_valid(nodemap, node);
+}
+
 /* Populate cpumap with the cpus spanned by the nodes in nodemap */
 int libxl_nodemap_to_cpumap(libxl_ctx *ctx,
                             const libxl_bitmap *nodemap,
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 1659259..a035162 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -76,8 +76,9 @@ xlchild children[child_max];
 static const char *common_domname;
 static int fd_lock = -1;
 
-/* Stash for specific vcpu to pcpu mappping */
+/* Stash for specific vcpu to pcpu and vcpu to node mappping */
 static int *vcpu_to_pcpu;
+static int *vcpu_to_node;
 
 static const char savefileheader_magic[32]=
     "Xen saved domain, xl format\n \0 \r";
@@ -670,7 +671,7 @@ static void parse_config_data(const char *config_source,
     const char *buf;
     long l;
     XLU_Config *config;
-    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms;
+    XLU_ConfigList *cpus, *nodes, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms;
     XLU_ConfigList *ioports, *irqs, *iomem;
     int num_ioports, num_irqs, num_iomem;
     int pci_power_mgmt = 0;
@@ -846,6 +847,53 @@ static void parse_config_data(const char *config_source,
         libxl_defbool_set(&b_info->numa_placement, false);
     }
 
+    if (!xlu_cfg_get_list (config, "nodes", &nodes, 0, 1)) {
+        int n_cpus = 0;
+
+        if (libxl_node_bitmap_alloc(ctx, &b_info->nodemap, 0)) {
+            fprintf(stderr, "Unable to allocate nodemap\n");
+            exit(1);
+        }
+
+        /*
+         * As above, use a temporary storage for the single vcpus'
+         * node-affinities.
+         */
+        vcpu_to_node = xmalloc(sizeof(int) * b_info->max_vcpus);
+        memset(vcpu_to_node, -1, sizeof(int) * b_info->max_vcpus);
+
+        libxl_bitmap_set_none(&b_info->nodemap);
+        while ((buf = xlu_cfg_get_listitem(nodes, n_cpus)) != NULL) {
+            i = atoi(buf);
+            if (!libxl_bitmap_node_valid(&b_info->nodemap, i)) {
+                fprintf(stderr, "node %d illegal\n", i);
+                exit(1);
+            }
+            libxl_bitmap_set(&b_info->nodemap, i);
+            if (n_cpus < b_info->max_vcpus)
+                vcpu_to_node[n_cpus] = i;
+            n_cpus++;
+        }
+
+        /* We have a nodemap, disable automatic placement */
+        libxl_defbool_set(&b_info->numa_placement, false);
+    }
+    else if (!xlu_cfg_get_string (config, "nodes", &buf, 0)) {
+        char *buf2 = strdup(buf);
+
+        if (libxl_node_bitmap_alloc(ctx, &b_info->nodemap, 0)) {
+            fprintf(stderr, "Unable to allocate nodemap\n");
+            exit(1);
+        }
+
+        libxl_bitmap_set_none(&b_info->nodemap);
+        if (parse_bitmap_range(buf2, &b_info->nodemap))
+            exit(1);
+        free(buf2);
+
+        libxl_defbool_set(&b_info->numa_placement, false);
+    }
+
     if (!xlu_cfg_get_long (config, "memory", &l, 0)) {
         b_info->max_memkb = l * 1024;
         b_info->target_memkb = b_info->max_memkb;
@@ -2205,6 +2253,34 @@ start:
         free(vcpu_to_pcpu); vcpu_to_pcpu = NULL;
     }
 
+    /* And do the same for single vcpu to node-affinity mapping */
+    if (vcpu_to_node) {
+        libxl_bitmap vcpu_nodemap;
+
+        ret = libxl_node_bitmap_alloc(ctx, &vcpu_nodemap, 0);
+        if (ret)
+            goto error_out;
+        for (i = 0; i < d_config.b_info.max_vcpus; i++) {
+
+            if (vcpu_to_node[i] != -1) {
+                libxl_bitmap_set_none(&vcpu_nodemap);
+                libxl_bitmap_set(&vcpu_nodemap, vcpu_to_node[i]);
+            } else {
+                libxl_bitmap_set_any(&vcpu_nodemap);
+            }
+            if (libxl_set_vcpunodeaffinity(ctx, domid, i, &vcpu_nodemap)) {
+                fprintf(stderr, "setting node-affinity failed"
+                        " on vcpu `%d'.\n", i);
+                libxl_bitmap_dispose(&vcpu_nodemap);
+                free(vcpu_to_node);
+                ret = ERROR_FAIL;
+                goto error_out;
+            }
+        }
+        libxl_bitmap_dispose(&vcpu_nodemap);
+        free(vcpu_to_node); vcpu_to_node = NULL;
+    }
+
     ret = libxl_userdata_store(ctx, domid, "xl",
                                     config_data, config_len);
     if (ret) {

next prev parent reply	other threads:[~2013-11-05 14:36 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-05 14:33 [PATCH RESEND 00/12] Implement per-vcpu NUMA node-affinity for credit1 Dario Faggioli
2013-11-05 14:34 ` [PATCH RESEND 01/12] xen: numa-sched: leave node-affinity alone if not in "auto" mode Dario Faggioli
2013-11-05 14:43   ` George Dunlap
2013-11-05 14:34 ` [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning Dario Faggioli
2013-11-05 14:50   ` George Dunlap
2013-11-06  8:48     ` Dario Faggioli
2013-11-07 18:17   ` Ian Jackson
2013-11-08  9:24     ` Dario Faggioli
2013-11-08 15:20       ` Ian Jackson
2013-11-05 14:34 ` [PATCH RESEND 03/12] xl: implement and enable dryrun mode for `xl vcpu-pin' Dario Faggioli
2013-11-05 14:34 ` [PATCH RESEND 04/12] xl: test script for the cpumap parser (for vCPU pinning) Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity Dario Faggioli
2013-11-05 14:52   ` Jan Beulich
2013-11-05 15:03     ` George Dunlap
2013-11-05 15:11       ` Jan Beulich
2013-11-05 15:24         ` George Dunlap
2013-11-05 22:15         ` Dario Faggioli
2013-11-05 15:11       ` George Dunlap
2013-11-05 15:23         ` Jan Beulich
2013-11-05 15:39           ` George Dunlap
2013-11-05 16:56             ` George Dunlap
2013-11-05 17:16               ` George Dunlap
2013-11-05 17:30                 ` Jan Beulich
2013-11-05 23:12                   ` Dario Faggioli
2013-11-05 23:01                 ` Dario Faggioli
2013-11-06  9:39                 ` Dario Faggioli
2013-11-06  9:46                   ` Jan Beulich
2013-11-06 10:00                     ` Dario Faggioli
2013-11-06 11:44                       ` George Dunlap
2013-11-06 14:26                         ` Dario Faggioli
2013-11-06 14:56                           ` George Dunlap
2013-11-06 15:14                             ` Jan Beulich
2013-11-06 16:12                               ` George Dunlap
2013-11-06 16:22                                 ` Jan Beulich
2013-11-06 16:48                                 ` Dario Faggioli
2013-11-06 16:20                               ` Dario Faggioli
2013-11-06 16:23                             ` Dario Faggioli
2013-11-05 17:24               ` Jan Beulich
2013-11-05 17:31                 ` George Dunlap
2013-11-05 23:08               ` Dario Faggioli
2013-11-05 22:54             ` Dario Faggioli
2013-11-05 22:22         ` Dario Faggioli
2013-11-06 11:41         ` Dario Faggioli
2013-11-06 14:47           ` George Dunlap
2013-11-06 16:53             ` Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 06/12] xen: numa-sched: domain node-affinity always comes from vcpu node-affinity Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 07/12] xen: numa-sched: use per-vcpu node-affinity for actual scheduling Dario Faggioli
2013-11-05 16:20   ` George Dunlap
2013-11-06  9:15     ` Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 08/12] xen: numa-sched: enable getting/specifying per-vcpu node-affinity Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 09/12] libxc: " Dario Faggioli
2013-11-07 18:27   ` Ian Jackson
2013-11-12 16:01   ` Konrad Rzeszutek Wilk
2013-11-12 16:43     ` George Dunlap
2013-11-12 16:55       ` Konrad Rzeszutek Wilk
2013-11-12 18:40     ` Dario Faggioli
2013-11-12 19:13       ` Konrad Rzeszutek Wilk
2013-11-12 21:36         ` Dario Faggioli
2013-11-13 10:57         ` Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 10/12] libxl: " Dario Faggioli
2013-11-07 18:29   ` Ian Jackson
2013-11-08  9:18     ` Dario Faggioli
2013-11-08 15:07       ` Ian Jackson
2013-11-05 14:36 ` [PATCH RESEND 11/12] xl: " Dario Faggioli
2013-11-07 18:33   ` Ian Jackson
2013-11-08  9:33     ` Dario Faggioli
2013-11-08 15:18       ` Ian Jackson
2013-11-05 14:36 ` Dario Faggioli [this message]
2013-11-07 18:35   ` [PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file Ian Jackson
2013-11-08  9:49     ` Dario Faggioli
2013-11-08 15:22       ` Ian Jackson

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:1c98cb4 dfblob:1212426 dfblob:1812bdc dfblob:bc4cf9a
dfblob:20c99ac dfblob:1026579 dfblob:7b84e6a dfblob:cac057c
dfblob:1659259 dfblob:a035162 )
 OR (
bs:"[PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131105143617.30446.53267.stgit@Solace \
    --to=dario.faggioli@citrix.com \
    --cc=Andrew.Cooper3@citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=Marcus.Granado@eu.citrix.com \
    --cc=dgdegra@tycho.nsa.gov \
    --cc=george.dunlap@eu.citrix.com \
    --cc=jtweaver@hawaii.edu \
    --cc=juergen.gross@ts.fujitsu.com \
    --cc=keir@xen.org \
    --cc=lccycc123@gmail.com \
    --cc=msw@amazon.com \
    --cc=ufimtseva@gmail.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).