[PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dario Faggioli <dario.faggioli@citrix.com>
To: xen-devel@lists.xen.org
Cc: Marcus Granado <Marcus.Granado@eu.citrix.com>,
	Keir Fraser <keir@xen.org>,
	Ian Campbell <Ian.Campbell@citrix.com>,
	Li Yechen <lccycc123@gmail.com>,
	George Dunlap <george.dunlap@eu.citrix.com>,
	Andrew Cooper <Andrew.Cooper3@citrix.com>,
	Juergen Gross <juergen.gross@ts.fujitsu.com>,
	Ian Jackson <Ian.Jackson@eu.citrix.com>,
	Jan Beulich <JBeulich@suse.com>,
	Justin Weaver <jtweaver@hawaii.edu>,
	Daniel De Graaf <dgdegra@tycho.nsa.gov>,
	Matt Wilson <msw@amazon.com>,
	Elena Ufimtseva <ufimtseva@gmail.com>
Subject: [PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file
Date: Tue, 05 Nov 2013 15:36:17 +0100	[thread overview]
Message-ID: <20131105143617.30446.53267.stgit@Solace> (raw)
In-Reply-To: <20131105142844.30446.78671.stgit@Solace>

in a similar way to how it is possible to specify vcpu-affinity.

Manual page is updated accordingly.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 docs/man/xl.cfg.pod.5     |   70 +++++++++++++++++++++++++++++++++------
 tools/libxl/libxl_dom.c   |   18 ++++++----
 tools/libxl/libxl_numa.c  |   14 +-------
 tools/libxl/libxl_utils.h |   12 ++++++-
 tools/libxl/xl_cmdimpl.c  |   80 ++++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 159 insertions(+), 35 deletions(-)

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index 1c98cb4..1212426 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -144,18 +144,64 @@ run on cpu #3 of the host.
 =back
 
 If this option is not specified, no vcpu to cpu pinning is established,
-and the vcpus of the guest can run on all the cpus of the host.
-
-If we are on a NUMA machine (i.e., if the host has more than one NUMA
-node) and this option is not specified, libxl automatically tries to
-place the guest on the least possible number of nodes. That, however,
-will not affect vcpu pinning, so the guest will still be able to run on
-all the cpus, it will just prefer the ones from the node it has been
-placed on. A heuristic approach is used for choosing the best node (or
-set of nodes), with the goals of maximizing performance for the guest
-and, at the same time, achieving efficient utilization of host cpus
-and memory. See F<docs/misc/xl-numa-placement.markdown> for more
-details.
+and the vcpus of the guest can run on all the cpus of the host. If this
+option is specified, and no B<nodes=> option present the vcpu pinning
+mask of each vcpu is utilized to compute its vcpu node-affinity, and the
+union of all the vcpus node-affinities is what constitutes the domain
+node-affinity (which drives memory allocations).
+
+=back
+
+=item B<nodes="NODE-LIST">
+
+List of on which NUMA nodes the memory for the guest is allocated. This
+also means (starting from Xen 4.3 and if the credit scheduler is used)
+the vcpus of the domain prefers to run on the those NUMA nodes. Default is
+xl (via libxl) guesses. A C<NODE-LIST> may be specified as follows:
+
+=item "all"
+
+To specify no particular preference and avoid xl to automatically pick
+a (set of) NUMA ndoe(s). In practice, using "all", this domain will have
+no NUMA node-affinity and it's memory will be spread on all the host's
+NUMA nodes.
+
+=item "0-3,5,^1"
+
+To specify a NUMA node-affinity with the host NUMA nodes 0,2,3,5.
+Combining this with "all" is possible, meaning "all,^7" results in the
+memory being allocated on all the host NUMA nodes except node 7, as well
+as trying to avoid running the domain's vcpu on the pcpus that belong to
+node 7.
+
+=item ["1", "4"] (or [1, 4])
+
+To ask for specific vcpu to NUMA node mapping. That means (in this example),
+memory will be allocated on host NUMA nodes 1 and 4 but, at the same time,
+vcpu #0 of the guest prefers to run on the pcpus of host NUMA node 1, while
+vcpu #1 on the pcpus of host NUMA node 4.
+
+=back
+
+If this option is not specified, xl picks up a NUMA node (or a set of NUMA
+nodes), according to some heuristics, and use that as the NUMA node-affinity
+for the guest.
+
+If we are on a NUMA machine (i.e., if the host has more than one NUMA node)
+and this option is not specified, libxl automatically tries to place the
+guest on the least possible number of nodes. A heuristic approach is used
+for choosing the best node (or set of nodes), with the goals of maximizing
+performance for the guest and, at the same time, achieving efficient
+utilization of host cpus and memory. In this case, all the vcpus of the
+guest will have the same vcpu node-affinity.
+
+Notice that, independently from whether the node-affinity is specified
+via this parameter, or automatically decided by libxl, that does not affect
+vcpu pinning, so the guest will still be able to run on all the cpus to
+which its vcpus are pinned, or all the cpus, if no B<cpus=> option is
+provided.
+
+See F<docs/misc/xl-numa-placement.markdown> for more details.
 
 =back
 
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 1812bdc..bc4cf9a 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -215,19 +215,21 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
     }
 
     /*
-     * Check if the domain has any CPU affinity. If not, try to build
-     * up one. In case numa_place_domain() find at least a suitable
-     * candidate, it will affect info->nodemap accordingly; if it
-     * does not, it just leaves it as it is. This means (unless
-     * some weird error manifests) the subsequent call to
-     * libxl_domain_set_nodeaffinity() will do the actual placement,
+     * Check if the domain has any pinning or node-affinity and, if not, try
+     * to build up one.
+     *
+     * In case numa_place_domain() find at least a suitable candidate, it will
+     * affect info->nodemap accordingly; if it does not, it just leaves it as
+     * it is. This means (unless some weird error manifests) the subsequent
+     * call to libxl_domain_set_nodeaffinity() will do the actual placement,
      * whatever that turns out to be.
      */
     if (libxl_defbool_val(info->numa_placement)) {
 
-        if (!libxl_bitmap_is_full(&info->cpumap)) {
+        if (!libxl_bitmap_is_full(&info->cpumap) ||
+            !libxl_bitmap_is_full(&info->nodemap)) {
             LOG(ERROR, "Can run NUMA placement only if no vcpu "
-                       "affinity is specified");
+                       "pinning or node-affinity is specified");
             return ERROR_INVAL;
         }
 
diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
index 20c99ac..1026579 100644
--- a/tools/libxl/libxl_numa.c
+++ b/tools/libxl/libxl_numa.c
@@ -184,7 +184,7 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo,
                              int vcpus_on_node[])
 {
     libxl_dominfo *dinfo = NULL;
-    libxl_bitmap dom_nodemap, nodes_counted;
+    libxl_bitmap nodes_counted;
     int nr_doms, nr_cpus;
     int i, j, k;
 
@@ -197,12 +197,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo,
         return ERROR_FAIL;
     }
 
-    if (libxl_node_bitmap_alloc(CTX, &dom_nodemap, 0) < 0) {
-        libxl_bitmap_dispose(&nodes_counted);
-        libxl_dominfo_list_free(dinfo, nr_doms);
-        return ERROR_FAIL;
-    }
-
     for (i = 0; i < nr_doms; i++) {
         libxl_vcpuinfo *vinfo;
         int nr_dom_vcpus;
@@ -211,9 +205,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo,
         if (vinfo == NULL)
             continue;
 
-        /* Retrieve the domain's node-affinity map */
-        libxl_domain_get_nodeaffinity(CTX, dinfo[i].domid, &dom_nodemap);
-
         for (j = 0; j < nr_dom_vcpus; j++) {
             /*
              * For each vcpu of each domain, it must have both vcpu-affinity
@@ -225,7 +216,7 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo,
                 int node = tinfo[k].node;
 
                 if (libxl_bitmap_test(suitable_cpumap, k) &&
-                    libxl_bitmap_test(&dom_nodemap, node) &&
+                    libxl_bitmap_test(&vinfo[j].nodemap, node) &&
                     !libxl_bitmap_test(&nodes_counted, node)) {
                     libxl_bitmap_set(&nodes_counted, node);
                     vcpus_on_node[node]++;
@@ -236,7 +227,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo,
         libxl_vcpuinfo_list_free(vinfo, nr_dom_vcpus);
     }
 
-    libxl_bitmap_dispose(&dom_nodemap);
     libxl_bitmap_dispose(&nodes_counted);
     libxl_dominfo_list_free(dinfo, nr_doms);
     return 0;
diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h
index 7b84e6a..cac057c 100644
--- a/tools/libxl/libxl_utils.h
+++ b/tools/libxl/libxl_utils.h
@@ -90,7 +90,7 @@ static inline void libxl_bitmap_set_none(libxl_bitmap *bitmap)
 {
     memset(bitmap->map, 0, bitmap->size);
 }
-static inline int libxl_bitmap_cpu_valid(libxl_bitmap *bitmap, int bit)
+static inline int libxl_bitmap_valid(libxl_bitmap *bitmap, int bit)
 {
     return bit >= 0 && bit < (bitmap->size * 8);
 }
@@ -125,6 +125,16 @@ static inline int libxl_node_bitmap_alloc(libxl_ctx *ctx,
     return libxl_bitmap_alloc(ctx, nodemap, max_nodes);
 }
 
+static inline int libxl_bitmap_cpu_valid(libxl_bitmap *cpumap, int cpu)
+{
+    return libxl_bitmap_valid(cpumap, cpu);
+}
+
+static inline int libxl_bitmap_node_valid(libxl_bitmap *nodemap, int node)
+{
+    return libxl_bitmap_valid(nodemap, node);
+}
+
 /* Populate cpumap with the cpus spanned by the nodes in nodemap */
 int libxl_nodemap_to_cpumap(libxl_ctx *ctx,
                             const libxl_bitmap *nodemap,
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 1659259..a035162 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -76,8 +76,9 @@ xlchild children[child_max];
 static const char *common_domname;
 static int fd_lock = -1;
 
-/* Stash for specific vcpu to pcpu mappping */
+/* Stash for specific vcpu to pcpu and vcpu to node mappping */
 static int *vcpu_to_pcpu;
+static int *vcpu_to_node;
 
 static const char savefileheader_magic[32]=
     "Xen saved domain, xl format\n \0 \r";
@@ -670,7 +671,7 @@ static void parse_config_data(const char *config_source,
     const char *buf;
     long l;
     XLU_Config *config;
-    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms;
+    XLU_ConfigList *cpus, *nodes, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms;
     XLU_ConfigList *ioports, *irqs, *iomem;
     int num_ioports, num_irqs, num_iomem;
     int pci_power_mgmt = 0;
@@ -846,6 +847,53 @@ static void parse_config_data(const char *config_source,
         libxl_defbool_set(&b_info->numa_placement, false);
     }
 
+    if (!xlu_cfg_get_list (config, "nodes", &nodes, 0, 1)) {
+        int n_cpus = 0;
+
+        if (libxl_node_bitmap_alloc(ctx, &b_info->nodemap, 0)) {
+            fprintf(stderr, "Unable to allocate nodemap\n");
+            exit(1);
+        }
+
+        /*
+         * As above, use a temporary storage for the single vcpus'
+         * node-affinities.
+         */
+        vcpu_to_node = xmalloc(sizeof(int) * b_info->max_vcpus);
+        memset(vcpu_to_node, -1, sizeof(int) * b_info->max_vcpus);
+
+        libxl_bitmap_set_none(&b_info->nodemap);
+        while ((buf = xlu_cfg_get_listitem(nodes, n_cpus)) != NULL) {
+            i = atoi(buf);
+            if (!libxl_bitmap_node_valid(&b_info->nodemap, i)) {
+                fprintf(stderr, "node %d illegal\n", i);
+                exit(1);
+            }
+            libxl_bitmap_set(&b_info->nodemap, i);
+            if (n_cpus < b_info->max_vcpus)
+                vcpu_to_node[n_cpus] = i;
+            n_cpus++;
+        }
+
+        /* We have a nodemap, disable automatic placement */
+        libxl_defbool_set(&b_info->numa_placement, false);
+    }
+    else if (!xlu_cfg_get_string (config, "nodes", &buf, 0)) {
+        char *buf2 = strdup(buf);
+
+        if (libxl_node_bitmap_alloc(ctx, &b_info->nodemap, 0)) {
+            fprintf(stderr, "Unable to allocate nodemap\n");
+            exit(1);
+        }
+
+        libxl_bitmap_set_none(&b_info->nodemap);
+        if (parse_bitmap_range(buf2, &b_info->nodemap))
+            exit(1);
+        free(buf2);
+
+        libxl_defbool_set(&b_info->numa_placement, false);
+    }
+
     if (!xlu_cfg_get_long (config, "memory", &l, 0)) {
         b_info->max_memkb = l * 1024;
         b_info->target_memkb = b_info->max_memkb;
@@ -2205,6 +2253,34 @@ start:
         free(vcpu_to_pcpu); vcpu_to_pcpu = NULL;
     }
 
+    /* And do the same for single vcpu to node-affinity mapping */
+    if (vcpu_to_node) {
+        libxl_bitmap vcpu_nodemap;
+
+        ret = libxl_node_bitmap_alloc(ctx, &vcpu_nodemap, 0);
+        if (ret)
+            goto error_out;
+        for (i = 0; i < d_config.b_info.max_vcpus; i++) {
+
+            if (vcpu_to_node[i] != -1) {
+                libxl_bitmap_set_none(&vcpu_nodemap);
+                libxl_bitmap_set(&vcpu_nodemap, vcpu_to_node[i]);
+            } else {
+                libxl_bitmap_set_any(&vcpu_nodemap);
+            }
+            if (libxl_set_vcpunodeaffinity(ctx, domid, i, &vcpu_nodemap)) {
+                fprintf(stderr, "setting node-affinity failed"
+                        " on vcpu `%d'.\n", i);
+                libxl_bitmap_dispose(&vcpu_nodemap);
+                free(vcpu_to_node);
+                ret = ERROR_FAIL;
+                goto error_out;
+            }
+        }
+        libxl_bitmap_dispose(&vcpu_nodemap);
+        free(vcpu_to_node); vcpu_to_node = NULL;
+    }
+
     ret = libxl_userdata_store(ctx, domid, "xl",
                                     config_data, config_len);
     if (ret) {

next prev parent reply	other threads:[~2013-11-05 14:36 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-11-05 14:33 [PATCH RESEND 00/12] Implement per-vcpu NUMA node-affinity for credit1 Dario Faggioli
2013-11-05 14:34 ` [PATCH RESEND 01/12] xen: numa-sched: leave node-affinity alone if not in "auto" mode Dario Faggioli
2013-11-05 14:43   ` George Dunlap
2013-11-05 14:34 ` [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning Dario Faggioli
2013-11-05 14:50   ` George Dunlap
2013-11-06  8:48     ` Dario Faggioli
2013-11-07 18:17   ` Ian Jackson
2013-11-08  9:24     ` Dario Faggioli
2013-11-08 15:20       ` Ian Jackson
2013-11-05 14:34 ` [PATCH RESEND 03/12] xl: implement and enable dryrun mode for `xl vcpu-pin' Dario Faggioli
2013-11-05 14:34 ` [PATCH RESEND 04/12] xl: test script for the cpumap parser (for vCPU pinning) Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity Dario Faggioli
2013-11-05 14:52   ` Jan Beulich
2013-11-05 15:03     ` George Dunlap
2013-11-05 15:11       ` Jan Beulich
2013-11-05 15:24         ` George Dunlap
2013-11-05 22:15         ` Dario Faggioli
2013-11-05 15:11       ` George Dunlap
2013-11-05 15:23         ` Jan Beulich
2013-11-05 15:39           ` George Dunlap
2013-11-05 16:56             ` George Dunlap
2013-11-05 17:16               ` George Dunlap
2013-11-05 17:30                 ` Jan Beulich
2013-11-05 23:12                   ` Dario Faggioli
2013-11-05 23:01                 ` Dario Faggioli
2013-11-06  9:39                 ` Dario Faggioli
2013-11-06  9:46                   ` Jan Beulich
2013-11-06 10:00                     ` Dario Faggioli
2013-11-06 11:44                       ` George Dunlap
2013-11-06 14:26                         ` Dario Faggioli
2013-11-06 14:56                           ` George Dunlap
2013-11-06 15:14                             ` Jan Beulich
2013-11-06 16:12                               ` George Dunlap
2013-11-06 16:22                                 ` Jan Beulich
2013-11-06 16:48                                 ` Dario Faggioli
2013-11-06 16:20                               ` Dario Faggioli
2013-11-06 16:23                             ` Dario Faggioli
2013-11-05 17:24               ` Jan Beulich
2013-11-05 17:31                 ` George Dunlap
2013-11-05 23:08               ` Dario Faggioli
2013-11-05 22:54             ` Dario Faggioli
2013-11-05 22:22         ` Dario Faggioli
2013-11-06 11:41         ` Dario Faggioli
2013-11-06 14:47           ` George Dunlap
2013-11-06 16:53             ` Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 06/12] xen: numa-sched: domain node-affinity always comes from vcpu node-affinity Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 07/12] xen: numa-sched: use per-vcpu node-affinity for actual scheduling Dario Faggioli
2013-11-05 16:20   ` George Dunlap
2013-11-06  9:15     ` Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 08/12] xen: numa-sched: enable getting/specifying per-vcpu node-affinity Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 09/12] libxc: " Dario Faggioli
2013-11-07 18:27   ` Ian Jackson
2013-11-12 16:01   ` Konrad Rzeszutek Wilk
2013-11-12 16:43     ` George Dunlap
2013-11-12 16:55       ` Konrad Rzeszutek Wilk
2013-11-12 18:40     ` Dario Faggioli
2013-11-12 19:13       ` Konrad Rzeszutek Wilk
2013-11-12 21:36         ` Dario Faggioli
2013-11-13 10:57         ` Dario Faggioli
2013-11-05 14:35 ` [PATCH RESEND 10/12] libxl: " Dario Faggioli
2013-11-07 18:29   ` Ian Jackson
2013-11-08  9:18     ` Dario Faggioli
2013-11-08 15:07       ` Ian Jackson
2013-11-05 14:36 ` [PATCH RESEND 11/12] xl: " Dario Faggioli
2013-11-07 18:33   ` Ian Jackson
2013-11-08  9:33     ` Dario Faggioli
2013-11-08 15:18       ` Ian Jackson
2013-11-05 14:36 ` Dario Faggioli [this message]
2013-11-07 18:35   ` [PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file Ian Jackson
2013-11-08  9:49     ` Dario Faggioli
2013-11-08 15:22       ` Ian Jackson

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:1c98cb4 dfblob:1212426 dfblob:1812bdc dfblob:bc4cf9a
dfblob:20c99ac dfblob:1026579 dfblob:7b84e6a dfblob:cac057c
dfblob:1659259 dfblob:a035162 )
 OR (
bs:"[PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131105143617.30446.53267.stgit@Solace \
    --to=dario.faggioli@citrix.com \
    --cc=Andrew.Cooper3@citrix.com \
    --cc=Ian.Campbell@citrix.com \
    --cc=Ian.Jackson@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=Marcus.Granado@eu.citrix.com \
    --cc=dgdegra@tycho.nsa.gov \
    --cc=george.dunlap@eu.citrix.com \
    --cc=jtweaver@hawaii.edu \
    --cc=juergen.gross@ts.fujitsu.com \
    --cc=keir@xen.org \
    --cc=lccycc123@gmail.com \
    --cc=msw@amazon.com \
    --cc=ufimtseva@gmail.com \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.