[PATCH 00 of 10 v2] NUMA aware credit scheduling

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00 of 10 v2] NUMA aware credit scheduling
@ 2012-12-19 19:07 Dario Faggioli
  2012-12-19 19:07 ` [PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap Dario Faggioli
                   ` (11 more replies)
  0 siblings, 12 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-19 19:07 UTC (permalink / raw)
  To: xen-devel
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	Jan Beulich, Daniel De Graaf, Matt Wilson

Hello Everyone,

Here it is the take 2 of the NUMA aware credit scheduling series. Sorry it took
a bit, but I had to take care of those nasty bugs causing scheduling anomalies,
as they were getting in the way and messing up the numbers when trying to
evaluate performances of this! :-)

I also rewrote most of the core of the two step vcpu and node affinity
balancing algorithm, as per George's suggestion during last round, to try to
squeeze a little bit more of performances improvement.

As already and repeatedly said, what the series does is providing the (credit)
scheduler with the knowledge of a domain's node-affinity. It will then always
try to run domain's vCPUs on one of those nodes first. Only if that turns out
to be impossible, it falls back to the old behaviour. (BTW, for any update on
the status of my "quest" about improving NUMA support in Xen, look
http://wiki.xen.org/wiki/Xen_NUMA_Roadmap.)

I rerun my usual benchmark, SpecJBB2005, plus some others, i.e., some
configurations of sysbench and lmbench. A little bit more about them follows:

 * SpecJBB is all about throughput, so pinning is likely the ideal solution.

 * Sysbench-memory is the time it takes for writing a fixed amount of memory
   (and then it is the throughput that is measured). What we expect is
   locality to be important, but at the same time the potential imbalances
   due to pinning could have a say in it.

 * Lmbench-proc is the time it takes for a process to fork a fixed amount of
   children. This is much more about latency than throughput, with locality
   of memory accesses playing a smaller role and, again, imbalances due to
   pinning being a potential issue.

On a 2 nodes, 16 cores system, where I can have 2 to 10 VMs (2 vCPUs each)
executing the benchmarks concurrently, here's what I get:

 ----------------------------------------------------
 | SpecJBB2005, throughput (the higher the better)  |
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 |    2 |   43451.853 | 49876.750 |    49693.653    |
 |    6 |   29368.589 | 33782.132 |    33692.936    |
 |   10 |   19138.934 | 21950.696 |    21413.311    |
 ----------------------------------------------------
 | Sysbench memory, throughput (the higher the better)
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 |    2 |  484.42167  | 552.32667 |    552.86167    |
 |    6 |  404.43667  | 440.00056 |    449.42611    |
 |   10 |  296.45600  | 315.51733 |    331.49067    |
 ----------------------------------------------------
 | LMBench proc, latency (the lower the better)     |
 ----------------------------------------------------
 | #VMs | No affinity |  Pinning  | NUMA scheduling |
 ----------------------------------------------------
 |    2 |  824.00437  | 749.51892 |    741.42952    |
 |    6 |  942.39442  | 985.02761 |    974.94700    |
 |   10 |  1254.3121  | 1363.0792 |    1301.2917    |
 ----------------------------------------------------

Which, reasoning in terms of %-performances increase/decrease, means NUMA aware
scheduling does as follows, as compared to no affinity at all and to pinning:

     ----------------------------------
     | SpecJBB2005 (throughput)       |
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     |    2 |   +14.36%   |   -0.36%  |
     |    6 |   +14.72%   |   -0.26%  |
     |   10 |   +11.88%   |   -2.44%  |
     ----------------------------------
     | Sysbench memory (throughput)   |
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     |    2 |   +14.12%   |   +0.09%  |
     |    6 |   +11.12%   |   +2.14%  |
     |   10 |   +11.81%   |   +5.06%  |
     ----------------------------------
     | LMBench proc (latency)         |
     ----------------------------------
     | #VMs | No affinity |  Pinning  |
     ----------------------------------
     |    2 |   +10.02%   |   +1.07%  |
     |    6 |    +3.45%   |   +1.02%  |
     |   10 |    +2.94%   |   +4.53%  |
     ----------------------------------

Numbers seem to tell we're being successful in taking advantage of both the
improved locality (when compared to no affinity) and the greater flexibility
the NUMA aware scheduling approach gives us (when compared to pinning).  In
fact, when throughput only is concerned (SpecJBB case), it behaves almost on
par with pinning, and a lot better than no affinity at all. Moreover, we're
even able to do better than them both, when latency comes a little bit more
into the game and the imbalances caused by pinning would make things worse than
not having any affinity, like in the sysbench and, especially, in the LMBench
case.

Here are the patches included in the series. I '*'-ed ones already received one
or more acks during v1.  However, there are patches that were significantly
reworked since then. In that case, I just ignored that, and left them with my
SOB only, as I think they definitely need to be re-reviewd. :-)

 * [ 1/10] xen, libxc: rename xenctl_cpumap to xenctl_bitmap
 * [ 2/10] xen, libxc: introduce node maps and masks
   [ 3/10] xen: sched_credit: let the scheduler know about node-affinity
   [ 4/10] xen: allow for explicitly specifying node-affinity
 * [ 5/10] libxc: allow for explicitly specifying node-affinity
 * [ 6/10] libxl: allow for explicitly specifying node-affinity
   [ 7/10] libxl: optimize the calculation of how many VCPUs can run on a candidate
 * [ 8/10] libxl: automatic placement deals with node-affinity
 * [ 9/10] xl: add node-affinity to the output of `xl list`
   [10/10] docs: rearrange and update NUMA placement documentation

Thanks and Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap
  2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
@ 2012-12-19 19:07 ` Dario Faggioli
  2012-12-20  9:17   ` Jan Beulich
  2012-12-19 19:07 ` [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks Dario Faggioli
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 57+ messages in thread
From: Dario Faggioli @ 2012-12-19 19:07 UTC (permalink / raw)
  To: xen-devel
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	Jan Beulich, Daniel De Graaf, Matt Wilson

More specifically:
 1. replaces xenctl_cpumap with xenctl_bitmap
 2. provides bitmap_to_xenctl_bitmap and the reverse;
 3. re-implement cpumask_to_xenctl_bitmap with
    bitmap_to_xenctl_bitmap and the reverse;

Other than #3, no functional changes. Interface only slightly
afected.

This is in preparation of introducing NUMA node-affinity maps.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>

diff --git a/tools/libxc/xc_cpupool.c b/tools/libxc/xc_cpupool.c
--- a/tools/libxc/xc_cpupool.c
+++ b/tools/libxc/xc_cpupool.c
@@ -90,7 +90,7 @@ xc_cpupoolinfo_t *xc_cpupool_getinfo(xc_
     sysctl.u.cpupool_op.op = XEN_SYSCTL_CPUPOOL_OP_INFO;
     sysctl.u.cpupool_op.cpupool_id = poolid;
     set_xen_guest_handle(sysctl.u.cpupool_op.cpumap.bitmap, local);
-    sysctl.u.cpupool_op.cpumap.nr_cpus = local_size * 8;
+    sysctl.u.cpupool_op.cpumap.nr_elems = local_size * 8;
 
     err = do_sysctl_save(xch, &sysctl);
 
@@ -184,7 +184,7 @@ xc_cpumap_t xc_cpupool_freeinfo(xc_inter
     sysctl.cmd = XEN_SYSCTL_cpupool_op;
     sysctl.u.cpupool_op.op = XEN_SYSCTL_CPUPOOL_OP_FREEINFO;
     set_xen_guest_handle(sysctl.u.cpupool_op.cpumap.bitmap, local);
-    sysctl.u.cpupool_op.cpumap.nr_cpus = mapsize * 8;
+    sysctl.u.cpupool_op.cpumap.nr_elems = mapsize * 8;
 
     err = do_sysctl_save(xch, &sysctl);
 
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -142,7 +142,7 @@ int xc_vcpu_setaffinity(xc_interface *xc
 
     set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local);
 
-    domctl.u.vcpuaffinity.cpumap.nr_cpus = cpusize * 8;
+    domctl.u.vcpuaffinity.cpumap.nr_elems = cpusize * 8;
 
     ret = do_domctl(xch, &domctl);
 
@@ -182,7 +182,7 @@ int xc_vcpu_getaffinity(xc_interface *xc
     domctl.u.vcpuaffinity.vcpu = vcpu;
 
     set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local);
-    domctl.u.vcpuaffinity.cpumap.nr_cpus = cpusize * 8;
+    domctl.u.vcpuaffinity.cpumap.nr_elems = cpusize * 8;
 
     ret = do_domctl(xch, &domctl);
 
diff --git a/tools/libxc/xc_tbuf.c b/tools/libxc/xc_tbuf.c
--- a/tools/libxc/xc_tbuf.c
+++ b/tools/libxc/xc_tbuf.c
@@ -134,7 +134,7 @@ int xc_tbuf_set_cpu_mask(xc_interface *x
     bitmap_64_to_byte(bytemap, &mask64, sizeof (mask64) * 8);
 
     set_xen_guest_handle(sysctl.u.tbuf_op.cpu_mask.bitmap, bytemap);
-    sysctl.u.tbuf_op.cpu_mask.nr_cpus = sizeof(bytemap) * 8;
+    sysctl.u.tbuf_op.cpu_mask.nr_elems = sizeof(bytemap) * 8;
 
     ret = do_sysctl(xch, &sysctl);
 
diff --git a/xen/arch/x86/cpu/mcheck/mce.c b/xen/arch/x86/cpu/mcheck/mce.c
--- a/xen/arch/x86/cpu/mcheck/mce.c
+++ b/xen/arch/x86/cpu/mcheck/mce.c
@@ -1457,8 +1457,7 @@ long do_mca(XEN_GUEST_HANDLE_PARAM(xen_m
             cpumap = &cpu_online_map;
         else
         {
-            ret = xenctl_cpumap_to_cpumask(&cmv,
-                                           &op->u.mc_inject_v2.cpumap);
+            ret = xenctl_bitmap_to_cpumask(&cmv, &op->u.mc_inject_v2.cpumap);
             if ( ret )
                 break;
             cpumap = cmv;
diff --git a/xen/arch/x86/platform_hypercall.c b/xen/arch/x86/platform_hypercall.c
--- a/xen/arch/x86/platform_hypercall.c
+++ b/xen/arch/x86/platform_hypercall.c
@@ -375,7 +375,7 @@ ret_t do_platform_op(XEN_GUEST_HANDLE_PA
     {
         uint32_t cpu;
         uint64_t idletime, now = NOW();
-        struct xenctl_cpumap ctlmap;
+        struct xenctl_bitmap ctlmap;
         cpumask_var_t cpumap;
         XEN_GUEST_HANDLE(uint8) cpumap_bitmap;
         XEN_GUEST_HANDLE(uint64) idletimes;
@@ -388,11 +388,11 @@ ret_t do_platform_op(XEN_GUEST_HANDLE_PA
         if ( cpufreq_controller != FREQCTL_dom0_kernel )
             break;
 
-        ctlmap.nr_cpus  = op->u.getidletime.cpumap_nr_cpus;
+        ctlmap.nr_elems  = op->u.getidletime.cpumap_nr_cpus;
         guest_from_compat_handle(cpumap_bitmap,
                                  op->u.getidletime.cpumap_bitmap);
         ctlmap.bitmap.p = cpumap_bitmap.p; /* handle -> handle_64 conversion */
-        if ( (ret = xenctl_cpumap_to_cpumask(&cpumap, &ctlmap)) != 0 )
+        if ( (ret = xenctl_bitmap_to_cpumask(&cpumap, &ctlmap)) != 0 )
             goto out;
         guest_from_compat_handle(idletimes, op->u.getidletime.idletime);
 
@@ -411,7 +411,7 @@ ret_t do_platform_op(XEN_GUEST_HANDLE_PA
 
         op->u.getidletime.now = now;
         if ( ret == 0 )
-            ret = cpumask_to_xenctl_cpumap(&ctlmap, cpumap);
+            ret = cpumask_to_xenctl_bitmap(&ctlmap, cpumap);
         free_cpumask_var(cpumap);
 
         if ( ret == 0 && __copy_field_to_guest(u_xenpf_op, op, u.getidletime) )
diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -493,7 +493,7 @@ int cpupool_do_sysctl(struct xen_sysctl_
         op->cpupool_id = c->cpupool_id;
         op->sched_id = c->sched->sched_id;
         op->n_dom = c->n_dom;
-        ret = cpumask_to_xenctl_cpumap(&op->cpumap, c->cpu_valid);
+        ret = cpumask_to_xenctl_bitmap(&op->cpumap, c->cpu_valid);
         cpupool_put(c);
     }
     break;
@@ -588,7 +588,7 @@ int cpupool_do_sysctl(struct xen_sysctl_
 
     case XEN_SYSCTL_CPUPOOL_OP_FREEINFO:
     {
-        ret = cpumask_to_xenctl_cpumap(
+        ret = cpumask_to_xenctl_bitmap(
             &op->cpumap, &cpupool_free_cpus);
     }
     break;
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -32,28 +32,29 @@
 static DEFINE_SPINLOCK(domctl_lock);
 DEFINE_SPINLOCK(vcpu_alloc_lock);
 
-int cpumask_to_xenctl_cpumap(
-    struct xenctl_cpumap *xenctl_cpumap, const cpumask_t *cpumask)
+int bitmap_to_xenctl_bitmap(struct xenctl_bitmap *xenctl_bitmap,
+                            const unsigned long *bitmap,
+                            unsigned int nbits)
 {
     unsigned int guest_bytes, copy_bytes, i;
     uint8_t zero = 0;
     int err = 0;
-    uint8_t *bytemap = xmalloc_array(uint8_t, (nr_cpu_ids + 7) / 8);
+    uint8_t *bytemap = xmalloc_array(uint8_t, (nbits + 7) / 8);
 
     if ( !bytemap )
         return -ENOMEM;
 
-    guest_bytes = (xenctl_cpumap->nr_cpus + 7) / 8;
-    copy_bytes  = min_t(unsigned int, guest_bytes, (nr_cpu_ids + 7) / 8);
+    guest_bytes = (xenctl_bitmap->nr_elems + 7) / 8;
+    copy_bytes  = min_t(unsigned int, guest_bytes, (nbits + 7) / 8);
 
-    bitmap_long_to_byte(bytemap, cpumask_bits(cpumask), nr_cpu_ids);
+    bitmap_long_to_byte(bytemap, bitmap, nbits);
 
     if ( copy_bytes != 0 )
-        if ( copy_to_guest(xenctl_cpumap->bitmap, bytemap, copy_bytes) )
+        if ( copy_to_guest(xenctl_bitmap->bitmap, bytemap, copy_bytes) )
             err = -EFAULT;
 
     for ( i = copy_bytes; !err && i < guest_bytes; i++ )
-        if ( copy_to_guest_offset(xenctl_cpumap->bitmap, i, &zero, 1) )
+        if ( copy_to_guest_offset(xenctl_bitmap->bitmap, i, &zero, 1) )
             err = -EFAULT;
 
     xfree(bytemap);
@@ -61,36 +62,58 @@ int cpumask_to_xenctl_cpumap(
     return err;
 }
 
-int xenctl_cpumap_to_cpumask(
-    cpumask_var_t *cpumask, const struct xenctl_cpumap *xenctl_cpumap)
+int xenctl_bitmap_to_bitmap(unsigned long *bitmap,
+                            const struct xenctl_bitmap *xenctl_bitmap,
+                            unsigned int nbits)
 {
     unsigned int guest_bytes, copy_bytes;
     int err = 0;
-    uint8_t *bytemap = xzalloc_array(uint8_t, (nr_cpu_ids + 7) / 8);
+    uint8_t *bytemap = xzalloc_array(uint8_t, (nbits + 7) / 8);
 
     if ( !bytemap )
         return -ENOMEM;
 
-    guest_bytes = (xenctl_cpumap->nr_cpus + 7) / 8;
-    copy_bytes  = min_t(unsigned int, guest_bytes, (nr_cpu_ids + 7) / 8);
+    guest_bytes = (xenctl_bitmap->nr_elems + 7) / 8;
+    copy_bytes  = min_t(unsigned int, guest_bytes, (nbits + 7) / 8);
 
     if ( copy_bytes != 0 )
     {
-        if ( copy_from_guest(bytemap, xenctl_cpumap->bitmap, copy_bytes) )
+        if ( copy_from_guest(bytemap, xenctl_bitmap->bitmap, copy_bytes) )
             err = -EFAULT;
-        if ( (xenctl_cpumap->nr_cpus & 7) && (guest_bytes == copy_bytes) )
-            bytemap[guest_bytes-1] &= ~(0xff << (xenctl_cpumap->nr_cpus & 7));
+        if ( (xenctl_bitmap->nr_elems & 7) && (guest_bytes == copy_bytes) )
+            bytemap[guest_bytes-1] &= ~(0xff << (xenctl_bitmap->nr_elems & 7));
     }
 
-    if ( err )
-        /* nothing */;
-    else if ( alloc_cpumask_var(cpumask) )
-        bitmap_byte_to_long(cpumask_bits(*cpumask), bytemap, nr_cpu_ids);
+    if ( !err )
+        bitmap_byte_to_long(bitmap, bytemap, nbits);
+
+    xfree(bytemap);
+
+    return err;
+}
+
+int cpumask_to_xenctl_bitmap(struct xenctl_bitmap *xenctl_cpumap,
+                             const cpumask_t *cpumask)
+{
+    return bitmap_to_xenctl_bitmap(xenctl_cpumap, cpumask_bits(cpumask),
+                                   nr_cpu_ids);
+}
+
+int xenctl_bitmap_to_cpumask(cpumask_var_t *cpumask,
+                             const struct xenctl_bitmap *xenctl_cpumap)
+{
+    int err = 0;
+
+    if ( alloc_cpumask_var(cpumask) ) {
+        err = xenctl_bitmap_to_bitmap(cpumask_bits(*cpumask), xenctl_cpumap,
+                                      nr_cpu_ids);
+        /* In case of error, cleanup is up to us, as the caller won't care! */
+        if ( err )
+            free_cpumask_var(*cpumask);
+    }
     else
         err = -ENOMEM;
 
-    xfree(bytemap);
-
     return err;
 }
 
@@ -583,7 +606,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xe
         {
             cpumask_var_t new_affinity;
 
-            ret = xenctl_cpumap_to_cpumask(
+            ret = xenctl_bitmap_to_cpumask(
                 &new_affinity, &op->u.vcpuaffinity.cpumap);
             if ( !ret )
             {
@@ -593,7 +616,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xe
         }
         else
         {
-            ret = cpumask_to_xenctl_cpumap(
+            ret = cpumask_to_xenctl_bitmap(
                 &op->u.vcpuaffinity.cpumap, v->cpu_affinity);
         }
     }
diff --git a/xen/common/trace.c b/xen/common/trace.c
--- a/xen/common/trace.c
+++ b/xen/common/trace.c
@@ -384,7 +384,7 @@ int tb_control(xen_sysctl_tbuf_op_t *tbc
     {
         cpumask_var_t mask;
 
-        rc = xenctl_cpumap_to_cpumask(&mask, &tbc->cpu_mask);
+        rc = xenctl_bitmap_to_cpumask(&mask, &tbc->cpu_mask);
         if ( !rc )
         {
             cpumask_copy(&tb_cpu_mask, mask);
diff --git a/xen/include/public/arch-x86/xen-mca.h b/xen/include/public/arch-x86/xen-mca.h
--- a/xen/include/public/arch-x86/xen-mca.h
+++ b/xen/include/public/arch-x86/xen-mca.h
@@ -414,7 +414,7 @@ struct xen_mc_mceinject {
 
 struct xen_mc_inject_v2 {
 	uint32_t flags;
-	struct xenctl_cpumap cpumap;
+	struct xenctl_bitmap cpumap;
 };
 #endif
 
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -284,7 +284,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_getvc
 /* XEN_DOMCTL_getvcpuaffinity */
 struct xen_domctl_vcpuaffinity {
     uint32_t  vcpu;              /* IN */
-    struct xenctl_cpumap cpumap; /* IN/OUT */
+    struct xenctl_bitmap cpumap; /* IN/OUT */
 };
 typedef struct xen_domctl_vcpuaffinity xen_domctl_vcpuaffinity_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_vcpuaffinity_t);
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -71,7 +71,7 @@ struct xen_sysctl_tbuf_op {
 #define XEN_SYSCTL_TBUFOP_disable      5
     uint32_t cmd;
     /* IN/OUT variables */
-    struct xenctl_cpumap cpu_mask;
+    struct xenctl_bitmap cpu_mask;
     uint32_t             evt_mask;
     /* OUT variables */
     uint64_aligned_t buffer_mfn;
@@ -532,7 +532,7 @@ struct xen_sysctl_cpupool_op {
     uint32_t domid;       /* IN: M              */
     uint32_t cpu;         /* IN: AR             */
     uint32_t n_dom;       /*            OUT: I  */
-    struct xenctl_cpumap cpumap; /*     OUT: IF */
+    struct xenctl_bitmap cpumap; /*     OUT: IF */
 };
 typedef struct xen_sysctl_cpupool_op xen_sysctl_cpupool_op_t;
 DEFINE_XEN_GUEST_HANDLE(xen_sysctl_cpupool_op_t);
diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h
--- a/xen/include/public/xen.h
+++ b/xen/include/public/xen.h
@@ -851,9 +851,9 @@ typedef uint8_t xen_domain_handle_t[16];
 #endif
 
 #ifndef __ASSEMBLY__
-struct xenctl_cpumap {
+struct xenctl_bitmap {
     XEN_GUEST_HANDLE_64(uint8) bitmap;
-    uint32_t nr_cpus;
+    uint32_t nr_elems;
 };
 #endif
 
diff --git a/xen/include/xen/cpumask.h b/xen/include/xen/cpumask.h
--- a/xen/include/xen/cpumask.h
+++ b/xen/include/xen/cpumask.h
@@ -424,8 +424,8 @@ extern cpumask_t cpu_present_map;
 #define for_each_present_cpu(cpu)  for_each_cpu(cpu, &cpu_present_map)
 
 /* Copy to/from cpumap provided by control tools. */
-struct xenctl_cpumap;
-int cpumask_to_xenctl_cpumap(struct xenctl_cpumap *, const cpumask_t *);
-int xenctl_cpumap_to_cpumask(cpumask_var_t *, const struct xenctl_cpumap *);
+struct xenctl_bitmap;
+int cpumask_to_xenctl_bitmap(struct xenctl_bitmap *, const cpumask_t *);
+int xenctl_bitmap_to_cpumask(cpumask_var_t *, const struct xenctl_bitmap *);
 
 #endif /* __XEN_CPUMASK_H */
diff --git a/xen/include/xlat.lst b/xen/include/xlat.lst
--- a/xen/include/xlat.lst
+++ b/xen/include/xlat.lst
@@ -2,7 +2,7 @@
 # ! - needs translation
 # ? - needs checking
 ?	dom0_vga_console_info		xen.h
-?	xenctl_cpumap			xen.h
+?	xenctl_bitmap			xen.h
 ?	mmu_update			xen.h
 !	mmuext_op			xen.h
 !	start_info			xen.h

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap
  2012-12-19 19:07 ` [PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap Dario Faggioli
@ 2012-12-20  9:17   ` Jan Beulich
  2012-12-20  9:35     ` Dario Faggioli
  0 siblings, 1 reply; 57+ messages in thread
From: Jan Beulich @ 2012-12-20  9:17 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel, Matt Wilson, Daniel De Graaf

>>> On 19.12.12 at 20:07, Dario Faggioli <dario.faggioli@citrix.com> wrote:
> More specifically:
>  1. replaces xenctl_cpumap with xenctl_bitmap
>  2. provides bitmap_to_xenctl_bitmap and the reverse;
>  3. re-implement cpumask_to_xenctl_bitmap with
>     bitmap_to_xenctl_bitmap and the reverse;
> 
> Other than #3, no functional changes. Interface only slightly
> afected.
> 
> This is in preparation of introducing NUMA node-affinity maps.

This (at least) lacks an adjustment to
tools/tests/mce-test/tools/xen-mceinj.c afaict.

Jan.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap
  2012-12-20  9:17   ` Jan Beulich
@ 2012-12-20  9:35     ` Dario Faggioli
  0 siblings, 0 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-20  9:35 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel, Matt Wilson, Daniel De Graaf


[-- Attachment #1.1: Type: text/plain, Size: 1044 bytes --]

On Thu, 2012-12-20 at 09:17 +0000, Jan Beulich wrote: 
> >>> On 19.12.12 at 20:07, Dario Faggioli <dario.faggioli@citrix.com> wrote:
> > More specifically:
> >  1. replaces xenctl_cpumap with xenctl_bitmap
> >  2. provides bitmap_to_xenctl_bitmap and the reverse;
> >  3. re-implement cpumask_to_xenctl_bitmap with
> >     bitmap_to_xenctl_bitmap and the reverse;
> > 
> > Other than #3, no functional changes. Interface only slightly
> > afected.
> > 
> > This is in preparation of introducing NUMA node-affinity maps.
> 
> This (at least) lacks an adjustment to
> tools/tests/mce-test/tools/xen-mceinj.c afaict.
> 
Indeed. I really think I was building those, but it looks like I wasn't.
Sorry for that and thanks for pointing out, will fix.

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks
  2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
  2012-12-19 19:07 ` [PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap Dario Faggioli
@ 2012-12-19 19:07 ` Dario Faggioli
  2012-12-20  9:18   ` Jan Beulich
  2012-12-19 19:07 ` [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity Dario Faggioli
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 57+ messages in thread
From: Dario Faggioli @ 2012-12-19 19:07 UTC (permalink / raw)
  To: xen-devel
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	Jan Beulich, Daniel De Graaf, Matt Wilson

Following suit from cpumap and cpumask implementations.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>

diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c
--- a/tools/libxc/xc_misc.c
+++ b/tools/libxc/xc_misc.c
@@ -54,6 +54,11 @@ int xc_get_cpumap_size(xc_interface *xch
     return (xc_get_max_cpus(xch) + 7) / 8;
 }
 
+int xc_get_nodemap_size(xc_interface *xch)
+{
+    return (xc_get_max_nodes(xch) + 7) / 8;
+}
+
 xc_cpumap_t xc_cpumap_alloc(xc_interface *xch)
 {
     int sz;
@@ -64,6 +69,16 @@ xc_cpumap_t xc_cpumap_alloc(xc_interface
     return calloc(1, sz);
 }
 
+xc_nodemap_t xc_nodemap_alloc(xc_interface *xch)
+{
+    int sz;
+
+    sz = xc_get_nodemap_size(xch);
+    if (sz == 0)
+        return NULL;
+    return calloc(1, sz);
+}
+
 int xc_readconsolering(xc_interface *xch,
                        char *buffer,
                        unsigned int *pnr_chars,
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -330,12 +330,20 @@ int xc_get_cpumap_size(xc_interface *xch
 /* allocate a cpumap */
 xc_cpumap_t xc_cpumap_alloc(xc_interface *xch);
 
- /*
+/*
  * NODEMAP handling
  */
+typedef uint8_t *xc_nodemap_t;
+
 /* return maximum number of NUMA nodes the hypervisor supports */
 int xc_get_max_nodes(xc_interface *xch);
 
+/* return array size for nodemap */
+int xc_get_nodemap_size(xc_interface *xch);
+
+/* allocate a nodemap */
+xc_nodemap_t xc_nodemap_alloc(xc_interface *xch);
+
 /*
  * DOMAIN DEBUGGING FUNCTIONS
  */
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -117,6 +117,30 @@ int xenctl_bitmap_to_cpumask(cpumask_var
     return err;
 }
 
+int nodemask_to_xenctl_bitmap(struct xenctl_bitmap *xenctl_nodemap,
+                              const nodemask_t *nodemask)
+{
+    return bitmap_to_xenctl_bitmap(xenctl_nodemap, cpumask_bits(nodemask),
+                                   MAX_NUMNODES);
+}
+
+int xenctl_bitmap_to_nodemask(nodemask_t *nodemask,
+                              const struct xenctl_bitmap *xenctl_nodemap)
+{
+    int err = 0;
+
+    if ( alloc_nodemask_var(nodemask) ) {
+        err = xenctl_bitmap_to_bitmap(nodes_addr(*nodemask), xenctl_nodemap,
+                                      MAX_NUMNODES);
+        if ( err )
+            free_nodemask_var(*nodemask);
+    }
+    else
+        err = -ENOMEM;
+
+    return err;
+}
+
 static inline int is_free_domid(domid_t dom)
 {
     struct domain *d;
diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h
--- a/xen/include/xen/nodemask.h
+++ b/xen/include/xen/nodemask.h
@@ -298,6 +298,53 @@ static inline int __nodemask_parse(const
 }
 #endif
 
+/*
+ * nodemask_var_t: struct nodemask for stack usage.
+ *
+ * See definition of cpumask_var_t in include/xen//cpumask.h.
+ */
+#if MAX_NUMNODES > 2 * BITS_PER_LONG
+#include <xen/xmalloc.h>
+
+typedef nodemask_t *nodemask_var_t;
+
+#define nr_nodemask_bits (BITS_TO_LONGS(MAX_NUMNODES) * BITS_PER_LONG)
+
+static inline bool_t alloc_nodemask_var(nodemask_var_t *mask)
+{
+	*(void **)mask = _xmalloc(nr_nodemask_bits / 8, sizeof(long));
+	return *mask != NULL;
+}
+
+static inline bool_t zalloc_nodemask_var(nodemask_var_t *mask)
+{
+	*(void **)mask = _xzalloc(nr_nodemask_bits / 8, sizeof(long));
+	return *mask != NULL;
+}
+
+static inline void free_nodemask_var(nodemask_var_t mask)
+{
+	xfree(mask);
+}
+#else
+typedef nodemask_t nodemask_var_t;
+
+static inline bool_t alloc_nodemask_var(nodemask_var_t *mask)
+{
+	return 1;
+}
+
+static inline bool_t zalloc_nodemask_var(nodemask_var_t *mask)
+{
+	nodes_clear(*mask);
+	return 1;
+}
+
+static inline void free_nodemask_var(nodemask_var_t mask)
+{
+}
+#endif
+
 #if MAX_NUMNODES > 1
 #define for_each_node_mask(node, mask)			\
 	for ((node) = first_node(mask);			\

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks
  2012-12-19 19:07 ` [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks Dario Faggioli
@ 2012-12-20  9:18   ` Jan Beulich
  2012-12-20  9:55     ` Dario Faggioli
  2012-12-20 14:33     ` George Dunlap
  0 siblings, 2 replies; 57+ messages in thread
From: Jan Beulich @ 2012-12-20  9:18 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel, Matt Wilson, Daniel De Graaf

>>> On 19.12.12 at 20:07, Dario Faggioli <dario.faggioli@citrix.com> wrote:
> --- a/xen/include/xen/nodemask.h
> +++ b/xen/include/xen/nodemask.h
> @@ -298,6 +298,53 @@ static inline int __nodemask_parse(const
>  }
>  #endif
>  
> +/*
> + * nodemask_var_t: struct nodemask for stack usage.
> + *
> + * See definition of cpumask_var_t in include/xen//cpumask.h.
> + */
> +#if MAX_NUMNODES > 2 * BITS_PER_LONG

Is that case reasonable to expect?

Jan

> +#include <xen/xmalloc.h>
> +
> +typedef nodemask_t *nodemask_var_t;
> +
> +#define nr_nodemask_bits (BITS_TO_LONGS(MAX_NUMNODES) * BITS_PER_LONG)
> +
> +static inline bool_t alloc_nodemask_var(nodemask_var_t *mask)
> +{
> +	*(void **)mask = _xmalloc(nr_nodemask_bits / 8, sizeof(long));
> +	return *mask != NULL;
> +}
> +
> +static inline bool_t zalloc_nodemask_var(nodemask_var_t *mask)
> +{
> +	*(void **)mask = _xzalloc(nr_nodemask_bits / 8, sizeof(long));
> +	return *mask != NULL;
> +}
> +
> +static inline void free_nodemask_var(nodemask_var_t mask)
> +{
> +	xfree(mask);
> +}
> +#else
> +typedef nodemask_t nodemask_var_t;
> +
> +static inline bool_t alloc_nodemask_var(nodemask_var_t *mask)
> +{
> +	return 1;
> +}
> +
> +static inline bool_t zalloc_nodemask_var(nodemask_var_t *mask)
> +{
> +	nodes_clear(*mask);
> +	return 1;
> +}
> +
> +static inline void free_nodemask_var(nodemask_var_t mask)
> +{
> +}
> +#endif
> +
>  #if MAX_NUMNODES > 1
>  #define for_each_node_mask(node, mask)			\
>  	for ((node) = first_node(mask);			\

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks
  2012-12-20  9:18   ` Jan Beulich
@ 2012-12-20  9:55     ` Dario Faggioli
  2012-12-20 14:33     ` George Dunlap
  1 sibling, 0 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-20  9:55 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel, Matt Wilson, Daniel De Graaf


[-- Attachment #1.1: Type: text/plain, Size: 786 bytes --]

On Thu, 2012-12-20 at 09:18 +0000, Jan Beulich wrote: 
> > +/*
> > + * nodemask_var_t: struct nodemask for stack usage.
> > + *
> > + * See definition of cpumask_var_t in include/xen//cpumask.h.
> > + */
> > +#if MAX_NUMNODES > 2 * BITS_PER_LONG
> 
> Is that case reasonable to expect?
> 
I really don't think so. Here, I really was just aligning cpumask and
nodemask types. But you're right, thinking more about it, this is
completely unnecessary. I'll kill this hunk.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks
  2012-12-20  9:18   ` Jan Beulich
  2012-12-20  9:55     ` Dario Faggioli
@ 2012-12-20 14:33     ` George Dunlap
  2012-12-20 14:52       ` Jan Beulich
  1 sibling, 1 reply; 57+ messages in thread
From: George Dunlap @ 2012-12-20 14:33 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Dario Faggioli, Ian Jackson,
	xen-devel@lists.xen.org, Matt Wilson, Daniel De Graaf,
	Juergen Gross

On 20/12/12 09:18, Jan Beulich wrote:
>>>> On 19.12.12 at 20:07, Dario Faggioli <dario.faggioli@citrix.com> wrote:
>> --- a/xen/include/xen/nodemask.h
>> +++ b/xen/include/xen/nodemask.h
>> @@ -298,6 +298,53 @@ static inline int __nodemask_parse(const
>>   }
>>   #endif
>>   
>> +/*
>> + * nodemask_var_t: struct nodemask for stack usage.
>> + *
>> + * See definition of cpumask_var_t in include/xen//cpumask.h.
>> + */
>> +#if MAX_NUMNODES > 2 * BITS_PER_LONG
> Is that case reasonable to expect?

2 * BITS_PER_LONG is just going to be 128, right?  It wasn't too long 
ago that I would have considered 4096 cores a pretty unreasonable 
expectation.  Is there a particular reason you think this is going to be 
more than a few years away, and a particular harm in having the code 
here to begin with?

At very least it should be replaced with something like this:

#if MAX_NUMNODES > 2 * BITS_PER_LONG
# error "MAX_NUMNODES exceeds fixed size nodemask; need to implement 
variable-length nodemasks"
#endif

  -George

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks
  2012-12-20 14:33     ` George Dunlap
@ 2012-12-20 14:52       ` Jan Beulich
  2012-12-20 15:13         ` Dario Faggioli
  0 siblings, 1 reply; 57+ messages in thread
From: Jan Beulich @ 2012-12-20 14:52 UTC (permalink / raw)
  To: George Dunlap
  Cc: MarcusGranado, Dan Magenheimer, Ian Campbell, AnilMadhavapeddy,
	Andrew Cooper, Dario Faggioli, Ian Jackson,
	xen-devel@lists.xen.org, Matt Wilson, Daniel De Graaf,
	Juergen Gross

>>> On 20.12.12 at 15:33, George Dunlap <george.dunlap@eu.citrix.com> wrote:
> On 20/12/12 09:18, Jan Beulich wrote:
>>>>> On 19.12.12 at 20:07, Dario Faggioli <dario.faggioli@citrix.com> wrote:
>>> --- a/xen/include/xen/nodemask.h
>>> +++ b/xen/include/xen/nodemask.h
>>> @@ -298,6 +298,53 @@ static inline int __nodemask_parse(const
>>>   }
>>>   #endif
>>>   
>>> +/*
>>> + * nodemask_var_t: struct nodemask for stack usage.
>>> + *
>>> + * See definition of cpumask_var_t in include/xen//cpumask.h.
>>> + */
>>> +#if MAX_NUMNODES > 2 * BITS_PER_LONG
>> Is that case reasonable to expect?
> 
> 2 * BITS_PER_LONG is just going to be 128, right?  It wasn't too long 
> ago that I would have considered 4096 cores a pretty unreasonable 
> expectation.  Is there a particular reason you think this is going to be 
> more than a few years away, and a particular harm in having the code 
> here to begin with?

I just don't see node counts grow even near as quickly as core/
thread ones.

> At very least it should be replaced with something like this:
> 
> #if MAX_NUMNODES > 2 * BITS_PER_LONG
> # error "MAX_NUMNODES exceeds fixed size nodemask; need to implement 
> variable-length nodemasks"
> #endif

Yes, if there is a limitation in the code, it being violated should be
detected at build time. But I'd suppose on can construct the
statically sized mask definition such that it copes with larger counts
(just at the expense of having larger data objects perhaps
on-stack). Making sure to always pass pointers rather than objects
to functions will already eliminated a good part of the problem.

Jan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks
  2012-12-20 14:52       ` Jan Beulich
@ 2012-12-20 15:13         ` Dario Faggioli
  0 siblings, 0 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-20 15:13 UTC (permalink / raw)
  To: Jan Beulich
  Cc: MarcusGranado, Dan Magenheimer, Ian Campbell, AnilMadhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Matt Wilson, Daniel De Graaf

On Thu, Dec 20, 2012 at 3:52 PM, Jan Beulich <JBeulich@suse.com> wrote:
>> 2 * BITS_PER_LONG is just going to be 128, right?  It wasn't too long
>> ago that I would have considered 4096 cores a pretty unreasonable
>> expectation.  Is there a particular reason you think this is going to be
>> more than a few years away, and a particular harm in having the code
>> here to begin with?
>
> I just don't see node counts grow even near as quickly as core/
> thread ones.
>
Yep, same here, that's why, despite having put that code there, I agree
with Jan in removing it. :-)

That feeling matches with what we've been repeatedly told by hardware
vendors, about node count already saturating at around 8. Beyond that, a
different architectural approach will be needed to tackle scalability issues
(clusters? NoCs?).

Just for the records, Linux still deals with this in the same way it did when
we took these files from there (i.e., without this hunk):
 - cpumasks exists in static and dynamically allocated (cpumask_var_t) foms
 - nodemask are only and always statically allocated

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
---------------------------------------------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
  2012-12-19 19:07 ` [PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap Dario Faggioli
  2012-12-19 19:07 ` [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks Dario Faggioli
@ 2012-12-19 19:07 ` Dario Faggioli
  2012-12-20  6:44   ` Juergen Gross
                     ` (3 more replies)
  2012-12-19 19:07 ` [PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity Dario Faggioli
                   ` (8 subsequent siblings)
  11 siblings, 4 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-19 19:07 UTC (permalink / raw)
  To: xen-devel
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	Jan Beulich, Daniel De Graaf, Matt Wilson

As vcpu-affinity tells where VCPUs must run, node-affinity tells
where they should or, better, prefer. While respecting vcpu-affinity
remains mandatory, node-affinity is not that strict, it only expresses
a preference, although honouring it is almost always true that will
bring significant performances benefit (especially as compared to
not having any affinity at all).

This change modifies the VCPU load balancing algorithm (for the
credit scheduler only), introducing a two steps logic.
During the first step, we use the node-affinity mask. The aim is
giving precedence to the CPUs where it is known to be preferable
for the domain to run. If that fails in finding a valid PCPU, the
node-affinity is just ignored and, in the second step, we fall
back to using cpu-affinity only.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Changes from v1:
 * CPU masks variables moved off from the stack, as requested during
   review. As per the comments in the code, having them in the private
   (per-scheduler instance) struct could have been enough, but it would be
   racy (again, see comments). For that reason, use a global bunch of
   them of (via per_cpu());
 * George suggested a different load balancing logic during v1's review. I
   think he was right and then I changed the old implementation in a way
   that resembles exactly that. I rewrote most of this patch to introduce
   a more sensible and effective noda-affinity handling logic.

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -111,6 +111,33 @@
 
 
 /*
+ * Node Balancing
+ */
+#define CSCHED_BALANCE_CPU_AFFINITY     0
+#define CSCHED_BALANCE_NODE_AFFINITY    1
+#define CSCHED_BALANCE_LAST CSCHED_BALANCE_NODE_AFFINITY
+
+/*
+ * When building for high number of CPUs, cpumask_var_t
+ * variables on stack are better avoided. However, we need them,
+ * in order to be able to consider both vcpu and node affinity.
+ * We also don't want to xmalloc()/xfree() them, as that would
+ * happen in critical code paths. Therefore, let's (pre)allocate
+ * some scratch space for them.
+ *
+ * Having one mask for each instance of the scheduler seems
+ * enough, and that would suggest putting it wihin `struct
+ * csched_private' below. However, we don't always hold the
+ * private scheduler lock when the mask itself would need to
+ * be used, leaving room for races. For that reason, we define
+ * and use a cpumask_t for each CPU. As preemption is not an
+ * issue here (we're holding the runqueue spin-lock!), that is
+ * both enough and safe.
+ */
+DEFINE_PER_CPU(cpumask_t, csched_balance_mask);
+#define scratch_balance_mask (this_cpu(csched_balance_mask))
+
+/*
  * Boot parameters
  */
 static int __read_mostly sched_credit_tslice_ms = CSCHED_DEFAULT_TSLICE_MS;
@@ -159,6 +186,9 @@ struct csched_dom {
     struct list_head active_vcpu;
     struct list_head active_sdom_elem;
     struct domain *dom;
+    /* cpumask translated from the domain's node-affinity.
+     * Basically, the CPUs we prefer to be scheduled on. */
+    cpumask_var_t node_affinity_cpumask;
     uint16_t active_vcpu_count;
     uint16_t weight;
     uint16_t cap;
@@ -239,6 +269,42 @@ static inline void
     list_del_init(&svc->runq_elem);
 }
 
+#define for_each_csched_balance_step(__step) \
+    for ( (__step) = CSCHED_BALANCE_LAST; (__step) >= 0; (__step)-- )
+
+/*
+ * Each csched-balance step has to use its own cpumask. This function
+ * determines which one, given the step, and copies it in mask. Notice
+ * that, in case of node-affinity balancing step, it also filters out from
+ * the node-affinity mask the cpus that are not part of vc's cpu-affinity,
+ * as we do not want to end up running a vcpu where it would like, but
+ * is not allowed to!
+ *
+ * As an optimization, if a domain does not have any node-affinity at all
+ * (namely, its node affinity is automatically computed), not only the
+ * computed mask will reflect its vcpu-affinity, but we also return -1 to
+ * let the caller know that he can skip the step or quit the loop (if he
+ * wants).
+ */
+static int
+csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
+{
+    if ( step == CSCHED_BALANCE_NODE_AFFINITY )
+    {
+        struct domain *d = vc->domain;
+        struct csched_dom *sdom = CSCHED_DOM(d);
+
+        cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity);
+
+        if ( cpumask_full(sdom->node_affinity_cpumask) )
+            return -1;
+    }
+    else /* step == CSCHED_BALANCE_CPU_AFFINITY */
+        cpumask_copy(mask, vc->cpu_affinity);
+
+    return 0;
+}
+
 static void burn_credits(struct csched_vcpu *svc, s_time_t now)
 {
     s_time_t delta;
@@ -266,67 +332,94 @@ static inline void
     struct csched_vcpu * const cur = CSCHED_VCPU(curr_on_cpu(cpu));
     struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
     cpumask_t mask, idle_mask;
-    int idlers_empty;
+    int balance_step, idlers_empty;
 
     ASSERT(cur);
-    cpumask_clear(&mask);
-
     idlers_empty = cpumask_empty(prv->idlers);
 
     /*
-     * If the pcpu is idle, or there are no idlers and the new
-     * vcpu is a higher priority than the old vcpu, run it here.
-     *
-     * If there are idle cpus, first try to find one suitable to run
-     * new, so we can avoid preempting cur.  If we cannot find a
-     * suitable idler on which to run new, run it here, but try to
-     * find a suitable idler on which to run cur instead.
+     * Node and vcpu-affinity balancing loop. To speed things up, in case
+     * no node-affinity at all is present, scratch_balance_mask reflects
+     * the vcpu-affinity, and ret is -1, so that we then can quit the
+     * loop after only one step.
      */
-    if ( cur->pri == CSCHED_PRI_IDLE
-         || (idlers_empty && new->pri > cur->pri) )
+    for_each_csched_balance_step( balance_step )
     {
-        if ( cur->pri != CSCHED_PRI_IDLE )
-            SCHED_STAT_CRANK(tickle_idlers_none);
-        cpumask_set_cpu(cpu, &mask);
-    }
-    else if ( !idlers_empty )
-    {
-        /* Check whether or not there are idlers that can run new */
-        cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
+        int ret, new_idlers_empty;
+
+        cpumask_clear(&mask);
 
         /*
-         * If there are no suitable idlers for new, and it's higher
-         * priority than cur, ask the scheduler to migrate cur away.
-         * We have to act like this (instead of just waking some of
-         * the idlers suitable for cur) because cur is running.
+         * If the pcpu is idle, or there are no idlers and the new
+         * vcpu is a higher priority than the old vcpu, run it here.
          *
-         * If there are suitable idlers for new, no matter priorities,
-         * leave cur alone (as it is running and is, likely, cache-hot)
-         * and wake some of them (which is waking up and so is, likely,
-         * cache cold anyway).
+         * If there are idle cpus, first try to find one suitable to run
+         * new, so we can avoid preempting cur.  If we cannot find a
+         * suitable idler on which to run new, run it here, but try to
+         * find a suitable idler on which to run cur instead.
          */
-        if ( cpumask_empty(&idle_mask) && new->pri > cur->pri )
+        if ( cur->pri == CSCHED_PRI_IDLE
+             || (idlers_empty && new->pri > cur->pri) )
         {
-            SCHED_STAT_CRANK(tickle_idlers_none);
-            SCHED_VCPU_STAT_CRANK(cur, kicked_away);
-            SCHED_VCPU_STAT_CRANK(cur, migrate_r);
-            SCHED_STAT_CRANK(migrate_kicked_away);
-            set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
+            if ( cur->pri != CSCHED_PRI_IDLE )
+                SCHED_STAT_CRANK(tickle_idlers_none);
             cpumask_set_cpu(cpu, &mask);
         }
-        else if ( !cpumask_empty(&idle_mask) )
+        else if ( !idlers_empty )
         {
-            /* Which of the idlers suitable for new shall we wake up? */
-            SCHED_STAT_CRANK(tickle_idlers_some);
-            if ( opt_tickle_one_idle )
+            /* Are there idlers suitable for new (for this balance step)? */
+            ret = csched_balance_cpumask(new->vcpu, balance_step,
+                                         &scratch_balance_mask);
+            cpumask_and(&idle_mask, prv->idlers, &scratch_balance_mask);
+            new_idlers_empty = cpumask_empty(&idle_mask);
+
+            /*
+             * Let's not be too harsh! If there aren't idlers suitable
+             * for new in its node-affinity mask, make sure we check its
+             * vcpu-affinity as well, before tacking final decisions.
+             */
+            if ( new_idlers_empty
+                 && (balance_step == CSCHED_BALANCE_NODE_AFFINITY && !ret) )
+                continue;
+
+            /*
+             * If there are no suitable idlers for new, and it's higher
+             * priority than cur, ask the scheduler to migrate cur away.
+             * We have to act like this (instead of just waking some of
+             * the idlers suitable for cur) because cur is running.
+             *
+             * If there are suitable idlers for new, no matter priorities,
+             * leave cur alone (as it is running and is, likely, cache-hot)
+             * and wake some of them (which is waking up and so is, likely,
+             * cache cold anyway).
+             */
+            if ( new_idlers_empty && new->pri > cur->pri )
             {
-                this_cpu(last_tickle_cpu) =
-                    cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
-                cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
+                SCHED_STAT_CRANK(tickle_idlers_none);
+                SCHED_VCPU_STAT_CRANK(cur, kicked_away);
+                SCHED_VCPU_STAT_CRANK(cur, migrate_r);
+                SCHED_STAT_CRANK(migrate_kicked_away);
+                set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
+                cpumask_set_cpu(cpu, &mask);
             }
-            else
-                cpumask_or(&mask, &mask, &idle_mask);
+            else if ( !new_idlers_empty )
+            {
+                /* Which of the idlers suitable for new shall we wake up? */
+                SCHED_STAT_CRANK(tickle_idlers_some);
+                if ( opt_tickle_one_idle )
+                {
+                    this_cpu(last_tickle_cpu) =
+                        cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
+                    cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
+                }
+                else
+                    cpumask_or(&mask, &mask, &idle_mask);
+            }
         }
+
+        /* Did we find anyone (or csched_balance_cpumask() says we're done)? */
+        if ( !cpumask_empty(&mask) || ret )
+            break;
     }
 
     if ( !cpumask_empty(&mask) )
@@ -475,15 +568,28 @@ static inline int
 }
 
 static inline int
-__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu)
+__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
 {
     /*
      * Don't pick up work that's in the peer's scheduling tail or hot on
-     * peer PCPU. Only pick up work that's allowed to run on our CPU.
+     * peer PCPU. Only pick up work that prefers and/or is allowed to run
+     * on our CPU.
      */
     return !vc->is_running &&
            !__csched_vcpu_is_cache_hot(vc) &&
-           cpumask_test_cpu(dest_cpu, vc->cpu_affinity);
+           cpumask_test_cpu(dest_cpu, mask);
+}
+
+static inline int
+__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers)
+{
+    /*
+     * Consent to migration if cpu is one of the idlers in the VCPU's
+     * affinity mask. In fact, if that is not the case, it just means it
+     * was some other CPU that was tickled and should hence come and pick
+     * VCPU up. Migrating it to cpu would only make things worse.
+     */
+    return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask);
 }
 
 static int
@@ -493,85 +599,98 @@ static int
     cpumask_t idlers;
     cpumask_t *online;
     struct csched_pcpu *spc = NULL;
+    int ret, balance_step;
     int cpu;
 
-    /*
-     * Pick from online CPUs in VCPU's affinity mask, giving a
-     * preference to its current processor if it's in there.
-     */
     online = cpupool_scheduler_cpumask(vc->domain->cpupool);
-    cpumask_and(&cpus, online, vc->cpu_affinity);
-    cpu = cpumask_test_cpu(vc->processor, &cpus)
-            ? vc->processor
-            : cpumask_cycle(vc->processor, &cpus);
-    ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
+    for_each_csched_balance_step( balance_step )
+    {
+        /* Pick an online CPU from the proper affinity mask */
+        ret = csched_balance_cpumask(vc, balance_step, &cpus);
+        cpumask_and(&cpus, &cpus, online);
 
-    /*
-     * Try to find an idle processor within the above constraints.
-     *
-     * In multi-core and multi-threaded CPUs, not all idle execution
-     * vehicles are equal!
-     *
-     * We give preference to the idle execution vehicle with the most
-     * idling neighbours in its grouping. This distributes work across
-     * distinct cores first and guarantees we don't do something stupid
-     * like run two VCPUs on co-hyperthreads while there are idle cores
-     * or sockets.
-     *
-     * Notice that, when computing the "idleness" of cpu, we may want to
-     * discount vc. That is, iff vc is the currently running and the only
-     * runnable vcpu on cpu, we add cpu to the idlers.
-     */
-    cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
-    if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
-        cpumask_set_cpu(cpu, &idlers);
-    cpumask_and(&cpus, &cpus, &idlers);
-    cpumask_clear_cpu(cpu, &cpus);
+        /* If present, prefer vc's current processor */
+        cpu = cpumask_test_cpu(vc->processor, &cpus)
+                ? vc->processor
+                : cpumask_cycle(vc->processor, &cpus);
+        ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
 
-    while ( !cpumask_empty(&cpus) )
-    {
-        cpumask_t cpu_idlers;
-        cpumask_t nxt_idlers;
-        int nxt, weight_cpu, weight_nxt;
-        int migrate_factor;
+        /*
+         * Try to find an idle processor within the above constraints.
+         *
+         * In multi-core and multi-threaded CPUs, not all idle execution
+         * vehicles are equal!
+         *
+         * We give preference to the idle execution vehicle with the most
+         * idling neighbours in its grouping. This distributes work across
+         * distinct cores first and guarantees we don't do something stupid
+         * like run two VCPUs on co-hyperthreads while there are idle cores
+         * or sockets.
+         *
+         * Notice that, when computing the "idleness" of cpu, we may want to
+         * discount vc. That is, iff vc is the currently running and the only
+         * runnable vcpu on cpu, we add cpu to the idlers.
+         */
+        cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
+        if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
+            cpumask_set_cpu(cpu, &idlers);
+        cpumask_and(&cpus, &cpus, &idlers);
+        /* If there are idlers and cpu is still not among them, pick one */
+        if ( !cpumask_empty(&cpus) && !cpumask_test_cpu(cpu, &cpus) )
+            cpu = cpumask_cycle(cpu, &cpus);
+        cpumask_clear_cpu(cpu, &cpus);
 
-        nxt = cpumask_cycle(cpu, &cpus);
+        while ( !cpumask_empty(&cpus) )
+        {
+            cpumask_t cpu_idlers;
+            cpumask_t nxt_idlers;
+            int nxt, weight_cpu, weight_nxt;
+            int migrate_factor;
 
-        if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) )
-        {
-            /* We're on the same socket, so check the busy-ness of threads.
-             * Migrate if # of idlers is less at all */
-            ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
-            migrate_factor = 1;
-            cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask, cpu));
-            cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask, nxt));
-        }
-        else
-        {
-            /* We're on different sockets, so check the busy-ness of cores.
-             * Migrate only if the other core is twice as idle */
-            ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
-            migrate_factor = 2;
-            cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu));
-            cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt));
+            nxt = cpumask_cycle(cpu, &cpus);
+
+            if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) )
+            {
+                /* We're on the same socket, so check the busy-ness of threads.
+                 * Migrate if # of idlers is less at all */
+                ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
+                migrate_factor = 1;
+                cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask,
+                            cpu));
+                cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask,
+                            nxt));
+            }
+            else
+            {
+                /* We're on different sockets, so check the busy-ness of cores.
+                 * Migrate only if the other core is twice as idle */
+                ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
+                migrate_factor = 2;
+                cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu));
+                cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt));
+            }
+
+            weight_cpu = cpumask_weight(&cpu_idlers);
+            weight_nxt = cpumask_weight(&nxt_idlers);
+            /* smt_power_savings: consolidate work rather than spreading it */
+            if ( sched_smt_power_savings ?
+                 weight_cpu > weight_nxt :
+                 weight_cpu * migrate_factor < weight_nxt )
+            {
+                cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
+                spc = CSCHED_PCPU(nxt);
+                cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers);
+                cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu));
+            }
+            else
+            {
+                cpumask_andnot(&cpus, &cpus, &nxt_idlers);
+            }
         }
 
-        weight_cpu = cpumask_weight(&cpu_idlers);
-        weight_nxt = cpumask_weight(&nxt_idlers);
-        /* smt_power_savings: consolidate work rather than spreading it */
-        if ( sched_smt_power_savings ?
-             weight_cpu > weight_nxt :
-             weight_cpu * migrate_factor < weight_nxt )
-        {
-            cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
-            spc = CSCHED_PCPU(nxt);
-            cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers);
-            cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu));
-        }
-        else
-        {
-            cpumask_andnot(&cpus, &cpus, &nxt_idlers);
-        }
+        /* Stop if cpu is idle (or if csched_balance_cpumask() says we can) */
+        if ( cpumask_test_cpu(cpu, &idlers) || ret )
+            break;
     }
 
     if ( commit && spc )
@@ -913,6 +1032,13 @@ csched_alloc_domdata(const struct schedu
     if ( sdom == NULL )
         return NULL;
 
+    if ( !alloc_cpumask_var(&sdom->node_affinity_cpumask) )
+    {
+        xfree(sdom);
+        return NULL;
+    }
+    cpumask_setall(sdom->node_affinity_cpumask);
+
     /* Initialize credit and weight */
     INIT_LIST_HEAD(&sdom->active_vcpu);
     sdom->active_vcpu_count = 0;
@@ -944,6 +1070,9 @@ csched_dom_init(const struct scheduler *
 static void
 csched_free_domdata(const struct scheduler *ops, void *data)
 {
+    struct csched_dom *sdom = data;
+
+    free_cpumask_var(sdom->node_affinity_cpumask);
     xfree(data);
 }
 
@@ -1240,9 +1369,10 @@ csched_tick(void *_cpu)
 }
 
 static struct csched_vcpu *
-csched_runq_steal(int peer_cpu, int cpu, int pri)
+csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
 {
     const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
+    struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, peer_cpu));
     const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu);
     struct csched_vcpu *speer;
     struct list_head *iter;
@@ -1265,11 +1395,24 @@ csched_runq_steal(int peer_cpu, int cpu,
             if ( speer->pri <= pri )
                 break;
 
-            /* Is this VCPU is runnable on our PCPU? */
+            /* Is this VCPU runnable on our PCPU? */
             vc = speer->vcpu;
             BUG_ON( is_idle_vcpu(vc) );
 
-            if (__csched_vcpu_is_migrateable(vc, cpu))
+            /*
+             * Retrieve the correct mask for this balance_step or, if we're
+             * dealing with node-affinity and the vcpu has no node affinity
+             * at all, just skip this vcpu. That is needed if we want to
+             * check if we have any node-affine work to steal first (wrt
+             * any vcpu-affine work).
+             */
+            if ( csched_balance_cpumask(vc, balance_step,
+                                        &scratch_balance_mask) )
+                continue;
+
+            if ( __csched_vcpu_is_migrateable(vc, cpu, &scratch_balance_mask)
+                 && __csched_vcpu_should_migrate(cpu, &scratch_balance_mask,
+                                                 prv->idlers) )
             {
                 /* We got a candidate. Grab it! */
                 TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu,
@@ -1295,7 +1438,8 @@ csched_load_balance(struct csched_privat
     struct csched_vcpu *speer;
     cpumask_t workers;
     cpumask_t *online;
-    int peer_cpu;
+    int peer_cpu, peer_node, bstep;
+    int node = cpu_to_node(cpu);
 
     BUG_ON( cpu != snext->vcpu->processor );
     online = cpupool_scheduler_cpumask(per_cpu(cpupool, cpu));
@@ -1312,42 +1456,68 @@ csched_load_balance(struct csched_privat
         SCHED_STAT_CRANK(load_balance_other);
 
     /*
-     * Peek at non-idling CPUs in the system, starting with our
-     * immediate neighbour.
+     * Let's look around for work to steal, taking both vcpu-affinity
+     * and node-affinity into account. More specifically, we check all
+     * the non-idle CPUs' runq, looking for:
+     *  1. any node-affine work to steal first,
+     *  2. if not finding anything, any vcpu-affine work to steal.
      */
-    cpumask_andnot(&workers, online, prv->idlers);
-    cpumask_clear_cpu(cpu, &workers);
-    peer_cpu = cpu;
+    for_each_csched_balance_step( bstep )
+    {
+        /*
+         * We peek at the non-idling CPUs in a node-wise fashion. In fact,
+         * it is more likely that we find some node-affine work on our same
+         * node, not to mention that migrating vcpus within the same node
+         * could well expected to be cheaper than across-nodes (memory
+         * stays local, there might be some node-wide cache[s], etc.).
+         */
+        peer_node = node;
+        do
+        {
+            /* Find out what the !idle are in this node */
+            cpumask_andnot(&workers, online, prv->idlers);
+            cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
+            cpumask_clear_cpu(cpu, &workers);
 
-    while ( !cpumask_empty(&workers) )
-    {
-        peer_cpu = cpumask_cycle(peer_cpu, &workers);
-        cpumask_clear_cpu(peer_cpu, &workers);
+            if ( cpumask_empty(&workers) )
+                goto next_node;
 
-        /*
-         * Get ahold of the scheduler lock for this peer CPU.
-         *
-         * Note: We don't spin on this lock but simply try it. Spinning could
-         * cause a deadlock if the peer CPU is also load balancing and trying
-         * to lock this CPU.
-         */
-        if ( !pcpu_schedule_trylock(peer_cpu) )
-        {
-            SCHED_STAT_CRANK(steal_trylock_failed);
-            continue;
-        }
+            peer_cpu = cpumask_first(&workers);
+            do
+            {
+                /*
+                 * Get ahold of the scheduler lock for this peer CPU.
+                 *
+                 * Note: We don't spin on this lock but simply try it. Spinning
+                 * could cause a deadlock if the peer CPU is also load
+                 * balancing and trying to lock this CPU.
+                 */
+                if ( !pcpu_schedule_trylock(peer_cpu) )
+                {
+                    SCHED_STAT_CRANK(steal_trylock_failed);
+                    peer_cpu = cpumask_cycle(peer_cpu, &workers);
+                    continue;
+                }
 
-        /*
-         * Any work over there to steal?
-         */
-        speer = cpumask_test_cpu(peer_cpu, online) ?
-            csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL;
-        pcpu_schedule_unlock(peer_cpu);
-        if ( speer != NULL )
-        {
-            *stolen = 1;
-            return speer;
-        }
+                /* Any work over there to steal? */
+                speer = cpumask_test_cpu(peer_cpu, online) ?
+                    csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL;
+                pcpu_schedule_unlock(peer_cpu);
+
+                /* As soon as one vcpu is found, balancing ends */
+                if ( speer != NULL )
+                {
+                    *stolen = 1;
+                    return speer;
+                }
+
+                peer_cpu = cpumask_cycle(peer_cpu, &workers);
+
+            } while( peer_cpu != cpumask_first(&workers) );
+
+ next_node:
+            peer_node = cycle_node(peer_node, node_online_map);
+        } while( peer_node != node );
     }
 
  out:
diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h
--- a/xen/include/xen/nodemask.h
+++ b/xen/include/xen/nodemask.h
@@ -41,6 +41,8 @@
  * int last_node(mask)			Number highest set bit, or MAX_NUMNODES
  * int first_unset_node(mask)		First node not set in mask, or 
  *					MAX_NUMNODES.
+ * int cycle_node(node, mask)		Next node cycling from 'node', or
+ *					MAX_NUMNODES
  *
  * nodemask_t nodemask_of_node(node)	Return nodemask with bit 'node' set
  * NODE_MASK_ALL			Initializer - all bits set
@@ -254,6 +256,16 @@ static inline int __first_unset_node(con
 			find_first_zero_bit(maskp->bits, MAX_NUMNODES));
 }
 
+#define cycle_node(n, src) __cycle_node((n), &(src), MAX_NUMNODES)
+static inline int __cycle_node(int n, const nodemask_t *maskp, int nbits)
+{
+    int nxt = __next_node(n, maskp, nbits);
+
+    if (nxt == nbits)
+        nxt = __first_node(maskp, nbits);
+    return nxt;
+}
+
 #define NODE_MASK_LAST_WORD BITMAP_LAST_WORD_MASK(MAX_NUMNODES)
 
 #if MAX_NUMNODES <= BITS_PER_LONG

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-19 19:07 ` [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity Dario Faggioli
@ 2012-12-20  6:44   ` Juergen Gross
  2012-12-20  8:16     ` Dario Faggioli
  2012-12-20 15:56   ` George Dunlap
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 57+ messages in thread
From: Juergen Gross @ 2012-12-20  6:44 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich,
	Daniel De Graaf, Matt Wilson

Am 19.12.2012 20:07, schrieb Dario Faggioli:
> As vcpu-affinity tells where VCPUs must run, node-affinity tells
> where they should or, better, prefer. While respecting vcpu-affinity
> remains mandatory, node-affinity is not that strict, it only expresses
> a preference, although honouring it is almost always true that will
> bring significant performances benefit (especially as compared to
> not having any affinity at all).
>
> This change modifies the VCPU load balancing algorithm (for the
> credit scheduler only), introducing a two steps logic.
> During the first step, we use the node-affinity mask. The aim is
> giving precedence to the CPUs where it is known to be preferable
> for the domain to run. If that fails in finding a valid PCPU, the
> node-affinity is just ignored and, in the second step, we fall
> back to using cpu-affinity only.
>
> Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com>
> ---
> Changes from v1:
>   * CPU masks variables moved off from the stack, as requested during
>     review. As per the comments in the code, having them in the private
>     (per-scheduler instance) struct could have been enough, but it would be
>     racy (again, see comments). For that reason, use a global bunch of
>     them of (via per_cpu());

Wouldn't it be better to put the mask in the scheduler private per-pcpu area?
This could be applied to several other instances of cpu masks on the stack,
too.

>   * George suggested a different load balancing logic during v1's review. I
>     think he was right and then I changed the old implementation in a way
>     that resembles exactly that. I rewrote most of this patch to introduce
>     a more sensible and effective noda-affinity handling logic.
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -111,6 +111,33 @@
>
>
>   /*
> + * Node Balancing
> + */
> +#define CSCHED_BALANCE_CPU_AFFINITY     0
> +#define CSCHED_BALANCE_NODE_AFFINITY    1
> +#define CSCHED_BALANCE_LAST CSCHED_BALANCE_NODE_AFFINITY
> +
> +/*
> + * When building for high number of CPUs, cpumask_var_t
> + * variables on stack are better avoided. However, we need them,
> + * in order to be able to consider both vcpu and node affinity.
> + * We also don't want to xmalloc()/xfree() them, as that would
> + * happen in critical code paths. Therefore, let's (pre)allocate
> + * some scratch space for them.
> + *
> + * Having one mask for each instance of the scheduler seems
> + * enough, and that would suggest putting it wihin `struct
> + * csched_private' below. However, we don't always hold the
> + * private scheduler lock when the mask itself would need to
> + * be used, leaving room for races. For that reason, we define
> + * and use a cpumask_t for each CPU. As preemption is not an
> + * issue here (we're holding the runqueue spin-lock!), that is
> + * both enough and safe.
> + */
> +DEFINE_PER_CPU(cpumask_t, csched_balance_mask);
> +#define scratch_balance_mask (this_cpu(csched_balance_mask))
> +
> +/*
>    * Boot parameters
>    */
>   static int __read_mostly sched_credit_tslice_ms = CSCHED_DEFAULT_TSLICE_MS;
> @@ -159,6 +186,9 @@ struct csched_dom {
>       struct list_head active_vcpu;
>       struct list_head active_sdom_elem;
>       struct domain *dom;
> +    /* cpumask translated from the domain's node-affinity.
> +     * Basically, the CPUs we prefer to be scheduled on. */
> +    cpumask_var_t node_affinity_cpumask;
>       uint16_t active_vcpu_count;
>       uint16_t weight;
>       uint16_t cap;
> @@ -239,6 +269,42 @@ static inline void
>       list_del_init(&svc->runq_elem);
>   }
>
> +#define for_each_csched_balance_step(__step) \
> +    for ( (__step) = CSCHED_BALANCE_LAST; (__step)>= 0; (__step)-- )
> +
> +/*
> + * Each csched-balance step has to use its own cpumask. This function
> + * determines which one, given the step, and copies it in mask. Notice
> + * that, in case of node-affinity balancing step, it also filters out from
> + * the node-affinity mask the cpus that are not part of vc's cpu-affinity,
> + * as we do not want to end up running a vcpu where it would like, but
> + * is not allowed to!
> + *
> + * As an optimization, if a domain does not have any node-affinity at all
> + * (namely, its node affinity is automatically computed), not only the
> + * computed mask will reflect its vcpu-affinity, but we also return -1 to
> + * let the caller know that he can skip the step or quit the loop (if he
> + * wants).
> + */
> +static int
> +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
> +{
> +    if ( step == CSCHED_BALANCE_NODE_AFFINITY )
> +    {
> +        struct domain *d = vc->domain;
> +        struct csched_dom *sdom = CSCHED_DOM(d);
> +
> +        cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity);
> +
> +        if ( cpumask_full(sdom->node_affinity_cpumask) )
> +            return -1;
> +    }
> +    else /* step == CSCHED_BALANCE_CPU_AFFINITY */
> +        cpumask_copy(mask, vc->cpu_affinity);
> +
> +    return 0;
> +}
> +
>   static void burn_credits(struct csched_vcpu *svc, s_time_t now)
>   {
>       s_time_t delta;
> @@ -266,67 +332,94 @@ static inline void
>       struct csched_vcpu * const cur = CSCHED_VCPU(curr_on_cpu(cpu));
>       struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
>       cpumask_t mask, idle_mask;
> -    int idlers_empty;
> +    int balance_step, idlers_empty;
>
>       ASSERT(cur);
> -    cpumask_clear(&mask);
> -
>       idlers_empty = cpumask_empty(prv->idlers);
>
>       /*
> -     * If the pcpu is idle, or there are no idlers and the new
> -     * vcpu is a higher priority than the old vcpu, run it here.
> -     *
> -     * If there are idle cpus, first try to find one suitable to run
> -     * new, so we can avoid preempting cur.  If we cannot find a
> -     * suitable idler on which to run new, run it here, but try to
> -     * find a suitable idler on which to run cur instead.
> +     * Node and vcpu-affinity balancing loop. To speed things up, in case
> +     * no node-affinity at all is present, scratch_balance_mask reflects
> +     * the vcpu-affinity, and ret is -1, so that we then can quit the
> +     * loop after only one step.
>        */
> -    if ( cur->pri == CSCHED_PRI_IDLE
> -         || (idlers_empty&&  new->pri>  cur->pri) )
> +    for_each_csched_balance_step( balance_step )
>       {
> -        if ( cur->pri != CSCHED_PRI_IDLE )
> -            SCHED_STAT_CRANK(tickle_idlers_none);
> -        cpumask_set_cpu(cpu,&mask);
> -    }
> -    else if ( !idlers_empty )
> -    {
> -        /* Check whether or not there are idlers that can run new */
> -        cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
> +        int ret, new_idlers_empty;
> +
> +        cpumask_clear(&mask);
>
>           /*
> -         * If there are no suitable idlers for new, and it's higher
> -         * priority than cur, ask the scheduler to migrate cur away.
> -         * We have to act like this (instead of just waking some of
> -         * the idlers suitable for cur) because cur is running.
> +         * If the pcpu is idle, or there are no idlers and the new
> +         * vcpu is a higher priority than the old vcpu, run it here.
>            *
> -         * If there are suitable idlers for new, no matter priorities,
> -         * leave cur alone (as it is running and is, likely, cache-hot)
> -         * and wake some of them (which is waking up and so is, likely,
> -         * cache cold anyway).
> +         * If there are idle cpus, first try to find one suitable to run
> +         * new, so we can avoid preempting cur.  If we cannot find a
> +         * suitable idler on which to run new, run it here, but try to
> +         * find a suitable idler on which to run cur instead.
>            */
> -        if ( cpumask_empty(&idle_mask)&&  new->pri>  cur->pri )
> +        if ( cur->pri == CSCHED_PRI_IDLE
> +             || (idlers_empty&&  new->pri>  cur->pri) )
>           {
> -            SCHED_STAT_CRANK(tickle_idlers_none);
> -            SCHED_VCPU_STAT_CRANK(cur, kicked_away);
> -            SCHED_VCPU_STAT_CRANK(cur, migrate_r);
> -            SCHED_STAT_CRANK(migrate_kicked_away);
> -            set_bit(_VPF_migrating,&cur->vcpu->pause_flags);
> +            if ( cur->pri != CSCHED_PRI_IDLE )
> +                SCHED_STAT_CRANK(tickle_idlers_none);
>               cpumask_set_cpu(cpu,&mask);
>           }
> -        else if ( !cpumask_empty(&idle_mask) )
> +        else if ( !idlers_empty )
>           {
> -            /* Which of the idlers suitable for new shall we wake up? */
> -            SCHED_STAT_CRANK(tickle_idlers_some);
> -            if ( opt_tickle_one_idle )
> +            /* Are there idlers suitable for new (for this balance step)? */
> +            ret = csched_balance_cpumask(new->vcpu, balance_step,
> +&scratch_balance_mask);
> +            cpumask_and(&idle_mask, prv->idlers,&scratch_balance_mask);
> +            new_idlers_empty = cpumask_empty(&idle_mask);
> +
> +            /*
> +             * Let's not be too harsh! If there aren't idlers suitable
> +             * for new in its node-affinity mask, make sure we check its
> +             * vcpu-affinity as well, before tacking final decisions.
> +             */
> +            if ( new_idlers_empty
> +&&  (balance_step == CSCHED_BALANCE_NODE_AFFINITY&&  !ret) )
> +                continue;
> +
> +            /*
> +             * If there are no suitable idlers for new, and it's higher
> +             * priority than cur, ask the scheduler to migrate cur away.
> +             * We have to act like this (instead of just waking some of
> +             * the idlers suitable for cur) because cur is running.
> +             *
> +             * If there are suitable idlers for new, no matter priorities,
> +             * leave cur alone (as it is running and is, likely, cache-hot)
> +             * and wake some of them (which is waking up and so is, likely,
> +             * cache cold anyway).
> +             */
> +            if ( new_idlers_empty&&  new->pri>  cur->pri )
>               {
> -                this_cpu(last_tickle_cpu) =
> -                    cpumask_cycle(this_cpu(last_tickle_cpu),&idle_mask);
> -                cpumask_set_cpu(this_cpu(last_tickle_cpu),&mask);
> +                SCHED_STAT_CRANK(tickle_idlers_none);
> +                SCHED_VCPU_STAT_CRANK(cur, kicked_away);
> +                SCHED_VCPU_STAT_CRANK(cur, migrate_r);
> +                SCHED_STAT_CRANK(migrate_kicked_away);
> +                set_bit(_VPF_migrating,&cur->vcpu->pause_flags);
> +                cpumask_set_cpu(cpu,&mask);
>               }
> -            else
> -                cpumask_or(&mask,&mask,&idle_mask);
> +            else if ( !new_idlers_empty )
> +            {
> +                /* Which of the idlers suitable for new shall we wake up? */
> +                SCHED_STAT_CRANK(tickle_idlers_some);
> +                if ( opt_tickle_one_idle )
> +                {
> +                    this_cpu(last_tickle_cpu) =
> +                        cpumask_cycle(this_cpu(last_tickle_cpu),&idle_mask);
> +                    cpumask_set_cpu(this_cpu(last_tickle_cpu),&mask);
> +                }
> +                else
> +                    cpumask_or(&mask,&mask,&idle_mask);
> +            }
>           }
> +
> +        /* Did we find anyone (or csched_balance_cpumask() says we're done)? */
> +        if ( !cpumask_empty(&mask) || ret )
> +            break;
>       }
>
>       if ( !cpumask_empty(&mask) )
> @@ -475,15 +568,28 @@ static inline int
>   }
>
>   static inline int
> -__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu)
> +__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
>   {
>       /*
>        * Don't pick up work that's in the peer's scheduling tail or hot on
> -     * peer PCPU. Only pick up work that's allowed to run on our CPU.
> +     * peer PCPU. Only pick up work that prefers and/or is allowed to run
> +     * on our CPU.
>        */
>       return !vc->is_running&&
>              !__csched_vcpu_is_cache_hot(vc)&&
> -           cpumask_test_cpu(dest_cpu, vc->cpu_affinity);
> +           cpumask_test_cpu(dest_cpu, mask);
> +}
> +
> +static inline int
> +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers)
> +{
> +    /*
> +     * Consent to migration if cpu is one of the idlers in the VCPU's
> +     * affinity mask. In fact, if that is not the case, it just means it
> +     * was some other CPU that was tickled and should hence come and pick
> +     * VCPU up. Migrating it to cpu would only make things worse.
> +     */
> +    return cpumask_test_cpu(cpu, idlers)&&  cpumask_test_cpu(cpu, mask);
>   }
>
>   static int
> @@ -493,85 +599,98 @@ static int
>       cpumask_t idlers;
>       cpumask_t *online;
>       struct csched_pcpu *spc = NULL;
> +    int ret, balance_step;
>       int cpu;
>
> -    /*
> -     * Pick from online CPUs in VCPU's affinity mask, giving a
> -     * preference to its current processor if it's in there.
> -     */
>       online = cpupool_scheduler_cpumask(vc->domain->cpupool);
> -    cpumask_and(&cpus, online, vc->cpu_affinity);
> -    cpu = cpumask_test_cpu(vc->processor,&cpus)
> -            ? vc->processor
> -            : cpumask_cycle(vc->processor,&cpus);
> -    ASSERT( !cpumask_empty(&cpus)&&  cpumask_test_cpu(cpu,&cpus) );
> +    for_each_csched_balance_step( balance_step )
> +    {
> +        /* Pick an online CPU from the proper affinity mask */
> +        ret = csched_balance_cpumask(vc, balance_step,&cpus);
> +        cpumask_and(&cpus,&cpus, online);
>
> -    /*
> -     * Try to find an idle processor within the above constraints.
> -     *
> -     * In multi-core and multi-threaded CPUs, not all idle execution
> -     * vehicles are equal!
> -     *
> -     * We give preference to the idle execution vehicle with the most
> -     * idling neighbours in its grouping. This distributes work across
> -     * distinct cores first and guarantees we don't do something stupid
> -     * like run two VCPUs on co-hyperthreads while there are idle cores
> -     * or sockets.
> -     *
> -     * Notice that, when computing the "idleness" of cpu, we may want to
> -     * discount vc. That is, iff vc is the currently running and the only
> -     * runnable vcpu on cpu, we add cpu to the idlers.
> -     */
> -    cpumask_and(&idlers,&cpu_online_map, CSCHED_PRIV(ops)->idlers);
> -    if ( vc->processor == cpu&&  IS_RUNQ_IDLE(cpu) )
> -        cpumask_set_cpu(cpu,&idlers);
> -    cpumask_and(&cpus,&cpus,&idlers);
> -    cpumask_clear_cpu(cpu,&cpus);
> +        /* If present, prefer vc's current processor */
> +        cpu = cpumask_test_cpu(vc->processor,&cpus)
> +                ? vc->processor
> +                : cpumask_cycle(vc->processor,&cpus);
> +        ASSERT( !cpumask_empty(&cpus)&&  cpumask_test_cpu(cpu,&cpus) );
>
> -    while ( !cpumask_empty(&cpus) )
> -    {
> -        cpumask_t cpu_idlers;
> -        cpumask_t nxt_idlers;
> -        int nxt, weight_cpu, weight_nxt;
> -        int migrate_factor;
> +        /*
> +         * Try to find an idle processor within the above constraints.
> +         *
> +         * In multi-core and multi-threaded CPUs, not all idle execution
> +         * vehicles are equal!
> +         *
> +         * We give preference to the idle execution vehicle with the most
> +         * idling neighbours in its grouping. This distributes work across
> +         * distinct cores first and guarantees we don't do something stupid
> +         * like run two VCPUs on co-hyperthreads while there are idle cores
> +         * or sockets.
> +         *
> +         * Notice that, when computing the "idleness" of cpu, we may want to
> +         * discount vc. That is, iff vc is the currently running and the only
> +         * runnable vcpu on cpu, we add cpu to the idlers.
> +         */
> +        cpumask_and(&idlers,&cpu_online_map, CSCHED_PRIV(ops)->idlers);
> +        if ( vc->processor == cpu&&  IS_RUNQ_IDLE(cpu) )
> +            cpumask_set_cpu(cpu,&idlers);
> +        cpumask_and(&cpus,&cpus,&idlers);
> +        /* If there are idlers and cpu is still not among them, pick one */
> +        if ( !cpumask_empty(&cpus)&&  !cpumask_test_cpu(cpu,&cpus) )
> +            cpu = cpumask_cycle(cpu,&cpus);
> +        cpumask_clear_cpu(cpu,&cpus);
>
> -        nxt = cpumask_cycle(cpu,&cpus);
> +        while ( !cpumask_empty(&cpus) )
> +        {
> +            cpumask_t cpu_idlers;
> +            cpumask_t nxt_idlers;
> +            int nxt, weight_cpu, weight_nxt;
> +            int migrate_factor;
>
> -        if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) )
> -        {
> -            /* We're on the same socket, so check the busy-ness of threads.
> -             * Migrate if # of idlers is less at all */
> -            ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> -            migrate_factor = 1;
> -            cpumask_and(&cpu_idlers,&idlers, per_cpu(cpu_sibling_mask, cpu));
> -            cpumask_and(&nxt_idlers,&idlers, per_cpu(cpu_sibling_mask, nxt));
> -        }
> -        else
> -        {
> -            /* We're on different sockets, so check the busy-ness of cores.
> -             * Migrate only if the other core is twice as idle */
> -            ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> -            migrate_factor = 2;
> -            cpumask_and(&cpu_idlers,&idlers, per_cpu(cpu_core_mask, cpu));
> -            cpumask_and(&nxt_idlers,&idlers, per_cpu(cpu_core_mask, nxt));
> +            nxt = cpumask_cycle(cpu,&cpus);
> +
> +            if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) )
> +            {
> +                /* We're on the same socket, so check the busy-ness of threads.
> +                 * Migrate if # of idlers is less at all */
> +                ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> +                migrate_factor = 1;
> +                cpumask_and(&cpu_idlers,&idlers, per_cpu(cpu_sibling_mask,
> +                            cpu));
> +                cpumask_and(&nxt_idlers,&idlers, per_cpu(cpu_sibling_mask,
> +                            nxt));
> +            }
> +            else
> +            {
> +                /* We're on different sockets, so check the busy-ness of cores.
> +                 * Migrate only if the other core is twice as idle */
> +                ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> +                migrate_factor = 2;
> +                cpumask_and(&cpu_idlers,&idlers, per_cpu(cpu_core_mask, cpu));
> +                cpumask_and(&nxt_idlers,&idlers, per_cpu(cpu_core_mask, nxt));
> +            }
> +
> +            weight_cpu = cpumask_weight(&cpu_idlers);
> +            weight_nxt = cpumask_weight(&nxt_idlers);
> +            /* smt_power_savings: consolidate work rather than spreading it */
> +            if ( sched_smt_power_savings ?
> +                 weight_cpu>  weight_nxt :
> +                 weight_cpu * migrate_factor<  weight_nxt )
> +            {
> +                cpumask_and(&nxt_idlers,&cpus,&nxt_idlers);
> +                spc = CSCHED_PCPU(nxt);
> +                cpu = cpumask_cycle(spc->idle_bias,&nxt_idlers);
> +                cpumask_andnot(&cpus,&cpus, per_cpu(cpu_sibling_mask, cpu));
> +            }
> +            else
> +            {
> +                cpumask_andnot(&cpus,&cpus,&nxt_idlers);
> +            }
>           }
>
> -        weight_cpu = cpumask_weight(&cpu_idlers);
> -        weight_nxt = cpumask_weight(&nxt_idlers);
> -        /* smt_power_savings: consolidate work rather than spreading it */
> -        if ( sched_smt_power_savings ?
> -             weight_cpu>  weight_nxt :
> -             weight_cpu * migrate_factor<  weight_nxt )
> -        {
> -            cpumask_and(&nxt_idlers,&cpus,&nxt_idlers);
> -            spc = CSCHED_PCPU(nxt);
> -            cpu = cpumask_cycle(spc->idle_bias,&nxt_idlers);
> -            cpumask_andnot(&cpus,&cpus, per_cpu(cpu_sibling_mask, cpu));
> -        }
> -        else
> -        {
> -            cpumask_andnot(&cpus,&cpus,&nxt_idlers);
> -        }
> +        /* Stop if cpu is idle (or if csched_balance_cpumask() says we can) */
> +        if ( cpumask_test_cpu(cpu,&idlers) || ret )
> +            break;
>       }
>
>       if ( commit&&  spc )
> @@ -913,6 +1032,13 @@ csched_alloc_domdata(const struct schedu
>       if ( sdom == NULL )
>           return NULL;
>
> +    if ( !alloc_cpumask_var(&sdom->node_affinity_cpumask) )
> +    {
> +        xfree(sdom);
> +        return NULL;
> +    }
> +    cpumask_setall(sdom->node_affinity_cpumask);
> +
>       /* Initialize credit and weight */
>       INIT_LIST_HEAD(&sdom->active_vcpu);
>       sdom->active_vcpu_count = 0;
> @@ -944,6 +1070,9 @@ csched_dom_init(const struct scheduler *
>   static void
>   csched_free_domdata(const struct scheduler *ops, void *data)
>   {
> +    struct csched_dom *sdom = data;
> +
> +    free_cpumask_var(sdom->node_affinity_cpumask);
>       xfree(data);
>   }
>
> @@ -1240,9 +1369,10 @@ csched_tick(void *_cpu)
>   }
>
>   static struct csched_vcpu *
> -csched_runq_steal(int peer_cpu, int cpu, int pri)
> +csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>   {
>       const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
> +    struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, peer_cpu));
>       const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu);
>       struct csched_vcpu *speer;
>       struct list_head *iter;
> @@ -1265,11 +1395,24 @@ csched_runq_steal(int peer_cpu, int cpu,
>               if ( speer->pri<= pri )
>                   break;
>
> -            /* Is this VCPU is runnable on our PCPU? */
> +            /* Is this VCPU runnable on our PCPU? */
>               vc = speer->vcpu;
>               BUG_ON( is_idle_vcpu(vc) );
>
> -            if (__csched_vcpu_is_migrateable(vc, cpu))
> +            /*
> +             * Retrieve the correct mask for this balance_step or, if we're
> +             * dealing with node-affinity and the vcpu has no node affinity
> +             * at all, just skip this vcpu. That is needed if we want to
> +             * check if we have any node-affine work to steal first (wrt
> +             * any vcpu-affine work).
> +             */
> +            if ( csched_balance_cpumask(vc, balance_step,
> +&scratch_balance_mask) )
> +                continue;
> +
> +            if ( __csched_vcpu_is_migrateable(vc, cpu,&scratch_balance_mask)
> +&&  __csched_vcpu_should_migrate(cpu,&scratch_balance_mask,
> +                                                 prv->idlers) )
>               {
>                   /* We got a candidate. Grab it! */
>                   TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu,
> @@ -1295,7 +1438,8 @@ csched_load_balance(struct csched_privat
>       struct csched_vcpu *speer;
>       cpumask_t workers;
>       cpumask_t *online;
> -    int peer_cpu;
> +    int peer_cpu, peer_node, bstep;
> +    int node = cpu_to_node(cpu);
>
>       BUG_ON( cpu != snext->vcpu->processor );
>       online = cpupool_scheduler_cpumask(per_cpu(cpupool, cpu));
> @@ -1312,42 +1456,68 @@ csched_load_balance(struct csched_privat
>           SCHED_STAT_CRANK(load_balance_other);
>
>       /*
> -     * Peek at non-idling CPUs in the system, starting with our
> -     * immediate neighbour.
> +     * Let's look around for work to steal, taking both vcpu-affinity
> +     * and node-affinity into account. More specifically, we check all
> +     * the non-idle CPUs' runq, looking for:
> +     *  1. any node-affine work to steal first,
> +     *  2. if not finding anything, any vcpu-affine work to steal.
>        */
> -    cpumask_andnot(&workers, online, prv->idlers);
> -    cpumask_clear_cpu(cpu,&workers);
> -    peer_cpu = cpu;
> +    for_each_csched_balance_step( bstep )
> +    {
> +        /*
> +         * We peek at the non-idling CPUs in a node-wise fashion. In fact,
> +         * it is more likely that we find some node-affine work on our same
> +         * node, not to mention that migrating vcpus within the same node
> +         * could well expected to be cheaper than across-nodes (memory
> +         * stays local, there might be some node-wide cache[s], etc.).
> +         */
> +        peer_node = node;
> +        do
> +        {
> +            /* Find out what the !idle are in this node */
> +            cpumask_andnot(&workers, online, prv->idlers);
> +            cpumask_and(&workers,&workers,&node_to_cpumask(peer_node));
> +            cpumask_clear_cpu(cpu,&workers);
>
> -    while ( !cpumask_empty(&workers) )
> -    {
> -        peer_cpu = cpumask_cycle(peer_cpu,&workers);
> -        cpumask_clear_cpu(peer_cpu,&workers);
> +            if ( cpumask_empty(&workers) )
> +                goto next_node;
>
> -        /*
> -         * Get ahold of the scheduler lock for this peer CPU.
> -         *
> -         * Note: We don't spin on this lock but simply try it. Spinning could
> -         * cause a deadlock if the peer CPU is also load balancing and trying
> -         * to lock this CPU.
> -         */
> -        if ( !pcpu_schedule_trylock(peer_cpu) )
> -        {
> -            SCHED_STAT_CRANK(steal_trylock_failed);
> -            continue;
> -        }
> +            peer_cpu = cpumask_first(&workers);
> +            do
> +            {
> +                /*
> +                 * Get ahold of the scheduler lock for this peer CPU.
> +                 *
> +                 * Note: We don't spin on this lock but simply try it. Spinning
> +                 * could cause a deadlock if the peer CPU is also load
> +                 * balancing and trying to lock this CPU.
> +                 */
> +                if ( !pcpu_schedule_trylock(peer_cpu) )
> +                {
> +                    SCHED_STAT_CRANK(steal_trylock_failed);
> +                    peer_cpu = cpumask_cycle(peer_cpu,&workers);
> +                    continue;
> +                }
>
> -        /*
> -         * Any work over there to steal?
> -         */
> -        speer = cpumask_test_cpu(peer_cpu, online) ?
> -            csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL;
> -        pcpu_schedule_unlock(peer_cpu);
> -        if ( speer != NULL )
> -        {
> -            *stolen = 1;
> -            return speer;
> -        }
> +                /* Any work over there to steal? */
> +                speer = cpumask_test_cpu(peer_cpu, online) ?
> +                    csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL;
> +                pcpu_schedule_unlock(peer_cpu);
> +
> +                /* As soon as one vcpu is found, balancing ends */
> +                if ( speer != NULL )
> +                {
> +                    *stolen = 1;
> +                    return speer;
> +                }
> +
> +                peer_cpu = cpumask_cycle(peer_cpu,&workers);
> +
> +            } while( peer_cpu != cpumask_first(&workers) );
> +
> + next_node:
> +            peer_node = cycle_node(peer_node, node_online_map);
> +        } while( peer_node != node );
>       }
>
>    out:
> diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h
> --- a/xen/include/xen/nodemask.h
> +++ b/xen/include/xen/nodemask.h
> @@ -41,6 +41,8 @@
>    * int last_node(mask)			Number highest set bit, or MAX_NUMNODES
>    * int first_unset_node(mask)		First node not set in mask, or
>    *					MAX_NUMNODES.
> + * int cycle_node(node, mask)		Next node cycling from 'node', or
> + *					MAX_NUMNODES
>    *
>    * nodemask_t nodemask_of_node(node)	Return nodemask with bit 'node' set
>    * NODE_MASK_ALL			Initializer - all bits set
> @@ -254,6 +256,16 @@ static inline int __first_unset_node(con
>   			find_first_zero_bit(maskp->bits, MAX_NUMNODES));
>   }
>
> +#define cycle_node(n, src) __cycle_node((n),&(src), MAX_NUMNODES)
> +static inline int __cycle_node(int n, const nodemask_t *maskp, int nbits)
> +{
> +    int nxt = __next_node(n, maskp, nbits);
> +
> +    if (nxt == nbits)
> +        nxt = __first_node(maskp, nbits);
> +    return nxt;
> +}
> +
>   #define NODE_MASK_LAST_WORD BITMAP_LAST_WORD_MASK(MAX_NUMNODES)
>
>   #if MAX_NUMNODES<= BITS_PER_LONG
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
>


-- 
Juergen Gross                 Principal Developer Operating Systems
PBG PDG ES&S SWE OS6                   Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-20  6:44   ` Juergen Gross
@ 2012-12-20  8:16     ` Dario Faggioli
  2012-12-20  8:25       ` Juergen Gross
  0 siblings, 1 reply; 57+ messages in thread
From: Dario Faggioli @ 2012-12-20  8:16 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich,
	Daniel De Graaf, Matt Wilson


[-- Attachment #1.1: Type: text/plain, Size: 1893 bytes --]

On Thu, 2012-12-20 at 07:44 +0100, Juergen Gross wrote: 
> Am 19.12.2012 20:07, schrieb Dario Faggioli:
> > [...] 
> >
> > This change modifies the VCPU load balancing algorithm (for the
> > credit scheduler only), introducing a two steps logic.
> > During the first step, we use the node-affinity mask. The aim is
> > giving precedence to the CPUs where it is known to be preferable
> > for the domain to run. If that fails in finding a valid PCPU, the
> > node-affinity is just ignored and, in the second step, we fall
> > back to using cpu-affinity only.
> >
> > Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com>
> > ---
> > Changes from v1:
> >   * CPU masks variables moved off from the stack, as requested during
> >     review. As per the comments in the code, having them in the private
> >     (per-scheduler instance) struct could have been enough, but it would be
> >     racy (again, see comments). For that reason, use a global bunch of
> >     them of (via per_cpu());
> 
> Wouldn't it be better to put the mask in the scheduler private per-pcpu area?
> This could be applied to several other instances of cpu masks on the stack,
> too.
> 
Yes, as I tired to explain, if it's per-cpu it should be fine, since
credit has one runq per each CPU and hence runq lock is enough for
serialization.

BTW, can you be a little bit more specific about where you're suggesting
to put it? I'm sorry but I'm not sure I figured what you mean by "the
scheduler private per-pcpu area"... Do you perhaps mean making it a
member of `struct csched_pcpu' ?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-20  8:16     ` Dario Faggioli
@ 2012-12-20  8:25       ` Juergen Gross
  2012-12-20  8:33         ` Dario Faggioli
  0 siblings, 1 reply; 57+ messages in thread
From: Juergen Gross @ 2012-12-20  8:25 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Ian Jackson, xen-devel, Jan Beulich,
	Daniel De Graaf, Matt Wilson

Am 20.12.2012 09:16, schrieb Dario Faggioli:
> On Thu, 2012-12-20 at 07:44 +0100, Juergen Gross wrote:
>> Am 19.12.2012 20:07, schrieb Dario Faggioli:
>>> [...]
>>>
>>> This change modifies the VCPU load balancing algorithm (for the
>>> credit scheduler only), introducing a two steps logic.
>>> During the first step, we use the node-affinity mask. The aim is
>>> giving precedence to the CPUs where it is known to be preferable
>>> for the domain to run. If that fails in finding a valid PCPU, the
>>> node-affinity is just ignored and, in the second step, we fall
>>> back to using cpu-affinity only.
>>>
>>> Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com>
>>> ---
>>> Changes from v1:
>>>    * CPU masks variables moved off from the stack, as requested during
>>>      review. As per the comments in the code, having them in the private
>>>      (per-scheduler instance) struct could have been enough, but it would be
>>>      racy (again, see comments). For that reason, use a global bunch of
>>>      them of (via per_cpu());
>>
>> Wouldn't it be better to put the mask in the scheduler private per-pcpu area?
>> This could be applied to several other instances of cpu masks on the stack,
>> too.
>>
> Yes, as I tired to explain, if it's per-cpu it should be fine, since
> credit has one runq per each CPU and hence runq lock is enough for
> serialization.
>
> BTW, can you be a little bit more specific about where you're suggesting
> to put it? I'm sorry but I'm not sure I figured what you mean by "the
> scheduler private per-pcpu area"... Do you perhaps mean making it a
> member of `struct csched_pcpu' ?

Yes, that's what I would suggest.

Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
PBG PDG ES&S SWE OS6                   Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-20  8:25       ` Juergen Gross
@ 2012-12-20  8:33         ` Dario Faggioli
  2012-12-20  8:39           ` Juergen Gross
  2012-12-20  9:22           ` Jan Beulich
  0 siblings, 2 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-20  8:33 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson


[-- Attachment #1.1: Type: text/plain, Size: 1246 bytes --]

On Thu, 2012-12-20 at 08:25 +0000, Juergen Gross wrote: 
> > BTW, can you be a little bit more specific about where you're suggesting
> > to put it? I'm sorry but I'm not sure I figured what you mean by "the
> > scheduler private per-pcpu area"... Do you perhaps mean making it a
> > member of `struct csched_pcpu' ?
> 
> Yes, that's what I would suggest.
> 
Ok then, functionally, that is going to be exactly the same thing as
where it is right now, i.e., a set of global per_cpu() variables. It is
possible for your solution to bring some cache/locality benefits,
although that will very much depends on single cases, architecture,
workload, etc.

That being said, I'm definitely fine with it and can go for it.

It was Jan that suggested/asked to pull them out of the stack, and I
guess it's George's taste that we value most when hacking sched_*.c, so
let's see if they want to comment on this and then I'll decide. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-20  8:33         ` Dario Faggioli
@ 2012-12-20  8:39           ` Juergen Gross
  2012-12-20  8:58             ` Dario Faggioli
  2012-12-20 15:28             ` George Dunlap
  2012-12-20  9:22           ` Jan Beulich
  1 sibling, 2 replies; 57+ messages in thread
From: Juergen Gross @ 2012-12-20  8:39 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson

Am 20.12.2012 09:33, schrieb Dario Faggioli:
> On Thu, 2012-12-20 at 08:25 +0000, Juergen Gross wrote:
>>> BTW, can you be a little bit more specific about where you're suggesting
>>> to put it? I'm sorry but I'm not sure I figured what you mean by "the
>>> scheduler private per-pcpu area"... Do you perhaps mean making it a
>>> member of `struct csched_pcpu' ?
>>
>> Yes, that's what I would suggest.
>>
> Ok then, functionally, that is going to be exactly the same thing as
> where it is right now, i.e., a set of global per_cpu() variables. It is
> possible for your solution to bring some cache/locality benefits,
> although that will very much depends on single cases, architecture,
> workload, etc.

The space is only allocated if sched_credit is responsible for the pcpu.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
PBG PDG ES&S SWE OS6                   Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-20  8:39           ` Juergen Gross
@ 2012-12-20  8:58             ` Dario Faggioli
  2012-12-20 15:28             ` George Dunlap
  1 sibling, 0 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-20  8:58 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson


[-- Attachment #1.1: Type: text/plain, Size: 785 bytes --]

On Thu, 2012-12-20 at 08:39 +0000, Juergen Gross wrote: 
> > Ok then, functionally, that is going to be exactly the same thing as
> > where it is right now, i.e., a set of global per_cpu() variables. It is
> > possible for your solution to bring some cache/locality benefits,
> > although that will very much depends on single cases, architecture,
> > workload, etc.
> 
> The space is only allocated if sched_credit is responsible for the pcpu.
> 
Right, that's true too.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-20  8:39           ` Juergen Gross
  2012-12-20  8:58             ` Dario Faggioli
@ 2012-12-20 15:28             ` George Dunlap
  2012-12-20 16:00               ` Dario Faggioli
  1 sibling, 1 reply; 57+ messages in thread
From: George Dunlap @ 2012-12-20 15:28 UTC (permalink / raw)
  To: Juergen Gross
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Dario Faggioli, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson

On 20/12/12 08:39, Juergen Gross wrote:
> Am 20.12.2012 09:33, schrieb Dario Faggioli:
>> On Thu, 2012-12-20 at 08:25 +0000, Juergen Gross wrote:
>>>> BTW, can you be a little bit more specific about where you're suggesting
>>>> to put it? I'm sorry but I'm not sure I figured what you mean by "the
>>>> scheduler private per-pcpu area"... Do you perhaps mean making it a
>>>> member of `struct csched_pcpu' ?
>>> Yes, that's what I would suggest.
>>>
>> Ok then, functionally, that is going to be exactly the same thing as
>> where it is right now, i.e., a set of global per_cpu() variables. It is
>> possible for your solution to bring some cache/locality benefits,
>> although that will very much depends on single cases, architecture,
>> workload, etc.
> The space is only allocated if sched_credit is responsible for the pcpu.

That makes sense.
  -George

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-20 15:28             ` George Dunlap
@ 2012-12-20 16:00               ` Dario Faggioli
  0 siblings, 0 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-20 16:00 UTC (permalink / raw)
  To: George Dunlap
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson

On Thu, Dec 20, 2012 at 4:28 PM, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
>> The space is only allocated if sched_credit is responsible for the pcpu.
>
>
> That makes sense.
>  -George
>
Ok, already done here in my queue. When I repost, you'll find it there. :-)

Thanks an Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
---------------------------------------------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-20  8:33         ` Dario Faggioli
  2012-12-20  8:39           ` Juergen Gross
@ 2012-12-20  9:22           ` Jan Beulich
  1 sibling, 0 replies; 57+ messages in thread
From: Jan Beulich @ 2012-12-20  9:22 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, AnilMadhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Matt Wilson, Daniel De Graaf

>>> On 20.12.12 at 09:33, Dario Faggioli <dario.faggioli@citrix.com> wrote:
> It was Jan that suggested/asked to pull them out of the stack, and I
> guess it's George's taste that we value most when hacking sched_*.c, so
> let's see if they want to comment on this and then I'll decide. :-)

Yes, I'm with Juergen here.

Jan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-19 19:07 ` [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity Dario Faggioli
  2012-12-20  6:44   ` Juergen Gross
@ 2012-12-20 15:56   ` George Dunlap
  2012-12-20 17:12     ` Dario Faggioli
  2012-12-20 16:48   ` George Dunlap
  2012-12-20 20:21   ` George Dunlap
  3 siblings, 1 reply; 57+ messages in thread
From: George Dunlap @ 2012-12-20 15:56 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson

On 19/12/12 19:07, Dario Faggioli wrote:
> As vcpu-affinity tells where VCPUs must run, node-affinity tells
> where they should or, better, prefer. While respecting vcpu-affinity
> remains mandatory, node-affinity is not that strict, it only expresses
> a preference, although honouring it is almost always true that will
> bring significant performances benefit (especially as compared to
> not having any affinity at all).
>
> This change modifies the VCPU load balancing algorithm (for the
> credit scheduler only), introducing a two steps logic.
> During the first step, we use the node-affinity mask. The aim is
> giving precedence to the CPUs where it is known to be preferable
> for the domain to run. If that fails in finding a valid PCPU, the
> node-affinity is just ignored and, in the second step, we fall
> back to using cpu-affinity only.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

This one has a lot of structural comments; so I'm going to send a couple 
of different mails as I'm going through it, so we can parallize the 
discussion better. :-)

> ---
> Changes from v1:
>   * CPU masks variables moved off from the stack, as requested during
>     review. As per the comments in the code, having them in the private
>     (per-scheduler instance) struct could have been enough, but it would be
>     racy (again, see comments). For that reason, use a global bunch of
>     them of (via per_cpu());
>   * George suggested a different load balancing logic during v1's review. I
>     think he was right and then I changed the old implementation in a way
>     that resembles exactly that. I rewrote most of this patch to introduce
>     a more sensible and effective noda-affinity handling logic.
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -111,6 +111,33 @@
>
>
>   /*
> + * Node Balancing
> + */
> +#define CSCHED_BALANCE_CPU_AFFINITY     0
> +#define CSCHED_BALANCE_NODE_AFFINITY    1
> +#define CSCHED_BALANCE_LAST CSCHED_BALANCE_NODE_AFFINITY
[snip]
> +#define for_each_csched_balance_step(__step) \
> +    for ( (__step) = CSCHED_BALANCE_LAST; (__step) >= 0; (__step)-- )

Why are we starting at the top and going down?  Is there any good reason 
for it?

Every time you do anything unexpected, you add to the cognitive load of 
the person reading your code, leaving less spare processing power or 
memory for other bits of the code, and increasing (sligthly) the chance 
of making a mistake.  The most natural thing would be for someone to 
expect that the steps start at 0 and go up; just reversing it means it's 
that little bit harder to understand.  When you name it "LAST", it's 
even worse, because that would definitely imply that this step is going 
to be executed last.

So why not just have this be as follows?

for(step=0; step<CSCHED_BALANCE_MAX; step++)

> +
> +/*
> + * Each csched-balance step has to use its own cpumask. This function
> + * determines which one, given the step, and copies it in mask. Notice
> + * that, in case of node-affinity balancing step, it also filters out from
> + * the node-affinity mask the cpus that are not part of vc's cpu-affinity,
> + * as we do not want to end up running a vcpu where it would like, but
> + * is not allowed to!
> + *
> + * As an optimization, if a domain does not have any node-affinity at all
> + * (namely, its node affinity is automatically computed), not only the
> + * computed mask will reflect its vcpu-affinity, but we also return -1 to
> + * let the caller know that he can skip the step or quit the loop (if he
> + * wants).
> + */
> +static int
> +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
> +{
> +    if ( step == CSCHED_BALANCE_NODE_AFFINITY )
> +    {
> +        struct domain *d = vc->domain;
> +        struct csched_dom *sdom = CSCHED_DOM(d);
> +
> +        cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity);
> +
> +        if ( cpumask_full(sdom->node_affinity_cpumask) )
> +            return -1;

There's no optimization in having this comparison done here.  You're not 
reading something from a local variable that you've just calculated.  
But hiding this comparison inside this function, and disguising it as 
"returns -1", does increase the cognitive load on anybody trying to read 
and understand the code -- particularly, as how the return value is used 
is not really clear.

Also, when you use this value, effectively what you're doing is saying, 
"Actually, we just said we were doing the NODE_BALANCE step, but it 
turns out that the results of NODE_BALANCE and CPU_BALANCE will be the 
same, so we're just going to pretend that we've been doing the 
CPU_BALANCE step instead."  (See for example, "balance_step == 
CSCHED_BALANCE_NODE_AFFINITY && !ret" -- why the !ret in this clause?  
Because if !ret then we're not actually doing NODE_AFFINITY now, but 
CPU_AFFINITY.)  Another non-negligible chunk of cognitive load for 
someone reading the code to 1) figure out, and 2) keep in mind as she 
tries to analyze it.

I took a look at all the places which use this return value, and it 
seems like the best thing in each case would just be to have the 
*caller*, before getting into the loop, call 
cpumask_full(sdom->node_affinity_cpumask) and just skip the 
CSCHED_NODE_BALANCE step altogether if it's true.  (Example below.)

> @@ -266,67 +332,94 @@ static inline void
>       struct csched_vcpu * const cur = CSCHED_VCPU(curr_on_cpu(cpu));
>       struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
>       cpumask_t mask, idle_mask;
> -    int idlers_empty;
> +    int balance_step, idlers_empty;
>
>       ASSERT(cur);
> -    cpumask_clear(&mask);
> -
>       idlers_empty = cpumask_empty(prv->idlers);
>
>       /*
> -     * If the pcpu is idle, or there are no idlers and the new
> -     * vcpu is a higher priority than the old vcpu, run it here.
> -     *
> -     * If there are idle cpus, first try to find one suitable to run
> -     * new, so we can avoid preempting cur.  If we cannot find a
> -     * suitable idler on which to run new, run it here, but try to
> -     * find a suitable idler on which to run cur instead.
> +     * Node and vcpu-affinity balancing loop. To speed things up, in case
> +     * no node-affinity at all is present, scratch_balance_mask reflects
> +     * the vcpu-affinity, and ret is -1, so that we then can quit the
> +     * loop after only one step.
>        */
> -    if ( cur->pri == CSCHED_PRI_IDLE
> -         || (idlers_empty && new->pri > cur->pri) )
> +    for_each_csched_balance_step( balance_step )
>       {
> -        if ( cur->pri != CSCHED_PRI_IDLE )
> -            SCHED_STAT_CRANK(tickle_idlers_none);
> -        cpumask_set_cpu(cpu, &mask);
> -    }
> -    else if ( !idlers_empty )
> -    {
> -        /* Check whether or not there are idlers that can run new */
> -        cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
> +        int ret, new_idlers_empty;
> +
> +        cpumask_clear(&mask);
>
>           /*
> -         * If there are no suitable idlers for new, and it's higher
> -         * priority than cur, ask the scheduler to migrate cur away.
> -         * We have to act like this (instead of just waking some of
> -         * the idlers suitable for cur) because cur is running.
> +         * If the pcpu is idle, or there are no idlers and the new
> +         * vcpu is a higher priority than the old vcpu, run it here.
>            *
> -         * If there are suitable idlers for new, no matter priorities,
> -         * leave cur alone (as it is running and is, likely, cache-hot)
> -         * and wake some of them (which is waking up and so is, likely,
> -         * cache cold anyway).
> +         * If there are idle cpus, first try to find one suitable to run
> +         * new, so we can avoid preempting cur.  If we cannot find a
> +         * suitable idler on which to run new, run it here, but try to
> +         * find a suitable idler on which to run cur instead.
>            */
> -        if ( cpumask_empty(&idle_mask) && new->pri > cur->pri )
> +        if ( cur->pri == CSCHED_PRI_IDLE
> +             || (idlers_empty && new->pri > cur->pri) )
>           {
> -            SCHED_STAT_CRANK(tickle_idlers_none);
> -            SCHED_VCPU_STAT_CRANK(cur, kicked_away);
> -            SCHED_VCPU_STAT_CRANK(cur, migrate_r);
> -            SCHED_STAT_CRANK(migrate_kicked_away);
> -            set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
> +            if ( cur->pri != CSCHED_PRI_IDLE )
> +                SCHED_STAT_CRANK(tickle_idlers_none);
>               cpumask_set_cpu(cpu, &mask);
>           }
> -        else if ( !cpumask_empty(&idle_mask) )
> +        else if ( !idlers_empty )
>           {
> -            /* Which of the idlers suitable for new shall we wake up? */
> -            SCHED_STAT_CRANK(tickle_idlers_some);
> -            if ( opt_tickle_one_idle )
> +            /* Are there idlers suitable for new (for this balance step)? */
> +            ret = csched_balance_cpumask(new->vcpu, balance_step,
> +                                         &scratch_balance_mask);
> +            cpumask_and(&idle_mask, prv->idlers, &scratch_balance_mask);
> +            new_idlers_empty = cpumask_empty(&idle_mask);
> +
> +            /*
> +             * Let's not be too harsh! If there aren't idlers suitable
> +             * for new in its node-affinity mask, make sure we check its
> +             * vcpu-affinity as well, before tacking final decisions.
> +             */
> +            if ( new_idlers_empty
> +                 && (balance_step == CSCHED_BALANCE_NODE_AFFINITY && !ret) )
> +                continue;
> +
> +            /*
> +             * If there are no suitable idlers for new, and it's higher
> +             * priority than cur, ask the scheduler to migrate cur away.
> +             * We have to act like this (instead of just waking some of
> +             * the idlers suitable for cur) because cur is running.
> +             *
> +             * If there are suitable idlers for new, no matter priorities,
> +             * leave cur alone (as it is running and is, likely, cache-hot)
> +             * and wake some of them (which is waking up and so is, likely,
> +             * cache cold anyway).
> +             */
> +            if ( new_idlers_empty && new->pri > cur->pri )
>               {
> -                this_cpu(last_tickle_cpu) =
> -                    cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
> -                cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
> +                SCHED_STAT_CRANK(tickle_idlers_none);
> +                SCHED_VCPU_STAT_CRANK(cur, kicked_away);
> +                SCHED_VCPU_STAT_CRANK(cur, migrate_r);
> +                SCHED_STAT_CRANK(migrate_kicked_away);
> +                set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
> +                cpumask_set_cpu(cpu, &mask);
>               }
> -            else
> -                cpumask_or(&mask, &mask, &idle_mask);
> +            else if ( !new_idlers_empty )
> +            {
> +                /* Which of the idlers suitable for new shall we wake up? */
> +                SCHED_STAT_CRANK(tickle_idlers_some);
> +                if ( opt_tickle_one_idle )
> +                {
> +                    this_cpu(last_tickle_cpu) =
> +                        cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
> +                    cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
> +                }
> +                else
> +                    cpumask_or(&mask, &mask, &idle_mask);
> +            }
>           }
> +
> +        /* Did we find anyone (or csched_balance_cpumask() says we're done)? */
> +        if ( !cpumask_empty(&mask) || ret )
> +            break;
>       }

The whole logic here is really convoluted and hard to read.  For 
example, if cur->pri==IDLE, then you will always just break of the loop 
after the first iteration.  In that case, why have the if() inside the 
loop to begin with?  And if idlers_empty is true but cur->pri >= 
new->pri, then you'll go through the loop two times, even though both 
times it will come up empty.  And, of course, the whole thing about the 
node affinity mask being checked inside csched_balance_cpumask(), but 
not used until the very end.

A much more straighforward way to arrange it would be:

if(cur->pri=IDLE &c &c)
{
   foo;
}
else if(!idlers_empty)
{
   if(cpumask_full(sdom->node_affinity_cpumask)
     balance_step=CSCHED_BALANCE_CPU_AFFINITY;
   else
     balance_step=CSCHED_BALANCE_NODE_AFFINITY;

   for(; balance_step <= CSCHED_BALANCE_MAX; balance_step++)
  {
  ...
  }
}

That seems a lot clearer to me -- does that make sense?

[To be continued...]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-20 15:56   ` George Dunlap
@ 2012-12-20 17:12     ` Dario Faggioli
  0 siblings, 0 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-20 17:12 UTC (permalink / raw)
  To: George Dunlap
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson

On Thu, Dec 20, 2012 at 4:56 PM, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
>> This change modifies the VCPU load balancing algorithm (for the
>> credit scheduler only), introducing a two steps logic.
>> During the first step, we use the node-affinity mask. The aim is
>> giving precedence to the CPUs where it is known to be preferable
>> for the domain to run. If that fails in finding a valid PCPU, the
>> node-affinity is just ignored and, in the second step, we fall
>> back to using cpu-affinity only.
>>
>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
>
>
> This one has a lot of structural comments; so I'm going to send a couple of
> different mails as I'm going through it, so we can parallize the discussion
> better. :-)
>
Ok.

>> +#define for_each_csched_balance_step(__step) \
>> +    for ( (__step) = CSCHED_BALANCE_LAST; (__step) >= 0; (__step)-- )
>
>
> Why are we starting at the top and going down?  Is there any good reason for
> it?
>
You're totally right, it looked like this in the very first RFC, when
the whole set of
macros was different. I changed that but never considered this, but I
agree going
up would be more natural.

> So why not just have this be as follows?
>
> for(step=0; step<CSCHED_BALANCE_MAX; step++)
>
Will do.

>> +static int
>> +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
>> +{
>> +    if ( step == CSCHED_BALANCE_NODE_AFFINITY )
>> +    {
>> +        struct domain *d = vc->domain;
>> +        struct csched_dom *sdom = CSCHED_DOM(d);
>> +
>> +        cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity);
>> +
>> +        if ( cpumask_full(sdom->node_affinity_cpumask) )
>> +            return -1;
>
>
> There's no optimization in having this comparison done here.  You're not
> reading something from a local variable that you've just calculated.  But
> hiding this comparison inside this function, and disguising it as "returns
> -1", does increase the cognitive load on anybody trying to read and
> understand the code -- particularly, as how the return value is used is not
> really clear.
>
Yes, I again agree this is ugly. The previous version was better, from
the caller
point of view, but had other downsides (IIRC it was mangling with the
loop control
variable directly) so we agreed on this '-1' thing.

> Also, when you use this value, effectively what you're doing is saying,
> "Actually, we just said we were doing the NODE_BALANCE step, but it turns
> out that the results of NODE_BALANCE and CPU_BALANCE will be the same, so
> we're just going to pretend that we've been doing the CPU_BALANCE step
> instead."  (See for example, "balance_step == CSCHED_BALANCE_NODE_AFFINITY
> && !ret" -- why the !ret in this clause?  Because if !ret then we're not
> actually doing NODE_AFFINITY now, but CPU_AFFINITY.)  Another non-negligible
> chunk of cognitive load for someone reading the code to 1) figure out, and
> 2) keep in mind as she tries to analyze it.
>
Totally agreed. :-)

> I took a look at all the places which use this return value, and it seems
> like the best thing in each case would just be to have the *caller*, before
> getting into the loop, call cpumask_full(sdom->node_affinity_cpumask) and
> just skip the CSCHED_NODE_BALANCE step altogether if it's true.  (Example
> below.)
>
I will think about it and see if I can find a nice solution for making
that happen
(the point being I'm not sure I like exposing and disseminating that
cpumask_full... thing that much, but I guess I can hide it under some more
macro-ing and stuff).

>> @@ -266,67 +332,94 @@ static inline void
>>       struct csched_vcpu * const cur = CSCHED_VCPU(curr_on_cpu(cpu));
>>       struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
>>       cpumask_t mask, idle_mask;
>> -    int idlers_empty;
>> +    int balance_step, idlers_empty;
>>
>>       ASSERT(cur);
>> -    cpumask_clear(&mask);
>> -
>>       idlers_empty = cpumask_empty(prv->idlers);
>>
>>       /*
>> -     * If the pcpu is idle, or there are no idlers and the new
>> -     * vcpu is a higher priority than the old vcpu, run it here.
>> -     *
>> -     * If there are idle cpus, first try to find one suitable to run
>> -     * new, so we can avoid preempting cur.  If we cannot find a
>> -     * suitable idler on which to run new, run it here, but try to
>> -     * find a suitable idler on which to run cur instead.
>> +     * Node and vcpu-affinity balancing loop. To speed things up, in case
>> +     * no node-affinity at all is present, scratch_balance_mask reflects
>> +     * the vcpu-affinity, and ret is -1, so that we then can quit the
>> +     * loop after only one step.
>>        */
>> [snip]
>
> The whole logic here is really convoluted and hard to read.  For example, if
> cur->pri==IDLE, then you will always just break of the loop after the first
> iteration.  In that case, why have the if() inside the loop to begin with?
> And if idlers_empty is true but cur->pri >= new->pri, then you'll go through
> the loop two times, even though both times it will come up empty.  And, of
> course, the whole thing about the node affinity mask being checked inside
> csched_balance_cpumask(), but not used until the very end.
>
I fear it looks convoluted and complex because, well, it _is_ quite complex!
However, I see you're point, and there definitely are chances for it
to, although
being complex, look more complex than it should. :-)

> A much more straighforward way to arrange it would be:
>
> if(cur->pri=IDLE &c &c)
> {
>   foo;
> }
> else if(!idlers_empty)
> {
>   if(cpumask_full(sdom->node_affinity_cpumask)
>     balance_step=CSCHED_BALANCE_CPU_AFFINITY;
>   else
>     balance_step=CSCHED_BALANCE_NODE_AFFINITY;
>
Yes, I think this can be taken out. I'll give some more thinking to this and
see if I can simplify the flow.

> [To be continued...]
>
Thanks for now :-)
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
---------------------------------------------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-19 19:07 ` [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity Dario Faggioli
  2012-12-20  6:44   ` Juergen Gross
  2012-12-20 15:56   ` George Dunlap
@ 2012-12-20 16:48   ` George Dunlap
  2012-12-20 18:18     ` Dario Faggioli
  2012-12-20 20:21   ` George Dunlap
  3 siblings, 1 reply; 57+ messages in thread
From: George Dunlap @ 2012-12-20 16:48 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson

On 19/12/12 19:07, Dario Faggioli wrote:
>   static inline int
> -__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu)
> +__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
>   {
>       /*
>        * Don't pick up work that's in the peer's scheduling tail or hot on
> -     * peer PCPU. Only pick up work that's allowed to run on our CPU.
> +     * peer PCPU. Only pick up work that prefers and/or is allowed to run
> +     * on our CPU.
>        */
>       return !vc->is_running &&
>              !__csched_vcpu_is_cache_hot(vc) &&
> -           cpumask_test_cpu(dest_cpu, vc->cpu_affinity);
> +           cpumask_test_cpu(dest_cpu, mask);
> +}
> +
> +static inline int
> +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers)
> +{
> +    /*
> +     * Consent to migration if cpu is one of the idlers in the VCPU's
> +     * affinity mask. In fact, if that is not the case, it just means it
> +     * was some other CPU that was tickled and should hence come and pick
> +     * VCPU up. Migrating it to cpu would only make things worse.
> +     */
> +    return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask);
>   }

I don't get what this function is for.  The only time you call it is in 
csched_runq_steal(), immediately after calling 
__csched_vcpu_is_migrateable().  But is_migrateable() has already 
checked cpu_mask_test_cpu(cpu, mask).  So why do we need to check it again?

We could just replace this with cpumask_test_cpu(cpu, prv->idlers).  But 
that clause is going to be either true or false for every single 
iteration of all the loops, including the loops in 
csched_load_balance().  Wouldn't it make more sense to check it once in 
csched_load_balance(), rather than doing all those nested loops?

And in any case, looking at the caller of csched_load_balance(), it 
explicitly says to steal work if the next thing on the runqueue of cpu 
has a priority of TS_OVER.  That was chosen for a reason -- if you want 
to change that, you should change it there at the top (and make a 
justification for doing so), not deeply nested in a function like this.

Or am I completely missing something?

>   static struct csched_vcpu *
> -csched_runq_steal(int peer_cpu, int cpu, int pri)
> +csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step)
>   {
>       const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu);
> +    struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, peer_cpu));
>       const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu);
>       struct csched_vcpu *speer;
>       struct list_head *iter;
> @@ -1265,11 +1395,24 @@ csched_runq_steal(int peer_cpu, int cpu,
>               if ( speer->pri <= pri )
>                   break;
>
> -            /* Is this VCPU is runnable on our PCPU? */
> +            /* Is this VCPU runnable on our PCPU? */
>               vc = speer->vcpu;
>               BUG_ON( is_idle_vcpu(vc) );
>
> -            if (__csched_vcpu_is_migrateable(vc, cpu))
> +            /*
> +             * Retrieve the correct mask for this balance_step or, if we're
> +             * dealing with node-affinity and the vcpu has no node affinity
> +             * at all, just skip this vcpu. That is needed if we want to
> +             * check if we have any node-affine work to steal first (wrt
> +             * any vcpu-affine work).
> +             */
> +            if ( csched_balance_cpumask(vc, balance_step,
> +                                        &scratch_balance_mask) )
> +                continue;

Again, I think for clarity the best thing to do here is:

if ( balance_step == NODE

      && cpumask_full(speer->sdom->node_affinity_cpumask) )

     continue;

csched_balance_cpumask();

/* Etc. */




> @@ -1295,7 +1438,8 @@ csched_load_balance(struct csched_privat
>       struct csched_vcpu *speer;
>       cpumask_t workers;
>       cpumask_t *online;
> -    int peer_cpu;
> +    int peer_cpu, peer_node, bstep;
> +    int node = cpu_to_node(cpu);
>
>       BUG_ON( cpu != snext->vcpu->processor );
>       online = cpupool_scheduler_cpumask(per_cpu(cpupool, cpu));
> @@ -1312,42 +1456,68 @@ csched_load_balance(struct csched_privat
>           SCHED_STAT_CRANK(load_balance_other);
>
>       /*
> -     * Peek at non-idling CPUs in the system, starting with our
> -     * immediate neighbour.
> +     * Let's look around for work to steal, taking both vcpu-affinity
> +     * and node-affinity into account. More specifically, we check all
> +     * the non-idle CPUs' runq, looking for:
> +     *  1. any node-affine work to steal first,
> +     *  2. if not finding anything, any vcpu-affine work to steal.
>        */
> -    cpumask_andnot(&workers, online, prv->idlers);
> -    cpumask_clear_cpu(cpu, &workers);
> -    peer_cpu = cpu;
> +    for_each_csched_balance_step( bstep )
> +    {
> +        /*
> +         * We peek at the non-idling CPUs in a node-wise fashion. In fact,
> +         * it is more likely that we find some node-affine work on our same
> +         * node, not to mention that migrating vcpus within the same node
> +         * could well expected to be cheaper than across-nodes (memory
> +         * stays local, there might be some node-wide cache[s], etc.).
> +         */
> +        peer_node = node;
> +        do
> +        {
> +            /* Find out what the !idle are in this node */
> +            cpumask_andnot(&workers, online, prv->idlers);
> +            cpumask_and(&workers, &workers, &node_to_cpumask(peer_node));
> +            cpumask_clear_cpu(cpu, &workers);
>
> -    while ( !cpumask_empty(&workers) )
> -    {
> -        peer_cpu = cpumask_cycle(peer_cpu, &workers);
> -        cpumask_clear_cpu(peer_cpu, &workers);
> +            if ( cpumask_empty(&workers) )
> +                goto next_node;
>
> -        /*
> -         * Get ahold of the scheduler lock for this peer CPU.
> -         *
> -         * Note: We don't spin on this lock but simply try it. Spinning could
> -         * cause a deadlock if the peer CPU is also load balancing and trying
> -         * to lock this CPU.
> -         */
> -        if ( !pcpu_schedule_trylock(peer_cpu) )
> -        {
> -            SCHED_STAT_CRANK(steal_trylock_failed);
> -            continue;
> -        }
> +            peer_cpu = cpumask_first(&workers);
> +            do
> +            {
> +                /*
> +                 * Get ahold of the scheduler lock for this peer CPU.
> +                 *
> +                 * Note: We don't spin on this lock but simply try it. Spinning
> +                 * could cause a deadlock if the peer CPU is also load
> +                 * balancing and trying to lock this CPU.
> +                 */
> +                if ( !pcpu_schedule_trylock(peer_cpu) )
> +                {
> +                    SCHED_STAT_CRANK(steal_trylock_failed);
> +                    peer_cpu = cpumask_cycle(peer_cpu, &workers);
> +                    continue;
> +                }
>
> -        /*
> -         * Any work over there to steal?
> -         */
> -        speer = cpumask_test_cpu(peer_cpu, online) ?
> -            csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL;
> -        pcpu_schedule_unlock(peer_cpu);
> -        if ( speer != NULL )
> -        {
> -            *stolen = 1;
> -            return speer;
> -        }
> +                /* Any work over there to steal? */
> +                speer = cpumask_test_cpu(peer_cpu, online) ?
> +                    csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL;
> +                pcpu_schedule_unlock(peer_cpu);
> +
> +                /* As soon as one vcpu is found, balancing ends */
> +                if ( speer != NULL )
> +                {
> +                    *stolen = 1;
> +                    return speer;
> +                }
> +
> +                peer_cpu = cpumask_cycle(peer_cpu, &workers);
> +
> +            } while( peer_cpu != cpumask_first(&workers) );
> +
> + next_node:
> +            peer_node = cycle_node(peer_node, node_online_map);
> +        } while( peer_node != node );
>       }

These changes all look right.  But then, I'm a bit tired, so I'll give 
it another once-over tomorrow. :-)

[To be continued]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-20 16:48   ` George Dunlap
@ 2012-12-20 18:18     ` Dario Faggioli
  2012-12-21 14:29       ` George Dunlap
  0 siblings, 1 reply; 57+ messages in thread
From: Dario Faggioli @ 2012-12-20 18:18 UTC (permalink / raw)
  To: George Dunlap
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson

On Thu, Dec 20, 2012 at 5:48 PM, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> On 19/12/12 19:07, Dario Faggioli wrote:
>> +static inline int
>> +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers)
>> +{
>> +    /*
>> +     * Consent to migration if cpu is one of the idlers in the VCPU's
>> +     * affinity mask. In fact, if that is not the case, it just means it
>> +     * was some other CPU that was tickled and should hence come and pick
>> +     * VCPU up. Migrating it to cpu would only make things worse.
>> +     */
>> +    return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask);
>>   }
>
> And in any case, looking at the caller of csched_load_balance(), it
> explicitly says to steal work if the next thing on the runqueue of cpu has a
> priority of TS_OVER.  That was chosen for a reason -- if you want to change
> that, you should change it there at the top (and make a justification for
> doing so), not deeply nested in a function like this.
>
> Or am I completely missing something?
>
No, you're right. Trying to solve a nasty issue I was seeing, I overlooked I was
changing the underlying logic until that point... Thanks!

What I want to avoid is the following: a vcpu wakes-up on the busy pcpu Y. As
a consequence, the idle pcpu X is tickled. Then, for any unrelated reason, pcpu
Z reschedules and, as it would go idle too, it looks around for any
vcpu to steal,
finds one in Y's runqueue and grabs it. Afterward, when X gets the IPI and
schedules, it just does not find anyone to run and goes back idling.

Now, suppose the vcpu has X, but *not* Z, in its node-affinity (while
it has a full
vcpu-affinity, i.e., can run everywhere). In this case, a vcpu that
could have run on
a pcpu in its node-affinity, executes outside from it. That happens because,
the NODE_BALANCE_STEP in csched_load_balance(), when called by Z, won't
find anything suitable to steal (provided there actually isn't any
vcpu waiting in
any runqueue with node-affinity with Z), while the CPU_BALANCE_STEP will
find our vcpu. :-(

So, what I wanted is something that could tell me whether the pcpu which is
stealing work is the one that has actually been tickled to do so. I
was then using
the pcpu idleness as a (cheap and easy to check) indication of that,
but I now see
this is having side effects I in the first place did not want to cause.

Sorry for that, I probably spent so much time buried, as you where
saying, in the
various nested loops and calls, that I lost the context a little bit! :-P

Ok, I think the problem I was describing is real, and I've seen it happening and
causing performances degradation. However, as I think a good solution
is going to
be more complex than I thought, I'd better repost without this
function and deal with
it in a future separate patch (after having figured out the best way
of doing so). Is
that fine with you?

> These changes all look right.
>
At least. :-)

> But then, I'm a bit tired, so I'll give it
> another once-over tomorrow. :-)
>
I can imagine, looking forward to your next comments.

Thanks a lot and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
---------------------------------------------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-20 18:18     ` Dario Faggioli
@ 2012-12-21 14:29       ` George Dunlap
  2012-12-21 16:07         ` Dario Faggioli
  0 siblings, 1 reply; 57+ messages in thread
From: George Dunlap @ 2012-12-21 14:29 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson

On 20/12/12 18:18, Dario Faggioli wrote:
> On Thu, Dec 20, 2012 at 5:48 PM, George Dunlap
> <george.dunlap@eu.citrix.com> wrote:
>> And in any case, looking at the caller of csched_load_balance(), it
>> explicitly says to steal work if the next thing on the runqueue of cpu has a
>> priority of TS_OVER.  That was chosen for a reason -- if you want to change
>> that, you should change it there at the top (and make a justification for
>> doing so), not deeply nested in a function like this.
>>
>> Or am I completely missing something?
>>
> No, you're right. Trying to solve a nasty issue I was seeing, I overlooked I was
> changing the underlying logic until that point... Thanks!
>
> What I want to avoid is the following: a vcpu wakes-up on the busy pcpu Y. As
> a consequence, the idle pcpu X is tickled. Then, for any unrelated reason, pcpu
> Z reschedules and, as it would go idle too, it looks around for any
> vcpu to steal,
> finds one in Y's runqueue and grabs it. Afterward, when X gets the IPI and
> schedules, it just does not find anyone to run and goes back idling.
>
> Now, suppose the vcpu has X, but *not* Z, in its node-affinity (while
> it has a full
> vcpu-affinity, i.e., can run everywhere). In this case, a vcpu that
> could have run on
> a pcpu in its node-affinity, executes outside from it. That happens because,
> the NODE_BALANCE_STEP in csched_load_balance(), when called by Z, won't
> find anything suitable to steal (provided there actually isn't any
> vcpu waiting in
> any runqueue with node-affinity with Z), while the CPU_BALANCE_STEP will
> find our vcpu. :-(
>
> So, what I wanted is something that could tell me whether the pcpu which is
> stealing work is the one that has actually been tickled to do so. I
> was then using
> the pcpu idleness as a (cheap and easy to check) indication of that,
> but I now see
> this is having side effects I in the first place did not want to cause.
>
> Sorry for that, I probably spent so much time buried, as you where
> saying, in the
> various nested loops and calls, that I lost the context a little bit! :-P

OK, that makes sense -- I figured it was something like that.  Don't 
feel too bad about missing that connection -- we're all fairly blind to 
our own code, and I only caught it because I was trying to figure out 
what was going on.  That's why we do patch review. :-)

Honestly, the whole "steal work" idea seemed a bit backwards to begin 
with, but now that we're not just dealing with "possible" and "not 
possible", but with "better" and "worse", the work-stealing method of 
load balancing sort of falls down.

It does make sense to do the load-balancing work on idle cpus rather 
than already-busy cpus; but I wonder if what should happen instead is 
that before idling, a pcpu chooses a "busy" pcpu and does a global load 
balancing for it -- i.e., pcpu 1 will look at pcpu 5's runqueue, and 
consider moving away the vcpus on the runqueue not just to itself but to 
any available cpu.

That way, in your example, Z might wake up, look at X's runqueue, and 
say, "This would probably run well on Y -- I'll migrate it there."

But that's kind of a half-baked idea at this point.

> Ok, I think the problem I was describing is real, and I've seen it happening and
> causing performances degradation. However, as I think a good solution
> is going to
> be more complex than I thought, I'd better repost without this
> function and deal with
> it in a future separate patch (after having figured out the best way
> of doing so). Is
> that fine with you?

Yes, that's fine.  Thanks, Dario.

  -George

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-21 14:29       ` George Dunlap
@ 2012-12-21 16:07         ` Dario Faggioli
  0 siblings, 0 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-21 16:07 UTC (permalink / raw)
  To: George Dunlap
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson


[-- Attachment #1.1: Type: text/plain, Size: 2179 bytes --]

On Fri, 2012-12-21 at 14:29 +0000, George Dunlap wrote: 
> > Sorry for that, I probably spent so much time buried, as you where
> > saying, in the
> > various nested loops and calls, that I lost the context a little bit! :-P
> 
> OK, that makes sense -- I figured it was something like that.  Don't 
> feel too bad about missing that connection -- we're all fairly blind to 
> our own code, and I only caught it because I was trying to figure out 
> what was going on.
>
Yeah, thanks, and no, I won't let this let me down to much, Even if this
was quite a big one. After all, that's what we have patch review for!

> That's why we do patch review. :-)
> 
Hehe, I see we agree. :-)

> Honestly, the whole "steal work" idea seemed a bit backwards to begin 
> with, but now that we're not just dealing with "possible" and "not 
> possible", but with "better" and "worse", the work-stealing method of 
> load balancing sort of falls down.
>
> [snip]
> 
> But that's kind of a half-baked idea at this point.
> 
Yes, this whole stealing work machinery may be rethought a bit. However,
let's get something sane in re NUMA load balancing ASAP, as we planned,
and then we'll see whether/how to rework it with both simplicity and
effectiveness in mind.

> > Ok, I think the problem I was describing is real, and I've seen it happening and
> > causing performances degradation. However, as I think a good solution
> > is going to
> > be more complex than I thought, I'd better repost without this
> > function and deal with
> > it in a future separate patch (after having figured out the best way
> > of doing so). Is
> > that fine with you?
> 
> Yes, that's fine.
>
Ok, I'll sort out all your comments and try to post v3 in early January,
so that you'll find it in your inbox as soon as you'll be back from
vacations! :-)

>   Thanks, Dario.
> 
Thanks to you,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-19 19:07 ` [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity Dario Faggioli
                     ` (2 preceding siblings ...)
  2012-12-20 16:48   ` George Dunlap
@ 2012-12-20 20:21   ` George Dunlap
  2012-12-21  0:18     ` Dario Faggioli
  3 siblings, 1 reply; 57+ messages in thread
From: George Dunlap @ 2012-12-20 20:21 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson

On 19/12/12 19:07, Dario Faggioli wrote:
> +static inline int
> +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers)
> +{
> +    /*
> +     * Consent to migration if cpu is one of the idlers in the VCPU's
> +     * affinity mask. In fact, if that is not the case, it just means it
> +     * was some other CPU that was tickled and should hence come and pick
> +     * VCPU up. Migrating it to cpu would only make things worse.
> +     */
> +    return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask);
>   }
>
>   static int
> @@ -493,85 +599,98 @@ static int
>       cpumask_t idlers;
>       cpumask_t *online;
>       struct csched_pcpu *spc = NULL;
> +    int ret, balance_step;
>       int cpu;
>
> -    /*
> -     * Pick from online CPUs in VCPU's affinity mask, giving a
> -     * preference to its current processor if it's in there.
> -     */
>       online = cpupool_scheduler_cpumask(vc->domain->cpupool);
> -    cpumask_and(&cpus, online, vc->cpu_affinity);
> -    cpu = cpumask_test_cpu(vc->processor, &cpus)
> -            ? vc->processor
> -            : cpumask_cycle(vc->processor, &cpus);
> -    ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
> +    for_each_csched_balance_step( balance_step )
> +    {
> +        /* Pick an online CPU from the proper affinity mask */
> +        ret = csched_balance_cpumask(vc, balance_step, &cpus);
> +        cpumask_and(&cpus, &cpus, online);
>
> -    /*
> -     * Try to find an idle processor within the above constraints.
> -     *
> -     * In multi-core and multi-threaded CPUs, not all idle execution
> -     * vehicles are equal!
> -     *
> -     * We give preference to the idle execution vehicle with the most
> -     * idling neighbours in its grouping. This distributes work across
> -     * distinct cores first and guarantees we don't do something stupid
> -     * like run two VCPUs on co-hyperthreads while there are idle cores
> -     * or sockets.
> -     *
> -     * Notice that, when computing the "idleness" of cpu, we may want to
> -     * discount vc. That is, iff vc is the currently running and the only
> -     * runnable vcpu on cpu, we add cpu to the idlers.
> -     */
> -    cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
> -    if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
> -        cpumask_set_cpu(cpu, &idlers);
> -    cpumask_and(&cpus, &cpus, &idlers);
> -    cpumask_clear_cpu(cpu, &cpus);
> +        /* If present, prefer vc's current processor */
> +        cpu = cpumask_test_cpu(vc->processor, &cpus)
> +                ? vc->processor
> +                : cpumask_cycle(vc->processor, &cpus);
> +        ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
>
> -    while ( !cpumask_empty(&cpus) )
> -    {
> -        cpumask_t cpu_idlers;
> -        cpumask_t nxt_idlers;
> -        int nxt, weight_cpu, weight_nxt;
> -        int migrate_factor;
> +        /*
> +         * Try to find an idle processor within the above constraints.
> +         *
> +         * In multi-core and multi-threaded CPUs, not all idle execution
> +         * vehicles are equal!
> +         *
> +         * We give preference to the idle execution vehicle with the most
> +         * idling neighbours in its grouping. This distributes work across
> +         * distinct cores first and guarantees we don't do something stupid
> +         * like run two VCPUs on co-hyperthreads while there are idle cores
> +         * or sockets.
> +         *
> +         * Notice that, when computing the "idleness" of cpu, we may want to
> +         * discount vc. That is, iff vc is the currently running and the only
> +         * runnable vcpu on cpu, we add cpu to the idlers.
> +         */
> +        cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
> +        if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
> +            cpumask_set_cpu(cpu, &idlers);
> +        cpumask_and(&cpus, &cpus, &idlers);
> +        /* If there are idlers and cpu is still not among them, pick one */
> +        if ( !cpumask_empty(&cpus) && !cpumask_test_cpu(cpu, &cpus) )
> +            cpu = cpumask_cycle(cpu, &cpus);

This seems to be an addition to the algorithm -- particularly hidden in 
this kind of "indent a big section that's almost exactly the same", I 
think this at least needs to be called out in the changelog message, 
perhaps put in a separate patch.

Can you comment on to why you think it's necessary?  Was there a 
particular problem you were seeing?

> +        cpumask_clear_cpu(cpu, &cpus);
>
> -        nxt = cpumask_cycle(cpu, &cpus);
> +        while ( !cpumask_empty(&cpus) )
> +        {
> +            cpumask_t cpu_idlers;
> +            cpumask_t nxt_idlers;
> +            int nxt, weight_cpu, weight_nxt;
> +            int migrate_factor;
>
> -        if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) )
> -        {
> -            /* We're on the same socket, so check the busy-ness of threads.
> -             * Migrate if # of idlers is less at all */
> -            ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> -            migrate_factor = 1;
> -            cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask, cpu));
> -            cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask, nxt));
> -        }
> -        else
> -        {
> -            /* We're on different sockets, so check the busy-ness of cores.
> -             * Migrate only if the other core is twice as idle */
> -            ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> -            migrate_factor = 2;
> -            cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu));
> -            cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt));
> +            nxt = cpumask_cycle(cpu, &cpus);
> +
> +            if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) )
> +            {
> +                /* We're on the same socket, so check the busy-ness of threads.
> +                 * Migrate if # of idlers is less at all */
> +                ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> +                migrate_factor = 1;
> +                cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask,
> +                            cpu));
> +                cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask,
> +                            nxt));
> +            }
> +            else
> +            {
> +                /* We're on different sockets, so check the busy-ness of cores.
> +                 * Migrate only if the other core is twice as idle */
> +                ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) );
> +                migrate_factor = 2;
> +                cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu));
> +                cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt));
> +            }
> +
> +            weight_cpu = cpumask_weight(&cpu_idlers);
> +            weight_nxt = cpumask_weight(&nxt_idlers);
> +            /* smt_power_savings: consolidate work rather than spreading it */
> +            if ( sched_smt_power_savings ?
> +                 weight_cpu > weight_nxt :
> +                 weight_cpu * migrate_factor < weight_nxt )
> +            {
> +                cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
> +                spc = CSCHED_PCPU(nxt);
> +                cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers);
> +                cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu));
> +            }
> +            else
> +            {
> +                cpumask_andnot(&cpus, &cpus, &nxt_idlers);
> +            }
>           }
>
> -        weight_cpu = cpumask_weight(&cpu_idlers);
> -        weight_nxt = cpumask_weight(&nxt_idlers);
> -        /* smt_power_savings: consolidate work rather than spreading it */
> -        if ( sched_smt_power_savings ?
> -             weight_cpu > weight_nxt :
> -             weight_cpu * migrate_factor < weight_nxt )
> -        {
> -            cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
> -            spc = CSCHED_PCPU(nxt);
> -            cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers);
> -            cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu));
> -        }
> -        else
> -        {
> -            cpumask_andnot(&cpus, &cpus, &nxt_idlers);
> -        }
> +        /* Stop if cpu is idle (or if csched_balance_cpumask() says we can) */
> +        if ( cpumask_test_cpu(cpu, &idlers) || ret )
> +            break;

Right -- OK, I think everything looks good here, except the "return -1 
from csched_balance_cpumask" thing.  I think it would be better if we 
explicitly checked cpumask_full(...->node_affinity_cpumask) and skipped 
the NODE step if that's the case.

Also -- and sorry to have to ask this kind of thing, but after sorting 
through the placement algorithm my head hurts -- under what 
circumstances would "cpumask_test_cpu(cpu, &idlers)" be false at this 
point?  It seems like the only possibility would be if:
( (vc->processor was not in the original &cpus [1])
   || !IS_RUNQ_IDLE(vc->processor) )
&& (there are no idlers in the original &cpus)

Which I suppose probably matches the time when we want to move on from 
looking at NODE affinity and look for CPU affinity.

[1] This could happen either if the vcpu/node affinity has changed, or 
if we're currently running outside our node affinity and we're doing the 
NODE step.

OK -- I think I've convinced myself that this is OK as well (apart from 
the hidden check).  I'll come back to look at your response to the load 
balancing thing tomorrow.

  -George

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-20 20:21   ` George Dunlap
@ 2012-12-21  0:18     ` Dario Faggioli
  2012-12-21 14:56       ` George Dunlap
  0 siblings, 1 reply; 57+ messages in thread
From: Dario Faggioli @ 2012-12-21  0:18 UTC (permalink / raw)
  To: George Dunlap
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson


[-- Attachment #1.1: Type: text/plain, Size: 9109 bytes --]

On Thu, 2012-12-20 at 20:21 +0000, George Dunlap wrote: 
> > -    /*
> > -     * Pick from online CPUs in VCPU's affinity mask, giving a
> > -     * preference to its current processor if it's in there.
> > -     */
> >       online = cpupool_scheduler_cpumask(vc->domain->cpupool);
> > -    cpumask_and(&cpus, online, vc->cpu_affinity);
> > -    cpu = cpumask_test_cpu(vc->processor, &cpus)
> > -            ? vc->processor
> > -            : cpumask_cycle(vc->processor, &cpus);
> > -    ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
> > +    for_each_csched_balance_step( balance_step )
> > +    {
> > +        /* Pick an online CPU from the proper affinity mask */
> > +        ret = csched_balance_cpumask(vc, balance_step, &cpus);
> > +        cpumask_and(&cpus, &cpus, online);
> >
> > -    /*
> > -     * Try to find an idle processor within the above constraints.
> > -     * 
> > -     * In multi-core and multi-threaded CPUs, not all idle execution
> > -     * vehicles are equal!
> > -     * 
> > -     * We give preference to the idle execution vehicle with the most
> > -     * idling neighbours in its grouping. This distributes work across
> > -     * distinct cores first and guarantees we don't do something stupid
> > -     * like run two VCPUs on co-hyperthreads while there are idle cores
> > -     * or sockets.
> > -     * 
> > -     * Notice that, when computing the "idleness" of cpu, we may want to
> > -     * discount vc. That is, iff vc is the currently running and the only
> > -     * runnable vcpu on cpu, we add cpu to the idlers.
> > -     */ 
> > -    cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
> > -    if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
> > -        cpumask_set_cpu(cpu, &idlers);
> > -    cpumask_and(&cpus, &cpus, &idlers);
> > -    cpumask_clear_cpu(cpu, &cpus);
> > +        /* If present, prefer vc's current processor */
> > +        cpu = cpumask_test_cpu(vc->processor, &cpus)
> > +                ? vc->processor
> > +                : cpumask_cycle(vc->processor, &cpus); 
> > +        ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
> > 
> > -    while ( !cpumask_empty(&cpus) )
> > -    {
> > -        cpumask_t cpu_idlers;
> > -        cpumask_t nxt_idlers;
> > -        int nxt, weight_cpu, weight_nxt;
> > -        int migrate_factor;
> > +        /*
> > +         * Try to find an idle processor within the above constraints.
> > +         *
> > +         * In multi-core and multi-threaded CPUs, not all idle execution
> > +         * vehicles are equal!
> > +         *
> > +         * We give preference to the idle execution vehicle with the most
> > +         * idling neighbours in its grouping. This distributes work across
> > +         * distinct cores first and guarantees we don't do something stupid
> > +         * like run two VCPUs on co-hyperthreads while there are idle cores
> > +         * or sockets.
> > +         *
> > +         * Notice that, when computing the "idleness" of cpu, we may want to
> > +         * discount vc. That is, iff vc is the currently running and the only
> > +         * runnable vcpu on cpu, we add cpu to the idlers.
> > +         */
> > +        cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
> > +        if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
> > +            cpumask_set_cpu(cpu, &idlers);
> > +        cpumask_and(&cpus, &cpus, &idlers);
> > +        /* If there are idlers and cpu is still not among them, pick one */
> > +        if ( !cpumask_empty(&cpus) && !cpumask_test_cpu(cpu, &cpus) )
> > +            cpu = cpumask_cycle(cpu, &cpus);
> 
> This seems to be an addition to the algorithm -- particularly hidden in 
> this kind of "indent a big section that's almost exactly the same", I 
> think this at least needs to be called out in the changelog message, 
> perhaps put in a separate patch.
> 
You're right, it is an addition, although a minor enough one (at least
from the amount of code point of view, the effect of not having it was
pretty bad! :-P) that I thought it can "hide" here. :-)

But I guess I can put it in a separate patch.

> Can you comment on to why you think it's necessary?  Was there a 
> particular problem you were seeing?
> 
Yep. Suppose vc is for some reason running on a pcpu which is outside
its node-affinity, but that now some pcpus within vc's node-affinity
have become idle. What we would like is vc start running there as soon
as possible, so we expect this call to _csched_pick_cpu() to determine
that.

What happens is we do not use vc->processor (as it is outside of vc's
node-affinity) and 'cpu' gets set to cpumask_cycle(vc->processor,
&cpus), where 'cpu' is the cpumask_and(&cpus, balance_mask, online).
This means 'cpu'. Let's also suppose that 'cpu' now points to a busy
thread but with an idle sibling, and that there aren't any other idle
pcpus (either core or threads). Now, the algorithm evaluates the
idleness of 'cpu', and compares it with the idleness of all the other
pcpus, and it won't find anything better that 'cpu' itself, as all the
other pcpus except its sibling thread are busy, while its sibling thread
has the very same idleness it has (2 threads, 1 idle 1 busy).

The neat effect is vc being moved to 'cpu', which is busy, while it
could have been moved to 'cpu''s sibling thread, which is indeed idle.

The if() I added fixes this by making sure that the reference cpu is an
idle one (if that is possible).

I hope I've explained it correctly, and sorry if it is a little bit
tricky, especially to explain like this (although, believe me, it was
tricky to hunt it out too! :-P). I've seen that happening and I'm almost
sure I kept a trace somewhere, so let me know if you want to see the
"smoking gun". :-)

> > -        weight_cpu = cpumask_weight(&cpu_idlers);
> > -        weight_nxt = cpumask_weight(&nxt_idlers);
> > -        /* smt_power_savings: consolidate work rather than spreading it */
> > -        if ( sched_smt_power_savings ?
> > -             weight_cpu > weight_nxt :
> > -             weight_cpu * migrate_factor < weight_nxt )
> > -        {
> > -            cpumask_and(&nxt_idlers, &cpus, &nxt_idlers);
> > -            spc = CSCHED_PCPU(nxt);
> > -            cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers);
> > -            cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu));
> > -        }
> > -        else
> > -        {
> > -            cpumask_andnot(&cpus, &cpus, &nxt_idlers);
> > -        }
> > +        /* Stop if cpu is idle (or if csched_balance_cpumask() says we can) */
> > +        if ( cpumask_test_cpu(cpu, &idlers) || ret )
> > +            break;
> 
> Right -- OK, I think everything looks good here, except the "return -1 
> from csched_balance_cpumask" thing.  I think it would be better if we 
> explicitly checked cpumask_full(...->node_affinity_cpumask) and skipped 
> the NODE step if that's the case.
> 
Yep. Will do this, or something along this line, all over the place.
Thanks.

> Also -- and sorry to have to ask this kind of thing, but after sorting 
> through the placement algorithm my head hurts -- under what 
> circumstances would "cpumask_test_cpu(cpu, &idlers)" be false at this 
> point?  It seems like the only possibility would be if:
> ( (vc->processor was not in the original &cpus [1])
>    || !IS_RUNQ_IDLE(vc->processor) )
> && (there are no idlers in the original &cpus)
> 
> Which I suppose probably matches the time when we want to move on from 
> looking at NODE affinity and look for CPU affinity.
> 
> [1] This could happen either if the vcpu/node affinity has changed, or 
> if we're currently running outside our node affinity and we're doing the 
> NODE step.
> 
> OK -- I think I've convinced myself that this is OK as well (apart from 
> the hidden check).  I'll come back to look at your response to the load 
> balancing thing tomorrow.
> 
Mmm... Sorry, not sure I follow, does this means that you figured out
and understood why I need that 'if(){break;}' ? Sounds like so, but I
can't be sure (my head hurts a bit too, after having written that
thing! :-D).

If no, consider that, for example, it can be false if all the pcpus in
the mask for this step are busy, and if this step is the node-affinity
step, I do _not_ want to exit the balancing loop, and check the other
pcpus in the vcpu-affinity. OTOH, if I don't put a break somewhere, even
if an idle pcpu is found during the node-affinity balancing step, it
will just go on and check all the others pcpus in the vcpu-affinity,
which would kill having tried to do the balancing here. Makes sense?

Thanks again and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-21  0:18     ` Dario Faggioli
@ 2012-12-21 14:56       ` George Dunlap
  2012-12-21 16:13         ` Dario Faggioli
  0 siblings, 1 reply; 57+ messages in thread
From: George Dunlap @ 2012-12-21 14:56 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson

On 21/12/12 00:18, Dario Faggioli wrote:
> On Thu, 2012-12-20 at 20:21 +0000, George Dunlap wrote:
>>> -    /*
>>> -     * Pick from online CPUs in VCPU's affinity mask, giving a
>>> -     * preference to its current processor if it's in there.
>>> -     */
>>>        online = cpupool_scheduler_cpumask(vc->domain->cpupool);
>>> -    cpumask_and(&cpus, online, vc->cpu_affinity);
>>> -    cpu = cpumask_test_cpu(vc->processor, &cpus)
>>> -            ? vc->processor
>>> -            : cpumask_cycle(vc->processor, &cpus);
>>> -    ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
>>> +    for_each_csched_balance_step( balance_step )
>>> +    {
>>> +        /* Pick an online CPU from the proper affinity mask */
>>> +        ret = csched_balance_cpumask(vc, balance_step, &cpus);
>>> +        cpumask_and(&cpus, &cpus, online);
>>>
>>> -    /*
>>> -     * Try to find an idle processor within the above constraints.
>>> -     *
>>> -     * In multi-core and multi-threaded CPUs, not all idle execution
>>> -     * vehicles are equal!
>>> -     *
>>> -     * We give preference to the idle execution vehicle with the most
>>> -     * idling neighbours in its grouping. This distributes work across
>>> -     * distinct cores first and guarantees we don't do something stupid
>>> -     * like run two VCPUs on co-hyperthreads while there are idle cores
>>> -     * or sockets.
>>> -     *
>>> -     * Notice that, when computing the "idleness" of cpu, we may want to
>>> -     * discount vc. That is, iff vc is the currently running and the only
>>> -     * runnable vcpu on cpu, we add cpu to the idlers.
>>> -     */
>>> -    cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
>>> -    if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
>>> -        cpumask_set_cpu(cpu, &idlers);
>>> -    cpumask_and(&cpus, &cpus, &idlers);
>>> -    cpumask_clear_cpu(cpu, &cpus);
>>> +        /* If present, prefer vc's current processor */
>>> +        cpu = cpumask_test_cpu(vc->processor, &cpus)
>>> +                ? vc->processor
>>> +                : cpumask_cycle(vc->processor, &cpus);
>>> +        ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) );
>>>
>>> -    while ( !cpumask_empty(&cpus) )
>>> -    {
>>> -        cpumask_t cpu_idlers;
>>> -        cpumask_t nxt_idlers;
>>> -        int nxt, weight_cpu, weight_nxt;
>>> -        int migrate_factor;
>>> +        /*
>>> +         * Try to find an idle processor within the above constraints.
>>> +         *
>>> +         * In multi-core and multi-threaded CPUs, not all idle execution
>>> +         * vehicles are equal!
>>> +         *
>>> +         * We give preference to the idle execution vehicle with the most
>>> +         * idling neighbours in its grouping. This distributes work across
>>> +         * distinct cores first and guarantees we don't do something stupid
>>> +         * like run two VCPUs on co-hyperthreads while there are idle cores
>>> +         * or sockets.
>>> +         *
>>> +         * Notice that, when computing the "idleness" of cpu, we may want to
>>> +         * discount vc. That is, iff vc is the currently running and the only
>>> +         * runnable vcpu on cpu, we add cpu to the idlers.
>>> +         */
>>> +        cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers);
>>> +        if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) )
>>> +            cpumask_set_cpu(cpu, &idlers);
>>> +        cpumask_and(&cpus, &cpus, &idlers);
>>> +        /* If there are idlers and cpu is still not among them, pick one */
>>> +        if ( !cpumask_empty(&cpus) && !cpumask_test_cpu(cpu, &cpus) )
>>> +            cpu = cpumask_cycle(cpu, &cpus);
>> This seems to be an addition to the algorithm -- particularly hidden in
>> this kind of "indent a big section that's almost exactly the same", I
>> think this at least needs to be called out in the changelog message,
>> perhaps put in a separate patch.
>>
> You're right, it is an addition, although a minor enough one (at least
> from the amount of code point of view, the effect of not having it was
> pretty bad! :-P) that I thought it can "hide" here. :-)
>
> But I guess I can put it in a separate patch.
>
>> Can you comment on to why you think it's necessary?  Was there a
>> particular problem you were seeing?
>>
> Yep. Suppose vc is for some reason running on a pcpu which is outside
> its node-affinity, but that now some pcpus within vc's node-affinity
> have become idle. What we would like is vc start running there as soon
> as possible, so we expect this call to _csched_pick_cpu() to determine
> that.
>
> What happens is we do not use vc->processor (as it is outside of vc's
> node-affinity) and 'cpu' gets set to cpumask_cycle(vc->processor,
> &cpus), where 'cpu' is the cpumask_and(&cpus, balance_mask, online).
> This means 'cpu'. Let's also suppose that 'cpu' now points to a busy
> thread but with an idle sibling, and that there aren't any other idle
> pcpus (either core or threads). Now, the algorithm evaluates the
> idleness of 'cpu', and compares it with the idleness of all the other
> pcpus, and it won't find anything better that 'cpu' itself, as all the
> other pcpus except its sibling thread are busy, while its sibling thread
> has the very same idleness it has (2 threads, 1 idle 1 busy).
>
> The neat effect is vc being moved to 'cpu', which is busy, while it
> could have been moved to 'cpu''s sibling thread, which is indeed idle.
>
> The if() I added fixes this by making sure that the reference cpu is an
> idle one (if that is possible).
>
> I hope I've explained it correctly, and sorry if it is a little bit
> tricky, especially to explain like this (although, believe me, it was
> tricky to hunt it out too! :-P). I've seen that happening and I'm almost
> sure I kept a trace somewhere, so let me know if you want to see the
> "smoking gun". :-)

No, the change looks quite plausible.  I guess it's not obvious that the 
balancing code will never migrate from one thread to another thread.  
(That whole algorithm could do with some commenting -- I may submit a 
patch once this series is in.)

I'm really glad you've had the opportunity to take a close look at these 
kinds of things.
>> Also -- and sorry to have to ask this kind of thing, but after sorting
>> through the placement algorithm my head hurts -- under what
>> circumstances would "cpumask_test_cpu(cpu, &idlers)" be false at this
>> point?  It seems like the only possibility would be if:
>> ( (vc->processor was not in the original &cpus [1])
>>     || !IS_RUNQ_IDLE(vc->processor) )
>> && (there are no idlers in the original &cpus)
>>
>> Which I suppose probably matches the time when we want to move on from
>> looking at NODE affinity and look for CPU affinity.
>>
>> [1] This could happen either if the vcpu/node affinity has changed, or
>> if we're currently running outside our node affinity and we're doing the
>> NODE step.
>>
>> OK -- I think I've convinced myself that this is OK as well (apart from
>> the hidden check).  I'll come back to look at your response to the load
>> balancing thing tomorrow.
>>
> Mmm... Sorry, not sure I follow, does this means that you figured out
> and understood why I need that 'if(){break;}' ? Sounds like so, but I
> can't be sure (my head hurts a bit too, after having written that
> thing! :-D).

Well I always understood why we needed the break -- the purpose is to 
avoid the second run through when it's not necessary.  What I was doing, 
in a sort of "thinking out loud" fashion, seeing under what conditions 
that break might actually happen.  Like the analysis with 
vcpu_should_migrate(), it might have turned out to be redundant, or to 
have missed some cases.  But I think it doesn't, so it's fine. :-)

  -George

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
  2012-12-21 14:56       ` George Dunlap
@ 2012-12-21 16:13         ` Dario Faggioli
  0 siblings, 0 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-21 16:13 UTC (permalink / raw)
  To: George Dunlap
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson


[-- Attachment #1.1: Type: text/plain, Size: 1621 bytes --]

On Fri, 2012-12-21 at 14:56 +0000, George Dunlap wrote: 
> > I hope I've explained it correctly, and sorry if it is a little bit
> > tricky, especially to explain like this (although, believe me, it was
> > tricky to hunt it out too! :-P). I've seen that happening and I'm almost
> > sure I kept a trace somewhere, so let me know if you want to see the
> > "smoking gun". :-)
> 
> No, the change looks quite plausible.  I guess it's not obvious that the 
> balancing code will never migrate from one thread to another thread.  
>
It was far from obvious to figure out this was happening, yes. :-)

> (That whole algorithm could do with some commenting -- I may submit a 
> patch once this series is in.)
> 
Nice.

> I'm really glad you've had the opportunity to take a close look at these 
> kinds of things.
>
Yeah, well, I'm happy to, scheduling never stops entertaining me, even
(or especially) when it requires my brain-cells to workout so hard! :-D

> What I was doing, 
> in a sort of "thinking out loud" fashion, seeing under what conditions 
> that break might actually happen.  Like the analysis with 
> vcpu_should_migrate(), it might have turned out to be redundant, or to 
> have missed some cases.
>
Yep, I agree, it's another aspects of the patch-review model which is
really helpful.

Thanks,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity
  2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
                   ` (2 preceding siblings ...)
  2012-12-19 19:07 ` [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity Dario Faggioli
@ 2012-12-19 19:07 ` Dario Faggioli
  2012-12-21 15:17   ` George Dunlap
  2012-12-19 19:07 ` [PATCH 05 of 10 v2] libxc: " Dario Faggioli
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 57+ messages in thread
From: Dario Faggioli @ 2012-12-19 19:07 UTC (permalink / raw)
  To: xen-devel
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	Jan Beulich, Daniel De Graaf, Matt Wilson

Make it possible to pass the node-affinity of a domain to the hypervisor
from the upper layers, instead of always being computed automatically.

Note that this also required generalizing the Flask hooks for setting
and getting the affinity, so that they now deal with both vcpu and
node affinity.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Changes from v1:
 * added the missing dummy hook for nodeaffinity;
 * let the permission renaming affect flask policies too.

diff --git a/tools/flask/policy/policy/flask/access_vectors b/tools/flask/policy/policy/flask/access_vectors
--- a/tools/flask/policy/policy/flask/access_vectors
+++ b/tools/flask/policy/policy/flask/access_vectors
@@ -47,8 +47,8 @@ class domain
     transition
     max_vcpus
     destroy
-    setvcpuaffinity
-	getvcpuaffinity
+    setaffinity
+	getaffinity
 	scheduler
 	getdomaininfo
 	getvcpuinfo
diff --git a/tools/flask/policy/policy/mls b/tools/flask/policy/policy/mls
--- a/tools/flask/policy/policy/mls
+++ b/tools/flask/policy/policy/mls
@@ -70,11 +70,11 @@ mlsconstrain domain transition
 	(( h1 dom h2 ) and (( l1 eq l2 ) or (t1 == mls_priv)));
 
 # all the domain "read" ops
-mlsconstrain domain { getvcpuaffinity getdomaininfo getvcpuinfo getvcpucontext getaddrsize getextvcpucontext }
+mlsconstrain domain { getaffinity getdomaininfo getvcpuinfo getvcpucontext getaddrsize getextvcpucontext }
 	((l1 dom l2) or (t1 == mls_priv));
 
 # all the domain "write" ops
-mlsconstrain domain { setvcpucontext pause unpause resume create max_vcpus destroy setvcpuaffinity scheduler setdomainmaxmem setdomainhandle setdebugging hypercall settime set_target shutdown setaddrsize trigger setextvcpucontext }
+mlsconstrain domain { setvcpucontext pause unpause resume create max_vcpus destroy setaffinity scheduler setdomainmaxmem setdomainhandle setdebugging hypercall settime set_target shutdown setaddrsize trigger setextvcpucontext }
 	((l1 eq l2) or (t1 == mls_priv));
 
 # This is incomplete - similar constraints must be written for all classes
diff --git a/tools/flask/policy/policy/modules/xen/xen.if b/tools/flask/policy/policy/modules/xen/xen.if
--- a/tools/flask/policy/policy/modules/xen/xen.if
+++ b/tools/flask/policy/policy/modules/xen/xen.if
@@ -55,9 +55,9 @@ define(`create_domain_build_label', `
 # manage_domain(priv, target)
 #   Allow managing a running domain
 define(`manage_domain', `
-	allow $1 $2:domain { getdomaininfo getvcpuinfo getvcpuaffinity
+	allow $1 $2:domain { getdomaininfo getvcpuinfo getaffinity
 			getaddrsize pause unpause trigger shutdown destroy
-			setvcpuaffinity setdomainmaxmem };
+			setaffinity setdomainmaxmem };
 ')
 
 # migrate_domain_out(priv, target)
diff --git a/xen/common/domain.c b/xen/common/domain.c
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -222,6 +222,7 @@ struct domain *domain_create(
 
     spin_lock_init(&d->node_affinity_lock);
     d->node_affinity = NODE_MASK_ALL;
+    d->auto_node_affinity = 1;
 
     spin_lock_init(&d->shutdown_lock);
     d->shutdown_code = -1;
@@ -362,11 +363,26 @@ void domain_update_node_affinity(struct 
         cpumask_or(cpumask, cpumask, online_affinity);
     }
 
-    for_each_online_node ( node )
-        if ( cpumask_intersects(&node_to_cpumask(node), cpumask) )
-            node_set(node, nodemask);
+    if ( d->auto_node_affinity )
+    {
+        /* Node-affinity is automaically computed from all vcpu-affinities */
+        for_each_online_node ( node )
+            if ( cpumask_intersects(&node_to_cpumask(node), cpumask) )
+                node_set(node, nodemask);
 
-    d->node_affinity = nodemask;
+        d->node_affinity = nodemask;
+    }
+    else
+    {
+        /* Node-affinity is provided by someone else, just filter out cpus
+         * that are either offline or not in the affinity of any vcpus. */
+        for_each_node_mask ( node, d->node_affinity )
+            if ( !cpumask_intersects(&node_to_cpumask(node), cpumask) )
+                node_clear(node, d->node_affinity);
+    }
+
+    sched_set_node_affinity(d, &d->node_affinity);
+
     spin_unlock(&d->node_affinity_lock);
 
     free_cpumask_var(online_affinity);
@@ -374,6 +390,36 @@ void domain_update_node_affinity(struct 
 }
 
 
+int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity)
+{
+    /* Being affine with no nodes is just wrong */
+    if ( nodes_empty(*affinity) )
+        return -EINVAL;
+
+    spin_lock(&d->node_affinity_lock);
+
+    /*
+     * Being/becoming explicitly affine to all nodes is not particularly
+     * useful. Let's take it as the `reset node affinity` command.
+     */
+    if ( nodes_full(*affinity) )
+    {
+        d->auto_node_affinity = 1;
+        goto out;
+    }
+
+    d->auto_node_affinity = 0;
+    d->node_affinity = *affinity;
+
+out:
+    spin_unlock(&d->node_affinity_lock);
+
+    domain_update_node_affinity(d);
+
+    return 0;
+}
+
+
 struct domain *get_domain_by_id(domid_t dom)
 {
     struct domain *d;
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -609,6 +609,40 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xe
     }
     break;
 
+    case XEN_DOMCTL_setnodeaffinity:
+    case XEN_DOMCTL_getnodeaffinity:
+    {
+        domid_t dom = op->domain;
+        struct domain *d = rcu_lock_domain_by_id(dom);
+
+        ret = -ESRCH;
+        if ( d == NULL )
+            break;
+
+        ret = xsm_nodeaffinity(op->cmd, d);
+        if ( ret )
+            goto nodeaffinity_out;
+
+        if ( op->cmd == XEN_DOMCTL_setnodeaffinity )
+        {
+            nodemask_t new_affinity;
+
+            ret = xenctl_bitmap_to_nodemask(&new_affinity,
+                                            &op->u.nodeaffinity.nodemap);
+            if ( !ret )
+                ret = domain_set_node_affinity(d, &new_affinity);
+        }
+        else
+        {
+            ret = nodemask_to_xenctl_bitmap(&op->u.nodeaffinity.nodemap,
+                                            &d->node_affinity);
+        }
+
+    nodeaffinity_out:
+        rcu_unlock_domain(d);
+    }
+    break;
+
     case XEN_DOMCTL_setvcpuaffinity:
     case XEN_DOMCTL_getvcpuaffinity:
     {
diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c
--- a/xen/common/keyhandler.c
+++ b/xen/common/keyhandler.c
@@ -217,6 +217,14 @@ static void cpuset_print(char *set, int 
     *set++ = '\0';
 }
 
+static void nodeset_print(char *set, int size, const nodemask_t *mask)
+{
+    *set++ = '[';
+    set += nodelist_scnprintf(set, size-2, mask);
+    *set++ = ']';
+    *set++ = '\0';
+}
+
 static void periodic_timer_print(char *str, int size, uint64_t period)
 {
     if ( period == 0 )
@@ -272,6 +280,9 @@ static void dump_domains(unsigned char k
 
         dump_pageframe_info(d);
                
+        nodeset_print(tmpstr, sizeof(tmpstr), &d->node_affinity);
+        printk("NODE affinity for domain %d: %s\n", d->domain_id, tmpstr);
+
         printk("VCPU information and callbacks for domain %u:\n",
                d->domain_id);
         for_each_vcpu ( d, v )
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -269,6 +269,33 @@ static inline void
     list_del_init(&svc->runq_elem);
 }
 
+/*
+ * Translates node-affinity mask into a cpumask, so that we can use it during
+ * actual scheduling. That of course will contain all the cpus from all the
+ * set nodes in the original node-affinity mask.
+ *
+ * Note that any serialization needed to access mask safely is complete
+ * responsibility of the caller of this function/hook.
+ */
+static void csched_set_node_affinity(
+    const struct scheduler *ops,
+    struct domain *d,
+    nodemask_t *mask)
+{
+    struct csched_dom *sdom;
+    int node;
+
+    /* Skip idle domain since it doesn't even have a node_affinity_cpumask */
+    if ( unlikely(is_idle_domain(d)) )
+        return;
+
+    sdom = CSCHED_DOM(d);
+    cpumask_clear(sdom->node_affinity_cpumask);
+    for_each_node_mask( node, *mask )
+        cpumask_or(sdom->node_affinity_cpumask, sdom->node_affinity_cpumask,
+                   &node_to_cpumask(node));
+}
+
 #define for_each_csched_balance_step(__step) \
     for ( (__step) = CSCHED_BALANCE_LAST; (__step) >= 0; (__step)-- )
 
@@ -296,7 +323,8 @@ csched_balance_cpumask(const struct vcpu
 
         cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity);
 
-        if ( cpumask_full(sdom->node_affinity_cpumask) )
+        if ( cpumask_full(sdom->node_affinity_cpumask) ||
+             d->auto_node_affinity == 1 )
             return -1;
     }
     else /* step == CSCHED_BALANCE_CPU_AFFINITY */
@@ -1896,6 +1924,8 @@ const struct scheduler sched_credit_def 
     .adjust         = csched_dom_cntl,
     .adjust_global  = csched_sys_cntl,
 
+    .set_node_affinity  = csched_set_node_affinity,
+
     .pick_cpu       = csched_cpu_pick,
     .do_schedule    = csched_schedule,
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -590,6 +590,11 @@ int cpu_disable_scheduler(unsigned int c
     return ret;
 }
 
+void sched_set_node_affinity(struct domain *d, nodemask_t *mask)
+{
+    SCHED_OP(DOM2OP(d), set_node_affinity, d, mask);
+}
+
 int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity)
 {
     cpumask_t online_affinity;
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -279,6 +279,16 @@ typedef struct xen_domctl_getvcpuinfo xe
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_getvcpuinfo_t);
 
 
+/* Get/set the NUMA node(s) with which the guest has affinity with. */
+/* XEN_DOMCTL_setnodeaffinity */
+/* XEN_DOMCTL_getnodeaffinity */
+struct xen_domctl_nodeaffinity {
+    struct xenctl_bitmap nodemap;/* IN */
+};
+typedef struct xen_domctl_nodeaffinity xen_domctl_nodeaffinity_t;
+DEFINE_XEN_GUEST_HANDLE(xen_domctl_nodeaffinity_t);
+
+
 /* Get/set which physical cpus a vcpu can execute on. */
 /* XEN_DOMCTL_setvcpuaffinity */
 /* XEN_DOMCTL_getvcpuaffinity */
@@ -907,6 +917,8 @@ struct xen_domctl {
 #define XEN_DOMCTL_audit_p2m                     65
 #define XEN_DOMCTL_set_virq_handler              66
 #define XEN_DOMCTL_set_broken_page_p2m           67
+#define XEN_DOMCTL_setnodeaffinity               68
+#define XEN_DOMCTL_getnodeaffinity               69
 #define XEN_DOMCTL_gdbsx_guestmemio            1000
 #define XEN_DOMCTL_gdbsx_pausevcpu             1001
 #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
@@ -920,6 +932,7 @@ struct xen_domctl {
         struct xen_domctl_getpageframeinfo  getpageframeinfo;
         struct xen_domctl_getpageframeinfo2 getpageframeinfo2;
         struct xen_domctl_getpageframeinfo3 getpageframeinfo3;
+        struct xen_domctl_nodeaffinity      nodeaffinity;
         struct xen_domctl_vcpuaffinity      vcpuaffinity;
         struct xen_domctl_shadow_op         shadow_op;
         struct xen_domctl_max_mem           max_mem;
diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h
--- a/xen/include/xen/nodemask.h
+++ b/xen/include/xen/nodemask.h
@@ -8,8 +8,9 @@
  * See detailed comments in the file linux/bitmap.h describing the
  * data type on which these nodemasks are based.
  *
- * For details of nodemask_scnprintf() and nodemask_parse(),
- * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c.
+ * For details of nodemask_scnprintf(), nodelist_scnpintf() and
+ * nodemask_parse(), see bitmap_scnprintf() and bitmap_parse()
+ * in lib/bitmap.c.
  *
  * The available nodemask operations are:
  *
@@ -50,6 +51,7 @@
  * unsigned long *nodes_addr(mask)	Array of unsigned long's in mask
  *
  * int nodemask_scnprintf(buf, len, mask) Format nodemask for printing
+ * int nodelist_scnprintf(buf, len, mask) Format nodemask as a list for printing
  * int nodemask_parse(ubuf, ulen, mask)	Parse ascii string as nodemask
  *
  * for_each_node_mask(node, mask)	for-loop node over mask
@@ -292,6 +294,14 @@ static inline int __cycle_node(int n, co
 
 #define nodes_addr(src) ((src).bits)
 
+#define nodelist_scnprintf(buf, len, src) \
+			__nodelist_scnprintf((buf), (len), (src), MAX_NUMNODES)
+static inline int __nodelist_scnprintf(char *buf, int len,
+					const nodemask_t *srcp, int nbits)
+{
+	return bitmap_scnlistprintf(buf, len, srcp->bits, nbits);
+}
+
 #if 0
 #define nodemask_scnprintf(buf, len, src) \
 			__nodemask_scnprintf((buf), (len), &(src), MAX_NUMNODES)
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -184,6 +184,8 @@ struct scheduler {
                                     struct xen_domctl_scheduler_op *);
     int          (*adjust_global)  (const struct scheduler *,
                                     struct xen_sysctl_scheduler_op *);
+    void         (*set_node_affinity) (const struct scheduler *,
+                                       struct domain *, nodemask_t *);
     void         (*dump_settings)  (const struct scheduler *);
     void         (*dump_cpu_state) (const struct scheduler *, int);
 
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -359,8 +359,12 @@ struct domain
     /* Various mem_events */
     struct mem_event_per_domain *mem_event;
 
-    /* Currently computed from union of all vcpu cpu-affinity masks. */
+    /*
+     * Can be specified by the user. If that is not the case, it is
+     * computed from the union of all the vcpu cpu-affinity masks.
+     */
     nodemask_t node_affinity;
+    int auto_node_affinity;
     unsigned int last_alloc_node;
     spinlock_t node_affinity_lock;
 };
@@ -429,6 +433,7 @@ static inline void get_knownalive_domain
     ASSERT(!(atomic_read(&d->refcnt) & DOMAIN_DESTROYED));
 }
 
+int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity);
 void domain_update_node_affinity(struct domain *d);
 
 struct domain *domain_create(
@@ -543,6 +548,7 @@ void sched_destroy_domain(struct domain 
 int sched_move_domain(struct domain *d, struct cpupool *c);
 long sched_adjust(struct domain *, struct xen_domctl_scheduler_op *);
 long sched_adjust_global(struct xen_sysctl_scheduler_op *);
+void sched_set_node_affinity(struct domain *, nodemask_t *);
 int  sched_id(void);
 void sched_tick_suspend(void);
 void sched_tick_resume(void);
diff --git a/xen/include/xsm/xsm.h b/xen/include/xsm/xsm.h
--- a/xen/include/xsm/xsm.h
+++ b/xen/include/xsm/xsm.h
@@ -56,6 +56,7 @@ struct xsm_operations {
     int (*domain_create) (struct domain *d, u32 ssidref);
     int (*max_vcpus) (struct domain *d);
     int (*destroydomain) (struct domain *d);
+    int (*nodeaffinity) (int cmd, struct domain *d);
     int (*vcpuaffinity) (int cmd, struct domain *d);
     int (*scheduler) (struct domain *d);
     int (*getdomaininfo) (struct domain *d);
@@ -229,6 +230,11 @@ static inline int xsm_destroydomain (str
     return xsm_call(destroydomain(d));
 }
 
+static inline int xsm_nodeaffinity (int cmd, struct domain *d)
+{
+    return xsm_call(nodeaffinity(cmd, d));
+}
+
 static inline int xsm_vcpuaffinity (int cmd, struct domain *d)
 {
     return xsm_call(vcpuaffinity(cmd, d));
diff --git a/xen/xsm/dummy.c b/xen/xsm/dummy.c
--- a/xen/xsm/dummy.c
+++ b/xen/xsm/dummy.c
@@ -54,6 +54,11 @@ static int dummy_destroydomain (struct d
     return 0;
 }
 
+static int dummy_nodeaffinity (int cmd, struct domain *d)
+{
+    return 0;
+}
+
 static int dummy_vcpuaffinity (int cmd, struct domain *d)
 {
     return 0;
@@ -634,6 +639,7 @@ void xsm_fixup_ops (struct xsm_operation
     set_to_dummy_if_null(ops, domain_create);
     set_to_dummy_if_null(ops, max_vcpus);
     set_to_dummy_if_null(ops, destroydomain);
+    set_to_dummy_if_null(ops, nodeaffinity);
     set_to_dummy_if_null(ops, vcpuaffinity);
     set_to_dummy_if_null(ops, scheduler);
     set_to_dummy_if_null(ops, getdomaininfo);
diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
--- a/xen/xsm/flask/hooks.c
+++ b/xen/xsm/flask/hooks.c
@@ -521,17 +521,19 @@ static int flask_destroydomain(struct do
                            DOMAIN__DESTROY);
 }
 
-static int flask_vcpuaffinity(int cmd, struct domain *d)
+static int flask_affinity(int cmd, struct domain *d)
 {
     u32 perm;
 
     switch ( cmd )
     {
     case XEN_DOMCTL_setvcpuaffinity:
-        perm = DOMAIN__SETVCPUAFFINITY;
+    case XEN_DOMCTL_setnodeaffinity:
+        perm = DOMAIN__SETAFFINITY;
         break;
     case XEN_DOMCTL_getvcpuaffinity:
-        perm = DOMAIN__GETVCPUAFFINITY;
+    case XEN_DOMCTL_getnodeaffinity:
+        perm = DOMAIN__GETAFFINITY;
         break;
     default:
         return -EPERM;
@@ -1473,7 +1475,8 @@ static struct xsm_operations flask_ops =
     .domain_create = flask_domain_create,
     .max_vcpus = flask_max_vcpus,
     .destroydomain = flask_destroydomain,
-    .vcpuaffinity = flask_vcpuaffinity,
+    .nodeaffinity = flask_affinity,
+    .vcpuaffinity = flask_affinity,
     .scheduler = flask_scheduler,
     .getdomaininfo = flask_getdomaininfo,
     .getvcpucontext = flask_getvcpucontext,
diff --git a/xen/xsm/flask/include/av_perm_to_string.h b/xen/xsm/flask/include/av_perm_to_string.h
--- a/xen/xsm/flask/include/av_perm_to_string.h
+++ b/xen/xsm/flask/include/av_perm_to_string.h
@@ -37,8 +37,8 @@
    S_(SECCLASS_DOMAIN, DOMAIN__TRANSITION, "transition")
    S_(SECCLASS_DOMAIN, DOMAIN__MAX_VCPUS, "max_vcpus")
    S_(SECCLASS_DOMAIN, DOMAIN__DESTROY, "destroy")
-   S_(SECCLASS_DOMAIN, DOMAIN__SETVCPUAFFINITY, "setvcpuaffinity")
-   S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUAFFINITY, "getvcpuaffinity")
+   S_(SECCLASS_DOMAIN, DOMAIN__SETAFFINITY, "setaffinity")
+   S_(SECCLASS_DOMAIN, DOMAIN__GETAFFINITY, "getaffinity")
    S_(SECCLASS_DOMAIN, DOMAIN__SCHEDULER, "scheduler")
    S_(SECCLASS_DOMAIN, DOMAIN__GETDOMAININFO, "getdomaininfo")
    S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUINFO, "getvcpuinfo")
diff --git a/xen/xsm/flask/include/av_permissions.h b/xen/xsm/flask/include/av_permissions.h
--- a/xen/xsm/flask/include/av_permissions.h
+++ b/xen/xsm/flask/include/av_permissions.h
@@ -38,8 +38,8 @@
 #define DOMAIN__TRANSITION                        0x00000020UL
 #define DOMAIN__MAX_VCPUS                         0x00000040UL
 #define DOMAIN__DESTROY                           0x00000080UL
-#define DOMAIN__SETVCPUAFFINITY                   0x00000100UL
-#define DOMAIN__GETVCPUAFFINITY                   0x00000200UL
+#define DOMAIN__SETAFFINITY                       0x00000100UL
+#define DOMAIN__GETAFFINITY                       0x00000200UL
 #define DOMAIN__SCHEDULER                         0x00000400UL
 #define DOMAIN__GETDOMAININFO                     0x00000800UL
 #define DOMAIN__GETVCPUINFO                       0x00001000UL

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity
  2012-12-19 19:07 ` [PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity Dario Faggioli
@ 2012-12-21 15:17   ` George Dunlap
  2012-12-21 16:17     ` Dario Faggioli
  2013-01-03 16:05     ` Daniel De Graaf
  0 siblings, 2 replies; 57+ messages in thread
From: George Dunlap @ 2012-12-21 15:17 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson

On 19/12/12 19:07, Dario Faggioli wrote:
> Make it possible to pass the node-affinity of a domain to the hypervisor
> from the upper layers, instead of always being computed automatically.
>
> Note that this also required generalizing the Flask hooks for setting
> and getting the affinity, so that they now deal with both vcpu and
> node affinity.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

I can't comment on the XSM stuff -- is any part of the "getvcpuaffinity" 
stuff for XSM a public interface that needs to be backwards-compatible?  
I.e., is s/vcpu//; OK from an interface point of view?

WRT everything else:
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

> ---
> Changes from v1:
>   * added the missing dummy hook for nodeaffinity;
>   * let the permission renaming affect flask policies too.
>
> diff --git a/tools/flask/policy/policy/flask/access_vectors b/tools/flask/policy/policy/flask/access_vectors
> --- a/tools/flask/policy/policy/flask/access_vectors
> +++ b/tools/flask/policy/policy/flask/access_vectors
> @@ -47,8 +47,8 @@ class domain
>       transition
>       max_vcpus
>       destroy
> -    setvcpuaffinity
> -       getvcpuaffinity
> +    setaffinity
> +       getaffinity
>          scheduler
>          getdomaininfo
>          getvcpuinfo
> diff --git a/tools/flask/policy/policy/mls b/tools/flask/policy/policy/mls
> --- a/tools/flask/policy/policy/mls
> +++ b/tools/flask/policy/policy/mls
> @@ -70,11 +70,11 @@ mlsconstrain domain transition
>          (( h1 dom h2 ) and (( l1 eq l2 ) or (t1 == mls_priv)));
>
>   # all the domain "read" ops
> -mlsconstrain domain { getvcpuaffinity getdomaininfo getvcpuinfo getvcpucontext getaddrsize getextvcpucontext }
> +mlsconstrain domain { getaffinity getdomaininfo getvcpuinfo getvcpucontext getaddrsize getextvcpucontext }
>          ((l1 dom l2) or (t1 == mls_priv));
>
>   # all the domain "write" ops
> -mlsconstrain domain { setvcpucontext pause unpause resume create max_vcpus destroy setvcpuaffinity scheduler setdomainmaxmem setdomainhandle setdebugging hypercall settime set_target shutdown setaddrsize trigger setextvcpucontext }
> +mlsconstrain domain { setvcpucontext pause unpause resume create max_vcpus destroy setaffinity scheduler setdomainmaxmem setdomainhandle setdebugging hypercall settime set_target shutdown setaddrsize trigger setextvcpucontext }
>          ((l1 eq l2) or (t1 == mls_priv));
>
>   # This is incomplete - similar constraints must be written for all classes
> diff --git a/tools/flask/policy/policy/modules/xen/xen.if b/tools/flask/policy/policy/modules/xen/xen.if
> --- a/tools/flask/policy/policy/modules/xen/xen.if
> +++ b/tools/flask/policy/policy/modules/xen/xen.if
> @@ -55,9 +55,9 @@ define(`create_domain_build_label', `
>   # manage_domain(priv, target)
>   #   Allow managing a running domain
>   define(`manage_domain', `
> -       allow $1 $2:domain { getdomaininfo getvcpuinfo getvcpuaffinity
> +       allow $1 $2:domain { getdomaininfo getvcpuinfo getaffinity
>                          getaddrsize pause unpause trigger shutdown destroy
> -                       setvcpuaffinity setdomainmaxmem };
> +                       setaffinity setdomainmaxmem };
>   ')
>
>   # migrate_domain_out(priv, target)
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -222,6 +222,7 @@ struct domain *domain_create(
>
>       spin_lock_init(&d->node_affinity_lock);
>       d->node_affinity = NODE_MASK_ALL;
> +    d->auto_node_affinity = 1;
>
>       spin_lock_init(&d->shutdown_lock);
>       d->shutdown_code = -1;
> @@ -362,11 +363,26 @@ void domain_update_node_affinity(struct
>           cpumask_or(cpumask, cpumask, online_affinity);
>       }
>
> -    for_each_online_node ( node )
> -        if ( cpumask_intersects(&node_to_cpumask(node), cpumask) )
> -            node_set(node, nodemask);
> +    if ( d->auto_node_affinity )
> +    {
> +        /* Node-affinity is automaically computed from all vcpu-affinities */
> +        for_each_online_node ( node )
> +            if ( cpumask_intersects(&node_to_cpumask(node), cpumask) )
> +                node_set(node, nodemask);
>
> -    d->node_affinity = nodemask;
> +        d->node_affinity = nodemask;
> +    }
> +    else
> +    {
> +        /* Node-affinity is provided by someone else, just filter out cpus
> +         * that are either offline or not in the affinity of any vcpus. */
> +        for_each_node_mask ( node, d->node_affinity )
> +            if ( !cpumask_intersects(&node_to_cpumask(node), cpumask) )
> +                node_clear(node, d->node_affinity);
> +    }
> +
> +    sched_set_node_affinity(d, &d->node_affinity);
> +
>       spin_unlock(&d->node_affinity_lock);
>
>       free_cpumask_var(online_affinity);
> @@ -374,6 +390,36 @@ void domain_update_node_affinity(struct
>   }
>
>
> +int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity)
> +{
> +    /* Being affine with no nodes is just wrong */
> +    if ( nodes_empty(*affinity) )
> +        return -EINVAL;
> +
> +    spin_lock(&d->node_affinity_lock);
> +
> +    /*
> +     * Being/becoming explicitly affine to all nodes is not particularly
> +     * useful. Let's take it as the `reset node affinity` command.
> +     */
> +    if ( nodes_full(*affinity) )
> +    {
> +        d->auto_node_affinity = 1;
> +        goto out;
> +    }
> +
> +    d->auto_node_affinity = 0;
> +    d->node_affinity = *affinity;
> +
> +out:
> +    spin_unlock(&d->node_affinity_lock);
> +
> +    domain_update_node_affinity(d);
> +
> +    return 0;
> +}
> +
> +
>   struct domain *get_domain_by_id(domid_t dom)
>   {
>       struct domain *d;
> diff --git a/xen/common/domctl.c b/xen/common/domctl.c
> --- a/xen/common/domctl.c
> +++ b/xen/common/domctl.c
> @@ -609,6 +609,40 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xe
>       }
>       break;
>
> +    case XEN_DOMCTL_setnodeaffinity:
> +    case XEN_DOMCTL_getnodeaffinity:
> +    {
> +        domid_t dom = op->domain;
> +        struct domain *d = rcu_lock_domain_by_id(dom);
> +
> +        ret = -ESRCH;
> +        if ( d == NULL )
> +            break;
> +
> +        ret = xsm_nodeaffinity(op->cmd, d);
> +        if ( ret )
> +            goto nodeaffinity_out;
> +
> +        if ( op->cmd == XEN_DOMCTL_setnodeaffinity )
> +        {
> +            nodemask_t new_affinity;
> +
> +            ret = xenctl_bitmap_to_nodemask(&new_affinity,
> +                                            &op->u.nodeaffinity.nodemap);
> +            if ( !ret )
> +                ret = domain_set_node_affinity(d, &new_affinity);
> +        }
> +        else
> +        {
> +            ret = nodemask_to_xenctl_bitmap(&op->u.nodeaffinity.nodemap,
> +                                            &d->node_affinity);
> +        }
> +
> +    nodeaffinity_out:
> +        rcu_unlock_domain(d);
> +    }
> +    break;
> +
>       case XEN_DOMCTL_setvcpuaffinity:
>       case XEN_DOMCTL_getvcpuaffinity:
>       {
> diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c
> --- a/xen/common/keyhandler.c
> +++ b/xen/common/keyhandler.c
> @@ -217,6 +217,14 @@ static void cpuset_print(char *set, int
>       *set++ = '\0';
>   }
>
> +static void nodeset_print(char *set, int size, const nodemask_t *mask)
> +{
> +    *set++ = '[';
> +    set += nodelist_scnprintf(set, size-2, mask);
> +    *set++ = ']';
> +    *set++ = '\0';
> +}
> +
>   static void periodic_timer_print(char *str, int size, uint64_t period)
>   {
>       if ( period == 0 )
> @@ -272,6 +280,9 @@ static void dump_domains(unsigned char k
>
>           dump_pageframe_info(d);
>
> +        nodeset_print(tmpstr, sizeof(tmpstr), &d->node_affinity);
> +        printk("NODE affinity for domain %d: %s\n", d->domain_id, tmpstr);
> +
>           printk("VCPU information and callbacks for domain %u:\n",
>                  d->domain_id);
>           for_each_vcpu ( d, v )
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -269,6 +269,33 @@ static inline void
>       list_del_init(&svc->runq_elem);
>   }
>
> +/*
> + * Translates node-affinity mask into a cpumask, so that we can use it during
> + * actual scheduling. That of course will contain all the cpus from all the
> + * set nodes in the original node-affinity mask.
> + *
> + * Note that any serialization needed to access mask safely is complete
> + * responsibility of the caller of this function/hook.
> + */
> +static void csched_set_node_affinity(
> +    const struct scheduler *ops,
> +    struct domain *d,
> +    nodemask_t *mask)
> +{
> +    struct csched_dom *sdom;
> +    int node;
> +
> +    /* Skip idle domain since it doesn't even have a node_affinity_cpumask */
> +    if ( unlikely(is_idle_domain(d)) )
> +        return;
> +
> +    sdom = CSCHED_DOM(d);
> +    cpumask_clear(sdom->node_affinity_cpumask);
> +    for_each_node_mask( node, *mask )
> +        cpumask_or(sdom->node_affinity_cpumask, sdom->node_affinity_cpumask,
> +                   &node_to_cpumask(node));
> +}
> +
>   #define for_each_csched_balance_step(__step) \
>       for ( (__step) = CSCHED_BALANCE_LAST; (__step) >= 0; (__step)-- )
>
> @@ -296,7 +323,8 @@ csched_balance_cpumask(const struct vcpu
>
>           cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity);
>
> -        if ( cpumask_full(sdom->node_affinity_cpumask) )
> +        if ( cpumask_full(sdom->node_affinity_cpumask) ||
> +             d->auto_node_affinity == 1 )
>               return -1;
>       }
>       else /* step == CSCHED_BALANCE_CPU_AFFINITY */
> @@ -1896,6 +1924,8 @@ const struct scheduler sched_credit_def
>       .adjust         = csched_dom_cntl,
>       .adjust_global  = csched_sys_cntl,
>
> +    .set_node_affinity  = csched_set_node_affinity,
> +
>       .pick_cpu       = csched_cpu_pick,
>       .do_schedule    = csched_schedule,
>
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -590,6 +590,11 @@ int cpu_disable_scheduler(unsigned int c
>       return ret;
>   }
>
> +void sched_set_node_affinity(struct domain *d, nodemask_t *mask)
> +{
> +    SCHED_OP(DOM2OP(d), set_node_affinity, d, mask);
> +}
> +
>   int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity)
>   {
>       cpumask_t online_affinity;
> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -279,6 +279,16 @@ typedef struct xen_domctl_getvcpuinfo xe
>   DEFINE_XEN_GUEST_HANDLE(xen_domctl_getvcpuinfo_t);
>
>
> +/* Get/set the NUMA node(s) with which the guest has affinity with. */
> +/* XEN_DOMCTL_setnodeaffinity */
> +/* XEN_DOMCTL_getnodeaffinity */
> +struct xen_domctl_nodeaffinity {
> +    struct xenctl_bitmap nodemap;/* IN */
> +};
> +typedef struct xen_domctl_nodeaffinity xen_domctl_nodeaffinity_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_domctl_nodeaffinity_t);
> +
> +
>   /* Get/set which physical cpus a vcpu can execute on. */
>   /* XEN_DOMCTL_setvcpuaffinity */
>   /* XEN_DOMCTL_getvcpuaffinity */
> @@ -907,6 +917,8 @@ struct xen_domctl {
>   #define XEN_DOMCTL_audit_p2m                     65
>   #define XEN_DOMCTL_set_virq_handler              66
>   #define XEN_DOMCTL_set_broken_page_p2m           67
> +#define XEN_DOMCTL_setnodeaffinity               68
> +#define XEN_DOMCTL_getnodeaffinity               69
>   #define XEN_DOMCTL_gdbsx_guestmemio            1000
>   #define XEN_DOMCTL_gdbsx_pausevcpu             1001
>   #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
> @@ -920,6 +932,7 @@ struct xen_domctl {
>           struct xen_domctl_getpageframeinfo  getpageframeinfo;
>           struct xen_domctl_getpageframeinfo2 getpageframeinfo2;
>           struct xen_domctl_getpageframeinfo3 getpageframeinfo3;
> +        struct xen_domctl_nodeaffinity      nodeaffinity;
>           struct xen_domctl_vcpuaffinity      vcpuaffinity;
>           struct xen_domctl_shadow_op         shadow_op;
>           struct xen_domctl_max_mem           max_mem;
> diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h
> --- a/xen/include/xen/nodemask.h
> +++ b/xen/include/xen/nodemask.h
> @@ -8,8 +8,9 @@
>    * See detailed comments in the file linux/bitmap.h describing the
>    * data type on which these nodemasks are based.
>    *
> - * For details of nodemask_scnprintf() and nodemask_parse(),
> - * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c.
> + * For details of nodemask_scnprintf(), nodelist_scnpintf() and
> + * nodemask_parse(), see bitmap_scnprintf() and bitmap_parse()
> + * in lib/bitmap.c.
>    *
>    * The available nodemask operations are:
>    *
> @@ -50,6 +51,7 @@
>    * unsigned long *nodes_addr(mask)     Array of unsigned long's in mask
>    *
>    * int nodemask_scnprintf(buf, len, mask) Format nodemask for printing
> + * int nodelist_scnprintf(buf, len, mask) Format nodemask as a list for printing
>    * int nodemask_parse(ubuf, ulen, mask)        Parse ascii string as nodemask
>    *
>    * for_each_node_mask(node, mask)      for-loop node over mask
> @@ -292,6 +294,14 @@ static inline int __cycle_node(int n, co
>
>   #define nodes_addr(src) ((src).bits)
>
> +#define nodelist_scnprintf(buf, len, src) \
> +                       __nodelist_scnprintf((buf), (len), (src), MAX_NUMNODES)
> +static inline int __nodelist_scnprintf(char *buf, int len,
> +                                       const nodemask_t *srcp, int nbits)
> +{
> +       return bitmap_scnlistprintf(buf, len, srcp->bits, nbits);
> +}
> +
>   #if 0
>   #define nodemask_scnprintf(buf, len, src) \
>                          __nodemask_scnprintf((buf), (len), &(src), MAX_NUMNODES)
> diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
> --- a/xen/include/xen/sched-if.h
> +++ b/xen/include/xen/sched-if.h
> @@ -184,6 +184,8 @@ struct scheduler {
>                                       struct xen_domctl_scheduler_op *);
>       int          (*adjust_global)  (const struct scheduler *,
>                                       struct xen_sysctl_scheduler_op *);
> +    void         (*set_node_affinity) (const struct scheduler *,
> +                                       struct domain *, nodemask_t *);
>       void         (*dump_settings)  (const struct scheduler *);
>       void         (*dump_cpu_state) (const struct scheduler *, int);
>
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -359,8 +359,12 @@ struct domain
>       /* Various mem_events */
>       struct mem_event_per_domain *mem_event;
>
> -    /* Currently computed from union of all vcpu cpu-affinity masks. */
> +    /*
> +     * Can be specified by the user. If that is not the case, it is
> +     * computed from the union of all the vcpu cpu-affinity masks.
> +     */
>       nodemask_t node_affinity;
> +    int auto_node_affinity;
>       unsigned int last_alloc_node;
>       spinlock_t node_affinity_lock;
>   };
> @@ -429,6 +433,7 @@ static inline void get_knownalive_domain
>       ASSERT(!(atomic_read(&d->refcnt) & DOMAIN_DESTROYED));
>   }
>
> +int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity);
>   void domain_update_node_affinity(struct domain *d);
>
>   struct domain *domain_create(
> @@ -543,6 +548,7 @@ void sched_destroy_domain(struct domain
>   int sched_move_domain(struct domain *d, struct cpupool *c);
>   long sched_adjust(struct domain *, struct xen_domctl_scheduler_op *);
>   long sched_adjust_global(struct xen_sysctl_scheduler_op *);
> +void sched_set_node_affinity(struct domain *, nodemask_t *);
>   int  sched_id(void);
>   void sched_tick_suspend(void);
>   void sched_tick_resume(void);
> diff --git a/xen/include/xsm/xsm.h b/xen/include/xsm/xsm.h
> --- a/xen/include/xsm/xsm.h
> +++ b/xen/include/xsm/xsm.h
> @@ -56,6 +56,7 @@ struct xsm_operations {
>       int (*domain_create) (struct domain *d, u32 ssidref);
>       int (*max_vcpus) (struct domain *d);
>       int (*destroydomain) (struct domain *d);
> +    int (*nodeaffinity) (int cmd, struct domain *d);
>       int (*vcpuaffinity) (int cmd, struct domain *d);
>       int (*scheduler) (struct domain *d);
>       int (*getdomaininfo) (struct domain *d);
> @@ -229,6 +230,11 @@ static inline int xsm_destroydomain (str
>       return xsm_call(destroydomain(d));
>   }
>
> +static inline int xsm_nodeaffinity (int cmd, struct domain *d)
> +{
> +    return xsm_call(nodeaffinity(cmd, d));
> +}
> +
>   static inline int xsm_vcpuaffinity (int cmd, struct domain *d)
>   {
>       return xsm_call(vcpuaffinity(cmd, d));
> diff --git a/xen/xsm/dummy.c b/xen/xsm/dummy.c
> --- a/xen/xsm/dummy.c
> +++ b/xen/xsm/dummy.c
> @@ -54,6 +54,11 @@ static int dummy_destroydomain (struct d
>       return 0;
>   }
>
> +static int dummy_nodeaffinity (int cmd, struct domain *d)
> +{
> +    return 0;
> +}
> +
>   static int dummy_vcpuaffinity (int cmd, struct domain *d)
>   {
>       return 0;
> @@ -634,6 +639,7 @@ void xsm_fixup_ops (struct xsm_operation
>       set_to_dummy_if_null(ops, domain_create);
>       set_to_dummy_if_null(ops, max_vcpus);
>       set_to_dummy_if_null(ops, destroydomain);
> +    set_to_dummy_if_null(ops, nodeaffinity);
>       set_to_dummy_if_null(ops, vcpuaffinity);
>       set_to_dummy_if_null(ops, scheduler);
>       set_to_dummy_if_null(ops, getdomaininfo);
> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
> --- a/xen/xsm/flask/hooks.c
> +++ b/xen/xsm/flask/hooks.c
> @@ -521,17 +521,19 @@ static int flask_destroydomain(struct do
>                              DOMAIN__DESTROY);
>   }
>
> -static int flask_vcpuaffinity(int cmd, struct domain *d)
> +static int flask_affinity(int cmd, struct domain *d)
>   {
>       u32 perm;
>
>       switch ( cmd )
>       {
>       case XEN_DOMCTL_setvcpuaffinity:
> -        perm = DOMAIN__SETVCPUAFFINITY;
> +    case XEN_DOMCTL_setnodeaffinity:
> +        perm = DOMAIN__SETAFFINITY;
>           break;
>       case XEN_DOMCTL_getvcpuaffinity:
> -        perm = DOMAIN__GETVCPUAFFINITY;
> +    case XEN_DOMCTL_getnodeaffinity:
> +        perm = DOMAIN__GETAFFINITY;
>           break;
>       default:
>           return -EPERM;
> @@ -1473,7 +1475,8 @@ static struct xsm_operations flask_ops =
>       .domain_create = flask_domain_create,
>       .max_vcpus = flask_max_vcpus,
>       .destroydomain = flask_destroydomain,
> -    .vcpuaffinity = flask_vcpuaffinity,
> +    .nodeaffinity = flask_affinity,
> +    .vcpuaffinity = flask_affinity,
>       .scheduler = flask_scheduler,
>       .getdomaininfo = flask_getdomaininfo,
>       .getvcpucontext = flask_getvcpucontext,
> diff --git a/xen/xsm/flask/include/av_perm_to_string.h b/xen/xsm/flask/include/av_perm_to_string.h
> --- a/xen/xsm/flask/include/av_perm_to_string.h
> +++ b/xen/xsm/flask/include/av_perm_to_string.h
> @@ -37,8 +37,8 @@
>      S_(SECCLASS_DOMAIN, DOMAIN__TRANSITION, "transition")
>      S_(SECCLASS_DOMAIN, DOMAIN__MAX_VCPUS, "max_vcpus")
>      S_(SECCLASS_DOMAIN, DOMAIN__DESTROY, "destroy")
> -   S_(SECCLASS_DOMAIN, DOMAIN__SETVCPUAFFINITY, "setvcpuaffinity")
> -   S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUAFFINITY, "getvcpuaffinity")
> +   S_(SECCLASS_DOMAIN, DOMAIN__SETAFFINITY, "setaffinity")
> +   S_(SECCLASS_DOMAIN, DOMAIN__GETAFFINITY, "getaffinity")
>      S_(SECCLASS_DOMAIN, DOMAIN__SCHEDULER, "scheduler")
>      S_(SECCLASS_DOMAIN, DOMAIN__GETDOMAININFO, "getdomaininfo")
>      S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUINFO, "getvcpuinfo")
> diff --git a/xen/xsm/flask/include/av_permissions.h b/xen/xsm/flask/include/av_permissions.h
> --- a/xen/xsm/flask/include/av_permissions.h
> +++ b/xen/xsm/flask/include/av_permissions.h
> @@ -38,8 +38,8 @@
>   #define DOMAIN__TRANSITION                        0x00000020UL
>   #define DOMAIN__MAX_VCPUS                         0x00000040UL
>   #define DOMAIN__DESTROY                           0x00000080UL
> -#define DOMAIN__SETVCPUAFFINITY                   0x00000100UL
> -#define DOMAIN__GETVCPUAFFINITY                   0x00000200UL
> +#define DOMAIN__SETAFFINITY                       0x00000100UL
> +#define DOMAIN__GETAFFINITY                       0x00000200UL
>   #define DOMAIN__SCHEDULER                         0x00000400UL
>   #define DOMAIN__GETDOMAININFO                     0x00000800UL
>   #define DOMAIN__GETVCPUINFO                       0x00001000UL

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity
  2012-12-21 15:17   ` George Dunlap
@ 2012-12-21 16:17     ` Dario Faggioli
  2013-01-03 16:05     ` Daniel De Graaf
  1 sibling, 0 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-21 16:17 UTC (permalink / raw)
  To: George Dunlap
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson


[-- Attachment #1.1: Type: text/plain, Size: 1591 bytes --]

On Fri, 2012-12-21 at 15:17 +0000, George Dunlap wrote: 
> On 19/12/12 19:07, Dario Faggioli wrote:
> > Make it possible to pass the node-affinity of a domain to the hypervisor
> > from the upper layers, instead of always being computed automatically.
> >
> > Note that this also required generalizing the Flask hooks for setting
> > and getting the affinity, so that they now deal with both vcpu and
> > node affinity.
> >
> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> I can't comment on the XSM stuff 
>
Right, it's the part I'm most week on too... Daniel had a go with this
parch during v1's review, and he raised a couple of points that I think
I addressed, let's see if he thinks there's anything else.

> -- is any part of the "getvcpuaffinity" 
> stuff for XSM a public interface that needs to be backwards-compatible?  
> I.e., is s/vcpu//; OK from an interface point of view?
> 
Mmm... Good point, I haven't thought about that. This was here in v1 and
Daniel explicitly said he was fine with the renaming instead of adding a
new hook, but mostly for the "semantic" point of view, not sure whether
bkwrd compatibility is an issue... Daniel, what do you think?

> WRT everything else:
> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
> 
Ok. Thanks,
Dario


<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity
  2012-12-21 15:17   ` George Dunlap
  2012-12-21 16:17     ` Dario Faggioli
@ 2013-01-03 16:05     ` Daniel De Graaf
  1 sibling, 0 replies; 57+ messages in thread
From: Daniel De Graaf @ 2013-01-03 16:05 UTC (permalink / raw)
  To: George Dunlap
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Dario Faggioli, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Matt Wilson, Juergen Gross

On 12/21/2012 10:17 AM, George Dunlap wrote:
> On 19/12/12 19:07, Dario Faggioli wrote:
>> Make it possible to pass the node-affinity of a domain to the hypervisor
>> from the upper layers, instead of always being computed automatically.
>>
>> Note that this also required generalizing the Flask hooks for setting
>> and getting the affinity, so that they now deal with both vcpu and
>> node affinity.
>>
>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> I can't comment on the XSM stuff -- is any part of the "getvcpuaffinity" stuff for XSM a public interface that needs to be backwards-compatible?  I.e., is s/vcpu//; OK from an interface point of view?
> 
> WRT everything else:
> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

It is an interface used only by the XSM policy itself, which is already
going to have non-backwards-compatible changes in 4.3 due to IS_PRIV
reworking and adding new hooks.  The FLASK policy in Xen has not supported
loading policies that do not exactly match the hypervisor's access vectors
because the hypervisor policy is still maintained in the same source code
tree as the hypervisor, so I would consider this similar to the compatibility
between libxc/libxl and the hypervisor rather than trying for the same level
of compatibility that Linux provides for SELinux policies.

A quick grep of xen-unstable finds one instance of getvcpuaffinity in xen.te
that needs to be changed to getaffinity; with that:
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>

>> ---
>> Changes from v1:
>>   * added the missing dummy hook for nodeaffinity;
>>   * let the permission renaming affect flask policies too.
>>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 05 of 10 v2] libxc: allow for explicitly specifying node-affinity
  2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
                   ` (3 preceding siblings ...)
  2012-12-19 19:07 ` [PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity Dario Faggioli
@ 2012-12-19 19:07 ` Dario Faggioli
  2012-12-21 15:19   ` George Dunlap
  2012-12-19 19:07 ` [PATCH 06 of 10 v2] libxl: " Dario Faggioli
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 57+ messages in thread
From: Dario Faggioli @ 2012-12-19 19:07 UTC (permalink / raw)
  To: xen-devel
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	Jan Beulich, Daniel De Graaf, Matt Wilson

By providing the proper get/set interface and wiring them
to the new domctl-s from the previous commit.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>

diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -110,6 +110,83 @@ int xc_domain_shutdown(xc_interface *xch
 }
 
 
+int xc_domain_node_setaffinity(xc_interface *xch,
+                               uint32_t domid,
+                               xc_nodemap_t nodemap)
+{
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
+    int ret = -1;
+    int nodesize;
+
+    nodesize = xc_get_nodemap_size(xch);
+    if (!nodesize)
+    {
+        PERROR("Could not get number of nodes");
+        goto out;
+    }
+
+    local = xc_hypercall_buffer_alloc(xch, local, nodesize);
+    if ( local == NULL )
+    {
+        PERROR("Could not allocate memory for setnodeaffinity domctl hypercall");
+        goto out;
+    }
+
+    domctl.cmd = XEN_DOMCTL_setnodeaffinity;
+    domctl.domain = (domid_t)domid;
+
+    memcpy(local, nodemap, nodesize);
+    set_xen_guest_handle(domctl.u.nodeaffinity.nodemap.bitmap, local);
+    domctl.u.nodeaffinity.nodemap.nr_elems = nodesize * 8;
+
+    ret = do_domctl(xch, &domctl);
+
+    xc_hypercall_buffer_free(xch, local);
+
+ out:
+    return ret;
+}
+
+int xc_domain_node_getaffinity(xc_interface *xch,
+                               uint32_t domid,
+                               xc_nodemap_t nodemap)
+{
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
+    int ret = -1;
+    int nodesize;
+
+    nodesize = xc_get_nodemap_size(xch);
+    if (!nodesize)
+    {
+        PERROR("Could not get number of nodes");
+        goto out;
+    }
+
+    local = xc_hypercall_buffer_alloc(xch, local, nodesize);
+    if ( local == NULL )
+    {
+        PERROR("Could not allocate memory for getnodeaffinity domctl hypercall");
+        goto out;
+    }
+
+    domctl.cmd = XEN_DOMCTL_getnodeaffinity;
+    domctl.domain = (domid_t)domid;
+
+    set_xen_guest_handle(domctl.u.nodeaffinity.nodemap.bitmap, local);
+    domctl.u.nodeaffinity.nodemap.nr_elems = nodesize * 8;
+
+    ret = do_domctl(xch, &domctl);
+
+    memcpy(nodemap, local, nodesize);
+
+    xc_hypercall_buffer_free(xch, local);
+
+ out:
+    return ret;
+}
+
 int xc_vcpu_setaffinity(xc_interface *xch,
                         uint32_t domid,
                         int vcpu,
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -521,6 +521,32 @@ int xc_watchdog(xc_interface *xch,
 		uint32_t id,
 		uint32_t timeout);
 
+/**
+ * This function explicitly sets the host NUMA nodes the domain will
+ * have affinity with.
+ *
+ * @parm xch a handle to an open hypervisor interface.
+ * @parm domid the domain id one wants to set the affinity of.
+ * @parm nodemap the map of the affine nodes.
+ * @return 0 on success, -1 on failure.
+ */
+int xc_domain_node_setaffinity(xc_interface *xch,
+                               uint32_t domind,
+                               xc_nodemap_t nodemap);
+
+/**
+ * This function retrieves the host NUMA nodes the domain has
+ * affinity with.
+ *
+ * @parm xch a handle to an open hypervisor interface.
+ * @parm domid the domain id one wants to get the node affinity of.
+ * @parm nodemap the map of the affine nodes.
+ * @return 0 on success, -1 on failure.
+ */
+int xc_domain_node_getaffinity(xc_interface *xch,
+                               uint32_t domind,
+                               xc_nodemap_t nodemap);
+
 int xc_vcpu_setaffinity(xc_interface *xch,
                         uint32_t domid,
                         int vcpu,

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 05 of 10 v2] libxc: allow for explicitly specifying node-affinity
  2012-12-19 19:07 ` [PATCH 05 of 10 v2] libxc: " Dario Faggioli
@ 2012-12-21 15:19   ` George Dunlap
  2012-12-21 16:27     ` Dario Faggioli
  0 siblings, 1 reply; 57+ messages in thread
From: George Dunlap @ 2012-12-21 15:19 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson

On 19/12/12 19:07, Dario Faggioli wrote:
> By providing the proper get/set interface and wiring them
> to the new domctl-s from the previous commit.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>

I haven't done a detailed review, but everything looks OK:

Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

>
> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
> --- a/tools/libxc/xc_domain.c
> +++ b/tools/libxc/xc_domain.c
> @@ -110,6 +110,83 @@ int xc_domain_shutdown(xc_interface *xch
>   }
>   
>   
> +int xc_domain_node_setaffinity(xc_interface *xch,
> +                               uint32_t domid,
> +                               xc_nodemap_t nodemap)
> +{
> +    DECLARE_DOMCTL;
> +    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
> +    int ret = -1;
> +    int nodesize;
> +
> +    nodesize = xc_get_nodemap_size(xch);
> +    if (!nodesize)
> +    {
> +        PERROR("Could not get number of nodes");
> +        goto out;
> +    }
> +
> +    local = xc_hypercall_buffer_alloc(xch, local, nodesize);
> +    if ( local == NULL )
> +    {
> +        PERROR("Could not allocate memory for setnodeaffinity domctl hypercall");
> +        goto out;
> +    }
> +
> +    domctl.cmd = XEN_DOMCTL_setnodeaffinity;
> +    domctl.domain = (domid_t)domid;
> +
> +    memcpy(local, nodemap, nodesize);
> +    set_xen_guest_handle(domctl.u.nodeaffinity.nodemap.bitmap, local);
> +    domctl.u.nodeaffinity.nodemap.nr_elems = nodesize * 8;
> +
> +    ret = do_domctl(xch, &domctl);
> +
> +    xc_hypercall_buffer_free(xch, local);
> +
> + out:
> +    return ret;
> +}
> +
> +int xc_domain_node_getaffinity(xc_interface *xch,
> +                               uint32_t domid,
> +                               xc_nodemap_t nodemap)
> +{
> +    DECLARE_DOMCTL;
> +    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
> +    int ret = -1;
> +    int nodesize;
> +
> +    nodesize = xc_get_nodemap_size(xch);
> +    if (!nodesize)
> +    {
> +        PERROR("Could not get number of nodes");
> +        goto out;
> +    }
> +
> +    local = xc_hypercall_buffer_alloc(xch, local, nodesize);
> +    if ( local == NULL )
> +    {
> +        PERROR("Could not allocate memory for getnodeaffinity domctl hypercall");
> +        goto out;
> +    }
> +
> +    domctl.cmd = XEN_DOMCTL_getnodeaffinity;
> +    domctl.domain = (domid_t)domid;
> +
> +    set_xen_guest_handle(domctl.u.nodeaffinity.nodemap.bitmap, local);
> +    domctl.u.nodeaffinity.nodemap.nr_elems = nodesize * 8;
> +
> +    ret = do_domctl(xch, &domctl);
> +
> +    memcpy(nodemap, local, nodesize);
> +
> +    xc_hypercall_buffer_free(xch, local);
> +
> + out:
> +    return ret;
> +}
> +
>   int xc_vcpu_setaffinity(xc_interface *xch,
>                           uint32_t domid,
>                           int vcpu,
> diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> --- a/tools/libxc/xenctrl.h
> +++ b/tools/libxc/xenctrl.h
> @@ -521,6 +521,32 @@ int xc_watchdog(xc_interface *xch,
>   		uint32_t id,
>   		uint32_t timeout);
>   
> +/**
> + * This function explicitly sets the host NUMA nodes the domain will
> + * have affinity with.
> + *
> + * @parm xch a handle to an open hypervisor interface.
> + * @parm domid the domain id one wants to set the affinity of.
> + * @parm nodemap the map of the affine nodes.
> + * @return 0 on success, -1 on failure.
> + */
> +int xc_domain_node_setaffinity(xc_interface *xch,
> +                               uint32_t domind,
> +                               xc_nodemap_t nodemap);
> +
> +/**
> + * This function retrieves the host NUMA nodes the domain has
> + * affinity with.
> + *
> + * @parm xch a handle to an open hypervisor interface.
> + * @parm domid the domain id one wants to get the node affinity of.
> + * @parm nodemap the map of the affine nodes.
> + * @return 0 on success, -1 on failure.
> + */
> +int xc_domain_node_getaffinity(xc_interface *xch,
> +                               uint32_t domind,
> +                               xc_nodemap_t nodemap);
> +
>   int xc_vcpu_setaffinity(xc_interface *xch,
>                           uint32_t domid,
>                           int vcpu,

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 05 of 10 v2] libxc: allow for explicitly specifying node-affinity
  2012-12-21 15:19   ` George Dunlap
@ 2012-12-21 16:27     ` Dario Faggioli
  0 siblings, 0 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-21 16:27 UTC (permalink / raw)
  To: George Dunlap
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson


[-- Attachment #1.1: Type: text/plain, Size: 5417 bytes --]

On Fri, 2012-12-21 at 15:19 +0000, George Dunlap wrote: 
> On 19/12/12 19:07, Dario Faggioli wrote:
> > By providing the proper get/set interface and wiring them
> > to the new domctl-s from the previous commit.
> >
> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> > Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
> 
> I haven't done a detailed review, but everything looks OK:
> 
> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
> 
Ok. Also, let me use this e-mail to thank you properly for the thorough,
useful and especially quick review! All you said was really useful, I'll
do my best do address the points you raised in a proper and equally
quick manner.

See you in 2013 (I'm going on vacation today) with v3!  :-D

Thanks again and Regards,
Dario

> >
> > diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
> > --- a/tools/libxc/xc_domain.c
> > +++ b/tools/libxc/xc_domain.c
> > @@ -110,6 +110,83 @@ int xc_domain_shutdown(xc_interface *xch
> >   }
> >   
> >   
> > +int xc_domain_node_setaffinity(xc_interface *xch,
> > +                               uint32_t domid,
> > +                               xc_nodemap_t nodemap)
> > +{
> > +    DECLARE_DOMCTL;
> > +    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
> > +    int ret = -1;
> > +    int nodesize;
> > +
> > +    nodesize = xc_get_nodemap_size(xch);
> > +    if (!nodesize)
> > +    {
> > +        PERROR("Could not get number of nodes");
> > +        goto out;
> > +    }
> > +
> > +    local = xc_hypercall_buffer_alloc(xch, local, nodesize);
> > +    if ( local == NULL )
> > +    {
> > +        PERROR("Could not allocate memory for setnodeaffinity domctl hypercall");
> > +        goto out;
> > +    }
> > +
> > +    domctl.cmd = XEN_DOMCTL_setnodeaffinity;
> > +    domctl.domain = (domid_t)domid;
> > +
> > +    memcpy(local, nodemap, nodesize);
> > +    set_xen_guest_handle(domctl.u.nodeaffinity.nodemap.bitmap, local);
> > +    domctl.u.nodeaffinity.nodemap.nr_elems = nodesize * 8;
> > +
> > +    ret = do_domctl(xch, &domctl);
> > +
> > +    xc_hypercall_buffer_free(xch, local);
> > +
> > + out:
> > +    return ret;
> > +}
> > +
> > +int xc_domain_node_getaffinity(xc_interface *xch,
> > +                               uint32_t domid,
> > +                               xc_nodemap_t nodemap)
> > +{
> > +    DECLARE_DOMCTL;
> > +    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
> > +    int ret = -1;
> > +    int nodesize;
> > +
> > +    nodesize = xc_get_nodemap_size(xch);
> > +    if (!nodesize)
> > +    {
> > +        PERROR("Could not get number of nodes");
> > +        goto out;
> > +    }
> > +
> > +    local = xc_hypercall_buffer_alloc(xch, local, nodesize);
> > +    if ( local == NULL )
> > +    {
> > +        PERROR("Could not allocate memory for getnodeaffinity domctl hypercall");
> > +        goto out;
> > +    }
> > +
> > +    domctl.cmd = XEN_DOMCTL_getnodeaffinity;
> > +    domctl.domain = (domid_t)domid;
> > +
> > +    set_xen_guest_handle(domctl.u.nodeaffinity.nodemap.bitmap, local);
> > +    domctl.u.nodeaffinity.nodemap.nr_elems = nodesize * 8;
> > +
> > +    ret = do_domctl(xch, &domctl);
> > +
> > +    memcpy(nodemap, local, nodesize);
> > +
> > +    xc_hypercall_buffer_free(xch, local);
> > +
> > + out:
> > +    return ret;
> > +}
> > +
> >   int xc_vcpu_setaffinity(xc_interface *xch,
> >                           uint32_t domid,
> >                           int vcpu,
> > diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> > --- a/tools/libxc/xenctrl.h
> > +++ b/tools/libxc/xenctrl.h
> > @@ -521,6 +521,32 @@ int xc_watchdog(xc_interface *xch,
> >   		uint32_t id,
> >   		uint32_t timeout);
> >   
> > +/**
> > + * This function explicitly sets the host NUMA nodes the domain will
> > + * have affinity with.
> > + *
> > + * @parm xch a handle to an open hypervisor interface.
> > + * @parm domid the domain id one wants to set the affinity of.
> > + * @parm nodemap the map of the affine nodes.
> > + * @return 0 on success, -1 on failure.
> > + */
> > +int xc_domain_node_setaffinity(xc_interface *xch,
> > +                               uint32_t domind,
> > +                               xc_nodemap_t nodemap);
> > +
> > +/**
> > + * This function retrieves the host NUMA nodes the domain has
> > + * affinity with.
> > + *
> > + * @parm xch a handle to an open hypervisor interface.
> > + * @parm domid the domain id one wants to get the node affinity of.
> > + * @parm nodemap the map of the affine nodes.
> > + * @return 0 on success, -1 on failure.
> > + */
> > +int xc_domain_node_getaffinity(xc_interface *xch,
> > +                               uint32_t domind,
> > +                               xc_nodemap_t nodemap);
> > +
> >   int xc_vcpu_setaffinity(xc_interface *xch,
> >                           uint32_t domid,
> >                           int vcpu,
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 06 of 10 v2] libxl: allow for explicitly specifying node-affinity
  2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
                   ` (4 preceding siblings ...)
  2012-12-19 19:07 ` [PATCH 05 of 10 v2] libxc: " Dario Faggioli
@ 2012-12-19 19:07 ` Dario Faggioli
  2012-12-21 15:30   ` George Dunlap
  2012-12-19 19:07 ` [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate Dario Faggioli
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 57+ messages in thread
From: Dario Faggioli @ 2012-12-19 19:07 UTC (permalink / raw)
  To: xen-devel
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	Jan Beulich, Daniel De Graaf, Matt Wilson

By introducing a nodemap in libxl_domain_build_info and
providing the get/set methods to deal with it.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -4142,6 +4142,26 @@ int libxl_set_vcpuaffinity_all(libxl_ctx
     return rc;
 }
 
+int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid,
+                                  libxl_bitmap *nodemap)
+{
+    if (xc_domain_node_setaffinity(ctx->xch, domid, nodemap->map)) {
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "setting node affinity");
+        return ERROR_FAIL;
+    }
+    return 0;
+}
+
+int libxl_domain_get_nodeaffinity(libxl_ctx *ctx, uint32_t domid,
+                                  libxl_bitmap *nodemap)
+{
+    if (xc_domain_node_getaffinity(ctx->xch, domid, nodemap->map)) {
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting node affinity");
+        return ERROR_FAIL;
+    }
+    return 0;
+}
+
 int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap)
 {
     GC_INIT(ctx);
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -861,6 +861,10 @@ int libxl_set_vcpuaffinity(libxl_ctx *ct
                            libxl_bitmap *cpumap);
 int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid,
                                unsigned int max_vcpus, libxl_bitmap *cpumap);
+int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid,
+                                  libxl_bitmap *nodemap);
+int libxl_domain_get_nodeaffinity(libxl_ctx *ctx, uint32_t domid,
+                                  libxl_bitmap *nodemap);
 int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap);
 
 libxl_scheduler libxl_get_scheduler(libxl_ctx *ctx);
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -184,6 +184,12 @@ int libxl__domain_build_info_setdefault(
 
     libxl_defbool_setdefault(&b_info->numa_placement, true);
 
+    if (!b_info->nodemap.size) {
+        if (libxl_node_bitmap_alloc(CTX, &b_info->nodemap, 0))
+            return ERROR_FAIL;
+        libxl_bitmap_set_any(&b_info->nodemap);
+    }
+
     if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT)
         b_info->max_memkb = 32 * 1024;
     if (b_info->target_memkb == LIBXL_MEMKB_DEFAULT)
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -230,6 +230,7 @@ int libxl__build_pre(libxl__gc *gc, uint
         if (rc)
             return rc;
     }
+    libxl_domain_set_nodeaffinity(ctx, domid, &info->nodemap);
     libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap);
 
     xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT);
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -261,6 +261,7 @@ libxl_domain_build_info = Struct("domain
     ("max_vcpus",       integer),
     ("avail_vcpus",     libxl_bitmap),
     ("cpumap",          libxl_bitmap),
+    ("nodemap",         libxl_bitmap),
     ("numa_placement",  libxl_defbool),
     ("tsc_mode",        libxl_tsc_mode),
     ("max_memkb",       MemKB),

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 06 of 10 v2] libxl: allow for explicitly specifying node-affinity
  2012-12-19 19:07 ` [PATCH 06 of 10 v2] libxl: " Dario Faggioli
@ 2012-12-21 15:30   ` George Dunlap
  2012-12-21 16:18     ` Dario Faggioli
  0 siblings, 1 reply; 57+ messages in thread
From: George Dunlap @ 2012-12-21 15:30 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson

On 19/12/12 19:07, Dario Faggioli wrote:
> By introducing a nodemap in libxl_domain_build_info and
> providing the get/set methods to deal with it.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>

I think you'll probably need to add a line like the following:

#define LIBXL_HAVE_NODEAFFINITY 1

So that people wanting to build against different versions of the 
library can behave appropriately.  But IanC or IanJ would be the final 
word on that, I think.

  -George

>
> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
> --- a/tools/libxl/libxl.c
> +++ b/tools/libxl/libxl.c
> @@ -4142,6 +4142,26 @@ int libxl_set_vcpuaffinity_all(libxl_ctx
>       return rc;
>   }
>   
> +int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid,
> +                                  libxl_bitmap *nodemap)
> +{
> +    if (xc_domain_node_setaffinity(ctx->xch, domid, nodemap->map)) {
> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "setting node affinity");
> +        return ERROR_FAIL;
> +    }
> +    return 0;
> +}
> +
> +int libxl_domain_get_nodeaffinity(libxl_ctx *ctx, uint32_t domid,
> +                                  libxl_bitmap *nodemap)
> +{
> +    if (xc_domain_node_getaffinity(ctx->xch, domid, nodemap->map)) {
> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting node affinity");
> +        return ERROR_FAIL;
> +    }
> +    return 0;
> +}
> +
>   int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap)
>   {
>       GC_INIT(ctx);
> diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
> --- a/tools/libxl/libxl.h
> +++ b/tools/libxl/libxl.h
> @@ -861,6 +861,10 @@ int libxl_set_vcpuaffinity(libxl_ctx *ct
>                              libxl_bitmap *cpumap);
>   int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid,
>                                  unsigned int max_vcpus, libxl_bitmap *cpumap);
> +int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid,
> +                                  libxl_bitmap *nodemap);
> +int libxl_domain_get_nodeaffinity(libxl_ctx *ctx, uint32_t domid,
> +                                  libxl_bitmap *nodemap);
>   int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap);
>   
>   libxl_scheduler libxl_get_scheduler(libxl_ctx *ctx);
> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
> --- a/tools/libxl/libxl_create.c
> +++ b/tools/libxl/libxl_create.c
> @@ -184,6 +184,12 @@ int libxl__domain_build_info_setdefault(
>   
>       libxl_defbool_setdefault(&b_info->numa_placement, true);
>   
> +    if (!b_info->nodemap.size) {
> +        if (libxl_node_bitmap_alloc(CTX, &b_info->nodemap, 0))
> +            return ERROR_FAIL;
> +        libxl_bitmap_set_any(&b_info->nodemap);
> +    }
> +
>       if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT)
>           b_info->max_memkb = 32 * 1024;
>       if (b_info->target_memkb == LIBXL_MEMKB_DEFAULT)
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -230,6 +230,7 @@ int libxl__build_pre(libxl__gc *gc, uint
>           if (rc)
>               return rc;
>       }
> +    libxl_domain_set_nodeaffinity(ctx, domid, &info->nodemap);
>       libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap);
>   
>       xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT);
> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
> --- a/tools/libxl/libxl_types.idl
> +++ b/tools/libxl/libxl_types.idl
> @@ -261,6 +261,7 @@ libxl_domain_build_info = Struct("domain
>       ("max_vcpus",       integer),
>       ("avail_vcpus",     libxl_bitmap),
>       ("cpumap",          libxl_bitmap),
> +    ("nodemap",         libxl_bitmap),
>       ("numa_placement",  libxl_defbool),
>       ("tsc_mode",        libxl_tsc_mode),
>       ("max_memkb",       MemKB),

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 06 of 10 v2] libxl: allow for explicitly specifying node-affinity
  2012-12-21 15:30   ` George Dunlap
@ 2012-12-21 16:18     ` Dario Faggioli
  2012-12-21 17:02       ` Ian Jackson
  0 siblings, 1 reply; 57+ messages in thread
From: Dario Faggioli @ 2012-12-21 16:18 UTC (permalink / raw)
  To: George Dunlap
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson


[-- Attachment #1.1: Type: text/plain, Size: 1077 bytes --]

On Fri, 2012-12-21 at 15:30 +0000, George Dunlap wrote: 
> On 19/12/12 19:07, Dario Faggioli wrote:
> > By introducing a nodemap in libxl_domain_build_info and
> > providing the get/set methods to deal with it.
> >
> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> > Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
> 
> I think you'll probably need to add a line like the following:
> 
> #define LIBXL_HAVE_NODEAFFINITY 1
> 
> So that people wanting to build against different versions of the 
> library can behave appropriately.  
>
I see what you mean.

> But IanC or IanJ would be the final 
> word on that, I think.
> 
Well, let's see if they want to comment. In any case, it shouldn't be a
big deal to add this. Thanks for pointing out.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 06 of 10 v2] libxl: allow for explicitly specifying node-affinity
  2012-12-21 16:18     ` Dario Faggioli
@ 2012-12-21 17:02       ` Ian Jackson
  2012-12-21 17:09         ` Dario Faggioli
  0 siblings, 1 reply; 57+ messages in thread
From: Ian Jackson @ 2012-12-21 17:02 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson

Dario Faggioli writes ("Re: [Xen-devel] [PATCH 06 of 10 v2] libxl: allow for explicitly specifying node-affinity"):
> On Fri, 2012-12-21 at 15:30 +0000, George Dunlap wrote: 
> > I think you'll probably need to add a line like the following:
> > 
> > #define LIBXL_HAVE_NODEAFFINITY 1
> > 
> > So that people wanting to build against different versions of the 
> > library can behave appropriately.  
> >
> I see what you mean.

I think this is a good idea.

I'm afraid I won't be able to do proper justice to your series until
the new year.

Ian.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 06 of 10 v2] libxl: allow for explicitly specifying node-affinity
  2012-12-21 17:02       ` Ian Jackson
@ 2012-12-21 17:09         ` Dario Faggioli
  0 siblings, 0 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-21 17:09 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson


[-- Attachment #1.1: Type: text/plain, Size: 981 bytes --]

On Fri, 2012-12-21 at 17:02 +0000, Ian Jackson wrote: 
> > On Fri, 2012-12-21 at 15:30 +0000, George Dunlap wrote: 
> > > I think you'll probably need to add a line like the following:
> > > 
> > > #define LIBXL_HAVE_NODEAFFINITY 1
> > > 
> > > So that people wanting to build against different versions of the 
> > > library can behave appropriately.  
> > >
> > I see what you mean.
> 
> I think this is a good idea.
> 
Ok then.

> I'm afraid I won't be able to do proper justice to your series until
> the new year.
> 
Yep, I now you're away, and that's fine, my bad I couldn't post it
earlier. :-)

I'll be happy to see your comments when you're back.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate
  2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
                   ` (5 preceding siblings ...)
  2012-12-19 19:07 ` [PATCH 06 of 10 v2] libxl: " Dario Faggioli
@ 2012-12-19 19:07 ` Dario Faggioli
  2012-12-20  8:41   ` Ian Campbell
  2012-12-21 16:00   ` George Dunlap
  2012-12-19 19:07 ` [PATCH 08 of 10 v2] libxl: automatic placement deals with node-affinity Dario Faggioli
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-19 19:07 UTC (permalink / raw)
  To: xen-devel
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	Jan Beulich, Daniel De Graaf, Matt Wilson

For choosing the best NUMA placement candidate, we need to figure out
how many VCPUs are runnable on each of them. That requires going through
all the VCPUs of all the domains and check their affinities.

With this change, instead of doing the above for each candidate, we
do it once for all, populating an array while counting. This way, when
we later are evaluating candidates, all we need is summing up the right
elements of the array itself.

This reduces the complexity of the overall algorithm, as it moves a
potentially expensive operation (for_each_vcpu_of_each_domain {})
outside from the core placement loop, so that it is performed only
once instead of (potentially) tens or hundreds of times.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
--- a/tools/libxl/libxl_numa.c
+++ b/tools/libxl/libxl_numa.c
@@ -165,15 +165,27 @@ static uint32_t nodemap_to_free_memkb(li
     return free_memkb;
 }
 
-/* Retrieve the number of vcpus able to run on the cpus of the nodes
- * that are part of the nodemap. */
-static int nodemap_to_nr_vcpus(libxl__gc *gc, libxl_cputopology *tinfo,
+/* Retrieve the number of vcpus able to run on the nodes in nodemap */
+static int nodemap_to_nr_vcpus(libxl__gc *gc, int vcpus_on_node[],
                                const libxl_bitmap *nodemap)
 {
+    int i, nr_vcpus = 0;
+
+    libxl_for_each_set_bit(i, *nodemap)
+        nr_vcpus += vcpus_on_node[i];
+
+    return nr_vcpus;
+}
+
+/* Number of vcpus able to run on the cpus of the various nodes
+ * (reported by filling the array vcpus_on_node[]). */
+static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo,
+                             const libxl_bitmap *suitable_cpumap,
+                             int vcpus_on_node[])
+{
     libxl_dominfo *dinfo = NULL;
     libxl_bitmap vcpu_nodemap;
     int nr_doms, nr_cpus;
-    int nr_vcpus = 0;
     int i, j, k;
 
     dinfo = libxl_list_domain(CTX, &nr_doms);
@@ -193,19 +205,17 @@ static int nodemap_to_nr_vcpus(libxl__gc
         if (vinfo == NULL)
             continue;
 
-        /* For each vcpu of each domain ... */
         for (j = 0; j < nr_dom_vcpus; j++) {
+            /* For each vcpu of each domain, increment the elements of
+             * the array corresponding to the nodes where the vcpu runs */
+            libxl_bitmap_set_none(&vcpu_nodemap);
+            libxl_for_each_set_bit(k, vinfo[j].cpumap) {
+                int node = tinfo[k].node;
 
-            /* Build up a map telling on which nodes the vcpu is runnable on */
-            libxl_bitmap_set_none(&vcpu_nodemap);
-            libxl_for_each_set_bit(k, vinfo[j].cpumap)
-                libxl_bitmap_set(&vcpu_nodemap, tinfo[k].node);
-
-            /* And check if that map has any intersection with our nodemap */
-            libxl_for_each_set_bit(k, vcpu_nodemap) {
-                if (libxl_bitmap_test(nodemap, k)) {
-                    nr_vcpus++;
-                    break;
+                if (libxl_bitmap_test(suitable_cpumap, k) &&
+                    !libxl_bitmap_test(&vcpu_nodemap, node)) {
+                    libxl_bitmap_set(&vcpu_nodemap, node);
+                    vcpus_on_node[node]++;
                 }
             }
         }
@@ -215,7 +225,7 @@ static int nodemap_to_nr_vcpus(libxl__gc
 
     libxl_bitmap_dispose(&vcpu_nodemap);
     libxl_dominfo_list_free(dinfo, nr_doms);
-    return nr_vcpus;
+    return 0;
 }
 
 /*
@@ -270,7 +280,7 @@ int libxl__get_numa_candidate(libxl__gc 
     libxl_numainfo *ninfo = NULL;
     int nr_nodes = 0, nr_suit_nodes, nr_cpus = 0;
     libxl_bitmap suitable_nodemap, nodemap;
-    int rc = 0;
+    int *vcpus_on_node, rc = 0;
 
     libxl_bitmap_init(&nodemap);
     libxl_bitmap_init(&suitable_nodemap);
@@ -281,6 +291,8 @@ int libxl__get_numa_candidate(libxl__gc 
     if (ninfo == NULL)
         return ERROR_FAIL;
 
+    GCNEW_ARRAY(vcpus_on_node, nr_nodes);
+
     /*
      * The good thing about this solution is that it is based on heuristics
      * (implemented in numa_cmpf() ), but we at least can evaluate it on
@@ -330,6 +342,19 @@ int libxl__get_numa_candidate(libxl__gc 
         goto out;
 
     /*
+     * Later on, we will try to figure out how many vcpus are runnable on
+     * each candidate (as a part of choosing the best one of them). That
+     * requires going through all the vcpus of all the domains and check
+     * their affinities. So, instead of doing that for each candidate,
+     * let's count here the number of vcpus runnable on each node, so that
+     * all we have to do later is summing up the right elements of the
+     * vcpus_on_node array.
+     */
+    rc = nr_vcpus_on_nodes(gc, tinfo, suitable_cpumap, vcpus_on_node);
+    if (rc)
+        goto out;
+
+    /*
      * If the minimum number of NUMA nodes is not explicitly specified
      * (i.e., min_nodes == 0), we try to figure out a sensible number of nodes
      * from where to start generating candidates, if possible (or just start
@@ -414,7 +439,8 @@ int libxl__get_numa_candidate(libxl__gc 
              * current best one (if any).
              */
             libxl__numa_candidate_put_nodemap(gc, &new_cndt, &nodemap);
-            new_cndt.nr_vcpus = nodemap_to_nr_vcpus(gc, tinfo, &nodemap);
+            new_cndt.nr_vcpus = nodemap_to_nr_vcpus(gc, vcpus_on_node,
+                                                    &nodemap);
             new_cndt.free_memkb = nodes_free_memkb;
             new_cndt.nr_nodes = libxl_bitmap_count_set(&nodemap);
             new_cndt.nr_cpus = nodes_cpus;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate
  2012-12-19 19:07 ` [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate Dario Faggioli
@ 2012-12-20  8:41   ` Ian Campbell
  2012-12-20  9:24     ` Dario Faggioli
  2012-12-21 16:00   ` George Dunlap
  1 sibling, 1 reply; 57+ messages in thread
From: Ian Campbell @ 2012-12-20  8:41 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Matt Wilson, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf

On Wed, 2012-12-19 at 19:07 +0000, Dario Faggioli wrote:
> This reduces the complexity of the overall algorithm, as it moves a 

What was/is the complexity before/after this change? ISTR it was O(n^2)
or something along those lines before.

Ian.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate
  2012-12-20  8:41   ` Ian Campbell
@ 2012-12-20  9:24     ` Dario Faggioli
  0 siblings, 0 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-20  9:24 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Marcus Granado, Dan Magenheimer, Matt Wilson, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf


[-- Attachment #1.1: Type: text/plain, Size: 1358 bytes --]

On Thu, 2012-12-20 at 08:41 +0000, Ian Campbell wrote: 
> On Wed, 2012-12-19 at 19:07 +0000, Dario Faggioli wrote:
> > This reduces the complexity of the overall algorithm, as it moves a 
> 
> What was/is the complexity before/after this change? ISTR it was O(n^2)
> or something along those lines before.
> 
Yes and no. Let me try to explain. Counting the number of vCPUs that can
run on (a set) of node(s) was and remains O(n_domains*n_domain_vcpus),
so, yes, sort of quadratic.

The upper bound of the number of candidates evaluated by the placement
algorithm is exponential in the number of NUMA nodes: O(2^n_nodes).

Before this change, we counted the number of vCPUs runnable on each
candidate during each step, so the overall complexity was:

O(2^n_nodes) * O(n_domains*n_domain_vcpus)

In this change I count the number of vCPUs runnable on each candidate
only once, and that happens outside the candidate generation loop, so
the overall complexity is:

O(n_domains*n_domains_vcpus) + O(2^n_nodes) = O(2^n_nodes)

Did I answer your question?

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate
  2012-12-19 19:07 ` [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate Dario Faggioli
  2012-12-20  8:41   ` Ian Campbell
@ 2012-12-21 16:00   ` George Dunlap
  2012-12-21 16:23     ` Dario Faggioli
  1 sibling, 1 reply; 57+ messages in thread
From: George Dunlap @ 2012-12-21 16:00 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson

On 19/12/12 19:07, Dario Faggioli wrote:
> For choosing the best NUMA placement candidate, we need to figure out
> how many VCPUs are runnable on each of them. That requires going through
> all the VCPUs of all the domains and check their affinities.
>
> With this change, instead of doing the above for each candidate, we
> do it once for all, populating an array while counting. This way, when
> we later are evaluating candidates, all we need is summing up the right
> elements of the array itself.
>
> This reduces the complexity of the overall algorithm, as it moves a
> potentially expensive operation (for_each_vcpu_of_each_domain {})
> outside from the core placement loop, so that it is performed only
> once instead of (potentially) tens or hundreds of times.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

You know this code best. :-)  I've looked it over and just have one 
minor suggestion:

>           for (j = 0; j < nr_dom_vcpus; j++) {
> +            /* For each vcpu of each domain, increment the elements of
> +             * the array corresponding to the nodes where the vcpu runs */
> +            libxl_bitmap_set_none(&vcpu_nodemap);
> +            libxl_for_each_set_bit(k, vinfo[j].cpumap) {
> +                int node = tinfo[k].node;

I think I might rename "vcpu_nodemap" to something that suggests better 
how it fits with the algorithm -- for instance, "counted_nodemap" or 
"nodes_counted" -- something to suggest that this is how we avoid 
counting the same vcpu on the same node multiple times.

  -George

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate
  2012-12-21 16:00   ` George Dunlap
@ 2012-12-21 16:23     ` Dario Faggioli
  0 siblings, 0 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-21 16:23 UTC (permalink / raw)
  To: George Dunlap
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson


[-- Attachment #1.1: Type: text/plain, Size: 2333 bytes --]

On Fri, 2012-12-21 at 16:00 +0000, George Dunlap wrote: 
> On 19/12/12 19:07, Dario Faggioli wrote:
> > For choosing the best NUMA placement candidate, we need to figure out
> > how many VCPUs are runnable on each of them. That requires going through
> > all the VCPUs of all the domains and check their affinities.
> >
> > With this change, instead of doing the above for each candidate, we
> > do it once for all, populating an array while counting. This way, when
> > we later are evaluating candidates, all we need is summing up the right
> > elements of the array itself.
> >
> > This reduces the complexity of the overall algorithm, as it moves a
> > potentially expensive operation (for_each_vcpu_of_each_domain {})
> > outside from the core placement loop, so that it is performed only
> > once instead of (potentially) tens or hundreds of times.
> >
> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> You know this code best. :-)  I've looked it over and just have one 
> minor suggestion:
> 
Well, I certainly spent quite a bit of time on it, and it still is in
need of some more, but again, this change only speed things up at no
"functional cost", so (despite this not being a critical path) I really
think it is something we want.

BTW, thanks for taking a look.

> >           for (j = 0; j < nr_dom_vcpus; j++) {
> > +            /* For each vcpu of each domain, increment the elements of
> > +             * the array corresponding to the nodes where the vcpu runs */
> > +            libxl_bitmap_set_none(&vcpu_nodemap);
> > +            libxl_for_each_set_bit(k, vinfo[j].cpumap) {
> > +                int node = tinfo[k].node;
> 
> I think I might rename "vcpu_nodemap" to something that suggests better 
> how it fits with the algorithm -- for instance, "counted_nodemap" or 
> "nodes_counted" -- something to suggest that this is how we avoid 
> counting the same vcpu on the same node multiple times.
> 
Good point, I'll go for something like that.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 08 of 10 v2] libxl: automatic placement deals with node-affinity
  2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
                   ` (6 preceding siblings ...)
  2012-12-19 19:07 ` [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate Dario Faggioli
@ 2012-12-19 19:07 ` Dario Faggioli
  2012-12-21 16:22   ` George Dunlap
  2012-12-19 19:07 ` [PATCH 09 of 10 v2] xl: add node-affinity to the output of `xl list` Dario Faggioli
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 57+ messages in thread
From: Dario Faggioli @ 2012-12-19 19:07 UTC (permalink / raw)
  To: xen-devel
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	Jan Beulich, Daniel De Graaf, Matt Wilson

Which basically means the following two things:
 1) during domain creation, it is the node-affinity of
    the domain --rather than the vcpu-affinities of its
    VCPUs-- that is affected by automatic placement;
 2) during automatic placement, when counting how many
    VCPUs are already "bound" to a placement candidate
    (as part of the process of choosing the best
    candidate), both vcpu-affinity and node-affinity
    are considered.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -133,13 +133,13 @@ static int numa_place_domain(libxl__gc *
 {
     int found;
     libxl__numa_candidate candidate;
-    libxl_bitmap candidate_nodemap;
+    libxl_bitmap cpupool_nodemap;
     libxl_cpupoolinfo cpupool_info;
     int i, cpupool, rc = 0;
     uint32_t memkb;
 
     libxl__numa_candidate_init(&candidate);
-    libxl_bitmap_init(&candidate_nodemap);
+    libxl_bitmap_init(&cpupool_nodemap);
 
     /*
      * Extract the cpumap from the cpupool the domain belong to. In fact,
@@ -156,7 +156,7 @@ static int numa_place_domain(libxl__gc *
     rc = libxl_domain_need_memory(CTX, info, &memkb);
     if (rc)
         goto out;
-    if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap, 0)) {
+    if (libxl_node_bitmap_alloc(CTX, &cpupool_nodemap, 0)) {
         rc = ERROR_FAIL;
         goto out;
     }
@@ -174,17 +174,19 @@ static int numa_place_domain(libxl__gc *
     if (found == 0)
         goto out;
 
-    /* Map the candidate's node map to the domain's info->cpumap */
-    libxl__numa_candidate_get_nodemap(gc, &candidate, &candidate_nodemap);
-    rc = libxl_nodemap_to_cpumap(CTX, &candidate_nodemap, &info->cpumap);
+    /* Map the candidate's node map to the domain's info->nodemap */
+    libxl__numa_candidate_get_nodemap(gc, &candidate, &info->nodemap);
+
+    /* Avoid trying to set the affinity to nodes that might be in the
+     * candidate's nodemap but out of our cpupool. */
+    rc = libxl_cpumap_to_nodemap(CTX, &cpupool_info.cpumap,
+                                 &cpupool_nodemap);
     if (rc)
         goto out;
 
-    /* Avoid trying to set the affinity to cpus that might be in the
-     * nodemap but not in our cpupool. */
-    libxl_for_each_set_bit(i, info->cpumap) {
-        if (!libxl_bitmap_test(&cpupool_info.cpumap, i))
-            libxl_bitmap_reset(&info->cpumap, i);
+    libxl_for_each_set_bit(i, info->nodemap) {
+        if (!libxl_bitmap_test(&cpupool_nodemap, i))
+            libxl_bitmap_reset(&info->nodemap, i);
     }
 
     LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and "
@@ -193,7 +195,7 @@ static int numa_place_domain(libxl__gc *
 
  out:
     libxl__numa_candidate_dispose(&candidate);
-    libxl_bitmap_dispose(&candidate_nodemap);
+    libxl_bitmap_dispose(&cpupool_nodemap);
     libxl_cpupoolinfo_dispose(&cpupool_info);
     return rc;
 }
@@ -211,10 +213,10 @@ int libxl__build_pre(libxl__gc *gc, uint
     /*
      * Check if the domain has any CPU affinity. If not, try to build
      * up one. In case numa_place_domain() find at least a suitable
-     * candidate, it will affect info->cpumap accordingly; if it
+     * candidate, it will affect info->nodemap accordingly; if it
      * does not, it just leaves it as it is. This means (unless
      * some weird error manifests) the subsequent call to
-     * libxl_set_vcpuaffinity_all() will do the actual placement,
+     * libxl_domain_set_nodeaffinity() will do the actual placement,
      * whatever that turns out to be.
      */
     if (libxl_defbool_val(info->numa_placement)) {
diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
--- a/tools/libxl/libxl_numa.c
+++ b/tools/libxl/libxl_numa.c
@@ -184,7 +184,7 @@ static int nr_vcpus_on_nodes(libxl__gc *
                              int vcpus_on_node[])
 {
     libxl_dominfo *dinfo = NULL;
-    libxl_bitmap vcpu_nodemap;
+    libxl_bitmap dom_nodemap, vcpu_nodemap;
     int nr_doms, nr_cpus;
     int i, j, k;
 
@@ -197,6 +197,12 @@ static int nr_vcpus_on_nodes(libxl__gc *
         return ERROR_FAIL;
     }
 
+    if (libxl_node_bitmap_alloc(CTX, &dom_nodemap, 0) < 0) {
+        libxl_bitmap_dispose(&vcpu_nodemap);
+        libxl_dominfo_list_free(dinfo, nr_doms);
+        return ERROR_FAIL;
+    }
+
     for (i = 0; i < nr_doms; i++) {
         libxl_vcpuinfo *vinfo;
         int nr_dom_vcpus;
@@ -205,14 +211,21 @@ static int nr_vcpus_on_nodes(libxl__gc *
         if (vinfo == NULL)
             continue;
 
+        /* Retrieve the domain's node-affinity map */
+        libxl_domain_get_nodeaffinity(CTX, dinfo[i].domid, &dom_nodemap);
+
         for (j = 0; j < nr_dom_vcpus; j++) {
-            /* For each vcpu of each domain, increment the elements of
-             * the array corresponding to the nodes where the vcpu runs */
+            /*
+             * For each vcpu of each domain, it must have both vcpu-affinity
+             * and node-affinity to (a pcpu belonging to) a certain node to
+             * cause an increment in the corresponding element of the array.
+             */
             libxl_bitmap_set_none(&vcpu_nodemap);
             libxl_for_each_set_bit(k, vinfo[j].cpumap) {
                 int node = tinfo[k].node;
 
                 if (libxl_bitmap_test(suitable_cpumap, k) &&
+                    libxl_bitmap_test(&dom_nodemap, node) &&
                     !libxl_bitmap_test(&vcpu_nodemap, node)) {
                     libxl_bitmap_set(&vcpu_nodemap, node);
                     vcpus_on_node[node]++;
@@ -223,6 +236,7 @@ static int nr_vcpus_on_nodes(libxl__gc *
         libxl_vcpuinfo_list_free(vinfo, nr_dom_vcpus);
     }
 
+    libxl_bitmap_dispose(&dom_nodemap);
     libxl_bitmap_dispose(&vcpu_nodemap);
     libxl_dominfo_list_free(dinfo, nr_doms);
     return 0;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 08 of 10 v2] libxl: automatic placement deals with node-affinity
  2012-12-19 19:07 ` [PATCH 08 of 10 v2] libxl: automatic placement deals with node-affinity Dario Faggioli
@ 2012-12-21 16:22   ` George Dunlap
  0 siblings, 0 replies; 57+ messages in thread
From: George Dunlap @ 2012-12-21 16:22 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson

On 19/12/12 19:07, Dario Faggioli wrote:
> Which basically means the following two things:
>   1) during domain creation, it is the node-affinity of
>      the domain --rather than the vcpu-affinities of its
>      VCPUs-- that is affected by automatic placement;
>   2) during automatic placement, when counting how many
>      VCPUs are already "bound" to a placement candidate
>      (as part of the process of choosing the best
>      candidate), both vcpu-affinity and node-affinity
>      are considered.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>

Re-confirming Ack.
  -George

>
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -133,13 +133,13 @@ static int numa_place_domain(libxl__gc *
>   {
>       int found;
>       libxl__numa_candidate candidate;
> -    libxl_bitmap candidate_nodemap;
> +    libxl_bitmap cpupool_nodemap;
>       libxl_cpupoolinfo cpupool_info;
>       int i, cpupool, rc = 0;
>       uint32_t memkb;
>   
>       libxl__numa_candidate_init(&candidate);
> -    libxl_bitmap_init(&candidate_nodemap);
> +    libxl_bitmap_init(&cpupool_nodemap);
>   
>       /*
>        * Extract the cpumap from the cpupool the domain belong to. In fact,
> @@ -156,7 +156,7 @@ static int numa_place_domain(libxl__gc *
>       rc = libxl_domain_need_memory(CTX, info, &memkb);
>       if (rc)
>           goto out;
> -    if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap, 0)) {
> +    if (libxl_node_bitmap_alloc(CTX, &cpupool_nodemap, 0)) {
>           rc = ERROR_FAIL;
>           goto out;
>       }
> @@ -174,17 +174,19 @@ static int numa_place_domain(libxl__gc *
>       if (found == 0)
>           goto out;
>   
> -    /* Map the candidate's node map to the domain's info->cpumap */
> -    libxl__numa_candidate_get_nodemap(gc, &candidate, &candidate_nodemap);
> -    rc = libxl_nodemap_to_cpumap(CTX, &candidate_nodemap, &info->cpumap);
> +    /* Map the candidate's node map to the domain's info->nodemap */
> +    libxl__numa_candidate_get_nodemap(gc, &candidate, &info->nodemap);
> +
> +    /* Avoid trying to set the affinity to nodes that might be in the
> +     * candidate's nodemap but out of our cpupool. */
> +    rc = libxl_cpumap_to_nodemap(CTX, &cpupool_info.cpumap,
> +                                 &cpupool_nodemap);
>       if (rc)
>           goto out;
>   
> -    /* Avoid trying to set the affinity to cpus that might be in the
> -     * nodemap but not in our cpupool. */
> -    libxl_for_each_set_bit(i, info->cpumap) {
> -        if (!libxl_bitmap_test(&cpupool_info.cpumap, i))
> -            libxl_bitmap_reset(&info->cpumap, i);
> +    libxl_for_each_set_bit(i, info->nodemap) {
> +        if (!libxl_bitmap_test(&cpupool_nodemap, i))
> +            libxl_bitmap_reset(&info->nodemap, i);
>       }
>   
>       LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and "
> @@ -193,7 +195,7 @@ static int numa_place_domain(libxl__gc *
>   
>    out:
>       libxl__numa_candidate_dispose(&candidate);
> -    libxl_bitmap_dispose(&candidate_nodemap);
> +    libxl_bitmap_dispose(&cpupool_nodemap);
>       libxl_cpupoolinfo_dispose(&cpupool_info);
>       return rc;
>   }
> @@ -211,10 +213,10 @@ int libxl__build_pre(libxl__gc *gc, uint
>       /*
>        * Check if the domain has any CPU affinity. If not, try to build
>        * up one. In case numa_place_domain() find at least a suitable
> -     * candidate, it will affect info->cpumap accordingly; if it
> +     * candidate, it will affect info->nodemap accordingly; if it
>        * does not, it just leaves it as it is. This means (unless
>        * some weird error manifests) the subsequent call to
> -     * libxl_set_vcpuaffinity_all() will do the actual placement,
> +     * libxl_domain_set_nodeaffinity() will do the actual placement,
>        * whatever that turns out to be.
>        */
>       if (libxl_defbool_val(info->numa_placement)) {
> diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
> --- a/tools/libxl/libxl_numa.c
> +++ b/tools/libxl/libxl_numa.c
> @@ -184,7 +184,7 @@ static int nr_vcpus_on_nodes(libxl__gc *
>                                int vcpus_on_node[])
>   {
>       libxl_dominfo *dinfo = NULL;
> -    libxl_bitmap vcpu_nodemap;
> +    libxl_bitmap dom_nodemap, vcpu_nodemap;
>       int nr_doms, nr_cpus;
>       int i, j, k;
>   
> @@ -197,6 +197,12 @@ static int nr_vcpus_on_nodes(libxl__gc *
>           return ERROR_FAIL;
>       }
>   
> +    if (libxl_node_bitmap_alloc(CTX, &dom_nodemap, 0) < 0) {
> +        libxl_bitmap_dispose(&vcpu_nodemap);
> +        libxl_dominfo_list_free(dinfo, nr_doms);
> +        return ERROR_FAIL;
> +    }
> +
>       for (i = 0; i < nr_doms; i++) {
>           libxl_vcpuinfo *vinfo;
>           int nr_dom_vcpus;
> @@ -205,14 +211,21 @@ static int nr_vcpus_on_nodes(libxl__gc *
>           if (vinfo == NULL)
>               continue;
>   
> +        /* Retrieve the domain's node-affinity map */
> +        libxl_domain_get_nodeaffinity(CTX, dinfo[i].domid, &dom_nodemap);
> +
>           for (j = 0; j < nr_dom_vcpus; j++) {
> -            /* For each vcpu of each domain, increment the elements of
> -             * the array corresponding to the nodes where the vcpu runs */
> +            /*
> +             * For each vcpu of each domain, it must have both vcpu-affinity
> +             * and node-affinity to (a pcpu belonging to) a certain node to
> +             * cause an increment in the corresponding element of the array.
> +             */
>               libxl_bitmap_set_none(&vcpu_nodemap);
>               libxl_for_each_set_bit(k, vinfo[j].cpumap) {
>                   int node = tinfo[k].node;
>   
>                   if (libxl_bitmap_test(suitable_cpumap, k) &&
> +                    libxl_bitmap_test(&dom_nodemap, node) &&
>                       !libxl_bitmap_test(&vcpu_nodemap, node)) {
>                       libxl_bitmap_set(&vcpu_nodemap, node);
>                       vcpus_on_node[node]++;
> @@ -223,6 +236,7 @@ static int nr_vcpus_on_nodes(libxl__gc *
>           libxl_vcpuinfo_list_free(vinfo, nr_dom_vcpus);
>       }
>   
> +    libxl_bitmap_dispose(&dom_nodemap);
>       libxl_bitmap_dispose(&vcpu_nodemap);
>       libxl_dominfo_list_free(dinfo, nr_doms);
>       return 0;

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 09 of 10 v2] xl: add node-affinity to the output of `xl list`
  2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
                   ` (7 preceding siblings ...)
  2012-12-19 19:07 ` [PATCH 08 of 10 v2] libxl: automatic placement deals with node-affinity Dario Faggioli
@ 2012-12-19 19:07 ` Dario Faggioli
  2012-12-21 16:34   ` George Dunlap
  2012-12-19 19:07 ` [PATCH 10 of 10 v2] docs: rearrange and update NUMA placement documentation Dario Faggioli
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 57+ messages in thread
From: Dario Faggioli @ 2012-12-19 19:07 UTC (permalink / raw)
  To: xen-devel
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	Jan Beulich, Daniel De Graaf, Matt Wilson

Node-affinity is now something that is under (some) control of the
user, so show it upon request as part of the output of `xl list'
by the `-n' option.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
---
Changes from v1:
 * print_{cpu,node}map() functions added instead of 'state variable'-izing
   print_bitmap().

diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -2961,14 +2961,95 @@ out:
     }
 }
 
-static void list_domains(int verbose, int context, const libxl_dominfo *info, int nb_domain)
+/* If map is not full, prints it and returns 0. Returns 1 otherwise. */
+static int print_bitmap(uint8_t *map, int maplen, FILE *stream)
+{
+    int i;
+    uint8_t pmap = 0, bitmask = 0;
+    int firstset = 0, state = 0;
+
+    for (i = 0; i < maplen; i++) {
+        if (i % 8 == 0) {
+            pmap = *map++;
+            bitmask = 1;
+        } else bitmask <<= 1;
+
+        switch (state) {
+        case 0:
+        case 2:
+            if ((pmap & bitmask) != 0) {
+                firstset = i;
+                state++;
+            }
+            continue;
+        case 1:
+        case 3:
+            if ((pmap & bitmask) == 0) {
+                fprintf(stream, "%s%d", state > 1 ? "," : "", firstset);
+                if (i - 1 > firstset)
+                    fprintf(stream, "-%d", i - 1);
+                state = 2;
+            }
+            continue;
+        }
+    }
+    switch (state) {
+        case 0:
+            fprintf(stream, "none");
+            break;
+        case 2:
+            break;
+        case 1:
+            if (firstset == 0)
+                return 1;
+        case 3:
+            fprintf(stream, "%s%d", state > 1 ? "," : "", firstset);
+            if (i - 1 > firstset)
+                fprintf(stream, "-%d", i - 1);
+            break;
+    }
+
+    return 0;
+}
+
+static void print_cpumap(uint8_t *map, int maplen, FILE *stream)
+{
+    if (print_bitmap(map, maplen, stream))
+        fprintf(stream, "any cpu");
+}
+
+static void print_nodemap(uint8_t *map, int maplen, FILE *stream)
+{
+    if (print_bitmap(map, maplen, stream))
+        fprintf(stream, "any node");
+}
+
+static void list_domains(int verbose, int context, int numa, const libxl_dominfo *info, int nb_domain)
 {
     int i;
     static const char shutdown_reason_letters[]= "-rscw";
+    libxl_bitmap nodemap;
+    libxl_physinfo physinfo;
+
+    libxl_bitmap_init(&nodemap);
+    libxl_physinfo_init(&physinfo);
 
     printf("Name                                        ID   Mem VCPUs\tState\tTime(s)");
     if (verbose) printf("   UUID                            Reason-Code\tSecurity Label");
     if (context && !verbose) printf("   Security Label");
+    if (numa) {
+        if (libxl_node_bitmap_alloc(ctx, &nodemap, 0)) {
+            fprintf(stderr, "libxl_node_bitmap_alloc_failed.\n");
+            exit(1);
+        }
+        if (libxl_get_physinfo(ctx, &physinfo) != 0) {
+            fprintf(stderr, "libxl_physinfo failed.\n");
+            libxl_bitmap_dispose(&nodemap);
+            exit(1);
+        }
+
+        printf(" NODE Affinity");
+    }
     printf("\n");
     for (i = 0; i < nb_domain; i++) {
         char *domname;
@@ -3002,14 +3083,23 @@ static void list_domains(int verbose, in
             rc = libxl_flask_sid_to_context(ctx, info[i].ssidref, &buf,
                                             &size);
             if (rc < 0)
-                printf("  -");
+                printf("                -");
             else {
-                printf("  %s", buf);
+                printf(" %16s", buf);
                 free(buf);
             }
         }
+        if (numa) {
+            libxl_domain_get_nodeaffinity(ctx, info[i].domid, &nodemap);
+
+            putchar(' ');
+            print_nodemap(nodemap.map, physinfo.nr_nodes, stdout);
+        }
         putchar('\n');
     }
+
+    libxl_bitmap_dispose(&nodemap);
+    libxl_physinfo_dispose(&physinfo);
 }
 
 static void list_vm(void)
@@ -3890,12 +3980,14 @@ int main_list(int argc, char **argv)
     int opt, verbose = 0;
     int context = 0;
     int details = 0;
+    int numa = 0;
     int option_index = 0;
     static struct option long_options[] = {
         {"long", 0, 0, 'l'},
         {"help", 0, 0, 'h'},
         {"verbose", 0, 0, 'v'},
         {"context", 0, 0, 'Z'},
+        {"numa", 0, 0, 'n'},
         {0, 0, 0, 0}
     };
 
@@ -3904,7 +3996,7 @@ int main_list(int argc, char **argv)
     int nb_domain, rc;
 
     while (1) {
-        opt = getopt_long(argc, argv, "lvhZ", long_options, &option_index);
+        opt = getopt_long(argc, argv, "lvhZn", long_options, &option_index);
         if (opt == -1)
             break;
 
@@ -3921,6 +4013,9 @@ int main_list(int argc, char **argv)
         case 'Z':
             context = 1;
             break;
+        case 'n':
+            numa = 1;
+            break;
         default:
             fprintf(stderr, "option `%c' not supported.\n", optopt);
             break;
@@ -3956,7 +4051,7 @@ int main_list(int argc, char **argv)
     if (details)
         list_domains_details(info, nb_domain);
     else
-        list_domains(verbose, context, info, nb_domain);
+        list_domains(verbose, context, numa, info, nb_domain);
 
     if (info_free)
         libxl_dominfo_list_free(info, nb_domain);
@@ -4228,56 +4323,6 @@ int main_button_press(int argc, char **a
     return 0;
 }
 
-static void print_bitmap(uint8_t *map, int maplen, FILE *stream)
-{
-    int i;
-    uint8_t pmap = 0, bitmask = 0;
-    int firstset = 0, state = 0;
-
-    for (i = 0; i < maplen; i++) {
-        if (i % 8 == 0) {
-            pmap = *map++;
-            bitmask = 1;
-        } else bitmask <<= 1;
-
-        switch (state) {
-        case 0:
-        case 2:
-            if ((pmap & bitmask) != 0) {
-                firstset = i;
-                state++;
-            }
-            continue;
-        case 1:
-        case 3:
-            if ((pmap & bitmask) == 0) {
-                fprintf(stream, "%s%d", state > 1 ? "," : "", firstset);
-                if (i - 1 > firstset)
-                    fprintf(stream, "-%d", i - 1);
-                state = 2;
-            }
-            continue;
-        }
-    }
-    switch (state) {
-        case 0:
-            fprintf(stream, "none");
-            break;
-        case 2:
-            break;
-        case 1:
-            if (firstset == 0) {
-                fprintf(stream, "any cpu");
-                break;
-            }
-        case 3:
-            fprintf(stream, "%s%d", state > 1 ? "," : "", firstset);
-            if (i - 1 > firstset)
-                fprintf(stream, "-%d", i - 1);
-            break;
-    }
-}
-
 static void print_vcpuinfo(uint32_t tdomid,
                            const libxl_vcpuinfo *vcpuinfo,
                            uint32_t nr_cpus)
@@ -4301,7 +4346,7 @@ static void print_vcpuinfo(uint32_t tdom
     /*      TIM */
     printf("%9.1f  ", ((float)vcpuinfo->vcpu_time / 1e9));
     /* CPU AFFINITY */
-    print_bitmap(vcpuinfo->cpumap.map, nr_cpus, stdout);
+    print_cpumap(vcpuinfo->cpumap.map, nr_cpus, stdout);
     printf("\n");
 }
 
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -50,7 +50,8 @@ struct cmd_spec cmd_table[] = {
       "[options] [Domain]\n",
       "-l, --long              Output all VM details\n"
       "-v, --verbose           Prints out UUIDs and security context\n"
-      "-Z, --context           Prints out security context"
+      "-Z, --context           Prints out security context\n"
+      "-n, --numa              Prints out NUMA node affinity"
     },
     { "destroy",
       &main_destroy, 0, 1,

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09 of 10 v2] xl: add node-affinity to the output of `xl list`
  2012-12-19 19:07 ` [PATCH 09 of 10 v2] xl: add node-affinity to the output of `xl list` Dario Faggioli
@ 2012-12-21 16:34   ` George Dunlap
  2012-12-21 16:54     ` Dario Faggioli
  0 siblings, 1 reply; 57+ messages in thread
From: George Dunlap @ 2012-12-21 16:34 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson

On 19/12/12 19:07, Dario Faggioli wrote:
> Node-affinity is now something that is under (some) control of the
> user, so show it upon request as part of the output of `xl list'
> by the `-n' option.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
> ---
> Changes from v1:
>   * print_{cpu,node}map() functions added instead of 'state variable'-izing
>     print_bitmap().
>
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -2961,14 +2961,95 @@ out:
>       }
>   }
>   
> -static void list_domains(int verbose, int context, const libxl_dominfo *info, int nb_domain)
> +/* If map is not full, prints it and returns 0. Returns 1 otherwise. */
> +static int print_bitmap(uint8_t *map, int maplen, FILE *stream)
> +{
> +    int i;
> +    uint8_t pmap = 0, bitmask = 0;
> +    int firstset = 0, state = 0;
> +
> +    for (i = 0; i < maplen; i++) {
> +        if (i % 8 == 0) {
> +            pmap = *map++;
> +            bitmask = 1;
> +        } else bitmask <<= 1;
> +
> +        switch (state) {
> +        case 0:
> +        case 2:
> +            if ((pmap & bitmask) != 0) {
> +                firstset = i;
> +                state++;
> +            }
> +            continue;
> +        case 1:
> +        case 3:
> +            if ((pmap & bitmask) == 0) {
> +                fprintf(stream, "%s%d", state > 1 ? "," : "", firstset);
> +                if (i - 1 > firstset)
> +                    fprintf(stream, "-%d", i - 1);
> +                state = 2;
> +            }
> +            continue;
> +        }
> +    }
> +    switch (state) {
> +        case 0:
> +            fprintf(stream, "none");
> +            break;
> +        case 2:
> +            break;
> +        case 1:
> +            if (firstset == 0)
> +                return 1;
> +        case 3:
> +            fprintf(stream, "%s%d", state > 1 ? "," : "", firstset);
> +            if (i - 1 > firstset)
> +                fprintf(stream, "-%d", i - 1);
> +            break;
> +    }
> +
> +    return 0;
> +}

Just checking -- is the print_bitmap() thing pure code motion? If so, 
would you mind saying that explicitly in the commit message, just to 
save people time when reading this patch?

Other than that, looks OK to me -- I haven't done a detailed review of 
the output layout however.

  -George

> +
> +static void print_cpumap(uint8_t *map, int maplen, FILE *stream)
> +{
> +    if (print_bitmap(map, maplen, stream))
> +        fprintf(stream, "any cpu");
> +}
> +
> +static void print_nodemap(uint8_t *map, int maplen, FILE *stream)
> +{
> +    if (print_bitmap(map, maplen, stream))
> +        fprintf(stream, "any node");
> +}
> +
> +static void list_domains(int verbose, int context, int numa, const libxl_dominfo *info, int nb_domain)
>   {
>       int i;
>       static const char shutdown_reason_letters[]= "-rscw";
> +    libxl_bitmap nodemap;
> +    libxl_physinfo physinfo;
> +
> +    libxl_bitmap_init(&nodemap);
> +    libxl_physinfo_init(&physinfo);
>   
>       printf("Name                                        ID   Mem VCPUs\tState\tTime(s)");
>       if (verbose) printf("   UUID                            Reason-Code\tSecurity Label");
>       if (context && !verbose) printf("   Security Label");
> +    if (numa) {
> +        if (libxl_node_bitmap_alloc(ctx, &nodemap, 0)) {
> +            fprintf(stderr, "libxl_node_bitmap_alloc_failed.\n");
> +            exit(1);
> +        }
> +        if (libxl_get_physinfo(ctx, &physinfo) != 0) {
> +            fprintf(stderr, "libxl_physinfo failed.\n");
> +            libxl_bitmap_dispose(&nodemap);
> +            exit(1);
> +        }
> +
> +        printf(" NODE Affinity");
> +    }
>       printf("\n");
>       for (i = 0; i < nb_domain; i++) {
>           char *domname;
> @@ -3002,14 +3083,23 @@ static void list_domains(int verbose, in
>               rc = libxl_flask_sid_to_context(ctx, info[i].ssidref, &buf,
>                                               &size);
>               if (rc < 0)
> -                printf("  -");
> +                printf("                -");
>               else {
> -                printf("  %s", buf);
> +                printf(" %16s", buf);
>                   free(buf);
>               }
>           }
> +        if (numa) {
> +            libxl_domain_get_nodeaffinity(ctx, info[i].domid, &nodemap);
> +
> +            putchar(' ');
> +            print_nodemap(nodemap.map, physinfo.nr_nodes, stdout);
> +        }
>           putchar('\n');
>       }
> +
> +    libxl_bitmap_dispose(&nodemap);
> +    libxl_physinfo_dispose(&physinfo);
>   }
>   
>   static void list_vm(void)
> @@ -3890,12 +3980,14 @@ int main_list(int argc, char **argv)
>       int opt, verbose = 0;
>       int context = 0;
>       int details = 0;
> +    int numa = 0;
>       int option_index = 0;
>       static struct option long_options[] = {
>           {"long", 0, 0, 'l'},
>           {"help", 0, 0, 'h'},
>           {"verbose", 0, 0, 'v'},
>           {"context", 0, 0, 'Z'},
> +        {"numa", 0, 0, 'n'},
>           {0, 0, 0, 0}
>       };
>   
> @@ -3904,7 +3996,7 @@ int main_list(int argc, char **argv)
>       int nb_domain, rc;
>   
>       while (1) {
> -        opt = getopt_long(argc, argv, "lvhZ", long_options, &option_index);
> +        opt = getopt_long(argc, argv, "lvhZn", long_options, &option_index);
>           if (opt == -1)
>               break;
>   
> @@ -3921,6 +4013,9 @@ int main_list(int argc, char **argv)
>           case 'Z':
>               context = 1;
>               break;
> +        case 'n':
> +            numa = 1;
> +            break;
>           default:
>               fprintf(stderr, "option `%c' not supported.\n", optopt);
>               break;
> @@ -3956,7 +4051,7 @@ int main_list(int argc, char **argv)
>       if (details)
>           list_domains_details(info, nb_domain);
>       else
> -        list_domains(verbose, context, info, nb_domain);
> +        list_domains(verbose, context, numa, info, nb_domain);
>   
>       if (info_free)
>           libxl_dominfo_list_free(info, nb_domain);
> @@ -4228,56 +4323,6 @@ int main_button_press(int argc, char **a
>       return 0;
>   }
>   
> -static void print_bitmap(uint8_t *map, int maplen, FILE *stream)
> -{
> -    int i;
> -    uint8_t pmap = 0, bitmask = 0;
> -    int firstset = 0, state = 0;
> -
> -    for (i = 0; i < maplen; i++) {
> -        if (i % 8 == 0) {
> -            pmap = *map++;
> -            bitmask = 1;
> -        } else bitmask <<= 1;
> -
> -        switch (state) {
> -        case 0:
> -        case 2:
> -            if ((pmap & bitmask) != 0) {
> -                firstset = i;
> -                state++;
> -            }
> -            continue;
> -        case 1:
> -        case 3:
> -            if ((pmap & bitmask) == 0) {
> -                fprintf(stream, "%s%d", state > 1 ? "," : "", firstset);
> -                if (i - 1 > firstset)
> -                    fprintf(stream, "-%d", i - 1);
> -                state = 2;
> -            }
> -            continue;
> -        }
> -    }
> -    switch (state) {
> -        case 0:
> -            fprintf(stream, "none");
> -            break;
> -        case 2:
> -            break;
> -        case 1:
> -            if (firstset == 0) {
> -                fprintf(stream, "any cpu");
> -                break;
> -            }
> -        case 3:
> -            fprintf(stream, "%s%d", state > 1 ? "," : "", firstset);
> -            if (i - 1 > firstset)
> -                fprintf(stream, "-%d", i - 1);
> -            break;
> -    }
> -}
> -
>   static void print_vcpuinfo(uint32_t tdomid,
>                              const libxl_vcpuinfo *vcpuinfo,
>                              uint32_t nr_cpus)
> @@ -4301,7 +4346,7 @@ static void print_vcpuinfo(uint32_t tdom
>       /*      TIM */
>       printf("%9.1f  ", ((float)vcpuinfo->vcpu_time / 1e9));
>       /* CPU AFFINITY */
> -    print_bitmap(vcpuinfo->cpumap.map, nr_cpus, stdout);
> +    print_cpumap(vcpuinfo->cpumap.map, nr_cpus, stdout);
>       printf("\n");
>   }
>   
> diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
> --- a/tools/libxl/xl_cmdtable.c
> +++ b/tools/libxl/xl_cmdtable.c
> @@ -50,7 +50,8 @@ struct cmd_spec cmd_table[] = {
>         "[options] [Domain]\n",
>         "-l, --long              Output all VM details\n"
>         "-v, --verbose           Prints out UUIDs and security context\n"
> -      "-Z, --context           Prints out security context"
> +      "-Z, --context           Prints out security context\n"
> +      "-n, --numa              Prints out NUMA node affinity"
>       },
>       { "destroy",
>         &main_destroy, 0, 1,

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 09 of 10 v2] xl: add node-affinity to the output of `xl list`
  2012-12-21 16:34   ` George Dunlap
@ 2012-12-21 16:54     ` Dario Faggioli
  0 siblings, 0 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-21 16:54 UTC (permalink / raw)
  To: George Dunlap
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf,
	Matt Wilson


[-- Attachment #1.1: Type: text/plain, Size: 3373 bytes --]

On Fri, 2012-12-21 at 16:34 +0000, George Dunlap wrote: 
> Just checking -- is the print_bitmap() thing pure code motion? If so, 
> would you mind saying that explicitly in the commit message, just to 
> save people time when reading this patch?
> 
It's _mostly_ code motion, but I had to hack it a tiny little bit to
make it possible to use the same function for printing cpu and node
bitmaps (basically, in case all bits are sets, the function was printing
"any cpu" itself, which wasn't fitting well with node maps).

But yeah, I should have mentioned what I just put here in the changelog.
Will do in v3.

> Other than that, looks OK to me -- I haven't done a detailed review of 
> the output layout however.
> 
That's the most tricky part of this patch! :-)

For your and others convenience, here's how it looks like on my testbox
(sorry for the extra long lines, but that's not my fault!):

root@Zhaman:~# xl list
Name                                        ID   Mem VCPUs	State	Time(s)
Domain-0                                     0   375    16     r-----     225.1
vm1                                          1   960     2     -b----      34.0
root@Zhaman:~# xl list -Z
Name                                        ID   Mem VCPUs	State	Time(s)   Security Label
Domain-0                                     0   375    16     r-----     225.4                -
vm1                                          1   960     2     -b----      34.0                -
root@Zhaman:~# xl list -v
Name                                        ID   Mem VCPUs	State	Time(s)   UUID                            Reason-Code	Security Label
Domain-0                                     0   375    16     r-----     226.6 00000000-0000-0000-0000-000000000000        -                -
vm1                                          1   960     2     -b----      34.2 e36429cc-d2a2-4da7-b21d-b053f725e7a7        -                -

root@Zhaman:~# xl list -n
Name                                        ID   Mem VCPUs	State	Time(s) NODE Affinity
Domain-0                                     0   375    16     r-----     226.8 any node
vm1                                          1   960     2     -b----      34.2 0
root@Zhaman:~# xl list -nZ
Name                                        ID   Mem VCPUs	State	Time(s)   Security Label NODE Affinity
Domain-0                                     0   375    16     r-----     226.9                - any node
vm1                                          1   960     2     -b----      34.2                - 0
root@Zhaman:~# xl list -nv
Name                                        ID   Mem VCPUs	State	Time(s)   UUID                            Reason-Code	Security Label NODE Affinity
Domain-0                                     0   375    16     r-----     227.0 00000000-0000-0000-0000-000000000000        -                - any node
vm1                                          1   960     2     -b----      34.2 e36429cc-d2a2-4da7-b21d-b053f725e7a7        -                - 0

Reasonable enough?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 10 of 10 v2] docs: rearrange and update NUMA placement documentation
  2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
                   ` (8 preceding siblings ...)
  2012-12-19 19:07 ` [PATCH 09 of 10 v2] xl: add node-affinity to the output of `xl list` Dario Faggioli
@ 2012-12-19 19:07 ` Dario Faggioli
  2012-12-19 23:16 ` [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
  2013-01-11 12:19 ` Ian Campbell
  11 siblings, 0 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-19 19:07 UTC (permalink / raw)
  To: xen-devel
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	Jan Beulich, Daniel De Graaf, Matt Wilson

To include the new concept of NUMA aware scheduling and its
impact.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown
--- a/docs/misc/xl-numa-placement.markdown
+++ b/docs/misc/xl-numa-placement.markdown
@@ -14,22 +14,67 @@ the memory directly attached to the set 
 
 The Xen hypervisor deals with NUMA machines by assigning to each domain
 a "node affinity", i.e., a set of NUMA nodes of the host from which they
-get their memory allocated.
+get their memory allocated. Also, even if the node affinity of a domain
+is allowed to change on-line, it is very important to "place" the domain
+correctly when it is fist created, as the most of its memory is allocated
+at that time and can not (for now) be moved easily.
 
 NUMA awareness becomes very important as soon as many domains start
 running memory-intensive workloads on a shared host. In fact, the cost
 of accessing non node-local memory locations is very high, and the
 performance degradation is likely to be noticeable.
 
-## Guest Placement in xl ##
+For more information, have a look at the [Xen NUMA Introduction][numa_intro]
+page on the Wiki.
+
+### Placing via pinning and cpupools ###
+
+The simplest way of placing a domain on a NUMA node is statically pinning
+the domain's vCPUs to the pCPUs of the node. This goes under the name of
+CPU affinity and can be set through the "cpus=" option in the config file
+(more about this below). Another option is to pool together the pCPUs
+spanning the node and put the domain in such a cpupool with the "pool="
+config option (as documented in our [Wiki][cpupools_howto]).
+
+In both the above cases, the domain will not be able to execute outside
+the specified set of pCPUs for any reasons, even if all those pCPUs are
+busy doing something else while there are others, idle, pCPUs.
+
+So, when doing this, local memory accesses are 100% guaranteed, but that
+may come at he cost of some load imbalances.
+
+### NUMA aware scheduling ###
+
+If the credit scheduler is in use, the concept of node affinity defined
+above does not only apply to memory. In fact, starting from Xen 4.3, the
+scheduler always tries to run the domain's vCPUs on one of the nodes in
+its node affinity. Only if that turns out to be impossible, it will just
+pick any free pCPU.
+
+This is, therefore, something more flexible than CPU affinity, as a domain
+can still run everywhere, it just prefers some nodes rather than others.
+Locality of access is less guaranteed than in the pinning case, but that
+comes along with better chances to exploit all the host resources (e.g.,
+the pCPUs).
+
+In fact, if all the pCPUs in a domain's node affinity are busy, it is
+possible for the domain to run outside of there, but it is very likely that
+slower execution (due to remote memory accesses) is still better than no
+execution at all, as it would happen with pinning. For this reason, NUMA
+aware scheduling has the potential of bringing substantial performances
+benefits, although this will depend on the workload.
+
+## Guest placement in xl ##
 
 If using xl for creating and managing guests, it is very easy to ask for
 both manual or automatic placement of them across the host's NUMA nodes.
 
-Note that xm/xend does the very same thing, the only differences residing
-in the details of the heuristics adopted for the placement (see below).
+Note that xm/xend does a very similar thing, the only differences being
+the details of the heuristics adopted for automatic placement (see below),
+and the lack of support (in both xm/xend and the Xen versions where that\
+was the default toolstack) for NUMA aware scheduling.
 
-### Manual Guest Placement with xl ###
+### Placing the guest manually ###
 
 Thanks to the "cpus=" option, it is possible to specify where a domain
 should be created and scheduled on, directly in its config file. This
@@ -41,14 +86,19 @@ This is very simple and effective, but r
 administrator to explicitly specify affinities for each and every domain,
 or Xen won't be able to guarantee the locality for their memory accesses.
 
-It is also possible to deal with NUMA by partitioning the system using
-cpupools. Again, this could be "The Right Answer" for many needs and
-occasions, but has to be carefully considered and setup by hand.
+Notice that this also pins the domain's vCPUs to the specified set of
+pCPUs, so it not only sets the domain's node affinity (its memory will
+come from the nodes to which the pCPUs belong), but at the same time
+forces the vCPUs of the domain to be scheduled on those same pCPUs.
 
-### Automatic Guest Placement with xl ###
+### Placing the guest automatically ###
 
 If no "cpus=" option is specified in the config file, libxl tries
 to figure out on its own on which node(s) the domain could fit best.
+If it finds one (some), the domain's node affinity get set to there,
+and both memory allocations and NUMA aware scheduling (for the credit
+scheduler and starting from Xen 4.3) will comply with it.
+
 It is worthwhile noting that optimally fitting a set of VMs on the NUMA
 nodes of an host is an incarnation of the Bin Packing Problem. In fact,
 the various VMs with different memory sizes are the items to be packed,
@@ -81,7 +131,7 @@ largest amounts of free memory helps kee
 small, and maximizes the probability of being able to put more domains
 there.
 
-## Guest Placement within libxl ##
+## Guest placement in libxl ##
 
 xl achieves automatic NUMA placement because that is what libxl does
 by default. No API is provided (yet) for modifying the behaviour of
@@ -93,15 +143,34 @@ any placement from happening:
     libxl_defbool_set(&domain_build_info->numa_placement, false);
 
 Also, if `numa_placement` is set to `true`, the domain must not
-have any cpu affinity (i.e., `domain_build_info->cpumap` must
+have any CPU affinity (i.e., `domain_build_info->cpumap` must
 have all its bits set, as it is by default), or domain creation
 will fail returning `ERROR_INVAL`.
 
+Starting from Xen 4.3, in case automatic placement happens (and is
+successful), it will affect the domain's node affinity and _not_ its
+CPU affinity. Namely, the domain's vCPUs will not be pinned to any
+pCPU on the host, but the memory from the domain will come from the
+selected node(s) and the NUMA aware scheduling (if the credit scheduler
+is in use) will try to keep the domain there as much as possible.
+
 Besides than that, looking and/or tweaking the placement algorithm
 search "Automatic NUMA placement" in libxl\_internal.h.
 
 Note this may change in future versions of Xen/libxl.
 
+## Xen < 4.3 ##
+
+As NUMA aware scheduling is a new feature of Xen 4.3, things are a little
+bit different for earlier version of Xen. If no "cpus=" option is specified
+and Xen 4.2 is in use, the automatic placement algorithm still runs, but
+the results is used to _pin_ the vCPUs of the domain to the output node(s).
+This is consistent with what was happening with xm/xend, which were also
+affecting the domain's CPU affinity.
+
+On a version of Xen earlier than 4.2, there is not automatic placement at
+all in xl or libxl, and hence no node or CPU affinity being affected.
+
 ## Limitations ##
 
 Analyzing various possible placement solutions is what makes the
@@ -109,3 +178,6 @@ algorithm flexible and quite effective. 
 it won't scale well to systems with arbitrary number of nodes.
 For this reason, automatic placement is disabled (with a warning)
 if it is requested on a host with more than 16 NUMA nodes.
+
+[numa_intro]: http://wiki.xen.org/wiki/Xen_NUMA_Introduction
+[cpupools_howto]: http://wiki.xen.org/wiki/Cpupools_Howto

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00 of 10 v2] NUMA aware credit scheduling
  2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
                   ` (9 preceding siblings ...)
  2012-12-19 19:07 ` [PATCH 10 of 10 v2] docs: rearrange and update NUMA placement documentation Dario Faggioli
@ 2012-12-19 23:16 ` Dario Faggioli
  2013-01-11 12:19 ` Ian Campbell
  11 siblings, 0 replies; 57+ messages in thread
From: Dario Faggioli @ 2012-12-19 23:16 UTC (permalink / raw)
  To: xen-devel
  Cc: Marcus Granado, Dan Magenheimer, Ian Campbell, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	Jan Beulich, Daniel De Graaf, Matt Wilson


[-- Attachment #1.1: Type: text/plain, Size: 2019 bytes --]

On Wed, 2012-12-19 at 20:07 +0100, Dario Faggioli wrote: 
> Which, reasoning in terms of %-performances increase/decrease, means NUMA aware
> scheduling does as follows, as compared to no affinity at all and to pinning:
> 
>      ----------------------------------
>      | SpecJBB2005 (throughput)       |
>      ----------------------------------
>      | #VMs | No affinity |  Pinning  |
>      |    2 |   +14.36%   |   -0.36%  |
>      |    6 |   +14.72%   |   -0.26%  |
>      |   10 |   +11.88%   |   -2.44%  |
>      ----------------------------------
>      | Sysbench memory (throughput)   |
>      ----------------------------------
>      | #VMs | No affinity |  Pinning  |
>      |    2 |   +14.12%   |   +0.09%  |
>      |    6 |   +11.12%   |   +2.14%  |
>      |   10 |   +11.81%   |   +5.06%  |
>      ----------------------------------
>      | LMBench proc (latency)         |
>      ----------------------------------
>      | #VMs | No affinity |  Pinning  |
>      ----------------------------------
>      |    2 |   +10.02%   |   +1.07%  |
>      |    6 |    +3.45%   |   +1.02%  |
>      |   10 |    +2.94%   |   +4.53%  |
>      ----------------------------------
> 
Just to be sure, as I may have not picked up the perfect wording, in the
table above a +xx.yy% means NUMA aware scheduling (i.e., with this patch
series fully applied) performs xx.yy% _better_ than either 'No affinity'
or 'Pinning'. Conversely, a -zz.ww% means it performs zz.ww% worse.

Sorry but the different combination and the presence of both throughput
values (which are better if high) and latency values (which are better
if low) made things a little bit tricky to present effectively. :-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00 of 10 v2] NUMA aware credit scheduling
  2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
                   ` (10 preceding siblings ...)
  2012-12-19 23:16 ` [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
@ 2013-01-11 12:19 ` Ian Campbell
  2013-01-11 13:57   ` Dario Faggioli
  11 siblings, 1 reply; 57+ messages in thread
From: Ian Campbell @ 2013-01-11 12:19 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Matt Wilson, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf

On Wed, 2012-12-19 at 19:07 +0000, Dario Faggioli wrote:
>  * [ 1/10] xen, libxc: rename xenctl_cpumap to xenctl_bitmap
>  * [ 2/10] xen, libxc: introduce node maps and masks

I was just about to apply these two when I realised they had a
hypervisor component too. Needs an Ack from Jan and/or Keir then, I
think.

>    [ 3/10] xen: sched_credit: let the scheduler know about node-affinity
>    [ 4/10] xen: allow for explicitly specifying node-affinity

These are for Keir and or Jan I think.

I wasn't sure if it was ok to apply the rest without the above. But
given that I'm now not immediately doing 1+2 either that's somewhat
moot.

>  * [ 5/10] libxc: allow for explicitly specifying node-affinity
>  * [ 6/10] libxl: allow for explicitly specifying node-affinity
>    [ 7/10] libxl: optimize the calculation of how many VCPUs can run on a candidate
>  * [ 8/10] libxl: automatic placement deals with node-affinity
>  * [ 9/10] xl: add node-affinity to the output of `xl list`
>    [10/10] docs: rearrange and update NUMA placement documentation
> 
> Thanks and Regards,
> Dario
> 
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00 of 10 v2] NUMA aware credit scheduling
  2013-01-11 12:19 ` Ian Campbell
@ 2013-01-11 13:57   ` Dario Faggioli
  2013-01-11 14:09     ` Ian Campbell
  0 siblings, 1 reply; 57+ messages in thread
From: Dario Faggioli @ 2013-01-11 13:57 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Marcus Granado, Dan Magenheimer, Matt Wilson, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf


[-- Attachment #1.1: Type: text/plain, Size: 1399 bytes --]

On Fri, 2013-01-11 at 12:19 +0000, Ian Campbell wrote:
> On Wed, 2012-12-19 at 19:07 +0000, Dario Faggioli wrote:
> >  * [ 1/10] xen, libxc: rename xenctl_cpumap to xenctl_bitmap
> >  * [ 2/10] xen, libxc: introduce node maps and masks
> 
> I was just about to apply these two when I realised they had a
> hypervisor component too. Needs an Ack from Jan and/or Keir then, I
> think.
> 
Mmm... But there have been comments, on at least 2, and I need to
address them in v3 (which I'll repost shortly... Sorry but I've been
sidetracked a bit).

> >    [ 3/10] xen: sched_credit: let the scheduler know about node-affinity
> >    [ 4/10] xen: allow for explicitly specifying node-affinity
> 
> These are for Keir and or Jan I think.
> 
> I wasn't sure if it was ok to apply the rest without the above. 
>
Well, as above, 3 needs reworking to let me cope with Juergen, Jan's and
George's comments.

So, if it's fine for you, I'd say to wait for a v3 before applying
anything from this series.

Sorry if I did not make the fact that there _will_ be a v3 not clear
enough. :-P

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 00 of 10 v2] NUMA aware credit scheduling
  2013-01-11 13:57   ` Dario Faggioli
@ 2013-01-11 14:09     ` Ian Campbell
  0 siblings, 0 replies; 57+ messages in thread
From: Ian Campbell @ 2013-01-11 14:09 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Marcus Granado, Dan Magenheimer, Matt Wilson, Anil Madhavapeddy,
	George Dunlap, Andrew Cooper, Juergen Gross, Ian Jackson,
	xen-devel@lists.xen.org, Jan Beulich, Daniel De Graaf

On Fri, 2013-01-11 at 13:57 +0000, Dario Faggioli wrote:
> On Fri, 2013-01-11 at 12:19 +0000, Ian Campbell wrote:
> > On Wed, 2012-12-19 at 19:07 +0000, Dario Faggioli wrote:
> > >  * [ 1/10] xen, libxc: rename xenctl_cpumap to xenctl_bitmap
> > >  * [ 2/10] xen, libxc: introduce node maps and masks
> > 
> > I was just about to apply these two when I realised they had a
> > hypervisor component too. Needs an Ack from Jan and/or Keir then, I
> > think.
> > 
> Mmm... But there have been comments, on at least 2, and I need to
> address them in v3 (which I'll repost shortly... Sorry but I've been
> sidetracked a bit).

Ah, I was just going by the two acks. OK I'll wait

Ian.

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2013-01-11 14:09 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-12-19 19:07 [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
2012-12-19 19:07 ` [PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap Dario Faggioli
2012-12-20  9:17   ` Jan Beulich
2012-12-20  9:35     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks Dario Faggioli
2012-12-20  9:18   ` Jan Beulich
2012-12-20  9:55     ` Dario Faggioli
2012-12-20 14:33     ` George Dunlap
2012-12-20 14:52       ` Jan Beulich
2012-12-20 15:13         ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity Dario Faggioli
2012-12-20  6:44   ` Juergen Gross
2012-12-20  8:16     ` Dario Faggioli
2012-12-20  8:25       ` Juergen Gross
2012-12-20  8:33         ` Dario Faggioli
2012-12-20  8:39           ` Juergen Gross
2012-12-20  8:58             ` Dario Faggioli
2012-12-20 15:28             ` George Dunlap
2012-12-20 16:00               ` Dario Faggioli
2012-12-20  9:22           ` Jan Beulich
2012-12-20 15:56   ` George Dunlap
2012-12-20 17:12     ` Dario Faggioli
2012-12-20 16:48   ` George Dunlap
2012-12-20 18:18     ` Dario Faggioli
2012-12-21 14:29       ` George Dunlap
2012-12-21 16:07         ` Dario Faggioli
2012-12-20 20:21   ` George Dunlap
2012-12-21  0:18     ` Dario Faggioli
2012-12-21 14:56       ` George Dunlap
2012-12-21 16:13         ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity Dario Faggioli
2012-12-21 15:17   ` George Dunlap
2012-12-21 16:17     ` Dario Faggioli
2013-01-03 16:05     ` Daniel De Graaf
2012-12-19 19:07 ` [PATCH 05 of 10 v2] libxc: " Dario Faggioli
2012-12-21 15:19   ` George Dunlap
2012-12-21 16:27     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 06 of 10 v2] libxl: " Dario Faggioli
2012-12-21 15:30   ` George Dunlap
2012-12-21 16:18     ` Dario Faggioli
2012-12-21 17:02       ` Ian Jackson
2012-12-21 17:09         ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate Dario Faggioli
2012-12-20  8:41   ` Ian Campbell
2012-12-20  9:24     ` Dario Faggioli
2012-12-21 16:00   ` George Dunlap
2012-12-21 16:23     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 08 of 10 v2] libxl: automatic placement deals with node-affinity Dario Faggioli
2012-12-21 16:22   ` George Dunlap
2012-12-19 19:07 ` [PATCH 09 of 10 v2] xl: add node-affinity to the output of `xl list` Dario Faggioli
2012-12-21 16:34   ` George Dunlap
2012-12-21 16:54     ` Dario Faggioli
2012-12-19 19:07 ` [PATCH 10 of 10 v2] docs: rearrange and update NUMA placement documentation Dario Faggioli
2012-12-19 23:16 ` [PATCH 00 of 10 v2] NUMA aware credit scheduling Dario Faggioli
2013-01-11 12:19 ` Ian Campbell
2013-01-11 13:57   ` Dario Faggioli
2013-01-11 14:09     ` Ian Campbell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).