[PATCH 0/5] (not just)x86/Dom0: NUMA related adjustments

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] (not just)x86/Dom0: NUMA related adjustments
@ 2015-02-26 13:44 Jan Beulich
  2015-02-26 13:52 ` [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on Jan Beulich
                   ` (5 more replies)
  0 siblings, 6 replies; 51+ messages in thread
From: Jan Beulich @ 2015-02-26 13:44 UTC (permalink / raw)
  To: xen-devel
  Cc: Keir Fraser, Andrew Cooper, Dario Faggioli, Ian Jackson,
	Tim Deegan, Ian Campbell

1: x86: allow specifying the NUMA nodes Dom0 should run on
2: allow domain heap allocations to specify more than one NUMA node
3: x86: widen NUMA nodes to be allocated from
4: VT-d: widen NUMA nodes to be allocated from
5: AMD IOMMU: widen NUMA nodes to be allocated from

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
To apply cleanly his depends on "x86/Dom0: account for shadow/HAP allocation"
(http://lists.xenproject.org/archives/html/xen-devel/2015-02/msg03111.html).

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-02-26 13:44 [PATCH 0/5] (not just)x86/Dom0: NUMA related adjustments Jan Beulich
@ 2015-02-26 13:52 ` Jan Beulich
  2015-02-26 17:14   ` Dario Faggioli
                     ` (2 more replies)
  2015-02-26 13:53 ` [PATCH 2/5] allow domain heap allocations to specify more than one NUMA node Jan Beulich
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 51+ messages in thread
From: Jan Beulich @ 2015-02-26 13:52 UTC (permalink / raw)
  To: xen-devel
  Cc: Keir Fraser, Andrew Cooper, Dario Faggioli, Ian Jackson,
	Tim Deegan, Ian Campbell

[-- Attachment #1: Type: text/plain, Size: 9126 bytes --]

... by introducing a "dom0_nodes" option augmenting the "dom0_mem" and
"dom0_max_vcpus" ones.

Note that this gives meaning to MEMF_exact_node specified alone (i.e.
implicitly combined with NUMA_NO_NODE): In such a case any node inside
the domain's node mask is acceptable, but no other node. This changed
behavior is (implicitly) being exposed through the memop hypercalls.

Note further that this change doesn't take care of moving the initrd
image into memory matching Dom0's affinity when the initrd doesn't get
copied (because of being part of the initial mapping) anyway.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I'm uncertain whether range restricting the PXMs permitted for Dom0 is
the right approach (matching what other NUMA code did until recently),
or whether we would instead want to simply limit the number of PXMs we
can handler there (i.e. using a static array instead of a static
bitmap).

--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -540,6 +540,15 @@ any dom0 autoballooning feature present 
 _xl.conf(5)_ man page or [Xen Best
 Practices](http://wiki.xen.org/wiki/Xen_Best_Practices#Xen_dom0_dedicated_memory_and_preventing_dom0_memory_ballooning).
 
+### dom0\_nodes
+
+> `= <integer>[,...]`
+
+Specify the NUMA nodes to place Dom0 on. Defaults for vCPU-s created
+and memory assigned to Dom0 will be adjusted to match the node
+restrictions set up here. Note that the values to be specified here are
+ACPI PXM ones, not Xen internal node numbers.
+
 ### dom0\_shadow
 > `= <boolean>`
 
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -100,11 +100,58 @@ static void __init parse_dom0_max_vcpus(
 }
 custom_param("dom0_max_vcpus", parse_dom0_max_vcpus);
 
+#define MAX_DOM0_PXM 255
+static __initdata DECLARE_BITMAP(dom0_pxms, MAX_DOM0_PXM + 1);
+
+static void __init parse_dom0_nodes(const char *s)
+{
+    do {
+        unsigned int pxm = simple_strtoul(s, &s, 0);
+
+        if ( pxm <= MAX_DOM0_PXM )
+            __set_bit(pxm, dom0_pxms);
+    } while ( *s++ == ',' );
+}
+custom_param("dom0_nodes", parse_dom0_nodes);
+
+static cpumask_t __initdata dom0_cpus;
+
+static struct vcpu *__init setup_vcpu(struct domain *d, unsigned int vcpu_id,
+                                      unsigned int cpu)
+{
+    struct vcpu *v = alloc_vcpu(d, vcpu_id, cpu);
+
+    if ( v )
+    {
+        if ( !d->is_pinned )
+            cpumask_copy(v->cpu_hard_affinity, &dom0_cpus);
+        cpumask_copy(v->cpu_soft_affinity, &dom0_cpus);
+    }
+
+    return v;
+}
+
+static nodemask_t __initdata dom0_nodes;
+
 unsigned int __init dom0_max_vcpus(void)
 {
-    unsigned max_vcpus;
+    unsigned int pxm, max_vcpus;
+    nodeid_t node;
+
+    for ( pxm = 0; pxm <= MAX_DOM0_PXM; ++pxm )
+        if ( test_bit(pxm, dom0_pxms) &&
+             (node = pxm_to_node(pxm)) != NUMA_NO_NODE )
+            node_set(node, dom0_nodes);
+    nodes_and(dom0_nodes, dom0_nodes, node_online_map);
+    if ( nodes_empty(dom0_nodes) )
+        dom0_nodes = node_online_map;
+    for_each_node_mask ( node, dom0_nodes )
+        cpumask_or(&dom0_cpus, &dom0_cpus, &node_to_cpumask(node));
+    cpumask_and(&dom0_cpus, &dom0_cpus, cpupool0->cpu_valid);
+    if ( cpumask_empty(&dom0_cpus) )
+        cpumask_copy(&dom0_cpus, cpupool0->cpu_valid);
 
-    max_vcpus = num_cpupool_cpus(cpupool0);
+    max_vcpus = cpumask_weight(&dom0_cpus);
     if ( opt_dom0_max_vcpus_min > max_vcpus )
         max_vcpus = opt_dom0_max_vcpus_min;
     if ( opt_dom0_max_vcpus_max < max_vcpus )
@@ -119,12 +166,15 @@ struct vcpu *__init alloc_dom0_vcpu0(str
 {
     unsigned int max_vcpus = dom0_max_vcpus();
 
+    dom0->node_affinity = dom0_nodes;
+    dom0->auto_node_affinity = !!bitmap_empty(dom0_pxms, MAX_DOM0_PXM + 1);
+
     dom0->vcpu = xzalloc_array(struct vcpu *, max_vcpus);
     if ( !dom0->vcpu )
         return NULL;
     dom0->max_vcpus = max_vcpus;
 
-    return alloc_vcpu(dom0, 0, 0);
+    return setup_vcpu(dom0, 0, cpumask_first(&dom0_cpus));
 }
 
 #ifdef CONFIG_SHADOW_PAGING
@@ -156,7 +206,7 @@ static struct page_info * __init alloc_c
     struct domain *d, unsigned long max_pages)
 {
     static unsigned int __initdata last_order = MAX_ORDER;
-    static unsigned int __initdata memflags = MEMF_no_dma;
+    static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
     struct page_info *page;
     unsigned int order = get_order_from_pages(max_pages), free_order;
 
@@ -190,7 +240,7 @@ static struct page_info * __init alloc_c
 
         if ( d->tot_pages + (1 << order) > d->max_pages )
             continue;
-        pg2 = alloc_domheap_pages(d, order, 0);
+        pg2 = alloc_domheap_pages(d, order, MEMF_exact_node);
         if ( pg2 > page )
         {
             free_domheap_pages(page, free_order);
@@ -217,10 +267,14 @@ static unsigned long __init dom0_paging_
 static unsigned long __init compute_dom0_nr_pages(
     struct domain *d, struct elf_dom_parms *parms, unsigned long initrd_len)
 {
-    unsigned long avail = avail_domheap_pages() + initial_images_nrpages();
-    unsigned long nr_pages, min_pages, max_pages;
+    nodeid_t node;
+    unsigned long avail = 0, nr_pages, min_pages, max_pages;
     bool_t need_paging;
 
+    for_each_node_mask ( node, dom0_nodes )
+        avail += avail_domheap_pages_region(node, 0, 0) +
+                 initial_images_nrpages(node);
+
     /* Reserve memory for further dom0 vcpu-struct allocations... */
     avail -= (d->max_vcpus - 1UL)
              << get_order_from_bytes(sizeof(struct vcpu));
@@ -1230,11 +1284,11 @@ int __init construct_dom0(
 
     printk("Dom0 has maximum %u VCPUs\n", d->max_vcpus);
 
-    cpu = cpumask_first(cpupool0->cpu_valid);
+    cpu = v->processor;
     for ( i = 1; i < d->max_vcpus; i++ )
     {
-        cpu = cpumask_cycle(cpu, cpupool0->cpu_valid);
-        (void)alloc_vcpu(d, i, cpu);
+        cpu = cpumask_cycle(cpu, &dom0_cpus);
+        setup_vcpu(d, i, cpu);
     }
 
     /*
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -140,13 +140,21 @@ static void __init parse_acpi_param(char
 static const module_t *__initdata initial_images;
 static unsigned int __initdata nr_initial_images;
 
-unsigned long __init initial_images_nrpages(void)
+unsigned long __init initial_images_nrpages(nodeid_t node)
 {
+    unsigned long node_start = node_start_pfn(node);
+    unsigned long node_end = node_end_pfn(node);
     unsigned long nr;
     unsigned int i;
 
     for ( nr = i = 0; i < nr_initial_images; ++i )
-        nr += PFN_UP(initial_images[i].mod_end);
+    {
+        unsigned long start = initial_images[i].mod_start;
+        unsigned long end = start + PFN_UP(initial_images[i].mod_end);
+
+        if ( end > node_start && node_end > start )
+            nr += min(node_end, end) - max(node_start, start);
+    }
 
     return nr;
 }
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -581,7 +581,7 @@ static struct page_info *alloc_heap_page
     struct domain *d)
 {
     unsigned int i, j, zone = 0, nodemask_retry = 0;
-    nodeid_t first_node, node = MEMF_get_node(memflags);
+    nodeid_t first_node, node = MEMF_get_node(memflags), req_node = node;
     unsigned long request = 1UL << order;
     struct page_info *pg;
     nodemask_t nodemask = (d != NULL ) ? d->node_affinity : node_online_map;
@@ -593,7 +593,6 @@ static struct page_info *alloc_heap_page
 
     if ( node == NUMA_NO_NODE )
     {
-        memflags &= ~MEMF_exact_node;
         if ( d != NULL )
         {
             node = next_node(d->last_alloc_node, nodemask);
@@ -654,7 +653,7 @@ static struct page_info *alloc_heap_page
                     goto found;
         } while ( zone-- > zone_lo ); /* careful: unsigned zone may wrap */
 
-        if ( memflags & MEMF_exact_node )
+        if ( (memflags & MEMF_exact_node) && req_node != NUMA_NO_NODE )
             goto not_found;
 
         /* Pick next node. */
@@ -671,7 +670,7 @@ static struct page_info *alloc_heap_page
         if ( node == first_node )
         {
             /* When we have tried all in nodemask, we fall back to others. */
-            if ( nodemask_retry++ )
+            if ( (memflags & MEMF_exact_node) || nodemask_retry++ )
                 goto not_found;
             nodes_andnot(nodemask, node_online_map, nodemask);
             first_node = node = first_node(nodemask);
--- a/xen/include/asm-x86/setup.h
+++ b/xen/include/asm-x86/setup.h
@@ -2,6 +2,7 @@
 #define __X86_SETUP_H_
 
 #include <xen/multiboot.h>
+#include <asm/numa.h>
 
 extern unsigned long xenheap_initial_phys_start;
 
@@ -32,7 +33,7 @@ int construct_dom0(
     void *(*bootstrap_map)(const module_t *),
     char *cmdline);
 
-unsigned long initial_images_nrpages(void);
+unsigned long initial_images_nrpages(nodeid_t node);
 void discard_initial_images(void);
 
 unsigned int dom0_max_vcpus(void);



[-- Attachment #2: x86-Dom0-nodes.patch --]
[-- Type: text/plain, Size: 9181 bytes --]

x86: allow specifying the NUMA nodes Dom0 should run on

... by introducing a "dom0_nodes" option augmenting the "dom0_mem" and
"dom0_max_vcpus" ones.

Note that this gives meaning to MEMF_exact_node specified alone (i.e.
implicitly combined with NUMA_NO_NODE): In such a case any node inside
the domain's node mask is acceptable, but no other node. This changed
behavior is (implicitly) being exposed through the memop hypercalls.

Note further that this change doesn't take care of moving the initrd
image into memory matching Dom0's affinity when the initrd doesn't get
copied (because of being part of the initial mapping) anyway.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I'm uncertain whether range restricting the PXMs permitted for Dom0 is
the right approach (matching what other NUMA code did until recently),
or whether we would instead want to simply limit the number of PXMs we
can handler there (i.e. using a static array instead of a static
bitmap).

--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -540,6 +540,15 @@ any dom0 autoballooning feature present 
 _xl.conf(5)_ man page or [Xen Best
 Practices](http://wiki.xen.org/wiki/Xen_Best_Practices#Xen_dom0_dedicated_memory_and_preventing_dom0_memory_ballooning).
 
+### dom0\_nodes
+
+> `= <integer>[,...]`
+
+Specify the NUMA nodes to place Dom0 on. Defaults for vCPU-s created
+and memory assigned to Dom0 will be adjusted to match the node
+restrictions set up here. Note that the values to be specified here are
+ACPI PXM ones, not Xen internal node numbers.
+
 ### dom0\_shadow
 > `= <boolean>`
 
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -100,11 +100,58 @@ static void __init parse_dom0_max_vcpus(
 }
 custom_param("dom0_max_vcpus", parse_dom0_max_vcpus);
 
+#define MAX_DOM0_PXM 255
+static __initdata DECLARE_BITMAP(dom0_pxms, MAX_DOM0_PXM + 1);
+
+static void __init parse_dom0_nodes(const char *s)
+{
+    do {
+        unsigned int pxm = simple_strtoul(s, &s, 0);
+
+        if ( pxm <= MAX_DOM0_PXM )
+            __set_bit(pxm, dom0_pxms);
+    } while ( *s++ == ',' );
+}
+custom_param("dom0_nodes", parse_dom0_nodes);
+
+static cpumask_t __initdata dom0_cpus;
+
+static struct vcpu *__init setup_vcpu(struct domain *d, unsigned int vcpu_id,
+                                      unsigned int cpu)
+{
+    struct vcpu *v = alloc_vcpu(d, vcpu_id, cpu);
+
+    if ( v )
+    {
+        if ( !d->is_pinned )
+            cpumask_copy(v->cpu_hard_affinity, &dom0_cpus);
+        cpumask_copy(v->cpu_soft_affinity, &dom0_cpus);
+    }
+
+    return v;
+}
+
+static nodemask_t __initdata dom0_nodes;
+
 unsigned int __init dom0_max_vcpus(void)
 {
-    unsigned max_vcpus;
+    unsigned int pxm, max_vcpus;
+    nodeid_t node;
+
+    for ( pxm = 0; pxm <= MAX_DOM0_PXM; ++pxm )
+        if ( test_bit(pxm, dom0_pxms) &&
+             (node = pxm_to_node(pxm)) != NUMA_NO_NODE )
+            node_set(node, dom0_nodes);
+    nodes_and(dom0_nodes, dom0_nodes, node_online_map);
+    if ( nodes_empty(dom0_nodes) )
+        dom0_nodes = node_online_map;
+    for_each_node_mask ( node, dom0_nodes )
+        cpumask_or(&dom0_cpus, &dom0_cpus, &node_to_cpumask(node));
+    cpumask_and(&dom0_cpus, &dom0_cpus, cpupool0->cpu_valid);
+    if ( cpumask_empty(&dom0_cpus) )
+        cpumask_copy(&dom0_cpus, cpupool0->cpu_valid);
 
-    max_vcpus = num_cpupool_cpus(cpupool0);
+    max_vcpus = cpumask_weight(&dom0_cpus);
     if ( opt_dom0_max_vcpus_min > max_vcpus )
         max_vcpus = opt_dom0_max_vcpus_min;
     if ( opt_dom0_max_vcpus_max < max_vcpus )
@@ -119,12 +166,15 @@ struct vcpu *__init alloc_dom0_vcpu0(str
 {
     unsigned int max_vcpus = dom0_max_vcpus();
 
+    dom0->node_affinity = dom0_nodes;
+    dom0->auto_node_affinity = !!bitmap_empty(dom0_pxms, MAX_DOM0_PXM + 1);
+
     dom0->vcpu = xzalloc_array(struct vcpu *, max_vcpus);
     if ( !dom0->vcpu )
         return NULL;
     dom0->max_vcpus = max_vcpus;
 
-    return alloc_vcpu(dom0, 0, 0);
+    return setup_vcpu(dom0, 0, cpumask_first(&dom0_cpus));
 }
 
 #ifdef CONFIG_SHADOW_PAGING
@@ -156,7 +206,7 @@ static struct page_info * __init alloc_c
     struct domain *d, unsigned long max_pages)
 {
     static unsigned int __initdata last_order = MAX_ORDER;
-    static unsigned int __initdata memflags = MEMF_no_dma;
+    static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
     struct page_info *page;
     unsigned int order = get_order_from_pages(max_pages), free_order;
 
@@ -190,7 +240,7 @@ static struct page_info * __init alloc_c
 
         if ( d->tot_pages + (1 << order) > d->max_pages )
             continue;
-        pg2 = alloc_domheap_pages(d, order, 0);
+        pg2 = alloc_domheap_pages(d, order, MEMF_exact_node);
         if ( pg2 > page )
         {
             free_domheap_pages(page, free_order);
@@ -217,10 +267,14 @@ static unsigned long __init dom0_paging_
 static unsigned long __init compute_dom0_nr_pages(
     struct domain *d, struct elf_dom_parms *parms, unsigned long initrd_len)
 {
-    unsigned long avail = avail_domheap_pages() + initial_images_nrpages();
-    unsigned long nr_pages, min_pages, max_pages;
+    nodeid_t node;
+    unsigned long avail = 0, nr_pages, min_pages, max_pages;
     bool_t need_paging;
 
+    for_each_node_mask ( node, dom0_nodes )
+        avail += avail_domheap_pages_region(node, 0, 0) +
+                 initial_images_nrpages(node);
+
     /* Reserve memory for further dom0 vcpu-struct allocations... */
     avail -= (d->max_vcpus - 1UL)
              << get_order_from_bytes(sizeof(struct vcpu));
@@ -1230,11 +1284,11 @@ int __init construct_dom0(
 
     printk("Dom0 has maximum %u VCPUs\n", d->max_vcpus);
 
-    cpu = cpumask_first(cpupool0->cpu_valid);
+    cpu = v->processor;
     for ( i = 1; i < d->max_vcpus; i++ )
     {
-        cpu = cpumask_cycle(cpu, cpupool0->cpu_valid);
-        (void)alloc_vcpu(d, i, cpu);
+        cpu = cpumask_cycle(cpu, &dom0_cpus);
+        setup_vcpu(d, i, cpu);
     }
 
     /*
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -140,13 +140,21 @@ static void __init parse_acpi_param(char
 static const module_t *__initdata initial_images;
 static unsigned int __initdata nr_initial_images;
 
-unsigned long __init initial_images_nrpages(void)
+unsigned long __init initial_images_nrpages(nodeid_t node)
 {
+    unsigned long node_start = node_start_pfn(node);
+    unsigned long node_end = node_end_pfn(node);
     unsigned long nr;
     unsigned int i;
 
     for ( nr = i = 0; i < nr_initial_images; ++i )
-        nr += PFN_UP(initial_images[i].mod_end);
+    {
+        unsigned long start = initial_images[i].mod_start;
+        unsigned long end = start + PFN_UP(initial_images[i].mod_end);
+
+        if ( end > node_start && node_end > start )
+            nr += min(node_end, end) - max(node_start, start);
+    }
 
     return nr;
 }
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -581,7 +581,7 @@ static struct page_info *alloc_heap_page
     struct domain *d)
 {
     unsigned int i, j, zone = 0, nodemask_retry = 0;
-    nodeid_t first_node, node = MEMF_get_node(memflags);
+    nodeid_t first_node, node = MEMF_get_node(memflags), req_node = node;
     unsigned long request = 1UL << order;
     struct page_info *pg;
     nodemask_t nodemask = (d != NULL ) ? d->node_affinity : node_online_map;
@@ -593,7 +593,6 @@ static struct page_info *alloc_heap_page
 
     if ( node == NUMA_NO_NODE )
     {
-        memflags &= ~MEMF_exact_node;
         if ( d != NULL )
         {
             node = next_node(d->last_alloc_node, nodemask);
@@ -654,7 +653,7 @@ static struct page_info *alloc_heap_page
                     goto found;
         } while ( zone-- > zone_lo ); /* careful: unsigned zone may wrap */
 
-        if ( memflags & MEMF_exact_node )
+        if ( (memflags & MEMF_exact_node) && req_node != NUMA_NO_NODE )
             goto not_found;
 
         /* Pick next node. */
@@ -671,7 +670,7 @@ static struct page_info *alloc_heap_page
         if ( node == first_node )
         {
             /* When we have tried all in nodemask, we fall back to others. */
-            if ( nodemask_retry++ )
+            if ( (memflags & MEMF_exact_node) || nodemask_retry++ )
                 goto not_found;
             nodes_andnot(nodemask, node_online_map, nodemask);
             first_node = node = first_node(nodemask);
--- a/xen/include/asm-x86/setup.h
+++ b/xen/include/asm-x86/setup.h
@@ -2,6 +2,7 @@
 #define __X86_SETUP_H_
 
 #include <xen/multiboot.h>
+#include <asm/numa.h>
 
 extern unsigned long xenheap_initial_phys_start;
 
@@ -32,7 +33,7 @@ int construct_dom0(
     void *(*bootstrap_map)(const module_t *),
     char *cmdline);
 
-unsigned long initial_images_nrpages(void);
+unsigned long initial_images_nrpages(nodeid_t node);
 void discard_initial_images(void);
 
 unsigned int dom0_max_vcpus(void);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-02-26 13:52 ` [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on Jan Beulich
@ 2015-02-26 17:14   ` Dario Faggioli
  2015-02-27  8:46     ` Jan Beulich
  2015-03-03  9:59   ` Ian Campbell
  2015-03-05 16:11   ` Andrew Cooper
  2 siblings, 1 reply; 51+ messages in thread
From: Dario Faggioli @ 2015-02-26 17:14 UTC (permalink / raw)
  To: JBeulich@suse.com
  Cc: Keir (Xen.org), Ian Campbell, Andrew Cooper, Tim (Xen.org),
	xen-devel@lists.xenproject.org, Ian Jackson


[-- Attachment #1.1: Type: text/plain, Size: 3168 bytes --]

On Thu, 2015-02-26 at 13:52 +0000, Jan Beulich wrote:
> ... by introducing a "dom0_nodes" option augmenting the "dom0_mem" and
> "dom0_max_vcpus" ones.
> 
> Note that this gives meaning to MEMF_exact_node specified alone (i.e.
> implicitly combined with NUMA_NO_NODE): In such a case any node inside
> the domain's node mask is acceptable, but no other node. This changed
> behavior is (implicitly) being exposed through the memop hypercalls.
> 
> Note further that this change doesn't take care of moving the initrd
> image into memory matching Dom0's affinity when the initrd doesn't get
> copied (because of being part of the initial mapping) anyway.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>

Just a couple of questions/comments.

> ---
> I'm uncertain whether range restricting the PXMs permitted for Dom0 is
> the right approach (matching what other NUMA code did until recently),
> or whether we would instead want to simply limit the number of PXMs we
> can handler there (i.e. using a static array instead of a static
> bitmap).
> 
FWIW, I think the approach taken in the patch is ok.

> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -540,6 +540,15 @@ any dom0 autoballooning feature present 
>  _xl.conf(5)_ man page or [Xen Best
>  Practices](http://wiki.xen.org/wiki/Xen_Best_Practices#Xen_dom0_dedicated_memory_and_preventing_dom0_memory_ballooning).
>  
> +### dom0\_nodes
> +
> +> `= <integer>[,...]`
> +
> +Specify the NUMA nodes to place Dom0 on. Defaults for vCPU-s created
> +and memory assigned to Dom0 will be adjusted to match the node
> +restrictions set up here. Note that the values to be specified here are
> +ACPI PXM ones, not Xen internal node numbers.
> +
Why use PXM ids? It might be me being much more used to work with NUMA
node ids, but wouldn't the other way round be more consistent (almost
everything the user interacts with after boot speak node ids) and easier
for the user to figure things out (e.g., with tools like numactl on
baremetal)?

> --- a/xen/arch/x86/domain_build.c
> +++ b/xen/arch/x86/domain_build.c

> +static struct vcpu *__init setup_vcpu(struct domain *d, unsigned int vcpu_id,
> +                                      unsigned int cpu)
> +{
> +    struct vcpu *v = alloc_vcpu(d, vcpu_id, cpu);
> +
> +    if ( v )
> +    {
> +        if ( !d->is_pinned )
> +            cpumask_copy(v->cpu_hard_affinity, &dom0_cpus);
> +        cpumask_copy(v->cpu_soft_affinity, &dom0_cpus);
> +    }
> +
About this, for DomUs, now that we have soft affinity available, what we
do is set only soft affinity to match the NUMA placement. I think I see
and agree why we want to be 'more strict' in Dom0, but I felt like it
was worth to point out the difference in behaviour (should it be
documented somewhere?).

Regards,
Dario

BTW, mostly out of curiosity, I've had a few strange issues/conflicts in
applying this on top of staging, in order to test it... Was it me doing
something very stupid, or was this based on something different?

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-02-26 17:14   ` Dario Faggioli
@ 2015-02-27  8:46     ` Jan Beulich
  2015-02-27 10:04       ` Dario Faggioli
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2015-02-27  8:46 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Keir (Xen.org), Ian Campbell, Andrew Cooper, Tim (Xen.org),
	Ian Jackson, xen-devel@lists.xenproject.org

>>> On 26.02.15 at 18:14, <dario.faggioli@citrix.com> wrote:
> On Thu, 2015-02-26 at 13:52 +0000, Jan Beulich wrote:
>> +### dom0\_nodes
>> +
>> +> `= <integer>[,...]`
>> +
>> +Specify the NUMA nodes to place Dom0 on. Defaults for vCPU-s created
>> +and memory assigned to Dom0 will be adjusted to match the node
>> +restrictions set up here. Note that the values to be specified here are
>> +ACPI PXM ones, not Xen internal node numbers.
>> +
> Why use PXM ids? It might be me being much more used to work with NUMA
> node ids, but wouldn't the other way round be more consistent (almost
> everything the user interacts with after boot speak node ids) and easier
> for the user to figure things out (e.g., with tools like numactl on
> baremetal)?

This way behavior doesn't change if internally in the hypervisor we
need to change the mapping from PXMs to node IDs.

>> +static struct vcpu *__init setup_vcpu(struct domain *d, unsigned int vcpu_id,
>> +                                      unsigned int cpu)
>> +{
>> +    struct vcpu *v = alloc_vcpu(d, vcpu_id, cpu);
>> +
>> +    if ( v )
>> +    {
>> +        if ( !d->is_pinned )
>> +            cpumask_copy(v->cpu_hard_affinity, &dom0_cpus);
>> +        cpumask_copy(v->cpu_soft_affinity, &dom0_cpus);
>> +    }
>> +
> About this, for DomUs, now that we have soft affinity available, what we
> do is set only soft affinity to match the NUMA placement. I think I see
> and agree why we want to be 'more strict' in Dom0, but I felt like it
> was worth to point out the difference in behaviour (should it be
> documented somewhere?).

I'm simply adjusting what sched_init_vcpu() did, which is alter
hard affinity conditionally on is_pinned and soft affinity
unconditionally.

> BTW, mostly out of curiosity, I've had a few strange issues/conflicts in
> applying this on top of staging, in order to test it... Was it me doing
> something very stupid, or was this based on something different?

Apart from the one patch named in the cover letter there shouldn't
be any other dependencies. Without you naming the issues you
encountered, I can't tell.

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-02-27  8:46     ` Jan Beulich
@ 2015-02-27 10:04       ` Dario Faggioli
  2015-02-27 10:50         ` Jan Beulich
  0 siblings, 1 reply; 51+ messages in thread
From: Dario Faggioli @ 2015-02-27 10:04 UTC (permalink / raw)
  To: JBeulich@suse.com
  Cc: Keir (Xen.org), Ian Campbell, Andrew Cooper, Tim (Xen.org),
	xen-devel@lists.xenproject.org, Ian Jackson


[-- Attachment #1.1: Type: text/plain, Size: 5415 bytes --]

On Fri, 2015-02-27 at 08:46 +0000, Jan Beulich wrote:
> >>> On 26.02.15 at 18:14, <dario.faggioli@citrix.com> wrote:
> > On Thu, 2015-02-26 at 13:52 +0000, Jan Beulich wrote:
> >> +### dom0\_nodes
> >> +
> >> +> `= <integer>[,...]`
> >> +
> >> +Specify the NUMA nodes to place Dom0 on. Defaults for vCPU-s created
> >> +and memory assigned to Dom0 will be adjusted to match the node
> >> +restrictions set up here. Note that the values to be specified here are
> >> +ACPI PXM ones, not Xen internal node numbers.
> >> +
> > Why use PXM ids? It might be me being much more used to work with NUMA
> > node ids, but wouldn't the other way round be more consistent (almost
> > everything the user interacts with after boot speak node ids) and easier
> > for the user to figure things out (e.g., with tools like numactl on
> > baremetal)?
> 
> This way behavior doesn't change if internally in the hypervisor we
> need to change the mapping from PXMs to node IDs.
> 
Ok, I see the value of this. I'm still a bit concerned about the fact
that everything else "speak" NUMA node, but it's probably just me being
much more used to that than to PXMs. :-)

> >> +static struct vcpu *__init setup_vcpu(struct domain *d, unsigned int vcpu_id,
> >> +                                      unsigned int cpu)
> >> +{
> >> +    struct vcpu *v = alloc_vcpu(d, vcpu_id, cpu);
> >> +
> >> +    if ( v )
> >> +    {
> >> +        if ( !d->is_pinned )
> >> +            cpumask_copy(v->cpu_hard_affinity, &dom0_cpus);
> >> +        cpumask_copy(v->cpu_soft_affinity, &dom0_cpus);
> >> +    }
> >> +
> > About this, for DomUs, now that we have soft affinity available, what we
> > do is set only soft affinity to match the NUMA placement. I think I see
> > and agree why we want to be 'more strict' in Dom0, but I felt like it
> > was worth to point out the difference in behaviour (should it be
> > documented somewhere?).
> 
> I'm simply adjusting what sched_init_vcpu() did, which is alter
> hard affinity conditionally on is_pinned and soft affinity
> unconditionally.
> 
Ok, I understand the idea behing this better now, thanks.

So, with the following boot command line (i.e., with 'dom0_vcpus_pin'):
com1=115200,8n1 dom0_vcpus_pin dom0_nodes=1 sched=credit noreboot dom0_mem=512M,max:512M dom0_max_vcpus=4 console=com1

I get this:
Name                                ID  VCPU   CPU State   Time(s) Affinity (Hard / Soft)
Domain-0                             0     0    8   -b-       4.4  8 / 8-15
Domain-0                             0     1    9   -b-       4.0  9 / 8-15
Domain-0                             0     2   10   r--       4.1  10 / 8-15
Domain-0                             0     3   11   -b-       3.6  11 / 8-15

With the following boot command line (i.e., without 'dom0_vcpus_pin'):
com1=115200,8n1 dom0_nodes=1 sched=credit noreboot dom0_mem=512M,max:512M dom0_max_vcpus=4 console=com1

I get this:
Name                                ID  VCPU   CPU State   Time(s) Affinity (Hard / Soft)
Domain-0                             0     0    8   r--       4.3  8-15 / 8-15
Domain-0                             0     1   12   -b-       2.9  8-15 / 8-15
Domain-0                             0     2   10   r--       3.5  8-15 / 8-15
Domain-0                             0     3   14   -b-       2.9  8-15 / 8-15

Setting soft affinity as a superset of (in the former case) or equal to
(in the latter) hard affinity is just pure overhead, when in the
scheduler.

In fact, if the scheduler sees that soft affinity is defined, it will go
through the load balancing/vcpu placement logic twice, the first time
using the soft affinity mask, the second using the hard affinity one.
Actually, the first time it uses 'soft & hard', which in these cases is
exactly equal to hard, and that's why I'm calling this pure overhead.

I probably should add checks in the scheduler to identify such
situations as "no need to consider soft affinity". I thought about this
before, but didn't do that because it's a more cpumask_foo() fiddling in
a few hot paths... but of course I can check for the relationship
between hard and soft affinity masks upfront, cache the result in a
bool_t, and use _that_ in hot paths... what do you think?

All this being said, I still would avoid putting the system in a
configuration where soft is superset or equal to hard, at the very least
not automatically, as I think it can appear confusing to the user (the
user himself can, of course, do that after boot, for Dom0 or DomUs, but
that's another story, I think). So I'm now thinking whether it wouldn't
be better to, in this patch, leave soft affinity alone completely.

Then, if we want to make it possible to tweak soft affinity, we can
allow for something like "dom0_nodes=soft:1,3" and, in that case, alter
soft affinity only.

> > BTW, mostly out of curiosity, I've had a few strange issues/conflicts in
> > applying this on top of staging, in order to test it... Was it me doing
> > something very stupid, or was this based on something different?
> 
> Apart from the one patch named in the cover letter there shouldn't
> be any other dependencies. Without you naming the issues you
> encountered, I can't tell.
> 
I see. Never mind then, maybe I messed up with my various branches...
Sorry for bothering with this. :-)

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-02-27 10:04       ` Dario Faggioli
@ 2015-02-27 10:50         ` Jan Beulich
  2015-02-27 14:54           ` Dario Faggioli
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2015-02-27 10:50 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Keir (Xen.org), IanCampbell, Andrew Cooper, Tim (Xen.org),
	Ian Jackson, xen-devel@lists.xenproject.org

>>> On 27.02.15 at 11:04, <dario.faggioli@citrix.com> wrote:
> On Fri, 2015-02-27 at 08:46 +0000, Jan Beulich wrote:
>> >>> On 26.02.15 at 18:14, <dario.faggioli@citrix.com> wrote:
>> > On Thu, 2015-02-26 at 13:52 +0000, Jan Beulich wrote:
>> >> +### dom0\_nodes
>> >> +
>> >> +> `= <integer>[,...]`
>> >> +
>> >> +Specify the NUMA nodes to place Dom0 on. Defaults for vCPU-s created
>> >> +and memory assigned to Dom0 will be adjusted to match the node
>> >> +restrictions set up here. Note that the values to be specified here are
>> >> +ACPI PXM ones, not Xen internal node numbers.
>> >> +
>> > Why use PXM ids? It might be me being much more used to work with NUMA
>> > node ids, but wouldn't the other way round be more consistent (almost
>> > everything the user interacts with after boot speak node ids) and easier
>> > for the user to figure things out (e.g., with tools like numactl on
>> > baremetal)?
>> 
>> This way behavior doesn't change if internally in the hypervisor we
>> need to change the mapping from PXMs to node IDs.
>> 
> Ok, I see the value of this. I'm still a bit concerned about the fact
> that everything else "speak" NUMA node, but it's probably just me being
> much more used to that than to PXMs. :-)

With "everything else" I suppose you mean the tool stack? There
shouldn't be any node IDs kept across reboots there. Yet the
consistent behavior to be achieved here is particularly for multiple
boots.

>> >> +static struct vcpu *__init setup_vcpu(struct domain *d, unsigned int vcpu_id,
>> >> +                                      unsigned int cpu)
>> >> +{
>> >> +    struct vcpu *v = alloc_vcpu(d, vcpu_id, cpu);
>> >> +
>> >> +    if ( v )
>> >> +    {
>> >> +        if ( !d->is_pinned )
>> >> +            cpumask_copy(v->cpu_hard_affinity, &dom0_cpus);
>> >> +        cpumask_copy(v->cpu_soft_affinity, &dom0_cpus);
>> >> +    }
>> >> +
>> > About this, for DomUs, now that we have soft affinity available, what we
>> > do is set only soft affinity to match the NUMA placement. I think I see
>> > and agree why we want to be 'more strict' in Dom0, but I felt like it
>> > was worth to point out the difference in behaviour (should it be
>> > documented somewhere?).
>> 
>> I'm simply adjusting what sched_init_vcpu() did, which is alter
>> hard affinity conditionally on is_pinned and soft affinity
>> unconditionally.
>> 
> Ok, I understand the idea behing this better now, thanks.
> [...]
> Setting soft affinity as a superset of (in the former case) or equal to
> (in the latter) hard affinity is just pure overhead, when in the
> scheduler.

The why does sched_init_vcpu() do what it does? If you want to
alter that, I'm fine with altering it here.

> In fact, if the scheduler sees that soft affinity is defined, it will go
> through the load balancing/vcpu placement logic twice, the first time
> using the soft affinity mask, the second using the hard affinity one.
> Actually, the first time it uses 'soft & hard', which in these cases is
> exactly equal to hard, and that's why I'm calling this pure overhead.
> 
> I probably should add checks in the scheduler to identify such
> situations as "no need to consider soft affinity". I thought about this
> before, but didn't do that because it's a more cpumask_foo() fiddling in
> a few hot paths... but of course I can check for the relationship
> between hard and soft affinity masks upfront, cache the result in a
> bool_t, and use _that_ in hot paths... what do you think?

Avoiding the fiddling in hot paths is surely desirable. But it would
indeed seem even better to avoid the inefficiency in the first place
(i.e. when storing affinities).

> All this being said, I still would avoid putting the system in a
> configuration where soft is superset or equal to hard, at the very least
> not automatically, as I think it can appear confusing to the user (the
> user himself can, of course, do that after boot, for Dom0 or DomUs, but
> that's another story, I think). So I'm now thinking whether it wouldn't
> be better to, in this patch, leave soft affinity alone completely.
> 
> Then, if we want to make it possible to tweak soft affinity, we can
> allow for something like "dom0_nodes=soft:1,3" and, in that case, alter
> soft affinity only.

Hmm, not sure. And I keep being confused whether soft means
"allow" and hard means "prefer" or the other way around. In any
event, again, with sched_init_vcpu() setting up things so that
soft is a superset of hard (and most likely they're equal), I don't
see why the same done here would be more of a problem.

>> > BTW, mostly out of curiosity, I've had a few strange issues/conflicts in
>> > applying this on top of staging, in order to test it... Was it me doing
>> > something very stupid, or was this based on something different?
>> 
>> Apart from the one patch named in the cover letter there shouldn't
>> be any other dependencies. Without you naming the issues you
>> encountered, I can't tell.
>> 
> I see. Never mind then, maybe I messed up with my various branches...
> Sorry for bothering with this. :-)

No reason to be sorry - I'm more than happy if inconsistencies get
pointed out before trying to commit anything.

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-02-27 10:50         ` Jan Beulich
@ 2015-02-27 14:54           ` Dario Faggioli
  2015-02-27 15:04             ` Jan Beulich
  2015-03-03 10:51             ` Jan Beulich
  0 siblings, 2 replies; 51+ messages in thread
From: Dario Faggioli @ 2015-02-27 14:54 UTC (permalink / raw)
  To: JBeulich@suse.com
  Cc: Keir (Xen.org), Ian Campbell, Andrew Cooper, Tim (Xen.org),
	xen-devel@lists.xenproject.org, Ian Jackson


[-- Attachment #1.1: Type: text/plain, Size: 3015 bytes --]

On Fri, 2015-02-27 at 10:50 +0000, Jan Beulich wrote:
> >>> On 27.02.15 at 11:04, <dario.faggioli@citrix.com> wrote:
> > On Fri, 2015-02-27 at 08:46 +0000, Jan Beulich wrote:

> >> This way behavior doesn't change if internally in the hypervisor we
> >> need to change the mapping from PXMs to node IDs.
> >> 
> > Ok, I see the value of this. I'm still a bit concerned about the fact
> > that everything else "speak" NUMA node, but it's probably just me being
> > much more used to that than to PXMs. :-)
> 
> With "everything else" I suppose you mean the tool stack? There
> shouldn't be any node IDs kept across reboots there. Yet the
> consistent behavior to be achieved here is particularly for multiple
> boots.
> 
Sure. I was more thinking to inconsistency "in the user mind", as he'll
have to deal with PXM when configuring Dom0, and with node IDs after
boot... but again, maybe it's only me.

> >> I'm simply adjusting what sched_init_vcpu() did, which is alter
> >> hard affinity conditionally on is_pinned and soft affinity
> >> unconditionally.
> >> 
> > Ok, I understand the idea behing this better now, thanks.
> > [...]
> > Setting soft affinity as a superset of (in the former case) or equal to
> > (in the latter) hard affinity is just pure overhead, when in the
> > scheduler.
> 
> The why does sched_init_vcpu() do what it does? If you want to
> alter that, I'm fine with altering it here.
> 
It does that, but, in there, soft affinity is unconditionally set to
'all bits set'. Then, in the scheduler, if we find out that the the soft
affinity mask is fully set, we just skip the soft affinity balancing
step.

The idea is that, whether the mask is full because no one touched this
default, or because it has been manually set like that, there is nothing
to do at the soft affinity balancing level.

So, you actually are right: rather that not touch soft affinity, as I
said in the previous email, I think we should set hard affinity
conditionally to is_pinned, as in the patch, and then unconditionally
set soft affinity to all, as in sched_init_vcpu().

> > Then, if we want to make it possible to tweak soft affinity, we can
> > allow for something like "dom0_nodes=soft:1,3" and, in that case, alter
> > soft affinity only.
> 
> Hmm, not sure. And I keep being confused whether soft means
> "allow" and hard means "prefer" or the other way around. 
>
"hard" means allow (or not allow)
"soft" means prefer

> In any
> event, again, with sched_init_vcpu() setting up things so that
> soft is a superset of hard (and most likely they're equal), I don't
> see why the same done here would be more of a problem.
> 
Indeed, sorry, my bad. When talking about soft being superset, I forgot
to mention the sort of special casing we are granting to the case when
soft mask is all set.

Using cpumask_setall here, as done in sched_init_vcpu(), would avoid
incurring in the pointless soft affinity balancing overhead.

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-02-27 14:54           ` Dario Faggioli
@ 2015-02-27 15:04             ` Jan Beulich
  2015-03-03 10:51             ` Jan Beulich
  1 sibling, 0 replies; 51+ messages in thread
From: Jan Beulich @ 2015-02-27 15:04 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Keir (Xen.org), IanCampbell, Andrew Cooper, Tim (Xen.org),
	Ian Jackson, xen-devel@lists.xenproject.org

>>> On 27.02.15 at 15:54, <dario.faggioli@citrix.com> wrote:
> On Fri, 2015-02-27 at 10:50 +0000, Jan Beulich wrote:
>> >>> On 27.02.15 at 11:04, <dario.faggioli@citrix.com> wrote:
>> > On Fri, 2015-02-27 at 08:46 +0000, Jan Beulich wrote:
>> >> I'm simply adjusting what sched_init_vcpu() did, which is alter
>> >> hard affinity conditionally on is_pinned and soft affinity
>> >> unconditionally.
>> >> 
>> > Ok, I understand the idea behing this better now, thanks.
>> > [...]
>> > Setting soft affinity as a superset of (in the former case) or equal to
>> > (in the latter) hard affinity is just pure overhead, when in the
>> > scheduler.
>> 
>> The why does sched_init_vcpu() do what it does? If you want to
>> alter that, I'm fine with altering it here.
>> 
> It does that, but, in there, soft affinity is unconditionally set to
> 'all bits set'. Then, in the scheduler, if we find out that the the soft
> affinity mask is fully set, we just skip the soft affinity balancing
> step.
> 
> The idea is that, whether the mask is full because no one touched this
> default, or because it has been manually set like that, there is nothing
> to do at the soft affinity balancing level.
> 
> So, you actually are right: rather that not touch soft affinity, as I
> said in the previous email, I think we should set hard affinity
> conditionally to is_pinned, as in the patch, and then unconditionally
> set soft affinity to all, as in sched_init_vcpu().

I.e. effectively not touching it anyway (because just before it
got set to "all" by sched_init_vcpu()). I guess instead of
removing the line, I'll put it in a comment.

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-02-27 14:54           ` Dario Faggioli
  2015-02-27 15:04             ` Jan Beulich
@ 2015-03-03 10:51             ` Jan Beulich
  2015-03-04 10:18               ` Dario Faggioli
  2015-03-06  9:11               ` Jan Beulich
  1 sibling, 2 replies; 51+ messages in thread
From: Jan Beulich @ 2015-03-03 10:51 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Keir (Xen.org), IanCampbell, Andrew Cooper, Tim (Xen.org),
	Ian Jackson, xen-devel@lists.xenproject.org

>>> On 27.02.15 at 15:54, <dario.faggioli@citrix.com> wrote:
> On Fri, 2015-02-27 at 10:50 +0000, Jan Beulich wrote:
>> >>> On 27.02.15 at 11:04, <dario.faggioli@citrix.com> wrote:
>> > On Fri, 2015-02-27 at 08:46 +0000, Jan Beulich wrote:
>> >> I'm simply adjusting what sched_init_vcpu() did, which is alter
>> >> hard affinity conditionally on is_pinned and soft affinity
>> >> unconditionally.
>> >> 
>> > Ok, I understand the idea behing this better now, thanks.
>> > [...]
>> > Setting soft affinity as a superset of (in the former case) or equal to
>> > (in the latter) hard affinity is just pure overhead, when in the
>> > scheduler.
>> 
>> The why does sched_init_vcpu() do what it does? If you want to
>> alter that, I'm fine with altering it here.
>> 
> It does that, but, in there, soft affinity is unconditionally set to
> 'all bits set'. Then, in the scheduler, if we find out that the the soft
> affinity mask is fully set, we just skip the soft affinity balancing
> step.
> 
> The idea is that, whether the mask is full because no one touched this
> default, or because it has been manually set like that, there is nothing
> to do at the soft affinity balancing level.

In that case I think __vcpu_has_soft_affinity() simply isn't general
enough: Along with checking whether all bits are set in the
soft affinity, it should also check whether soft is a subset of hard
(or the passed in second mask). And really it should imo also cover
the case where not all bits are set in the mask, but all those
corresponding to online CPUs (both of which ought to have the
same effect), i.e. whether online CPUs are a subset of soft:

static inline int __vcpu_has_soft_affinity(const struct vcpu *vc,
                                           const cpumask_t *mask)
{
    return !cpumask_subset(&online_cpu_map, vc->cpu_soft_affinity)
           && !cpumask_subset(vc->cpu_soft_affinity, vc->cpu_hard_affinity)
           && cpumask_intersects(vc->cpu_soft_affinity, mask);
}

That would then leave introducing a "relaxed (or "strict",
depending on what we'd like to be the default) mode in the patch
here, controlling whether ->cpu_hard_affinity gets overridden
(and we'd always override ->cpu_soft_affinity).

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-03-03 10:51             ` Jan Beulich
@ 2015-03-04 10:18               ` Dario Faggioli
  2015-03-06  9:11               ` Jan Beulich
  1 sibling, 0 replies; 51+ messages in thread
From: Dario Faggioli @ 2015-03-04 10:18 UTC (permalink / raw)
  To: JBeulich@suse.com
  Cc: Keir (Xen.org), Ian Campbell, Andrew Cooper, Tim (Xen.org),
	xen-devel@lists.xenproject.org, Ian Jackson


[-- Attachment #1.1: Type: text/plain, Size: 1137 bytes --]

On Tue, 2015-03-03 at 10:51 +0000, Jan Beulich wrote:
> >>> On 27.02.15 at 15:54, <dario.faggioli@citrix.com> wrote:

> > The idea is that, whether the mask is full because no one touched this
> > default, or because it has been manually set like that, there is nothing
> > to do at the soft affinity balancing level.
> 
> In that case I think __vcpu_has_soft_affinity() simply isn't general
> enough: Along with checking whether all bits are set in the
> soft affinity, it should also check whether soft is a subset of hard
> (or the passed in second mask). And really it should imo also cover
> the case where not all bits are set in the mask, but all those
> corresponding to online CPUs (both of which ought to have the
> same effect)
>
I'm fine with this.

> That would then leave introducing a "relaxed (or "strict",
> depending on what we'd like to be the default) mode in the patch
> here, controlling whether ->cpu_hard_affinity gets overridden
> (and we'd always override ->cpu_soft_affinity).
> 
And with this too... I'll comment the code in the other email, the one
with the patch.

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-03-03 10:51             ` Jan Beulich
  2015-03-04 10:18               ` Dario Faggioli
@ 2015-03-06  9:11               ` Jan Beulich
  2015-03-06 10:46                 ` Dario Faggioli
  1 sibling, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2015-03-06  9:11 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Keir (Xen.org), IanCampbell, Andrew Cooper, Tim (Xen.org),
	Ian Jackson, xen-devel@lists.xenproject.org

>>> On 03.03.15 at 11:51, <JBeulich@suse.com> wrote:
> That would then leave introducing a "relaxed (or "strict",
> depending on what we'd like to be the default) mode in the patch
> here, controlling whether ->cpu_hard_affinity gets overridden
> (and we'd always override ->cpu_soft_affinity).

Having implemented this "relaxed" addition (patch to be posted after
a few more tests), I find that with Dom0 being restricted to half of
the nodes of the test system, soft affinity set to that set, and hard
affinity left set to "all", many Dom0 vCPU-s nevertheless run on the
CPUs not in its soft affinity (and there's no other load on the system).
Is there a bug in that (credit) scheduler logic somewhere?

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-03-06  9:11               ` Jan Beulich
@ 2015-03-06 10:46                 ` Dario Faggioli
  2015-03-06 11:33                   ` Dario Faggioli
  2015-03-06 11:49                   ` Jan Beulich
  0 siblings, 2 replies; 51+ messages in thread
From: Dario Faggioli @ 2015-03-06 10:46 UTC (permalink / raw)
  To: JBeulich@suse.com
  Cc: Keir (Xen.org), Ian Campbell, Andrew Cooper, Tim (Xen.org),
	xen-devel@lists.xenproject.org, Ian Jackson


[-- Attachment #1.1: Type: text/plain, Size: 1340 bytes --]

On Fri, 2015-03-06 at 09:11 +0000, Jan Beulich wrote:
> >>> On 03.03.15 at 11:51, <JBeulich@suse.com> wrote:
> > That would then leave introducing a "relaxed (or "strict",
> > depending on what we'd like to be the default) mode in the patch
> > here, controlling whether ->cpu_hard_affinity gets overridden
> > (and we'd always override ->cpu_soft_affinity).
> 
> Having implemented this "relaxed" addition (patch to be posted after
> a few more tests), I find that with Dom0 being restricted to half of
> the nodes of the test system, soft affinity set to that set, and hard
> affinity left set to "all", many Dom0 vCPU-s nevertheless run on the
> CPUs not in its soft affinity (and there's no other load on the system).
>
I saw the patch and just povided my Reviewed-by, as it looked good to me
(relaxed mode included).

> Is there a bug in that (credit) scheduler logic somewhere?
> 
There may be, of course, but nothing showed up during testing and
benchmarking the feature. It's true that I probably concentrate mostly
on DomU (especially while benchmarking), but it worked for me, and
numbers from benchmarks confirmed that.

Anyway, let me have a look!

One question, was there any other load in the system, especially on the
pCPUs with which Dom0's vCPUs have soft affinity?

Thanks and Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-03-06 10:46                 ` Dario Faggioli
@ 2015-03-06 11:33                   ` Dario Faggioli
  2015-03-06 13:26                     ` Jan Beulich
  2015-03-06 11:49                   ` Jan Beulich
  1 sibling, 1 reply; 51+ messages in thread
From: Dario Faggioli @ 2015-03-06 11:33 UTC (permalink / raw)
  To: JBeulich@suse.com
  Cc: Keir (Xen.org), Ian Campbell, Andrew Cooper, Tim (Xen.org),
	xen-devel@lists.xenproject.org, Ian Jackson


[-- Attachment #1.1: Type: text/plain, Size: 1885 bytes --]

On Fri, 2015-03-06 at 10:46 +0000, Dario Faggioli wrote:
> On Fri, 2015-03-06 at 09:11 +0000, Jan Beulich wrote:
> > >>> On 03.03.15 at 11:51, <JBeulich@suse.com> wrote:

> > Having implemented this "relaxed" addition (patch to be posted after
> > a few more tests), I find that with Dom0 being restricted to half of
> > the nodes of the test system, soft affinity set to that set, and hard
> > affinity left set to "all", many Dom0 vCPU-s nevertheless run on the
> > CPUs not in its soft affinity (and there's no other load on the system).
>
Oh, BTW, in my previous email I was asking about system load because,
while replying, I missed this: "there's no other load on the system"...
sorry! :-/

> > Is there a bug in that (credit) scheduler logic somewhere?
> > 
> There may be, of course, but nothing showed up during testing and
> benchmarking the feature. It's true that I probably concentrate mostly
> on DomU (especially while benchmarking), but it worked for me, and
> numbers from benchmarks confirmed that.
>
I'm testing soft affinity for Dom0 _without_ your patches, i.e., I'm
just setting soft affinity for Dom0's vCPUs after boot (hard affinity
set to "all") and looking at where they executes, both with and without
other load, and results look consistent to me.

I see Dom0 vCPUs executing almost only on pCPUs from their soft affinity
set, at least all the times that this is possible. If I generate other
vCPU load aimed at kicking them away from there, they do go away, but
they come back to such set as soon as the load disappears.

So, soft affinity per-se seems to be working for me.

I'll now apply your patch and see whether that changes thing (seems
unlikely, though).

If you want me to try replicate some specific testing scenario, feel
free to provide more details about it and I'll give it a go. :-)

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-03-06 11:33                   ` Dario Faggioli
@ 2015-03-06 13:26                     ` Jan Beulich
  0 siblings, 0 replies; 51+ messages in thread
From: Jan Beulich @ 2015-03-06 13:26 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Keir (Xen.org), Ian Campbell, Andrew Cooper, Tim (Xen.org),
	IanJackson, xen-devel@lists.xenproject.org

>>> On 06.03.15 at 12:33, <raistlin.df@gmail.com> wrote:
> I'll now apply your patch and see whether that changes thing (seems
> unlikely, though).

So with the change done that George pointed out is needed via
his comment change request the anomaly is gone. I have to
admit though that I can't see why that inversion caused the
anomaly.

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-03-06 10:46                 ` Dario Faggioli
  2015-03-06 11:33                   ` Dario Faggioli
@ 2015-03-06 11:49                   ` Jan Beulich
  1 sibling, 0 replies; 51+ messages in thread
From: Jan Beulich @ 2015-03-06 11:49 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Keir (Xen.org), IanCampbell, Andrew Cooper, Tim (Xen.org),
	Ian Jackson, xen-devel@lists.xenproject.org

>>> On 06.03.15 at 11:46, <dario.faggioli@citrix.com> wrote:
> One question, was there any other load in the system, especially on the
> pCPUs with which Dom0's vCPUs have soft affinity?

As said, there was no other load on the system (no DomU even
present).

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-02-26 13:52 ` [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on Jan Beulich
  2015-02-26 17:14   ` Dario Faggioli
@ 2015-03-03  9:59   ` Ian Campbell
  2015-03-05 16:11   ` Andrew Cooper
  2 siblings, 0 replies; 51+ messages in thread
From: Ian Campbell @ 2015-03-03  9:59 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Keir Fraser, Tim Deegan, Dario Faggioli, Ian Jackson,
	Andrew Cooper, xen-devel

On Thu, 2015-02-26 at 13:52 +0000, Jan Beulich wrote:
> ... by introducing a "dom0_nodes" option augmenting the "dom0_mem" and
> "dom0_max_vcpus" ones.
> 
> Note that this gives meaning to MEMF_exact_node specified alone (i.e.
> implicitly combined with NUMA_NO_NODE): In such a case any node inside
> the domain's node mask is acceptable, but no other node. This changed
> behavior is (implicitly) being exposed through the memop hypercalls.
> 
> Note further that this change doesn't take care of moving the initrd
> image into memory matching Dom0's affinity when the initrd doesn't get
> copied (because of being part of the initial mapping) anyway.
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

For the ARM/generic aspect:
Acked-by: Ian Campbell <ian.campbell@citrix.com>

> ---
> I'm uncertain whether range restricting the PXMs permitted for Dom0 is
> the right approach (matching what other NUMA code did until recently),
> or whether we would instead want to simply limit the number of PXMs we
> can handler there (i.e. using a static array instead of a static
> bitmap).
> 
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -540,6 +540,15 @@ any dom0 autoballooning feature present 
>  _xl.conf(5)_ man page or [Xen Best
>  Practices](http://wiki.xen.org/wiki/Xen_Best_Practices#Xen_dom0_dedicated_memory_and_preventing_dom0_memory_ballooning).
>  
> +### dom0\_nodes
> +
> +> `= <integer>[,...]`
> +
> +Specify the NUMA nodes to place Dom0 on. Defaults for vCPU-s created
> +and memory assigned to Dom0 will be adjusted to match the node
> +restrictions set up here. Note that the values to be specified here are
> +ACPI PXM ones, not Xen internal node numbers.
> +
>  ### dom0\_shadow
>  > `= <boolean>`
>  
> --- a/xen/arch/x86/domain_build.c
> +++ b/xen/arch/x86/domain_build.c
> @@ -100,11 +100,58 @@ static void __init parse_dom0_max_vcpus(
>  }
>  custom_param("dom0_max_vcpus", parse_dom0_max_vcpus);
>  
> +#define MAX_DOM0_PXM 255
> +static __initdata DECLARE_BITMAP(dom0_pxms, MAX_DOM0_PXM + 1);
> +
> +static void __init parse_dom0_nodes(const char *s)
> +{
> +    do {
> +        unsigned int pxm = simple_strtoul(s, &s, 0);
> +
> +        if ( pxm <= MAX_DOM0_PXM )
> +            __set_bit(pxm, dom0_pxms);
> +    } while ( *s++ == ',' );
> +}
> +custom_param("dom0_nodes", parse_dom0_nodes);
> +
> +static cpumask_t __initdata dom0_cpus;
> +
> +static struct vcpu *__init setup_vcpu(struct domain *d, unsigned int vcpu_id,
> +                                      unsigned int cpu)
> +{
> +    struct vcpu *v = alloc_vcpu(d, vcpu_id, cpu);
> +
> +    if ( v )
> +    {
> +        if ( !d->is_pinned )
> +            cpumask_copy(v->cpu_hard_affinity, &dom0_cpus);
> +        cpumask_copy(v->cpu_soft_affinity, &dom0_cpus);
> +    }
> +
> +    return v;
> +}
> +
> +static nodemask_t __initdata dom0_nodes;
> +
>  unsigned int __init dom0_max_vcpus(void)
>  {
> -    unsigned max_vcpus;
> +    unsigned int pxm, max_vcpus;
> +    nodeid_t node;
> +
> +    for ( pxm = 0; pxm <= MAX_DOM0_PXM; ++pxm )
> +        if ( test_bit(pxm, dom0_pxms) &&
> +             (node = pxm_to_node(pxm)) != NUMA_NO_NODE )
> +            node_set(node, dom0_nodes);
> +    nodes_and(dom0_nodes, dom0_nodes, node_online_map);
> +    if ( nodes_empty(dom0_nodes) )
> +        dom0_nodes = node_online_map;
> +    for_each_node_mask ( node, dom0_nodes )
> +        cpumask_or(&dom0_cpus, &dom0_cpus, &node_to_cpumask(node));
> +    cpumask_and(&dom0_cpus, &dom0_cpus, cpupool0->cpu_valid);
> +    if ( cpumask_empty(&dom0_cpus) )
> +        cpumask_copy(&dom0_cpus, cpupool0->cpu_valid);
>  
> -    max_vcpus = num_cpupool_cpus(cpupool0);
> +    max_vcpus = cpumask_weight(&dom0_cpus);
>      if ( opt_dom0_max_vcpus_min > max_vcpus )
>          max_vcpus = opt_dom0_max_vcpus_min;
>      if ( opt_dom0_max_vcpus_max < max_vcpus )
> @@ -119,12 +166,15 @@ struct vcpu *__init alloc_dom0_vcpu0(str
>  {
>      unsigned int max_vcpus = dom0_max_vcpus();
>  
> +    dom0->node_affinity = dom0_nodes;
> +    dom0->auto_node_affinity = !!bitmap_empty(dom0_pxms, MAX_DOM0_PXM + 1);
> +
>      dom0->vcpu = xzalloc_array(struct vcpu *, max_vcpus);
>      if ( !dom0->vcpu )
>          return NULL;
>      dom0->max_vcpus = max_vcpus;
>  
> -    return alloc_vcpu(dom0, 0, 0);
> +    return setup_vcpu(dom0, 0, cpumask_first(&dom0_cpus));
>  }
>  
>  #ifdef CONFIG_SHADOW_PAGING
> @@ -156,7 +206,7 @@ static struct page_info * __init alloc_c
>      struct domain *d, unsigned long max_pages)
>  {
>      static unsigned int __initdata last_order = MAX_ORDER;
> -    static unsigned int __initdata memflags = MEMF_no_dma;
> +    static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
>      struct page_info *page;
>      unsigned int order = get_order_from_pages(max_pages), free_order;
>  
> @@ -190,7 +240,7 @@ static struct page_info * __init alloc_c
>  
>          if ( d->tot_pages + (1 << order) > d->max_pages )
>              continue;
> -        pg2 = alloc_domheap_pages(d, order, 0);
> +        pg2 = alloc_domheap_pages(d, order, MEMF_exact_node);
>          if ( pg2 > page )
>          {
>              free_domheap_pages(page, free_order);
> @@ -217,10 +267,14 @@ static unsigned long __init dom0_paging_
>  static unsigned long __init compute_dom0_nr_pages(
>      struct domain *d, struct elf_dom_parms *parms, unsigned long initrd_len)
>  {
> -    unsigned long avail = avail_domheap_pages() + initial_images_nrpages();
> -    unsigned long nr_pages, min_pages, max_pages;
> +    nodeid_t node;
> +    unsigned long avail = 0, nr_pages, min_pages, max_pages;
>      bool_t need_paging;
>  
> +    for_each_node_mask ( node, dom0_nodes )
> +        avail += avail_domheap_pages_region(node, 0, 0) +
> +                 initial_images_nrpages(node);
> +
>      /* Reserve memory for further dom0 vcpu-struct allocations... */
>      avail -= (d->max_vcpus - 1UL)
>               << get_order_from_bytes(sizeof(struct vcpu));
> @@ -1230,11 +1284,11 @@ int __init construct_dom0(
>  
>      printk("Dom0 has maximum %u VCPUs\n", d->max_vcpus);
>  
> -    cpu = cpumask_first(cpupool0->cpu_valid);
> +    cpu = v->processor;
>      for ( i = 1; i < d->max_vcpus; i++ )
>      {
> -        cpu = cpumask_cycle(cpu, cpupool0->cpu_valid);
> -        (void)alloc_vcpu(d, i, cpu);
> +        cpu = cpumask_cycle(cpu, &dom0_cpus);
> +        setup_vcpu(d, i, cpu);
>      }
>  
>      /*
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -140,13 +140,21 @@ static void __init parse_acpi_param(char
>  static const module_t *__initdata initial_images;
>  static unsigned int __initdata nr_initial_images;
>  
> -unsigned long __init initial_images_nrpages(void)
> +unsigned long __init initial_images_nrpages(nodeid_t node)
>  {
> +    unsigned long node_start = node_start_pfn(node);
> +    unsigned long node_end = node_end_pfn(node);
>      unsigned long nr;
>      unsigned int i;
>  
>      for ( nr = i = 0; i < nr_initial_images; ++i )
> -        nr += PFN_UP(initial_images[i].mod_end);
> +    {
> +        unsigned long start = initial_images[i].mod_start;
> +        unsigned long end = start + PFN_UP(initial_images[i].mod_end);
> +
> +        if ( end > node_start && node_end > start )
> +            nr += min(node_end, end) - max(node_start, start);
> +    }
>  
>      return nr;
>  }
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -581,7 +581,7 @@ static struct page_info *alloc_heap_page
>      struct domain *d)
>  {
>      unsigned int i, j, zone = 0, nodemask_retry = 0;
> -    nodeid_t first_node, node = MEMF_get_node(memflags);
> +    nodeid_t first_node, node = MEMF_get_node(memflags), req_node = node;
>      unsigned long request = 1UL << order;
>      struct page_info *pg;
>      nodemask_t nodemask = (d != NULL ) ? d->node_affinity : node_online_map;
> @@ -593,7 +593,6 @@ static struct page_info *alloc_heap_page
>  
>      if ( node == NUMA_NO_NODE )
>      {
> -        memflags &= ~MEMF_exact_node;
>          if ( d != NULL )
>          {
>              node = next_node(d->last_alloc_node, nodemask);
> @@ -654,7 +653,7 @@ static struct page_info *alloc_heap_page
>                      goto found;
>          } while ( zone-- > zone_lo ); /* careful: unsigned zone may wrap */
>  
> -        if ( memflags & MEMF_exact_node )
> +        if ( (memflags & MEMF_exact_node) && req_node != NUMA_NO_NODE )
>              goto not_found;
>  
>          /* Pick next node. */
> @@ -671,7 +670,7 @@ static struct page_info *alloc_heap_page
>          if ( node == first_node )
>          {
>              /* When we have tried all in nodemask, we fall back to others. */
> -            if ( nodemask_retry++ )
> +            if ( (memflags & MEMF_exact_node) || nodemask_retry++ )
>                  goto not_found;
>              nodes_andnot(nodemask, node_online_map, nodemask);
>              first_node = node = first_node(nodemask);
> --- a/xen/include/asm-x86/setup.h
> +++ b/xen/include/asm-x86/setup.h
> @@ -2,6 +2,7 @@
>  #define __X86_SETUP_H_
>  
>  #include <xen/multiboot.h>
> +#include <asm/numa.h>
>  
>  extern unsigned long xenheap_initial_phys_start;
>  
> @@ -32,7 +33,7 @@ int construct_dom0(
>      void *(*bootstrap_map)(const module_t *),
>      char *cmdline);
>  
> -unsigned long initial_images_nrpages(void);
> +unsigned long initial_images_nrpages(nodeid_t node);
>  void discard_initial_images(void);
>  
>  unsigned int dom0_max_vcpus(

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-02-26 13:52 ` [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on Jan Beulich
  2015-02-26 17:14   ` Dario Faggioli
  2015-03-03  9:59   ` Ian Campbell
@ 2015-03-05 16:11   ` Andrew Cooper
  2015-03-05 16:43     ` Jan Beulich
  2 siblings, 1 reply; 51+ messages in thread
From: Andrew Cooper @ 2015-03-05 16:11 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Ian Campbell, Keir Fraser, Dario Faggioli, Ian Jackson,
	Tim Deegan

On 26/02/15 13:52, Jan Beulich wrote:
> ... by introducing a "dom0_nodes" option augmenting the "dom0_mem" and
> "dom0_max_vcpus" ones.
>
> Note that this gives meaning to MEMF_exact_node specified alone (i.e.
> implicitly combined with NUMA_NO_NODE): In such a case any node inside
> the domain's node mask is acceptable, but no other node. This changed
> behavior is (implicitly) being exposed through the memop hypercalls.
>
> Note further that this change doesn't take care of moving the initrd
> image into memory matching Dom0's affinity when the initrd doesn't get
> copied (because of being part of the initial mapping) anyway.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

A couple of comments, but Reviewed-by: Andrew Cooper
<andrew.cooper3@citrix.com>

> ---
> I'm uncertain whether range restricting the PXMs permitted for Dom0 is
> the right approach (matching what other NUMA code did until recently),
> or whether we would instead want to simply limit the number of PXMs we
> can handler there (i.e. using a static array instead of a static
> bitmap).

I am not quite sure what you mean by this.

>
> --- a/docs/misc/xen-command-line.markdown
> +++ b/docs/misc/xen-command-line.markdown
> @@ -540,6 +540,15 @@ any dom0 autoballooning feature present 
>  _xl.conf(5)_ man page or [Xen Best
>  Practices](http://wiki.xen.org/wiki/Xen_Best_Practices#Xen_dom0_dedicated_memory_and_preventing_dom0_memory_ballooning).
>  
> +### dom0\_nodes
> +
> +> `= <integer>[,...]`
> +
> +Specify the NUMA nodes to place Dom0 on. Defaults for vCPU-s created
> +and memory assigned to Dom0 will be adjusted to match the node
> +restrictions set up here. Note that the values to be specified here are
> +ACPI PXM ones, not Xen internal node numbers.
> +
>  ### dom0\_shadow
>  > `= <boolean>`
>  
> --- a/xen/arch/x86/domain_build.c
> +++ b/xen/arch/x86/domain_build.c
> @@ -100,11 +100,58 @@ static void __init parse_dom0_max_vcpus(
>  }
>  custom_param("dom0_max_vcpus", parse_dom0_max_vcpus);
>  
> +#define MAX_DOM0_PXM 255
> +static __initdata DECLARE_BITMAP(dom0_pxms, MAX_DOM0_PXM + 1);
> +
> +static void __init parse_dom0_nodes(const char *s)
> +{
> +    do {
> +        unsigned int pxm = simple_strtoul(s, &s, 0);
> +
> +        if ( pxm <= MAX_DOM0_PXM )
> +            __set_bit(pxm, dom0_pxms);
> +    } while ( *s++ == ',' );
> +}
> +custom_param("dom0_nodes", parse_dom0_nodes);
> +
> +static cpumask_t __initdata dom0_cpus;
> +
> +static struct vcpu *__init setup_vcpu(struct domain *d, unsigned int vcpu_id,
> +                                      unsigned int cpu)

I would be tempted to name this setup_dom0_vcpu() to bring it in line
with other dom0 construction functions, and to make it clear that it is
not for use on general domains.

> +{
> +    struct vcpu *v = alloc_vcpu(d, vcpu_id, cpu);
> +
> +    if ( v )
> +    {
> +        if ( !d->is_pinned )
> +            cpumask_copy(v->cpu_hard_affinity, &dom0_cpus);
> +        cpumask_copy(v->cpu_soft_affinity, &dom0_cpus);
> +    }
> +
> +    return v;
> +}
> +
> +static nodemask_t __initdata dom0_nodes;
> +
>  unsigned int __init dom0_max_vcpus(void)
>  {
> -    unsigned max_vcpus;
> +    unsigned int pxm, max_vcpus;
> +    nodeid_t node;
> +
> +    for ( pxm = 0; pxm <= MAX_DOM0_PXM; ++pxm )
> +        if ( test_bit(pxm, dom0_pxms) &&
> +             (node = pxm_to_node(pxm)) != NUMA_NO_NODE )
> +            node_set(node, dom0_nodes);

for_each_set_bit() ?  It would simply the above loop slightly, and is
more efficient.

> +    nodes_and(dom0_nodes, dom0_nodes, node_online_map);
> +    if ( nodes_empty(dom0_nodes) )
> +        dom0_nodes = node_online_map;
> +    for_each_node_mask ( node, dom0_nodes )
> +        cpumask_or(&dom0_cpus, &dom0_cpus, &node_to_cpumask(node));
> +    cpumask_and(&dom0_cpus, &dom0_cpus, cpupool0->cpu_valid);
> +    if ( cpumask_empty(&dom0_cpus) )
> +        cpumask_copy(&dom0_cpus, cpupool0->cpu_valid);
>  
> -    max_vcpus = num_cpupool_cpus(cpupool0);
> +    max_vcpus = cpumask_weight(&dom0_cpus);
>      if ( opt_dom0_max_vcpus_min > max_vcpus )
>          max_vcpus = opt_dom0_max_vcpus_min;
>      if ( opt_dom0_max_vcpus_max < max_vcpus )
> @@ -119,12 +166,15 @@ struct vcpu *__init alloc_dom0_vcpu0(str
>  {
>      unsigned int max_vcpus = dom0_max_vcpus();
>  
> +    dom0->node_affinity = dom0_nodes;
> +    dom0->auto_node_affinity = !!bitmap_empty(dom0_pxms, MAX_DOM0_PXM + 1);
> +
>      dom0->vcpu = xzalloc_array(struct vcpu *, max_vcpus);
>      if ( !dom0->vcpu )
>          return NULL;
>      dom0->max_vcpus = max_vcpus;
>  
> -    return alloc_vcpu(dom0, 0, 0);
> +    return setup_vcpu(dom0, 0, cpumask_first(&dom0_cpus));
>  }
>  
>  #ifdef CONFIG_SHADOW_PAGING
> @@ -156,7 +206,7 @@ static struct page_info * __init alloc_c
>      struct domain *d, unsigned long max_pages)
>  {
>      static unsigned int __initdata last_order = MAX_ORDER;
> -    static unsigned int __initdata memflags = MEMF_no_dma;
> +    static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
>      struct page_info *page;
>      unsigned int order = get_order_from_pages(max_pages), free_order;
>  
> @@ -190,7 +240,7 @@ static struct page_info * __init alloc_c
>  
>          if ( d->tot_pages + (1 << order) > d->max_pages )
>              continue;
> -        pg2 = alloc_domheap_pages(d, order, 0);
> +        pg2 = alloc_domheap_pages(d, order, MEMF_exact_node);
>          if ( pg2 > page )
>          {
>              free_domheap_pages(page, free_order);
> @@ -217,10 +267,14 @@ static unsigned long __init dom0_paging_
>  static unsigned long __init compute_dom0_nr_pages(
>      struct domain *d, struct elf_dom_parms *parms, unsigned long initrd_len)
>  {
> -    unsigned long avail = avail_domheap_pages() + initial_images_nrpages();
> -    unsigned long nr_pages, min_pages, max_pages;
> +    nodeid_t node;
> +    unsigned long avail = 0, nr_pages, min_pages, max_pages;
>      bool_t need_paging;
>  
> +    for_each_node_mask ( node, dom0_nodes )
> +        avail += avail_domheap_pages_region(node, 0, 0) +
> +                 initial_images_nrpages(node);
> +
>      /* Reserve memory for further dom0 vcpu-struct allocations... */
>      avail -= (d->max_vcpus - 1UL)
>               << get_order_from_bytes(sizeof(struct vcpu));
> @@ -1230,11 +1284,11 @@ int __init construct_dom0(
>  
>      printk("Dom0 has maximum %u VCPUs\n", d->max_vcpus);
>  
> -    cpu = cpumask_first(cpupool0->cpu_valid);
> +    cpu = v->processor;
>      for ( i = 1; i < d->max_vcpus; i++ )
>      {
> -        cpu = cpumask_cycle(cpu, cpupool0->cpu_valid);
> -        (void)alloc_vcpu(d, i, cpu);
> +        cpu = cpumask_cycle(cpu, &dom0_cpus);
> +        setup_vcpu(d, i, cpu);

I know this is a preexisting bug, but you might want to fix it as you
are changing the affected codepath.

Construction of dom0 should fail if any alloc_vcpu() call fails (and now
setup_vcpu()).   Nothing currently catches a failure to allocate vcpu
1..$N, and this patch introduces a way for dom0 be more
memory-constrained than previously.

>      }
>  
>      /*
> --- a/xen/arch/x86/setup.c
> +++ b/xen/arch/x86/setup.c
> @@ -140,13 +140,21 @@ static void __init parse_acpi_param(char
>  static const module_t *__initdata initial_images;
>  static unsigned int __initdata nr_initial_images;
>  
> -unsigned long __init initial_images_nrpages(void)
> +unsigned long __init initial_images_nrpages(nodeid_t node)
>  {
> +    unsigned long node_start = node_start_pfn(node);
> +    unsigned long node_end = node_end_pfn(node);
>      unsigned long nr;
>      unsigned int i;
>  
>      for ( nr = i = 0; i < nr_initial_images; ++i )
> -        nr += PFN_UP(initial_images[i].mod_end);
> +    {
> +        unsigned long start = initial_images[i].mod_start;
> +        unsigned long end = start + PFN_UP(initial_images[i].mod_end);
> +
> +        if ( end > node_start && node_end > start )
> +            nr += min(node_end, end) - max(node_start, start);
> +    }
>  
>      return nr;
>  }
> --- a/xen/common/page_alloc.c
> +++ b/xen/common/page_alloc.c
> @@ -581,7 +581,7 @@ static struct page_info *alloc_heap_page
>      struct domain *d)
>  {
>      unsigned int i, j, zone = 0, nodemask_retry = 0;
> -    nodeid_t first_node, node = MEMF_get_node(memflags);
> +    nodeid_t first_node, node = MEMF_get_node(memflags), req_node = node;
>      unsigned long request = 1UL << order;
>      struct page_info *pg;
>      nodemask_t nodemask = (d != NULL ) ? d->node_affinity : node_online_map;
> @@ -593,7 +593,6 @@ static struct page_info *alloc_heap_page
>  
>      if ( node == NUMA_NO_NODE )
>      {
> -        memflags &= ~MEMF_exact_node;
>          if ( d != NULL )
>          {
>              node = next_node(d->last_alloc_node, nodemask);
> @@ -654,7 +653,7 @@ static struct page_info *alloc_heap_page
>                      goto found;
>          } while ( zone-- > zone_lo ); /* careful: unsigned zone may wrap */
>  
> -        if ( memflags & MEMF_exact_node )
> +        if ( (memflags & MEMF_exact_node) && req_node != NUMA_NO_NODE )
>              goto not_found;
>  
>          /* Pick next node. */
> @@ -671,7 +670,7 @@ static struct page_info *alloc_heap_page
>          if ( node == first_node )
>          {
>              /* When we have tried all in nodemask, we fall back to others. */
> -            if ( nodemask_retry++ )
> +            if ( (memflags & MEMF_exact_node) || nodemask_retry++ )
>                  goto not_found;
>              nodes_andnot(nodemask, node_online_map, nodemask);
>              first_node = node = first_node(nodemask);
> --- a/xen/include/asm-x86/setup.h
> +++ b/xen/include/asm-x86/setup.h
> @@ -2,6 +2,7 @@
>  #define __X86_SETUP_H_
>  
>  #include <xen/multiboot.h>
> +#include <asm/numa.h>
>  
>  extern unsigned long xenheap_initial_phys_start;
>  
> @@ -32,7 +33,7 @@ int construct_dom0(
>      void *(*bootstrap_map)(const module_t *),
>      char *cmdline);
>  
> -unsigned long initial_images_nrpages(void);
> +unsigned long initial_images_nrpages(nodeid_t node);
>  void discard_initial_images(void);
>  
>  unsigned int dom0_max_vcpus(void);
>
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-03-05 16:11   ` Andrew Cooper
@ 2015-03-05 16:43     ` Jan Beulich
  2015-03-05 17:27       ` Andrew Cooper
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2015-03-05 16:43 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: KeirFraser, Tim Deegan, Dario Faggioli, Ian Jackson, Ian Campbell,
	xen-devel

>>> On 05.03.15 at 17:11, <andrew.cooper3@citrix.com> wrote:
> On 26/02/15 13:52, Jan Beulich wrote:
>> ---
>> I'm uncertain whether range restricting the PXMs permitted for Dom0 is
>> the right approach (matching what other NUMA code did until recently),
>> or whether we would instead want to simply limit the number of PXMs we
>> can handler there (i.e. using a static array instead of a static
>> bitmap).
> 
> I am not quite sure what you mean by this.

Since we somehow need to store the information provided on the
command line, we have basically two options (taking into consideration
that command line parsing happens very early): Either we use a bitmap
(as done here) to track the PXMs provided (which puts an upper bound
on the PXM values) or we store each PXM in a (static) array, which puts
an upper bound on how many PXMs may be specified (and requires
more storage per PXM).

>> +static struct vcpu *__init setup_vcpu(struct domain *d, unsigned int vcpu_id,
>> +                                      unsigned int cpu)
> 
> I would be tempted to name this setup_dom0_vcpu() to bring it in line
> with other dom0 construction functions, and to make it clear that it is
> not for use on general domains.

Makes sense, will do.

>>  unsigned int __init dom0_max_vcpus(void)
>>  {
>> -    unsigned max_vcpus;
>> +    unsigned int pxm, max_vcpus;
>> +    nodeid_t node;
>> +
>> +    for ( pxm = 0; pxm <= MAX_DOM0_PXM; ++pxm )
>> +        if ( test_bit(pxm, dom0_pxms) &&
>> +             (node = pxm_to_node(pxm)) != NUMA_NO_NODE )
>> +            node_set(node, dom0_nodes);
> 
> for_each_set_bit() ?  It would simply the above loop slightly, and is
> more efficient.

Sure - how did I forget about this?

>> @@ -1230,11 +1284,11 @@ int __init construct_dom0(
>>  
>>      printk("Dom0 has maximum %u VCPUs\n", d->max_vcpus);
>>  
>> -    cpu = cpumask_first(cpupool0->cpu_valid);
>> +    cpu = v->processor;
>>      for ( i = 1; i < d->max_vcpus; i++ )
>>      {
>> -        cpu = cpumask_cycle(cpu, cpupool0->cpu_valid);
>> -        (void)alloc_vcpu(d, i, cpu);
>> +        cpu = cpumask_cycle(cpu, &dom0_cpus);
>> +        setup_vcpu(d, i, cpu);
> 
> I know this is a preexisting bug, but you might want to fix it as you
> are changing the affected codepath.
> 
> Construction of dom0 should fail if any alloc_vcpu() call fails (and now
> setup_vcpu()).   Nothing currently catches a failure to allocate vcpu
> 1..$N, and this patch introduces a way for dom0 be more
> memory-constrained than previously.

I'm not sure - starting Dom0 with (in the worst case) just one vCPU
would seem better to me than not starting it at all.

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-03-05 16:43     ` Jan Beulich
@ 2015-03-05 17:27       ` Andrew Cooper
  2015-03-06  9:19         ` [PATCH 1/5 v2] " Jan Beulich
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Cooper @ 2015-03-05 17:27 UTC (permalink / raw)
  To: Jan Beulich
  Cc: KeirFraser, Tim Deegan, Dario Faggioli, Ian Jackson, Ian Campbell,
	xen-devel

On 05/03/15 16:43, Jan Beulich wrote:
>>>> On 05.03.15 at 17:11, <andrew.cooper3@citrix.com> wrote:
>> On 26/02/15 13:52, Jan Beulich wrote:
>>> ---
>>> I'm uncertain whether range restricting the PXMs permitted for Dom0 is
>>> the right approach (matching what other NUMA code did until recently),
>>> or whether we would instead want to simply limit the number of PXMs we
>>> can handler there (i.e. using a static array instead of a static
>>> bitmap).
>> I am not quite sure what you mean by this.
> Since we somehow need to store the information provided on the
> command line, we have basically two options (taking into consideration
> that command line parsing happens very early): Either we use a bitmap
> (as done here) to track the PXMs provided (which puts an upper bound
> on the PXM values) or we store each PXM in a (static) array, which puts
> an upper bound on how many PXMs may be specified (and requires
> more storage per PXM).

Oh I see.

In which case I suggest a static array with a bounded based on
NUMA_NODE_SHIFT.  It is __initdata so storage size is not a big deal,
and its upper bound will be the same as Xen will tolerate with distinct
NUMA nodes.

>>> @@ -1230,11 +1284,11 @@ int __init construct_dom0(
>>>  
>>>      printk("Dom0 has maximum %u VCPUs\n", d->max_vcpus);
>>>  
>>> -    cpu = cpumask_first(cpupool0->cpu_valid);
>>> +    cpu = v->processor;
>>>      for ( i = 1; i < d->max_vcpus; i++ )
>>>      {
>>> -        cpu = cpumask_cycle(cpu, cpupool0->cpu_valid);
>>> -        (void)alloc_vcpu(d, i, cpu);
>>> +        cpu = cpumask_cycle(cpu, &dom0_cpus);
>>> +        setup_vcpu(d, i, cpu);
>> I know this is a preexisting bug, but you might want to fix it as you
>> are changing the affected codepath.
>>
>> Construction of dom0 should fail if any alloc_vcpu() call fails (and now
>> setup_vcpu()).   Nothing currently catches a failure to allocate vcpu
>> 1..$N, and this patch introduces a way for dom0 be more
>> memory-constrained than previously.
> I'm not sure - starting Dom0 with (in the worst case) just one vCPU
> would seem better to me than not starting it at all.

Having d->max_vcpus set and any of d->vcpu[1 ... d->max_vcpus] as NULL
pointers will quickly cause all sorts of misery.

In the case that the failure is -ENOMEM, it is likely that something
else on the construction path will also blow up, so killing dom0
completely is not really an overreaction.  If however if the failure is
something else, booting really should be halted.

An alternative course of action would be to reduce d->max_vcpus down a
bit on failure, but then the order of operations will have to change so
no other construction code makes decisions based on an overly large
max_vcpus.

The max_vcpus hypercall specifically avoids letting this happen when
constructing domUs.

~Andrew

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/5 v2] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-03-05 17:27       ` Andrew Cooper
@ 2015-03-06  9:19         ` Jan Beulich
  2015-03-06 10:41           ` Dario Faggioli
  2015-03-06 16:05           ` Andrew Cooper
  0 siblings, 2 replies; 51+ messages in thread
From: Jan Beulich @ 2015-03-06  9:19 UTC (permalink / raw)
  To: xen-devel
  Cc: KeirFraser, Andrew Cooper, Dario Faggioli, Ian Jackson, TimDeegan,
	Ian Campbell

[-- Attachment #1: Type: text/plain, Size: 9851 bytes --]

... by introducing a "dom0_nodes" option augmenting the "dom0_mem" and
"dom0_max_vcpus" ones.

Note that this gives meaning to MEMF_exact_node specified alone (i.e.
implicitly combined with NUMA_NO_NODE): In such a case any node inside
the domain's node mask is acceptable, but no other node. This changed
behavior is (implicitly) being exposed through the memop hypercalls.

Note further that this change doesn't take care of moving the initrd
image into memory matching Dom0's affinity when the initrd doesn't get
copied (because of being part of the initial mapping) anyway.

And note finally that this doesn't get us meaningfully closer to
handing vNUMA information to Dom0 (which will require the current
striping of allocations to become node-specific in order for the passed
on information to be meaningful).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Use MAX_NUMNODES sized array for storing PXMs. Implement relaxed
    mode. Minor other cleanup.

--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -540,6 +540,18 @@ any dom0 autoballooning feature present 
 _xl.conf(5)_ man page or [Xen Best
 Practices](http://wiki.xen.org/wiki/Xen_Best_Practices#Xen_dom0_dedicated_memory_and_preventing_dom0_memory_ballooning).
 
+### dom0\_nodes
+
+> `= List of [ <integer> | relaxed | strict ]`
+
+> Default: `strict`
+
+Specify the NUMA nodes to place Dom0 on. Defaults for vCPU-s created
+and memory assigned to Dom0 will be adjusted to match the node
+restrictions set up here. Note that the values to be specified here are
+ACPI PXM ones, not Xen internal node numbers. `relaxed` sets up vCPU
+affinities to prefer but be not limited to the specified node(s).
+
 ### dom0\_shadow
 > `= <boolean>`
 
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -100,11 +100,70 @@ static void __init parse_dom0_max_vcpus(
 }
 custom_param("dom0_max_vcpus", parse_dom0_max_vcpus);
 
+static __initdata unsigned int dom0_nr_pxms;
+static __initdata unsigned int dom0_pxms[MAX_NUMNODES] =
+    { [0 ... MAX_NUMNODES - 1] = ~0 };
+static __initdata bool_t dom0_affinity_relaxed;
+
+static void __init parse_dom0_nodes(const char *s)
+{
+    do {
+        if ( isdigit(*s) )
+            dom0_pxms[dom0_nr_pxms] = simple_strtoul(s, &s, 0);
+        else if ( !strncmp(s, "relaxed", 7) && (!s[7] || s[7] == ',') )
+        {
+            dom0_affinity_relaxed = 1;
+            s += 7;
+        }
+        else if ( !strncmp(s, "strict", 6) && (!s[6] || s[6] == ',') )
+        {
+            dom0_affinity_relaxed = 0;
+            s += 6;
+        }
+        else
+            break;
+    } while ( ++dom0_nr_pxms < ARRAY_SIZE(dom0_pxms) && *s++ == ',' );
+}
+custom_param("dom0_nodes", parse_dom0_nodes);
+
+static cpumask_t __initdata dom0_cpus;
+
+static struct vcpu *__init setup_dom0_vcpu(struct domain *d,
+                                           unsigned int vcpu_id,
+                                           unsigned int cpu)
+{
+    struct vcpu *v = alloc_vcpu(d, vcpu_id, cpu);
+
+    if ( v )
+    {
+        if ( !d->is_pinned && !dom0_affinity_relaxed )
+            cpumask_copy(v->cpu_hard_affinity, &dom0_cpus);
+        cpumask_copy(v->cpu_soft_affinity, &dom0_cpus);
+    }
+
+    return v;
+}
+
+static nodemask_t __initdata dom0_nodes;
+
 unsigned int __init dom0_max_vcpus(void)
 {
-    unsigned max_vcpus;
+    unsigned int i, max_vcpus;
+    nodeid_t node;
+
+    for ( i = 0; i < dom0_nr_pxms; ++i )
+        if ( (node = pxm_to_node(dom0_pxms[i])) != NUMA_NO_NODE )
+            node_set(node, dom0_nodes);
+    nodes_and(dom0_nodes, dom0_nodes, node_online_map);
+    if ( nodes_empty(dom0_nodes) )
+        dom0_nodes = node_online_map;
+    for_each_node_mask ( node, dom0_nodes )
+        cpumask_or(&dom0_cpus, &dom0_cpus, &node_to_cpumask(node));
+    cpumask_and(&dom0_cpus, &dom0_cpus, cpupool0->cpu_valid);
+    if ( cpumask_empty(&dom0_cpus) )
+        cpumask_copy(&dom0_cpus, cpupool0->cpu_valid);
 
-    max_vcpus = num_cpupool_cpus(cpupool0);
+    max_vcpus = cpumask_weight(&dom0_cpus);
     if ( opt_dom0_max_vcpus_min > max_vcpus )
         max_vcpus = opt_dom0_max_vcpus_min;
     if ( opt_dom0_max_vcpus_max < max_vcpus )
@@ -119,12 +178,15 @@ struct vcpu *__init alloc_dom0_vcpu0(str
 {
     unsigned int max_vcpus = dom0_max_vcpus();
 
+    dom0->node_affinity = dom0_nodes;
+    dom0->auto_node_affinity = !dom0_nr_pxms;
+
     dom0->vcpu = xzalloc_array(struct vcpu *, max_vcpus);
     if ( !dom0->vcpu )
         return NULL;
     dom0->max_vcpus = max_vcpus;
 
-    return alloc_vcpu(dom0, 0, 0);
+    return setup_dom0_vcpu(dom0, 0, cpumask_first(&dom0_cpus));
 }
 
 #ifdef CONFIG_SHADOW_PAGING
@@ -156,7 +218,7 @@ static struct page_info * __init alloc_c
     struct domain *d, unsigned long max_pages)
 {
     static unsigned int __initdata last_order = MAX_ORDER;
-    static unsigned int __initdata memflags = MEMF_no_dma;
+    static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
     struct page_info *page;
     unsigned int order = get_order_from_pages(max_pages), free_order;
 
@@ -190,7 +252,7 @@ static struct page_info * __init alloc_c
 
         if ( d->tot_pages + (1 << order) > d->max_pages )
             continue;
-        pg2 = alloc_domheap_pages(d, order, 0);
+        pg2 = alloc_domheap_pages(d, order, MEMF_exact_node);
         if ( pg2 > page )
         {
             free_domheap_pages(page, free_order);
@@ -217,10 +279,14 @@ static unsigned long __init dom0_paging_
 static unsigned long __init compute_dom0_nr_pages(
     struct domain *d, struct elf_dom_parms *parms, unsigned long initrd_len)
 {
-    unsigned long avail = avail_domheap_pages() + initial_images_nrpages();
-    unsigned long nr_pages, min_pages, max_pages;
+    nodeid_t node;
+    unsigned long avail = 0, nr_pages, min_pages, max_pages;
     bool_t need_paging;
 
+    for_each_node_mask ( node, dom0_nodes )
+        avail += avail_domheap_pages_region(node, 0, 0) +
+                 initial_images_nrpages(node);
+
     /* Reserve memory for further dom0 vcpu-struct allocations... */
     avail -= (d->max_vcpus - 1UL)
              << get_order_from_bytes(sizeof(struct vcpu));
@@ -1230,11 +1296,11 @@ int __init construct_dom0(
 
     printk("Dom0 has maximum %u VCPUs\n", d->max_vcpus);
 
-    cpu = cpumask_first(cpupool0->cpu_valid);
+    cpu = v->processor;
     for ( i = 1; i < d->max_vcpus; i++ )
     {
-        cpu = cpumask_cycle(cpu, cpupool0->cpu_valid);
-        (void)alloc_vcpu(d, i, cpu);
+        cpu = cpumask_cycle(cpu, &dom0_cpus);
+        setup_dom0_vcpu(d, i, cpu);
     }
 
     /*
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -140,13 +140,21 @@ static void __init parse_acpi_param(char
 static const module_t *__initdata initial_images;
 static unsigned int __initdata nr_initial_images;
 
-unsigned long __init initial_images_nrpages(void)
+unsigned long __init initial_images_nrpages(nodeid_t node)
 {
+    unsigned long node_start = node_start_pfn(node);
+    unsigned long node_end = node_end_pfn(node);
     unsigned long nr;
     unsigned int i;
 
     for ( nr = i = 0; i < nr_initial_images; ++i )
-        nr += PFN_UP(initial_images[i].mod_end);
+    {
+        unsigned long start = initial_images[i].mod_start;
+        unsigned long end = start + PFN_UP(initial_images[i].mod_end);
+
+        if ( end > node_start && node_end > start )
+            nr += min(node_end, end) - max(node_start, start);
+    }
 
     return nr;
 }
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -581,7 +581,7 @@ static struct page_info *alloc_heap_page
     struct domain *d)
 {
     unsigned int i, j, zone = 0, nodemask_retry = 0;
-    nodeid_t first_node, node = MEMF_get_node(memflags);
+    nodeid_t first_node, node = MEMF_get_node(memflags), req_node = node;
     unsigned long request = 1UL << order;
     struct page_info *pg;
     nodemask_t nodemask = (d != NULL ) ? d->node_affinity : node_online_map;
@@ -593,7 +593,6 @@ static struct page_info *alloc_heap_page
 
     if ( node == NUMA_NO_NODE )
     {
-        memflags &= ~MEMF_exact_node;
         if ( d != NULL )
         {
             node = next_node(d->last_alloc_node, nodemask);
@@ -654,7 +653,7 @@ static struct page_info *alloc_heap_page
                     goto found;
         } while ( zone-- > zone_lo ); /* careful: unsigned zone may wrap */
 
-        if ( memflags & MEMF_exact_node )
+        if ( (memflags & MEMF_exact_node) && req_node != NUMA_NO_NODE )
             goto not_found;
 
         /* Pick next node. */
@@ -671,7 +670,7 @@ static struct page_info *alloc_heap_page
         if ( node == first_node )
         {
             /* When we have tried all in nodemask, we fall back to others. */
-            if ( nodemask_retry++ )
+            if ( (memflags & MEMF_exact_node) || nodemask_retry++ )
                 goto not_found;
             nodes_andnot(nodemask, node_online_map, nodemask);
             first_node = node = first_node(nodemask);
--- a/xen/include/asm-x86/setup.h
+++ b/xen/include/asm-x86/setup.h
@@ -2,6 +2,7 @@
 #define __X86_SETUP_H_
 
 #include <xen/multiboot.h>
+#include <asm/numa.h>
 
 extern unsigned long xenheap_initial_phys_start;
 
@@ -32,7 +33,7 @@ int construct_dom0(
     void *(*bootstrap_map)(const module_t *),
     char *cmdline);
 
-unsigned long initial_images_nrpages(void);
+unsigned long initial_images_nrpages(nodeid_t node);
 void discard_initial_images(void);
 
 unsigned int dom0_max_vcpus(void);



[-- Attachment #2: x86-Dom0-nodes.patch --]
[-- Type: text/plain, Size: 9906 bytes --]

x86: allow specifying the NUMA nodes Dom0 should run on

... by introducing a "dom0_nodes" option augmenting the "dom0_mem" and
"dom0_max_vcpus" ones.

Note that this gives meaning to MEMF_exact_node specified alone (i.e.
implicitly combined with NUMA_NO_NODE): In such a case any node inside
the domain's node mask is acceptable, but no other node. This changed
behavior is (implicitly) being exposed through the memop hypercalls.

Note further that this change doesn't take care of moving the initrd
image into memory matching Dom0's affinity when the initrd doesn't get
copied (because of being part of the initial mapping) anyway.

And note finally that this doesn't get us meaningfully closer to
handing vNUMA information to Dom0 (which will require the current
striping of allocations to become node-specific in order for the passed
on information to be meaningful).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
v2: Use MAX_NUMNODES sized array for storing PXMs. Implement relaxed
    mode. Minor other cleanup.

--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -540,6 +540,18 @@ any dom0 autoballooning feature present 
 _xl.conf(5)_ man page or [Xen Best
 Practices](http://wiki.xen.org/wiki/Xen_Best_Practices#Xen_dom0_dedicated_memory_and_preventing_dom0_memory_ballooning).
 
+### dom0\_nodes
+
+> `= List of [ <integer> | relaxed | strict ]`
+
+> Default: `strict`
+
+Specify the NUMA nodes to place Dom0 on. Defaults for vCPU-s created
+and memory assigned to Dom0 will be adjusted to match the node
+restrictions set up here. Note that the values to be specified here are
+ACPI PXM ones, not Xen internal node numbers. `relaxed` sets up vCPU
+affinities to prefer but be not limited to the specified node(s).
+
 ### dom0\_shadow
 > `= <boolean>`
 
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -100,11 +100,70 @@ static void __init parse_dom0_max_vcpus(
 }
 custom_param("dom0_max_vcpus", parse_dom0_max_vcpus);
 
+static __initdata unsigned int dom0_nr_pxms;
+static __initdata unsigned int dom0_pxms[MAX_NUMNODES] =
+    { [0 ... MAX_NUMNODES - 1] = ~0 };
+static __initdata bool_t dom0_affinity_relaxed;
+
+static void __init parse_dom0_nodes(const char *s)
+{
+    do {
+        if ( isdigit(*s) )
+            dom0_pxms[dom0_nr_pxms] = simple_strtoul(s, &s, 0);
+        else if ( !strncmp(s, "relaxed", 7) && (!s[7] || s[7] == ',') )
+        {
+            dom0_affinity_relaxed = 1;
+            s += 7;
+        }
+        else if ( !strncmp(s, "strict", 6) && (!s[6] || s[6] == ',') )
+        {
+            dom0_affinity_relaxed = 0;
+            s += 6;
+        }
+        else
+            break;
+    } while ( ++dom0_nr_pxms < ARRAY_SIZE(dom0_pxms) && *s++ == ',' );
+}
+custom_param("dom0_nodes", parse_dom0_nodes);
+
+static cpumask_t __initdata dom0_cpus;
+
+static struct vcpu *__init setup_dom0_vcpu(struct domain *d,
+                                           unsigned int vcpu_id,
+                                           unsigned int cpu)
+{
+    struct vcpu *v = alloc_vcpu(d, vcpu_id, cpu);
+
+    if ( v )
+    {
+        if ( !d->is_pinned && !dom0_affinity_relaxed )
+            cpumask_copy(v->cpu_hard_affinity, &dom0_cpus);
+        cpumask_copy(v->cpu_soft_affinity, &dom0_cpus);
+    }
+
+    return v;
+}
+
+static nodemask_t __initdata dom0_nodes;
+
 unsigned int __init dom0_max_vcpus(void)
 {
-    unsigned max_vcpus;
+    unsigned int i, max_vcpus;
+    nodeid_t node;
+
+    for ( i = 0; i < dom0_nr_pxms; ++i )
+        if ( (node = pxm_to_node(dom0_pxms[i])) != NUMA_NO_NODE )
+            node_set(node, dom0_nodes);
+    nodes_and(dom0_nodes, dom0_nodes, node_online_map);
+    if ( nodes_empty(dom0_nodes) )
+        dom0_nodes = node_online_map;
+    for_each_node_mask ( node, dom0_nodes )
+        cpumask_or(&dom0_cpus, &dom0_cpus, &node_to_cpumask(node));
+    cpumask_and(&dom0_cpus, &dom0_cpus, cpupool0->cpu_valid);
+    if ( cpumask_empty(&dom0_cpus) )
+        cpumask_copy(&dom0_cpus, cpupool0->cpu_valid);
 
-    max_vcpus = num_cpupool_cpus(cpupool0);
+    max_vcpus = cpumask_weight(&dom0_cpus);
     if ( opt_dom0_max_vcpus_min > max_vcpus )
         max_vcpus = opt_dom0_max_vcpus_min;
     if ( opt_dom0_max_vcpus_max < max_vcpus )
@@ -119,12 +178,15 @@ struct vcpu *__init alloc_dom0_vcpu0(str
 {
     unsigned int max_vcpus = dom0_max_vcpus();
 
+    dom0->node_affinity = dom0_nodes;
+    dom0->auto_node_affinity = !dom0_nr_pxms;
+
     dom0->vcpu = xzalloc_array(struct vcpu *, max_vcpus);
     if ( !dom0->vcpu )
         return NULL;
     dom0->max_vcpus = max_vcpus;
 
-    return alloc_vcpu(dom0, 0, 0);
+    return setup_dom0_vcpu(dom0, 0, cpumask_first(&dom0_cpus));
 }
 
 #ifdef CONFIG_SHADOW_PAGING
@@ -156,7 +218,7 @@ static struct page_info * __init alloc_c
     struct domain *d, unsigned long max_pages)
 {
     static unsigned int __initdata last_order = MAX_ORDER;
-    static unsigned int __initdata memflags = MEMF_no_dma;
+    static unsigned int __initdata memflags = MEMF_no_dma|MEMF_exact_node;
     struct page_info *page;
     unsigned int order = get_order_from_pages(max_pages), free_order;
 
@@ -190,7 +252,7 @@ static struct page_info * __init alloc_c
 
         if ( d->tot_pages + (1 << order) > d->max_pages )
             continue;
-        pg2 = alloc_domheap_pages(d, order, 0);
+        pg2 = alloc_domheap_pages(d, order, MEMF_exact_node);
         if ( pg2 > page )
         {
             free_domheap_pages(page, free_order);
@@ -217,10 +279,14 @@ static unsigned long __init dom0_paging_
 static unsigned long __init compute_dom0_nr_pages(
     struct domain *d, struct elf_dom_parms *parms, unsigned long initrd_len)
 {
-    unsigned long avail = avail_domheap_pages() + initial_images_nrpages();
-    unsigned long nr_pages, min_pages, max_pages;
+    nodeid_t node;
+    unsigned long avail = 0, nr_pages, min_pages, max_pages;
     bool_t need_paging;
 
+    for_each_node_mask ( node, dom0_nodes )
+        avail += avail_domheap_pages_region(node, 0, 0) +
+                 initial_images_nrpages(node);
+
     /* Reserve memory for further dom0 vcpu-struct allocations... */
     avail -= (d->max_vcpus - 1UL)
              << get_order_from_bytes(sizeof(struct vcpu));
@@ -1230,11 +1296,11 @@ int __init construct_dom0(
 
     printk("Dom0 has maximum %u VCPUs\n", d->max_vcpus);
 
-    cpu = cpumask_first(cpupool0->cpu_valid);
+    cpu = v->processor;
     for ( i = 1; i < d->max_vcpus; i++ )
     {
-        cpu = cpumask_cycle(cpu, cpupool0->cpu_valid);
-        (void)alloc_vcpu(d, i, cpu);
+        cpu = cpumask_cycle(cpu, &dom0_cpus);
+        setup_dom0_vcpu(d, i, cpu);
     }
 
     /*
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -140,13 +140,21 @@ static void __init parse_acpi_param(char
 static const module_t *__initdata initial_images;
 static unsigned int __initdata nr_initial_images;
 
-unsigned long __init initial_images_nrpages(void)
+unsigned long __init initial_images_nrpages(nodeid_t node)
 {
+    unsigned long node_start = node_start_pfn(node);
+    unsigned long node_end = node_end_pfn(node);
     unsigned long nr;
     unsigned int i;
 
     for ( nr = i = 0; i < nr_initial_images; ++i )
-        nr += PFN_UP(initial_images[i].mod_end);
+    {
+        unsigned long start = initial_images[i].mod_start;
+        unsigned long end = start + PFN_UP(initial_images[i].mod_end);
+
+        if ( end > node_start && node_end > start )
+            nr += min(node_end, end) - max(node_start, start);
+    }
 
     return nr;
 }
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -581,7 +581,7 @@ static struct page_info *alloc_heap_page
     struct domain *d)
 {
     unsigned int i, j, zone = 0, nodemask_retry = 0;
-    nodeid_t first_node, node = MEMF_get_node(memflags);
+    nodeid_t first_node, node = MEMF_get_node(memflags), req_node = node;
     unsigned long request = 1UL << order;
     struct page_info *pg;
     nodemask_t nodemask = (d != NULL ) ? d->node_affinity : node_online_map;
@@ -593,7 +593,6 @@ static struct page_info *alloc_heap_page
 
     if ( node == NUMA_NO_NODE )
     {
-        memflags &= ~MEMF_exact_node;
         if ( d != NULL )
         {
             node = next_node(d->last_alloc_node, nodemask);
@@ -654,7 +653,7 @@ static struct page_info *alloc_heap_page
                     goto found;
         } while ( zone-- > zone_lo ); /* careful: unsigned zone may wrap */
 
-        if ( memflags & MEMF_exact_node )
+        if ( (memflags & MEMF_exact_node) && req_node != NUMA_NO_NODE )
             goto not_found;
 
         /* Pick next node. */
@@ -671,7 +670,7 @@ static struct page_info *alloc_heap_page
         if ( node == first_node )
         {
             /* When we have tried all in nodemask, we fall back to others. */
-            if ( nodemask_retry++ )
+            if ( (memflags & MEMF_exact_node) || nodemask_retry++ )
                 goto not_found;
             nodes_andnot(nodemask, node_online_map, nodemask);
             first_node = node = first_node(nodemask);
--- a/xen/include/asm-x86/setup.h
+++ b/xen/include/asm-x86/setup.h
@@ -2,6 +2,7 @@
 #define __X86_SETUP_H_
 
 #include <xen/multiboot.h>
+#include <asm/numa.h>
 
 extern unsigned long xenheap_initial_phys_start;
 
@@ -32,7 +33,7 @@ int construct_dom0(
     void *(*bootstrap_map)(const module_t *),
     char *cmdline);
 
-unsigned long initial_images_nrpages(void);
+unsigned long initial_images_nrpages(nodeid_t node);
 void discard_initial_images(void);
 
 unsigned int dom0_max_vcpus(void);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5 v2] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-03-06  9:19         ` [PATCH 1/5 v2] " Jan Beulich
@ 2015-03-06 10:41           ` Dario Faggioli
  2015-03-06 16:05           ` Andrew Cooper
  1 sibling, 0 replies; 51+ messages in thread
From: Dario Faggioli @ 2015-03-06 10:41 UTC (permalink / raw)
  To: JBeulich@suse.com
  Cc: Keir (Xen.org), Ian Campbell, Andrew Cooper, Tim (Xen.org),
	xen-devel@lists.xenproject.org, Ian Jackson


[-- Attachment #1.1: Type: text/plain, Size: 1217 bytes --]

On Fri, 2015-03-06 at 09:19 +0000, Jan Beulich wrote:
> ... by introducing a "dom0_nodes" option augmenting the "dom0_mem" and
> "dom0_max_vcpus" ones.
> 
> Note that this gives meaning to MEMF_exact_node specified alone (i.e.
> implicitly combined with NUMA_NO_NODE): In such a case any node inside
> the domain's node mask is acceptable, but no other node. This changed
> behavior is (implicitly) being exposed through the memop hypercalls.
> 
> Note further that this change doesn't take care of moving the initrd
> image into memory matching Dom0's affinity when the initrd doesn't get
> copied (because of being part of the initial mapping) anyway.
> 
> And note finally that this doesn't get us meaningfully closer to
> handing vNUMA information to Dom0 (which will require the current
> striping of allocations to become node-specific in order for the passed
> on information to be meaningful).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Acked-by: Ian Campbell <ian.campbell@citrix.com>
> ---
> v2: Use MAX_NUMNODES sized array for storing PXMs. Implement relaxed
>     mode. Minor other cleanup.
> 
Reviewed-by: Dario Faggioli <dario.faggioli@cirix.com>

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/5 v2] x86: allow specifying the NUMA nodes Dom0 should run on
  2015-03-06  9:19         ` [PATCH 1/5 v2] " Jan Beulich
  2015-03-06 10:41           ` Dario Faggioli
@ 2015-03-06 16:05           ` Andrew Cooper
  1 sibling, 0 replies; 51+ messages in thread
From: Andrew Cooper @ 2015-03-06 16:05 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Ian Campbell, KeirFraser, Dario Faggioli, Ian Jackson, TimDeegan

On 06/03/2015 09:19, Jan Beulich wrote:
> ... by introducing a "dom0_nodes" option augmenting the "dom0_mem" and
> "dom0_max_vcpus" ones.
>
> Note that this gives meaning to MEMF_exact_node specified alone (i.e.
> implicitly combined with NUMA_NO_NODE): In such a case any node inside
> the domain's node mask is acceptable, but no other node. This changed
> behavior is (implicitly) being exposed through the memop hypercalls.
>
> Note further that this change doesn't take care of moving the initrd
> image into memory matching Dom0's affinity when the initrd doesn't get
> copied (because of being part of the initial mapping) anyway.
>
> And note finally that this doesn't get us meaningfully closer to
> handing vNUMA information to Dom0 (which will require the current
> striping of allocations to become node-specific in order for the passed
> on information to be meaningful).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Acked-by: Ian Campbell <ian.campbell@citrix.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 2/5] allow domain heap allocations to specify more than one NUMA node
  2015-02-26 13:44 [PATCH 0/5] (not just)x86/Dom0: NUMA related adjustments Jan Beulich
  2015-02-26 13:52 ` [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on Jan Beulich
@ 2015-02-26 13:53 ` Jan Beulich
  2015-02-27 11:34   ` Dario Faggioli
                     ` (2 more replies)
  2015-02-26 13:54 ` [PATCH 3/5] x86: widen NUMA nodes to be allocated from Jan Beulich
                   ` (3 subsequent siblings)
  5 siblings, 3 replies; 51+ messages in thread
From: Jan Beulich @ 2015-02-26 13:53 UTC (permalink / raw)
  To: xen-devel
  Cc: Ian Campbell, Keir Fraser, Dario Faggioli, Ian Jackson,
	Tim Deegan

[-- Attachment #1: Type: text/plain, Size: 3064 bytes --]

... using struct domain as a container for passing the respective
affinity mask: Quite a number of allocations are domain specific, yet
not to be accounted for that domain. Introduce a flag suppressing the
accounting altogether (i.e. going beyond MEMF_no_refcount) and use it
right away in common code (x86 and IOMMU code will get adjusted
subsequently).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -1657,7 +1657,8 @@ gnttab_transfer(
             struct page_info *new_page;
             void *sp, *dp;
 
-            new_page = alloc_domheap_page(NULL, MEMF_bits(max_bitsize));
+            new_page = alloc_domheap_page(e, MEMF_no_owner |
+                                             MEMF_bits(max_bitsize));
             if ( new_page == NULL )
             {
                 gop.status = GNTST_address_too_big;
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -462,7 +462,8 @@ static long memory_exchange(XEN_GUEST_HA
         /* Allocate a chunk's worth of anonymous output pages. */
         for ( j = 0; j < (1UL << out_chunk_order); j++ )
         {
-            page = alloc_domheap_pages(NULL, exch.out.extent_order, memflags);
+            page = alloc_domheap_pages(d, exch.out.extent_order,
+                                       MEMF_no_owner | memflags);
             if ( unlikely(page == NULL) )
             {
                 rc = -ENOMEM;
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -1685,10 +1685,14 @@ struct page_info *alloc_domheap_pages(
 
     ASSERT(!in_irq());
 
-    bits = domain_clamp_alloc_bitsize(d, bits ? : (BITS_PER_LONG+PAGE_SHIFT));
+    bits = domain_clamp_alloc_bitsize(memflags & MEMF_no_owner ? NULL : d,
+                                      bits ? : (BITS_PER_LONG+PAGE_SHIFT));
     if ( (zone_hi = min_t(unsigned int, bits_to_zone(bits), zone_hi)) == 0 )
         return NULL;
 
+    if ( memflags & MEMF_no_owner )
+        memflags |= MEMF_no_refcount;
+
     if ( dma_bitsize && ((dma_zone = bits_to_zone(dma_bitsize)) < zone_hi) )
         pg = alloc_heap_pages(dma_zone + 1, zone_hi, order, memflags, d);
 
@@ -1698,7 +1702,8 @@ struct page_info *alloc_domheap_pages(
                                   memflags, d)) == NULL)) )
          return NULL;
 
-    if ( (d != NULL) && assign_pages(d, pg, order, memflags) )
+    if ( d && !(memflags & MEMF_no_owner) &&
+         assign_pages(d, pg, order, memflags) )
     {
         free_heap_pages(pg, order);
         return NULL;
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -120,6 +120,8 @@ struct npfec {
 #define  MEMF_no_dma      (1U<<_MEMF_no_dma)
 #define _MEMF_exact_node  4
 #define  MEMF_exact_node  (1U<<_MEMF_exact_node)
+#define _MEMF_no_owner    5
+#define  MEMF_no_owner    (1U<<_MEMF_no_owner)
 #define _MEMF_node        8
 #define  MEMF_node_mask   ((1U << (8 * sizeof(nodeid_t))) - 1)
 #define  MEMF_node(n)     ((((n) + 1) & MEMF_node_mask) << _MEMF_node)




[-- Attachment #2: heap-alloc-for-domain.patch --]
[-- Type: text/plain, Size: 3126 bytes --]

allow domain heap allocations to specify more than one NUMA node

... using struct domain as a container for passing the respective
affinity mask: Quite a number of allocations are domain specific, yet
not to be accounted for that domain. Introduce a flag suppressing the
accounting altogether (i.e. going beyond MEMF_no_refcount) and use it
right away in common code (x86 and IOMMU code will get adjusted
subsequently).

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/common/grant_table.c
+++ b/xen/common/grant_table.c
@@ -1657,7 +1657,8 @@ gnttab_transfer(
             struct page_info *new_page;
             void *sp, *dp;
 
-            new_page = alloc_domheap_page(NULL, MEMF_bits(max_bitsize));
+            new_page = alloc_domheap_page(e, MEMF_no_owner |
+                                             MEMF_bits(max_bitsize));
             if ( new_page == NULL )
             {
                 gop.status = GNTST_address_too_big;
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -462,7 +462,8 @@ static long memory_exchange(XEN_GUEST_HA
         /* Allocate a chunk's worth of anonymous output pages. */
         for ( j = 0; j < (1UL << out_chunk_order); j++ )
         {
-            page = alloc_domheap_pages(NULL, exch.out.extent_order, memflags);
+            page = alloc_domheap_pages(d, exch.out.extent_order,
+                                       MEMF_no_owner | memflags);
             if ( unlikely(page == NULL) )
             {
                 rc = -ENOMEM;
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -1685,10 +1685,14 @@ struct page_info *alloc_domheap_pages(
 
     ASSERT(!in_irq());
 
-    bits = domain_clamp_alloc_bitsize(d, bits ? : (BITS_PER_LONG+PAGE_SHIFT));
+    bits = domain_clamp_alloc_bitsize(memflags & MEMF_no_owner ? NULL : d,
+                                      bits ? : (BITS_PER_LONG+PAGE_SHIFT));
     if ( (zone_hi = min_t(unsigned int, bits_to_zone(bits), zone_hi)) == 0 )
         return NULL;
 
+    if ( memflags & MEMF_no_owner )
+        memflags |= MEMF_no_refcount;
+
     if ( dma_bitsize && ((dma_zone = bits_to_zone(dma_bitsize)) < zone_hi) )
         pg = alloc_heap_pages(dma_zone + 1, zone_hi, order, memflags, d);
 
@@ -1698,7 +1702,8 @@ struct page_info *alloc_domheap_pages(
                                   memflags, d)) == NULL)) )
          return NULL;
 
-    if ( (d != NULL) && assign_pages(d, pg, order, memflags) )
+    if ( d && !(memflags & MEMF_no_owner) &&
+         assign_pages(d, pg, order, memflags) )
     {
         free_heap_pages(pg, order);
         return NULL;
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -120,6 +120,8 @@ struct npfec {
 #define  MEMF_no_dma      (1U<<_MEMF_no_dma)
 #define _MEMF_exact_node  4
 #define  MEMF_exact_node  (1U<<_MEMF_exact_node)
+#define _MEMF_no_owner    5
+#define  MEMF_no_owner    (1U<<_MEMF_no_owner)
 #define _MEMF_node        8
 #define  MEMF_node_mask   ((1U << (8 * sizeof(nodeid_t))) - 1)
 #define  MEMF_node(n)     ((((n) + 1) & MEMF_node_mask) << _MEMF_node)

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/5] allow domain heap allocations to specify more than one NUMA node
  2015-02-26 13:53 ` [PATCH 2/5] allow domain heap allocations to specify more than one NUMA node Jan Beulich
@ 2015-02-27 11:34   ` Dario Faggioli
  2015-03-02 17:12   ` Ian Campbell
  2015-03-05 16:18   ` Andrew Cooper
  2 siblings, 0 replies; 51+ messages in thread
From: Dario Faggioli @ 2015-02-27 11:34 UTC (permalink / raw)
  To: JBeulich@suse.com
  Cc: Ian Jackson, Keir (Xen.org), Tim (Xen.org), Ian Campbell,
	xen-devel@lists.xenproject.org


[-- Attachment #1.1: Type: text/plain, Size: 562 bytes --]

On Thu, 2015-02-26 at 13:53 +0000, Jan Beulich wrote:
> ... using struct domain as a container for passing the respective
> affinity mask: Quite a number of allocations are domain specific, yet
> not to be accounted for that domain. Introduce a flag suppressing the
> accounting altogether (i.e. going beyond MEMF_no_refcount) and use it
> right away in common code (x86 and IOMMU code will get adjusted
> subsequently).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/5] allow domain heap allocations to specify more than one NUMA node
  2015-02-26 13:53 ` [PATCH 2/5] allow domain heap allocations to specify more than one NUMA node Jan Beulich
  2015-02-27 11:34   ` Dario Faggioli
@ 2015-03-02 17:12   ` Ian Campbell
  2015-03-03  7:59     ` Jan Beulich
  2015-03-05 16:18   ` Andrew Cooper
  2 siblings, 1 reply; 51+ messages in thread
From: Ian Campbell @ 2015-03-02 17:12 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Dario Faggioli, Keir Fraser, Ian Jackson, Tim Deegan

On Thu, 2015-02-26 at 13:53 +0000, Jan Beulich wrote:
> ... using struct domain as a container for passing the respective
> affinity mask: Quite a number of allocations are domain specific, yet
> not to be accounted for that domain. Introduce a flag suppressing the
> accounting altogether (i.e. going beyond MEMF_no_refcount) and use it
> right away in common code (x86 and IOMMU code will get adjusted
> subsequently).
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Acked-by: Ian Campbell <ian.campbell@citrix.com>

Does this patch constitute all the "not just"(x86) from the initial
mail? I'll assume so unless I hear otherwise.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/5] allow domain heap allocations to specify more than one NUMA node
  2015-03-02 17:12   ` Ian Campbell
@ 2015-03-03  7:59     ` Jan Beulich
  0 siblings, 0 replies; 51+ messages in thread
From: Jan Beulich @ 2015-03-03  7:59 UTC (permalink / raw)
  To: Ian Campbell
  Cc: xen-devel, Dario Faggioli, KeirFraser, Ian Jackson, Tim Deegan

>>> On 02.03.15 at 18:12, <ian.campbell@citrix.com> wrote:
> On Thu, 2015-02-26 at 13:53 +0000, Jan Beulich wrote:
>> ... using struct domain as a container for passing the respective
>> affinity mask: Quite a number of allocations are domain specific, yet
>> not to be accounted for that domain. Introduce a flag suppressing the
>> accounting altogether (i.e. going beyond MEMF_no_refcount) and use it
>> right away in common code (x86 and IOMMU code will get adjusted
>> subsequently).
>> 
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
> Acked-by: Ian Campbell <ian.campbell@citrix.com>

Thanks.

> Does this patch constitute all the "not just"(x86) from the initial
> mail? I'll assume so unless I hear otherwise.

No, this part of patch 1

"Note that this gives meaning to MEMF_exact_node specified alone (i.e.
 implicitly combined with NUMA_NO_NODE): In such a case any node inside
 the domain's node mask is acceptable, but no other node. This changed
 behavior is (implicitly) being exposed through the memop hypercalls."

does too. (Patches 4 and 5 are only indirectly x86-specific, as the
IOMMU code touched there is used on x86 only. But that's of no
concern to you anyway.)

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/5] allow domain heap allocations to specify more than one NUMA node
  2015-02-26 13:53 ` [PATCH 2/5] allow domain heap allocations to specify more than one NUMA node Jan Beulich
  2015-02-27 11:34   ` Dario Faggioli
  2015-03-02 17:12   ` Ian Campbell
@ 2015-03-05 16:18   ` Andrew Cooper
  2 siblings, 0 replies; 51+ messages in thread
From: Andrew Cooper @ 2015-03-05 16:18 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Ian Campbell, Ian Jackson, Dario Faggioli, Keir Fraser,
	Tim Deegan

On 26/02/15 13:53, Jan Beulich wrote:
> ... using struct domain as a container for passing the respective
> affinity mask: Quite a number of allocations are domain specific, yet
> not to be accounted for that domain. Introduce a flag suppressing the
> accounting altogether (i.e. going beyond MEMF_no_refcount) and use it
> right away in common code (x86 and IOMMU code will get adjusted
> subsequently).
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 3/5] x86: widen NUMA nodes to be allocated from
  2015-02-26 13:44 [PATCH 0/5] (not just)x86/Dom0: NUMA related adjustments Jan Beulich
  2015-02-26 13:52 ` [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on Jan Beulich
  2015-02-26 13:53 ` [PATCH 2/5] allow domain heap allocations to specify more than one NUMA node Jan Beulich
@ 2015-02-26 13:54 ` Jan Beulich
  2015-02-27 13:27   ` Dario Faggioli
  2015-03-05 16:39   ` Andrew Cooper
  2015-02-26 13:55 ` [PATCH 4/5] VT-d: " Jan Beulich
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 51+ messages in thread
From: Jan Beulich @ 2015-02-26 13:54 UTC (permalink / raw)
  To: xen-devel; +Cc: Andrew Cooper, Dario Faggioli, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 5664 bytes --]

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -285,7 +285,8 @@ struct vcpu_guest_context *alloc_vcpu_gu
 
     for ( i = 0; i < PFN_UP(sizeof(struct vcpu_guest_context)); ++i )
     {
-        struct page_info *pg = alloc_domheap_page(NULL, 0);
+        struct page_info *pg = alloc_domheap_page(current->domain,
+                                                  MEMF_no_owner);
 
         if ( unlikely(pg == NULL) )
         {
@@ -322,7 +323,7 @@ static int setup_compat_l4(struct vcpu *
     l4_pgentry_t *l4tab;
     int rc;
 
-    pg = alloc_domheap_page(NULL, MEMF_node(vcpu_to_node(v)));
+    pg = alloc_domheap_page(v->domain, MEMF_no_owner);
     if ( pg == NULL )
         return -ENOMEM;
 
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -1182,7 +1182,7 @@ int __init construct_dom0(
     }
     else
     {
-        page = alloc_domheap_page(NULL, 0);
+        page = alloc_domheap_page(d, MEMF_no_owner);
         if ( !page )
             panic("Not enough RAM for domain 0 PML4");
         page->u.inuse.type_info = PGT_l4_page_table|PGT_validated|1;
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -150,7 +150,7 @@ long arch_do_domctl(
                 break;
             }
 
-            page = alloc_domheap_page(NULL, 0);
+            page = alloc_domheap_page(current->domain, MEMF_no_owner);
             if ( !page )
             {
                 ret = -ENOMEM;
--- a/xen/arch/x86/hvm/stdvga.c
+++ b/xen/arch/x86/hvm/stdvga.c
@@ -604,7 +604,7 @@ void stdvga_init(struct domain *d)
     
     for ( i = 0; i != ARRAY_SIZE(s->vram_page); i++ )
     {
-        pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
+        pg = alloc_domheap_page(d, MEMF_no_owner);
         if ( pg == NULL )
             break;
         s->vram_page[i] = pg;
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -1417,7 +1417,6 @@ HVM_REGISTER_SAVE_RESTORE(LAPIC_REGS, la
 int vlapic_init(struct vcpu *v)
 {
     struct vlapic *vlapic = vcpu_vlapic(v);
-    unsigned int memflags = MEMF_node(vcpu_to_node(v));
 
     HVM_DBG_LOG(DBG_LEVEL_VLAPIC, "%d", v->vcpu_id);
 
@@ -1431,7 +1430,7 @@ int vlapic_init(struct vcpu *v)
 
     if (vlapic->regs_page == NULL)
     {
-        vlapic->regs_page = alloc_domheap_page(NULL, memflags);
+        vlapic->regs_page = alloc_domheap_page(v->domain, MEMF_no_owner);
         if ( vlapic->regs_page == NULL )
         {
             dprintk(XENLOG_ERR, "alloc vlapic regs error: %d/%d\n",
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5879,7 +5879,6 @@ int create_perdomain_mapping(struct doma
     l3_pgentry_t *l3tab;
     l2_pgentry_t *l2tab;
     l1_pgentry_t *l1tab;
-    unsigned int memf = MEMF_node(domain_to_node(d));
     int rc = 0;
 
     ASSERT(va >= PERDOMAIN_VIRT_START &&
@@ -5887,7 +5886,7 @@ int create_perdomain_mapping(struct doma
 
     if ( !d->arch.perdomain_l3_pg )
     {
-        pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
+        pg = alloc_domheap_page(d, MEMF_no_owner);
         if ( !pg )
             return -ENOMEM;
         l3tab = __map_domain_page(pg);
@@ -5908,7 +5907,7 @@ int create_perdomain_mapping(struct doma
 
     if ( !(l3e_get_flags(l3tab[l3_table_offset(va)]) & _PAGE_PRESENT) )
     {
-        pg = alloc_domheap_page(NULL, memf);
+        pg = alloc_domheap_page(d, MEMF_no_owner);
         if ( !pg )
         {
             unmap_domain_page(l3tab);
@@ -5937,7 +5936,7 @@ int create_perdomain_mapping(struct doma
         {
             if ( pl1tab && !IS_NIL(pl1tab) )
             {
-                l1tab = alloc_xenheap_pages(0, memf);
+                l1tab = alloc_xenheap_pages(0, MEMF_node(domain_to_node(d)));
                 if ( !l1tab )
                 {
                     rc = -ENOMEM;
@@ -5949,7 +5948,7 @@ int create_perdomain_mapping(struct doma
             }
             else
             {
-                pg = alloc_domheap_page(NULL, memf);
+                pg = alloc_domheap_page(d, MEMF_no_owner);
                 if ( !pg )
                 {
                     rc = -ENOMEM;
@@ -5966,7 +5965,7 @@ int create_perdomain_mapping(struct doma
         if ( ppg &&
              !(l1e_get_flags(l1tab[l1_table_offset(va)]) & _PAGE_PRESENT) )
         {
-            pg = alloc_domheap_page(NULL, memf);
+            pg = alloc_domheap_page(d, MEMF_no_owner);
             if ( pg )
             {
                 clear_domain_page(page_to_mfn(pg));
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -331,7 +331,7 @@ hap_set_allocation(struct domain *d, uns
         if ( d->arch.paging.hap.total_pages < pages )
         {
             /* Need to allocate more memory from domheap */
-            pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
+            pg = alloc_domheap_page(d, MEMF_no_owner);
             if ( pg == NULL )
             {
                 HAP_PRINTK("failed to allocate hap pages.\n");
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1677,7 +1677,7 @@ static unsigned int sh_set_allocation(st
         {
             /* Need to allocate more memory from domheap */
             sp = (struct page_info *)
-                alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
+                alloc_domheap_page(d, MEMF_no_owner);
             if ( sp == NULL )
             {
                 SHADOW_PRINTK("failed to allocate shadow pages.\n");



[-- Attachment #2: x86-alloc-for-domain.patch --]
[-- Type: text/plain, Size: 5706 bytes --]

x86: widen NUMA nodes to be allocated from

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -285,7 +285,8 @@ struct vcpu_guest_context *alloc_vcpu_gu
 
     for ( i = 0; i < PFN_UP(sizeof(struct vcpu_guest_context)); ++i )
     {
-        struct page_info *pg = alloc_domheap_page(NULL, 0);
+        struct page_info *pg = alloc_domheap_page(current->domain,
+                                                  MEMF_no_owner);
 
         if ( unlikely(pg == NULL) )
         {
@@ -322,7 +323,7 @@ static int setup_compat_l4(struct vcpu *
     l4_pgentry_t *l4tab;
     int rc;
 
-    pg = alloc_domheap_page(NULL, MEMF_node(vcpu_to_node(v)));
+    pg = alloc_domheap_page(v->domain, MEMF_no_owner);
     if ( pg == NULL )
         return -ENOMEM;
 
--- a/xen/arch/x86/domain_build.c
+++ b/xen/arch/x86/domain_build.c
@@ -1182,7 +1182,7 @@ int __init construct_dom0(
     }
     else
     {
-        page = alloc_domheap_page(NULL, 0);
+        page = alloc_domheap_page(d, MEMF_no_owner);
         if ( !page )
             panic("Not enough RAM for domain 0 PML4");
         page->u.inuse.type_info = PGT_l4_page_table|PGT_validated|1;
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -150,7 +150,7 @@ long arch_do_domctl(
                 break;
             }
 
-            page = alloc_domheap_page(NULL, 0);
+            page = alloc_domheap_page(current->domain, MEMF_no_owner);
             if ( !page )
             {
                 ret = -ENOMEM;
--- a/xen/arch/x86/hvm/stdvga.c
+++ b/xen/arch/x86/hvm/stdvga.c
@@ -604,7 +604,7 @@ void stdvga_init(struct domain *d)
     
     for ( i = 0; i != ARRAY_SIZE(s->vram_page); i++ )
     {
-        pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
+        pg = alloc_domheap_page(d, MEMF_no_owner);
         if ( pg == NULL )
             break;
         s->vram_page[i] = pg;
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -1417,7 +1417,6 @@ HVM_REGISTER_SAVE_RESTORE(LAPIC_REGS, la
 int vlapic_init(struct vcpu *v)
 {
     struct vlapic *vlapic = vcpu_vlapic(v);
-    unsigned int memflags = MEMF_node(vcpu_to_node(v));
 
     HVM_DBG_LOG(DBG_LEVEL_VLAPIC, "%d", v->vcpu_id);
 
@@ -1431,7 +1430,7 @@ int vlapic_init(struct vcpu *v)
 
     if (vlapic->regs_page == NULL)
     {
-        vlapic->regs_page = alloc_domheap_page(NULL, memflags);
+        vlapic->regs_page = alloc_domheap_page(v->domain, MEMF_no_owner);
         if ( vlapic->regs_page == NULL )
         {
             dprintk(XENLOG_ERR, "alloc vlapic regs error: %d/%d\n",
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5879,7 +5879,6 @@ int create_perdomain_mapping(struct doma
     l3_pgentry_t *l3tab;
     l2_pgentry_t *l2tab;
     l1_pgentry_t *l1tab;
-    unsigned int memf = MEMF_node(domain_to_node(d));
     int rc = 0;
 
     ASSERT(va >= PERDOMAIN_VIRT_START &&
@@ -5887,7 +5886,7 @@ int create_perdomain_mapping(struct doma
 
     if ( !d->arch.perdomain_l3_pg )
     {
-        pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
+        pg = alloc_domheap_page(d, MEMF_no_owner);
         if ( !pg )
             return -ENOMEM;
         l3tab = __map_domain_page(pg);
@@ -5908,7 +5907,7 @@ int create_perdomain_mapping(struct doma
 
     if ( !(l3e_get_flags(l3tab[l3_table_offset(va)]) & _PAGE_PRESENT) )
     {
-        pg = alloc_domheap_page(NULL, memf);
+        pg = alloc_domheap_page(d, MEMF_no_owner);
         if ( !pg )
         {
             unmap_domain_page(l3tab);
@@ -5937,7 +5936,7 @@ int create_perdomain_mapping(struct doma
         {
             if ( pl1tab && !IS_NIL(pl1tab) )
             {
-                l1tab = alloc_xenheap_pages(0, memf);
+                l1tab = alloc_xenheap_pages(0, MEMF_node(domain_to_node(d)));
                 if ( !l1tab )
                 {
                     rc = -ENOMEM;
@@ -5949,7 +5948,7 @@ int create_perdomain_mapping(struct doma
             }
             else
             {
-                pg = alloc_domheap_page(NULL, memf);
+                pg = alloc_domheap_page(d, MEMF_no_owner);
                 if ( !pg )
                 {
                     rc = -ENOMEM;
@@ -5966,7 +5965,7 @@ int create_perdomain_mapping(struct doma
         if ( ppg &&
              !(l1e_get_flags(l1tab[l1_table_offset(va)]) & _PAGE_PRESENT) )
         {
-            pg = alloc_domheap_page(NULL, memf);
+            pg = alloc_domheap_page(d, MEMF_no_owner);
             if ( pg )
             {
                 clear_domain_page(page_to_mfn(pg));
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -331,7 +331,7 @@ hap_set_allocation(struct domain *d, uns
         if ( d->arch.paging.hap.total_pages < pages )
         {
             /* Need to allocate more memory from domheap */
-            pg = alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
+            pg = alloc_domheap_page(d, MEMF_no_owner);
             if ( pg == NULL )
             {
                 HAP_PRINTK("failed to allocate hap pages.\n");
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1677,7 +1677,7 @@ static unsigned int sh_set_allocation(st
         {
             /* Need to allocate more memory from domheap */
             sp = (struct page_info *)
-                alloc_domheap_page(NULL, MEMF_node(domain_to_node(d)));
+                alloc_domheap_page(d, MEMF_no_owner);
             if ( sp == NULL )
             {
                 SHADOW_PRINTK("failed to allocate shadow pages.\n");

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] x86: widen NUMA nodes to be allocated from
  2015-02-26 13:54 ` [PATCH 3/5] x86: widen NUMA nodes to be allocated from Jan Beulich
@ 2015-02-27 13:27   ` Dario Faggioli
  2015-02-27 13:36     ` Jan Beulich
                       ` (2 more replies)
  2015-03-05 16:39   ` Andrew Cooper
  1 sibling, 3 replies; 51+ messages in thread
From: Dario Faggioli @ 2015-02-27 13:27 UTC (permalink / raw)
  To: JBeulich@suse.com
  Cc: Keir (Xen.org), Ian Campbell, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel@lists.xenproject.org

[-- Attachment #1.1: Type: text/plain, Size: 1090 bytes --]

On Thu, 2015-02-26 at 13:54 +0000, Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> 
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>

One question (a genuine one, i.e., I'm really not sure what I'm saying
is correct).

After this series, vcpu_to_node() (defined in xen/include/xen/numa.h) is
left with only one use, in xen/arch/arm/domain.c, besides of course
being used to implement domain_to_node() (still in
xen/include/xen/numa.h).

So, provided ARM people (and I'm Cc-ing them) can get rid of that, can
that macro be removed all together, and domain_to_node(d) be defined
after d->node_affinity... something like:

 #define domain_to_node(d) \
   ( nodes_equal(d->node_affinity, NODE_MASK_ALL) \
     ? NUMA_NO_NODE : first_node(d->node_affinity) ) 

I'm asking because I really don't like vcpu_to_node(). And I'm not
talking about how it is implemented (there probably are not much
alternatives), I'm saying I don't think it should exist, and I really
would see value in killing it. :-)

Thoughts?

Thanks and Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] x86: widen NUMA nodes to be allocated from
  2015-02-27 13:27   ` Dario Faggioli
@ 2015-02-27 13:36     ` Jan Beulich
  2015-02-27 14:11       ` Dario Faggioli
  2015-02-27 13:38     ` Julien Grall
  2015-02-27 13:46     ` Ian Campbell
  2 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2015-02-27 13:36 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Keir (Xen.org), Ian Campbell, AndrewCooper, Julien Grall,
	Stefano Stabellini, xen-devel@lists.xenproject.org

>>> On 27.02.15 at 14:27, <dario.faggioli@citrix.com> wrote:
> One question (a genuine one, i.e., I'm really not sure what I'm saying
> is correct).
> 
> After this series, vcpu_to_node() (defined in xen/include/xen/numa.h) is
> left with only one use, in xen/arch/arm/domain.c, besides of course
> being used to implement domain_to_node() (still in
> xen/include/xen/numa.h).
> 
> So, provided ARM people (and I'm Cc-ing them) can get rid of that, can
> that macro be removed all together, and domain_to_node(d) be defined
> after d->node_affinity... something like:
> 
>  #define domain_to_node(d) \
>    ( nodes_equal(d->node_affinity, NODE_MASK_ALL) \
>      ? NUMA_NO_NODE : first_node(d->node_affinity) ) 
> 
> I'm asking because I really don't like vcpu_to_node(). And I'm not
> talking about how it is implemented (there probably are not much
> alternatives), I'm saying I don't think it should exist, and I really
> would see value in killing it. :-)

I'm all for killing it. In fact I'd also like to see domain_to_node()
go away, as it's similarly bogus (no matter of the proposed
changed implementation) - neither a vCPU nor a domain have
a "focus" node or some such (some may happen to if their node
mask has just a single set bit, but that's nothing code should
depend on). (And btw, at the very least first_node() in your
proposal should become any_node().)

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] x86: widen NUMA nodes to be allocated from
  2015-02-27 13:36     ` Jan Beulich
@ 2015-02-27 14:11       ` Dario Faggioli
  0 siblings, 0 replies; 51+ messages in thread
From: Dario Faggioli @ 2015-02-27 14:11 UTC (permalink / raw)
  To: JBeulich@suse.com
  Cc: Keir (Xen.org), Ian Campbell, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel@lists.xenproject.org


[-- Attachment #1.1: Type: text/plain, Size: 1270 bytes --]

On Fri, 2015-02-27 at 13:36 +0000, Jan Beulich wrote:
> >>> On 27.02.15 at 14:27, <dario.faggioli@citrix.com> wrote:

> > I'm asking because I really don't like vcpu_to_node(). And I'm not
> > talking about how it is implemented (there probably are not much
> > alternatives), I'm saying I don't think it should exist, and I really
> > would see value in killing it. :-)
> 
> I'm all for killing it. In fact I'd also like to see domain_to_node()
> go away, as it's similarly bogus (no matter of the proposed
> changed implementation) - neither a vCPU nor a domain have
> a "focus" node or some such (some may happen to if their node
> mask has just a single set bit, but that's nothing code should
> depend on). 
>
I totally agree. I didn't go as far as far as suggesting that because,
if my grep-ing is not failing, it's still in use in two more places,
even with your series applied.

But yes, we really should make it possible to remove it too.

> (And btw, at the very least first_node() in your
> proposal should become any_node().)
> 
Except, there is no such function. But again, I agree, and if we get to
the point where we can kill vcpu_to_node() but need to keep
domain_to_node, we can of course implement it. :-)

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] x86: widen NUMA nodes to be allocated from
  2015-02-27 13:27   ` Dario Faggioli
  2015-02-27 13:36     ` Jan Beulich
@ 2015-02-27 13:38     ` Julien Grall
  2015-02-27 13:55       ` Dario Faggioli
  2015-02-27 13:58       ` Jan Beulich
  2015-02-27 13:46     ` Ian Campbell
  2 siblings, 2 replies; 51+ messages in thread
From: Julien Grall @ 2015-02-27 13:38 UTC (permalink / raw)
  To: Dario Faggioli, JBeulich@suse.com
  Cc: Keir (Xen.org), Ian Campbell, Andrew Cooper, Julien Grall,
	Stefano Stabellini, xen-devel@lists.xenproject.org

Hi Dario,

On 27/02/15 13:27, Dario Faggioli wrote:
> On Thu, 2015-02-26 at 13:54 +0000, Jan Beulich wrote:
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>>
> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> One question (a genuine one, i.e., I'm really not sure what I'm saying
> is correct).
> 
> After this series, vcpu_to_node() (defined in xen/include/xen/numa.h) is
> left with only one use, in xen/arch/arm/domain.c, besides of course
> being used to implement domain_to_node() (still in
> xen/include/xen/numa.h).
> 
> So, provided ARM people (and I'm Cc-ing them) can get rid of that, can
> that macro be removed all together, and domain_to_node(d) be defined
> after d->node_affinity... something like:
> 
>  #define domain_to_node(d) \
>    ( nodes_equal(d->node_affinity, NODE_MASK_ALL) \
>      ? NUMA_NO_NODE : first_node(d->node_affinity) ) 
> 
> I'm asking because I really don't like vcpu_to_node(). And I'm not
> talking about how it is implemented (there probably are not much
> alternatives), I'm saying I don't think it should exist, and I really
> would see value in killing it. :-)
> 
> Thoughts?

Given the changes made by Jan on x86, I think we could replace
vcpu_to_node by MEMF_no_owner.

FWIW, we don't have any NUMA support on ARM currently.

Regards,


-- 
Julien Grall

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] x86: widen NUMA nodes to be allocated from
  2015-02-27 13:38     ` Julien Grall
@ 2015-02-27 13:55       ` Dario Faggioli
  2015-02-27 13:58       ` Jan Beulich
  1 sibling, 0 replies; 51+ messages in thread
From: Dario Faggioli @ 2015-02-27 13:55 UTC (permalink / raw)
  To: julien.grall@linaro.org
  Cc: Keir (Xen.org), Ian Campbell, Andrew Cooper, Julien Grall,
	Stefano Stabellini, JBeulich@suse.com,
	xen-devel@lists.xenproject.org


[-- Attachment #1.1: Type: text/plain, Size: 991 bytes --]

On Fri, 2015-02-27 at 13:38 +0000, Julien Grall wrote:

> >> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> >>
> > Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
> > 
> > One question (a genuine one, i.e., I'm really not sure what I'm saying
> > is correct).
> > 
> > After this series, vcpu_to_node() (defined in xen/include/xen/numa.h) is
> > left with only one use, in xen/arch/arm/domain.c, besides of course
> > being used to implement domain_to_node() (still in
> > xen/include/xen/numa.h).
> > 
> > So, provided ARM people (and I'm Cc-ing them) can get rid of that, can
> > that macro be removed all together, and domain_to_node(d) be defined
> > after d->node_affinity... something like:

> Given the changes made by Jan on x86, I think we could replace
> vcpu_to_node by MEMF_no_owner.
> 
I expected this to be the case. Happy to hear it is! :-)

> FWIW, we don't have any NUMA support on ARM currently.
> 
I know.

Thanks and Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] x86: widen NUMA nodes to be allocated from
  2015-02-27 13:38     ` Julien Grall
  2015-02-27 13:55       ` Dario Faggioli
@ 2015-02-27 13:58       ` Jan Beulich
  1 sibling, 0 replies; 51+ messages in thread
From: Jan Beulich @ 2015-02-27 13:58 UTC (permalink / raw)
  To: Julien Grall
  Cc: Keir (Xen.org), Ian Campbell, Andrew Cooper, Dario Faggioli,
	Julien Grall, Stefano Stabellini, xen-devel@lists.xenproject.org

>>> On 27.02.15 at 14:38, <julien.grall@linaro.org> wrote:
> Given the changes made by Jan on x86, I think we could replace
> vcpu_to_node by MEMF_no_owner.

That would imply you switch from alloc_xenheap_pages() to
alloc_homheap_pages(). Yet if that's safe to do, I don't see
why it's not already the latter.

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] x86: widen NUMA nodes to be allocated from
  2015-02-27 13:27   ` Dario Faggioli
  2015-02-27 13:36     ` Jan Beulich
  2015-02-27 13:38     ` Julien Grall
@ 2015-02-27 13:46     ` Ian Campbell
  2015-02-27 14:00       ` Dario Faggioli
  2015-02-27 14:03       ` Jan Beulich
  2 siblings, 2 replies; 51+ messages in thread
From: Ian Campbell @ 2015-02-27 13:46 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Keir (Xen.org), Andrew Cooper, Julien Grall, Stefano Stabellini,
	JBeulich@suse.com, xen-devel@lists.xenproject.org

On Fri, 2015-02-27 at 13:27 +0000, Dario Faggioli wrote:
> On Thu, 2015-02-26 at 13:54 +0000, Jan Beulich wrote:
> > Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > 
> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> One question (a genuine one, i.e., I'm really not sure what I'm saying
> is correct).
> 
> After this series, vcpu_to_node() (defined in xen/include/xen/numa.h) is
> left with only one use, in xen/arch/arm/domain.c, besides of course
> being used to implement domain_to_node() (still in
> xen/include/xen/numa.h).
> 
> So, provided ARM people (and I'm Cc-ing them) can get rid of that,

Happy to do so if you have advise on what to replace it with, just 0?

We don't do NUMA yet on ARM so that would be fine, but eventually we'd
want the vcpu stack to be allocated in some sort of "sensible relative
to vcpu affinity" location...

>  can
> that macro be removed all together, and domain_to_node(d) be defined
> after d->node_affinity... something like:
> 
>  #define domain_to_node(d) \
>    ( nodes_equal(d->node_affinity, NODE_MASK_ALL) \
>      ? NUMA_NO_NODE : first_node(d->node_affinity) ) 
> 
> I'm asking because I really don't like vcpu_to_node(). And I'm not
> talking about how it is implemented (there probably are not much
> alternatives), I'm saying I don't think it should exist, and I really
> would see value in killing it. :-)
> 
> Thoughts?
> 
> Thanks and Regards,
> Dario

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] x86: widen NUMA nodes to be allocated from
  2015-02-27 13:46     ` Ian Campbell
@ 2015-02-27 14:00       ` Dario Faggioli
  2015-02-27 14:03       ` Jan Beulich
  1 sibling, 0 replies; 51+ messages in thread
From: Dario Faggioli @ 2015-02-27 14:00 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Keir (Xen.org), Andrew Cooper, Julien Grall, Stefano Stabellini,
	JBeulich@suse.com, xen-devel@lists.xenproject.org


[-- Attachment #1.1: Type: text/plain, Size: 914 bytes --]

On Fri, 2015-02-27 at 13:46 +0000, Ian Campbell wrote:
> On Fri, 2015-02-27 at 13:27 +0000, Dario Faggioli wrote:

> > After this series, vcpu_to_node() (defined in xen/include/xen/numa.h) is
> > left with only one use, in xen/arch/arm/domain.c, besides of course
> > being used to implement domain_to_node() (still in
> > xen/include/xen/numa.h).
> > 
> > So, provided ARM people (and I'm Cc-ing them) can get rid of that,
> 
> Happy to do so if you have advise on what to replace it with, just 0?
> 
As Julien says, with the MEMF_no_owner feature Jan is introducing in the
series.

> We don't do NUMA yet on ARM so that would be fine, but eventually we'd
> want the vcpu stack to be allocated in some sort of "sensible relative
> to vcpu affinity" location...
> 
Yes, and Jan's MEMF_no_owner, if it works on your arch too, as it seems
it could, will provide exactly that.

Regards,
Dario

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] x86: widen NUMA nodes to be allocated from
  2015-02-27 13:46     ` Ian Campbell
  2015-02-27 14:00       ` Dario Faggioli
@ 2015-02-27 14:03       ` Jan Beulich
  1 sibling, 0 replies; 51+ messages in thread
From: Jan Beulich @ 2015-02-27 14:03 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Keir (Xen.org), AndrewCooper, Dario Faggioli, Julien Grall,
	Stefano Stabellini, xen-devel@lists.xenproject.org

>>> On 27.02.15 at 14:46, <ian.campbell@citrix.com> wrote:
> On Fri, 2015-02-27 at 13:27 +0000, Dario Faggioli wrote:
>> On Thu, 2015-02-26 at 13:54 +0000, Jan Beulich wrote:
>> > Signed-off-by: Jan Beulich <jbeulich@suse.com>
>> > 
>> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
>> 
>> One question (a genuine one, i.e., I'm really not sure what I'm saying
>> is correct).
>> 
>> After this series, vcpu_to_node() (defined in xen/include/xen/numa.h) is
>> left with only one use, in xen/arch/arm/domain.c, besides of course
>> being used to implement domain_to_node() (still in
>> xen/include/xen/numa.h).
>> 
>> So, provided ARM people (and I'm Cc-ing them) can get rid of that,
> 
> Happy to do so if you have advise on what to replace it with, just 0?
> 
> We don't do NUMA yet on ARM so that would be fine, but eventually we'd
> want the vcpu stack to be allocated in some sort of "sensible relative
> to vcpu affinity" location...

That would require vNUMA information to be set by the tool stack
already by that time. What we have currently (using the node of
the CPU the vCPU starts on) is pretty meaningless unless the vCPU
is meant to only ever run on a single node.

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/5] x86: widen NUMA nodes to be allocated from
  2015-02-26 13:54 ` [PATCH 3/5] x86: widen NUMA nodes to be allocated from Jan Beulich
  2015-02-27 13:27   ` Dario Faggioli
@ 2015-03-05 16:39   ` Andrew Cooper
  1 sibling, 0 replies; 51+ messages in thread
From: Andrew Cooper @ 2015-03-05 16:39 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Dario Faggioli, Keir Fraser

On 26/02/15 13:54, Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 4/5] VT-d: widen NUMA nodes to be allocated from
  2015-02-26 13:44 [PATCH 0/5] (not just)x86/Dom0: NUMA related adjustments Jan Beulich
                   ` (2 preceding siblings ...)
  2015-02-26 13:54 ` [PATCH 3/5] x86: widen NUMA nodes to be allocated from Jan Beulich
@ 2015-02-26 13:55 ` Jan Beulich
  2015-03-05 17:08   ` Andrew Cooper
  2015-02-26 13:56 ` [PATCH 5/5] AMD IOMMU: " Jan Beulich
  2015-02-27 10:04 ` [PATCH 0/5] (not just)x86/Dom0: NUMA related adjustments Dario Faggioli
  5 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2015-02-26 13:55 UTC (permalink / raw)
  To: xen-devel; +Cc: Yang Z Zhang, Dario Faggioli, Kevin Tian

[-- Attachment #1: Type: text/plain, Size: 5297 bytes --]

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -67,7 +67,8 @@ unsigned int get_cache_line_size(void);
 void cacheline_flush(char *);
 void flush_all_cache(void);
 
-u64 alloc_pgtable_maddr(struct acpi_drhd_unit *drhd, unsigned long npages);
+u64 alloc_pgtable_maddr(struct acpi_drhd_unit *, unsigned int npages,
+                        struct domain *);
 void free_pgtable_maddr(u64 maddr);
 void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(void *va);
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -739,7 +739,8 @@ int enable_intremap(struct iommu *iommu,
     if ( ir_ctrl->iremap_maddr == 0 )
     {
         drhd = iommu_to_drhd(iommu);
-        ir_ctrl->iremap_maddr = alloc_pgtable_maddr(drhd, IREMAP_ARCH_PAGE_NR);
+        ir_ctrl->iremap_maddr = alloc_pgtable_maddr(drhd, IREMAP_ARCH_PAGE_NR,
+                                                    NULL);
         if ( ir_ctrl->iremap_maddr == 0 )
         {
             dprintk(XENLOG_WARNING VTDPREFIX,
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -185,7 +185,8 @@ void iommu_flush_cache_page(void *addr, 
 }
 
 /* Allocate page table, return its machine address */
-u64 alloc_pgtable_maddr(struct acpi_drhd_unit *drhd, unsigned long npages)
+u64 alloc_pgtable_maddr(struct acpi_drhd_unit *drhd, unsigned int npages,
+                        struct domain *d)
 {
     struct acpi_rhsa_unit *rhsa;
     struct page_info *pg, *cur_pg;
@@ -195,10 +196,10 @@ u64 alloc_pgtable_maddr(struct acpi_drhd
 
     rhsa = drhd_to_rhsa(drhd);
     if ( rhsa )
-        node =  pxm_to_node(rhsa->proximity_domain);
+        node = pxm_to_node(rhsa->proximity_domain);
 
-    pg = alloc_domheap_pages(NULL, get_order_from_pages(npages),
-                             (node == NUMA_NO_NODE) ? 0 : MEMF_node(node));
+    pg = alloc_domheap_pages(d, get_order_from_pages(npages),
+                             MEMF_node(node) | MEMF_no_owner);
     if ( !pg )
         return 0;
 
@@ -223,7 +224,7 @@ void free_pgtable_maddr(u64 maddr)
 }
 
 /* context entry handling */
-static u64 bus_to_context_maddr(struct iommu *iommu, u8 bus)
+static u64 bus_to_context_maddr(struct iommu *iommu, u8 bus, struct domain *d)
 {
     struct acpi_drhd_unit *drhd;
     struct root_entry *root, *root_entries;
@@ -235,7 +236,7 @@ static u64 bus_to_context_maddr(struct i
     if ( !root_present(*root) )
     {
         drhd = iommu_to_drhd(iommu);
-        maddr = alloc_pgtable_maddr(drhd, 1);
+        maddr = alloc_pgtable_maddr(drhd, 1, d);
         if ( maddr == 0 )
         {
             unmap_vtd_domain_page(root_entries);
@@ -271,7 +272,9 @@ static u64 addr_to_dma_page_maddr(struct
          */
         pdev = pci_get_pdev_by_domain(domain, -1, -1, -1);
         drhd = acpi_find_matched_drhd_unit(pdev);
-        if ( !alloc || ((hd->arch.pgd_maddr = alloc_pgtable_maddr(drhd, 1)) == 0) )
+        if ( !alloc ||
+             ((hd->arch.pgd_maddr = alloc_pgtable_maddr(drhd, 1,
+                                                        domain)) == 0) )
             goto out;
     }
 
@@ -289,7 +292,7 @@ static u64 addr_to_dma_page_maddr(struct
 
             pdev = pci_get_pdev_by_domain(domain, -1, -1, -1);
             drhd = acpi_find_matched_drhd_unit(pdev);
-            pte_maddr = alloc_pgtable_maddr(drhd, 1);
+            pte_maddr = alloc_pgtable_maddr(drhd, 1, domain);
             if ( !pte_maddr )
                 break;
 
@@ -1119,7 +1122,7 @@ int __init iommu_alloc(struct acpi_drhd_
     iommu->intel->drhd = drhd;
     drhd->iommu = iommu;
 
-    if ( !(iommu->root_maddr = alloc_pgtable_maddr(drhd, 1)) )
+    if ( !(iommu->root_maddr = alloc_pgtable_maddr(drhd, 1, NULL)) )
         return -ENOMEM;
 
     iommu->reg = ioremap(drhd->address, PAGE_SIZE);
@@ -1270,7 +1273,7 @@ int domain_context_mapping_one(
 
     ASSERT(spin_is_locked(&pcidevs_lock));
     spin_lock(&iommu->lock);
-    maddr = bus_to_context_maddr(iommu, bus);
+    maddr = bus_to_context_maddr(iommu, bus, domain);
     context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
     context = &context_entries[devfn];
 
@@ -1495,7 +1498,7 @@ int domain_context_unmap_one(
     ASSERT(spin_is_locked(&pcidevs_lock));
     spin_lock(&iommu->lock);
 
-    maddr = bus_to_context_maddr(iommu, bus);
+    maddr = bus_to_context_maddr(iommu, bus, domain);
     context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
     context = &context_entries[devfn];
 
--- a/xen/drivers/passthrough/vtd/qinval.c
+++ b/xen/drivers/passthrough/vtd/qinval.c
@@ -374,7 +374,8 @@ int enable_qinval(struct iommu *iommu)
     if ( qi_ctrl->qinval_maddr == 0 )
     {
         drhd = iommu_to_drhd(iommu);
-        qi_ctrl->qinval_maddr = alloc_pgtable_maddr(drhd, QINVAL_ARCH_PAGE_NR);
+        qi_ctrl->qinval_maddr = alloc_pgtable_maddr(drhd, QINVAL_ARCH_PAGE_NR,
+                                                    NULL);
         if ( qi_ctrl->qinval_maddr == 0 )
         {
             dprintk(XENLOG_WARNING VTDPREFIX,



[-- Attachment #2: VT-d-alloc-for-domain.patch --]
[-- Type: text/plain, Size: 5340 bytes --]

VT-d: widen NUMA nodes to be allocated from

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -67,7 +67,8 @@ unsigned int get_cache_line_size(void);
 void cacheline_flush(char *);
 void flush_all_cache(void);
 
-u64 alloc_pgtable_maddr(struct acpi_drhd_unit *drhd, unsigned long npages);
+u64 alloc_pgtable_maddr(struct acpi_drhd_unit *, unsigned int npages,
+                        struct domain *);
 void free_pgtable_maddr(u64 maddr);
 void *map_vtd_domain_page(u64 maddr);
 void unmap_vtd_domain_page(void *va);
--- a/xen/drivers/passthrough/vtd/intremap.c
+++ b/xen/drivers/passthrough/vtd/intremap.c
@@ -739,7 +739,8 @@ int enable_intremap(struct iommu *iommu,
     if ( ir_ctrl->iremap_maddr == 0 )
     {
         drhd = iommu_to_drhd(iommu);
-        ir_ctrl->iremap_maddr = alloc_pgtable_maddr(drhd, IREMAP_ARCH_PAGE_NR);
+        ir_ctrl->iremap_maddr = alloc_pgtable_maddr(drhd, IREMAP_ARCH_PAGE_NR,
+                                                    NULL);
         if ( ir_ctrl->iremap_maddr == 0 )
         {
             dprintk(XENLOG_WARNING VTDPREFIX,
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -185,7 +185,8 @@ void iommu_flush_cache_page(void *addr, 
 }
 
 /* Allocate page table, return its machine address */
-u64 alloc_pgtable_maddr(struct acpi_drhd_unit *drhd, unsigned long npages)
+u64 alloc_pgtable_maddr(struct acpi_drhd_unit *drhd, unsigned int npages,
+                        struct domain *d)
 {
     struct acpi_rhsa_unit *rhsa;
     struct page_info *pg, *cur_pg;
@@ -195,10 +196,10 @@ u64 alloc_pgtable_maddr(struct acpi_drhd
 
     rhsa = drhd_to_rhsa(drhd);
     if ( rhsa )
-        node =  pxm_to_node(rhsa->proximity_domain);
+        node = pxm_to_node(rhsa->proximity_domain);
 
-    pg = alloc_domheap_pages(NULL, get_order_from_pages(npages),
-                             (node == NUMA_NO_NODE) ? 0 : MEMF_node(node));
+    pg = alloc_domheap_pages(d, get_order_from_pages(npages),
+                             MEMF_node(node) | MEMF_no_owner);
     if ( !pg )
         return 0;
 
@@ -223,7 +224,7 @@ void free_pgtable_maddr(u64 maddr)
 }
 
 /* context entry handling */
-static u64 bus_to_context_maddr(struct iommu *iommu, u8 bus)
+static u64 bus_to_context_maddr(struct iommu *iommu, u8 bus, struct domain *d)
 {
     struct acpi_drhd_unit *drhd;
     struct root_entry *root, *root_entries;
@@ -235,7 +236,7 @@ static u64 bus_to_context_maddr(struct i
     if ( !root_present(*root) )
     {
         drhd = iommu_to_drhd(iommu);
-        maddr = alloc_pgtable_maddr(drhd, 1);
+        maddr = alloc_pgtable_maddr(drhd, 1, d);
         if ( maddr == 0 )
         {
             unmap_vtd_domain_page(root_entries);
@@ -271,7 +272,9 @@ static u64 addr_to_dma_page_maddr(struct
          */
         pdev = pci_get_pdev_by_domain(domain, -1, -1, -1);
         drhd = acpi_find_matched_drhd_unit(pdev);
-        if ( !alloc || ((hd->arch.pgd_maddr = alloc_pgtable_maddr(drhd, 1)) == 0) )
+        if ( !alloc ||
+             ((hd->arch.pgd_maddr = alloc_pgtable_maddr(drhd, 1,
+                                                        domain)) == 0) )
             goto out;
     }
 
@@ -289,7 +292,7 @@ static u64 addr_to_dma_page_maddr(struct
 
             pdev = pci_get_pdev_by_domain(domain, -1, -1, -1);
             drhd = acpi_find_matched_drhd_unit(pdev);
-            pte_maddr = alloc_pgtable_maddr(drhd, 1);
+            pte_maddr = alloc_pgtable_maddr(drhd, 1, domain);
             if ( !pte_maddr )
                 break;
 
@@ -1119,7 +1122,7 @@ int __init iommu_alloc(struct acpi_drhd_
     iommu->intel->drhd = drhd;
     drhd->iommu = iommu;
 
-    if ( !(iommu->root_maddr = alloc_pgtable_maddr(drhd, 1)) )
+    if ( !(iommu->root_maddr = alloc_pgtable_maddr(drhd, 1, NULL)) )
         return -ENOMEM;
 
     iommu->reg = ioremap(drhd->address, PAGE_SIZE);
@@ -1270,7 +1273,7 @@ int domain_context_mapping_one(
 
     ASSERT(spin_is_locked(&pcidevs_lock));
     spin_lock(&iommu->lock);
-    maddr = bus_to_context_maddr(iommu, bus);
+    maddr = bus_to_context_maddr(iommu, bus, domain);
     context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
     context = &context_entries[devfn];
 
@@ -1495,7 +1498,7 @@ int domain_context_unmap_one(
     ASSERT(spin_is_locked(&pcidevs_lock));
     spin_lock(&iommu->lock);
 
-    maddr = bus_to_context_maddr(iommu, bus);
+    maddr = bus_to_context_maddr(iommu, bus, domain);
     context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
     context = &context_entries[devfn];
 
--- a/xen/drivers/passthrough/vtd/qinval.c
+++ b/xen/drivers/passthrough/vtd/qinval.c
@@ -374,7 +374,8 @@ int enable_qinval(struct iommu *iommu)
     if ( qi_ctrl->qinval_maddr == 0 )
     {
         drhd = iommu_to_drhd(iommu);
-        qi_ctrl->qinval_maddr = alloc_pgtable_maddr(drhd, QINVAL_ARCH_PAGE_NR);
+        qi_ctrl->qinval_maddr = alloc_pgtable_maddr(drhd, QINVAL_ARCH_PAGE_NR,
+                                                    NULL);
         if ( qi_ctrl->qinval_maddr == 0 )
         {
             dprintk(XENLOG_WARNING VTDPREFIX,

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] VT-d: widen NUMA nodes to be allocated from
  2015-02-26 13:55 ` [PATCH 4/5] VT-d: " Jan Beulich
@ 2015-03-05 17:08   ` Andrew Cooper
  2015-03-09  3:07     ` Tian, Kevin
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Cooper @ 2015-03-05 17:08 UTC (permalink / raw)
  To: Jan Beulich, xen-devel; +Cc: Yang Z Zhang, Dario Faggioli, Kevin Tian


[-- Attachment #1.1: Type: text/plain, Size: 6079 bytes --]

On 26/02/15 13:55, Jan Beulich wrote:
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

I believe that this change is incorrect.

>
> --- a/xen/drivers/passthrough/vtd/extern.h
> +++ b/xen/drivers/passthrough/vtd/extern.h
> @@ -67,7 +67,8 @@ unsigned int get_cache_line_size(void);
>  void cacheline_flush(char *);
>  void flush_all_cache(void);
>  
> -u64 alloc_pgtable_maddr(struct acpi_drhd_unit *drhd, unsigned long npages);
> +u64 alloc_pgtable_maddr(struct acpi_drhd_unit *, unsigned int npages,
> +                        struct domain *);
>  void free_pgtable_maddr(u64 maddr);
>  void *map_vtd_domain_page(u64 maddr);
>  void unmap_vtd_domain_page(void *va);
> --- a/xen/drivers/passthrough/vtd/intremap.c
> +++ b/xen/drivers/passthrough/vtd/intremap.c
> @@ -739,7 +739,8 @@ int enable_intremap(struct iommu *iommu,
>      if ( ir_ctrl->iremap_maddr == 0 )
>      {
>          drhd = iommu_to_drhd(iommu);
> -        ir_ctrl->iremap_maddr = alloc_pgtable_maddr(drhd, IREMAP_ARCH_PAGE_NR);
> +        ir_ctrl->iremap_maddr = alloc_pgtable_maddr(drhd, IREMAP_ARCH_PAGE_NR,
> +                                                    NULL);
>          if ( ir_ctrl->iremap_maddr == 0 )
>          {
>              dprintk(XENLOG_WARNING VTDPREFIX,
> --- a/xen/drivers/passthrough/vtd/iommu.c
> +++ b/xen/drivers/passthrough/vtd/iommu.c
> @@ -185,7 +185,8 @@ void iommu_flush_cache_page(void *addr, 
>  }
>  
>  /* Allocate page table, return its machine address */
> -u64 alloc_pgtable_maddr(struct acpi_drhd_unit *drhd, unsigned long npages)
> +u64 alloc_pgtable_maddr(struct acpi_drhd_unit *drhd, unsigned int npages,
> +                        struct domain *d)
>  {
>      struct acpi_rhsa_unit *rhsa;
>      struct page_info *pg, *cur_pg;
> @@ -195,10 +196,10 @@ u64 alloc_pgtable_maddr(struct acpi_drhd
>  
>      rhsa = drhd_to_rhsa(drhd);
>      if ( rhsa )
> -        node =  pxm_to_node(rhsa->proximity_domain);
> +        node = pxm_to_node(rhsa->proximity_domain);
>  
> -    pg = alloc_domheap_pages(NULL, get_order_from_pages(npages),
> -                             (node == NUMA_NO_NODE) ? 0 : MEMF_node(node));
> +    pg = alloc_domheap_pages(d, get_order_from_pages(npages),
> +                             MEMF_node(node) | MEMF_no_owner);
>      if ( !pg )
>          return 0;
>  
> @@ -223,7 +224,7 @@ void free_pgtable_maddr(u64 maddr)
>  }
>  
>  /* context entry handling */
> -static u64 bus_to_context_maddr(struct iommu *iommu, u8 bus)
> +static u64 bus_to_context_maddr(struct iommu *iommu, u8 bus, struct domain *d)
>  {
>      struct acpi_drhd_unit *drhd;
>      struct root_entry *root, *root_entries;
> @@ -235,7 +236,7 @@ static u64 bus_to_context_maddr(struct i
>      if ( !root_present(*root) )
>      {
>          drhd = iommu_to_drhd(iommu);
> -        maddr = alloc_pgtable_maddr(drhd, 1);
> +        maddr = alloc_pgtable_maddr(drhd, 1, d);
>          if ( maddr == 0 )
>          {
>              unmap_vtd_domain_page(root_entries);
> @@ -271,7 +272,9 @@ static u64 addr_to_dma_page_maddr(struct
>           */
>          pdev = pci_get_pdev_by_domain(domain, -1, -1, -1);
>          drhd = acpi_find_matched_drhd_unit(pdev);
> -        if ( !alloc || ((hd->arch.pgd_maddr = alloc_pgtable_maddr(drhd, 1)) == 0) )
> +        if ( !alloc ||
> +             ((hd->arch.pgd_maddr = alloc_pgtable_maddr(drhd, 1,
> +                                                        domain)) == 0) )

This allocation should be based on the proximity information of the
iommu, not of the requesting domain.

>              goto out;
>      }
>  
> @@ -289,7 +292,7 @@ static u64 addr_to_dma_page_maddr(struct
>  
>              pdev = pci_get_pdev_by_domain(domain, -1, -1, -1);
>              drhd = acpi_find_matched_drhd_unit(pdev);
> -            pte_maddr = alloc_pgtable_maddr(drhd, 1);
> +            pte_maddr = alloc_pgtable_maddr(drhd, 1, domain);

As should this.

I don't see a need to extend the prototype of alloc_pgtable_maddr(). 
The drhd parameter contains all relevant information concerning proximity.

~Andrew

>              if ( !pte_maddr )
>                  break;
>  
> @@ -1119,7 +1122,7 @@ int __init iommu_alloc(struct acpi_drhd_
>      iommu->intel->drhd = drhd;
>      drhd->iommu = iommu;
>  
> -    if ( !(iommu->root_maddr = alloc_pgtable_maddr(drhd, 1)) )
> +    if ( !(iommu->root_maddr = alloc_pgtable_maddr(drhd, 1, NULL)) )
>          return -ENOMEM;
>  
>      iommu->reg = ioremap(drhd->address, PAGE_SIZE);
> @@ -1270,7 +1273,7 @@ int domain_context_mapping_one(
>  
>      ASSERT(spin_is_locked(&pcidevs_lock));
>      spin_lock(&iommu->lock);
> -    maddr = bus_to_context_maddr(iommu, bus);
> +    maddr = bus_to_context_maddr(iommu, bus, domain);
>      context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
>      context = &context_entries[devfn];
>  
> @@ -1495,7 +1498,7 @@ int domain_context_unmap_one(
>      ASSERT(spin_is_locked(&pcidevs_lock));
>      spin_lock(&iommu->lock);
>  
> -    maddr = bus_to_context_maddr(iommu, bus);
> +    maddr = bus_to_context_maddr(iommu, bus, domain);
>      context_entries = (struct context_entry *)map_vtd_domain_page(maddr);
>      context = &context_entries[devfn];
>  
> --- a/xen/drivers/passthrough/vtd/qinval.c
> +++ b/xen/drivers/passthrough/vtd/qinval.c
> @@ -374,7 +374,8 @@ int enable_qinval(struct iommu *iommu)
>      if ( qi_ctrl->qinval_maddr == 0 )
>      {
>          drhd = iommu_to_drhd(iommu);
> -        qi_ctrl->qinval_maddr = alloc_pgtable_maddr(drhd, QINVAL_ARCH_PAGE_NR);
> +        qi_ctrl->qinval_maddr = alloc_pgtable_maddr(drhd, QINVAL_ARCH_PAGE_NR,
> +                                                    NULL);
>          if ( qi_ctrl->qinval_maddr == 0 )
>          {
>              dprintk(XENLOG_WARNING VTDPREFIX,
>
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel


[-- Attachment #1.2: Type: text/html, Size: 7232 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/5] VT-d: widen NUMA nodes to be allocated from
  2015-03-05 17:08   ` Andrew Cooper
@ 2015-03-09  3:07     ` Tian, Kevin
  0 siblings, 0 replies; 51+ messages in thread
From: Tian, Kevin @ 2015-03-09  3:07 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich, xen-devel; +Cc: Zhang, Yang Z, Dario Faggioli


[-- Attachment #1.1: Type: text/plain, Size: 6549 bytes --]

I think Andrew's comment is correct. IO NUMA is different and IOMMU structures should be allocated based on proximity of IOMMU itself.

Thanks
Kevin

From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
Sent: Friday, March 06, 2015 1:09 AM
To: Jan Beulich; xen-devel
Cc: Zhang, Yang Z; Dario Faggioli; Tian, Kevin
Subject: Re: [Xen-devel] [PATCH 4/5] VT-d: widen NUMA nodes to be allocated from

On 26/02/15 13:55, Jan Beulich wrote:

Signed-off-by: Jan Beulich <jbeulich@suse.com><mailto:jbeulich@suse.com>

I believe that this change is incorrect.







--- a/xen/drivers/passthrough/vtd/extern.h

+++ b/xen/drivers/passthrough/vtd/extern.h

@@ -67,7 +67,8 @@ unsigned int get_cache_line_size(void);

 void cacheline_flush(char *);

 void flush_all_cache(void);



-u64 alloc_pgtable_maddr(struct acpi_drhd_unit *drhd, unsigned long npages);

+u64 alloc_pgtable_maddr(struct acpi_drhd_unit *, unsigned int npages,

+                        struct domain *);

 void free_pgtable_maddr(u64 maddr);

 void *map_vtd_domain_page(u64 maddr);

 void unmap_vtd_domain_page(void *va);

--- a/xen/drivers/passthrough/vtd/intremap.c

+++ b/xen/drivers/passthrough/vtd/intremap.c

@@ -739,7 +739,8 @@ int enable_intremap(struct iommu *iommu,

     if ( ir_ctrl->iremap_maddr == 0 )

     {

         drhd = iommu_to_drhd(iommu);

-        ir_ctrl->iremap_maddr = alloc_pgtable_maddr(drhd, IREMAP_ARCH_PAGE_NR);

+        ir_ctrl->iremap_maddr = alloc_pgtable_maddr(drhd, IREMAP_ARCH_PAGE_NR,

+                                                    NULL);

         if ( ir_ctrl->iremap_maddr == 0 )

         {

             dprintk(XENLOG_WARNING VTDPREFIX,

--- a/xen/drivers/passthrough/vtd/iommu.c

+++ b/xen/drivers/passthrough/vtd/iommu.c

@@ -185,7 +185,8 @@ void iommu_flush_cache_page(void *addr,

 }



 /* Allocate page table, return its machine address */

-u64 alloc_pgtable_maddr(struct acpi_drhd_unit *drhd, unsigned long npages)

+u64 alloc_pgtable_maddr(struct acpi_drhd_unit *drhd, unsigned int npages,

+                        struct domain *d)

 {

     struct acpi_rhsa_unit *rhsa;

     struct page_info *pg, *cur_pg;

@@ -195,10 +196,10 @@ u64 alloc_pgtable_maddr(struct acpi_drhd



     rhsa = drhd_to_rhsa(drhd);

     if ( rhsa )

-        node =  pxm_to_node(rhsa->proximity_domain);

+        node = pxm_to_node(rhsa->proximity_domain);



-    pg = alloc_domheap_pages(NULL, get_order_from_pages(npages),

-                             (node == NUMA_NO_NODE) ? 0 : MEMF_node(node));

+    pg = alloc_domheap_pages(d, get_order_from_pages(npages),

+                             MEMF_node(node) | MEMF_no_owner);

     if ( !pg )

         return 0;



@@ -223,7 +224,7 @@ void free_pgtable_maddr(u64 maddr)

 }



 /* context entry handling */

-static u64 bus_to_context_maddr(struct iommu *iommu, u8 bus)

+static u64 bus_to_context_maddr(struct iommu *iommu, u8 bus, struct domain *d)

 {

     struct acpi_drhd_unit *drhd;

     struct root_entry *root, *root_entries;

@@ -235,7 +236,7 @@ static u64 bus_to_context_maddr(struct i

     if ( !root_present(*root) )

     {

         drhd = iommu_to_drhd(iommu);

-        maddr = alloc_pgtable_maddr(drhd, 1);

+        maddr = alloc_pgtable_maddr(drhd, 1, d);

         if ( maddr == 0 )

         {

             unmap_vtd_domain_page(root_entries);

@@ -271,7 +272,9 @@ static u64 addr_to_dma_page_maddr(struct

          */

         pdev = pci_get_pdev_by_domain(domain, -1, -1, -1);

         drhd = acpi_find_matched_drhd_unit(pdev);

-        if ( !alloc || ((hd->arch.pgd_maddr = alloc_pgtable_maddr(drhd, 1)) == 0) )

+        if ( !alloc ||

+             ((hd->arch.pgd_maddr = alloc_pgtable_maddr(drhd, 1,

+                                                        domain)) == 0) )

This allocation should be based on the proximity information of the iommu, not of the requesting domain.





             goto out;

     }



@@ -289,7 +292,7 @@ static u64 addr_to_dma_page_maddr(struct



             pdev = pci_get_pdev_by_domain(domain, -1, -1, -1);

             drhd = acpi_find_matched_drhd_unit(pdev);

-            pte_maddr = alloc_pgtable_maddr(drhd, 1);

+            pte_maddr = alloc_pgtable_maddr(drhd, 1, domain);

As should this.

I don't see a need to extend the prototype of alloc_pgtable_maddr().  The drhd parameter contains all relevant information concerning proximity.

~Andrew





             if ( !pte_maddr )

                 break;



@@ -1119,7 +1122,7 @@ int __init iommu_alloc(struct acpi_drhd_

     iommu->intel->drhd = drhd;

     drhd->iommu = iommu;



-    if ( !(iommu->root_maddr = alloc_pgtable_maddr(drhd, 1)) )

+    if ( !(iommu->root_maddr = alloc_pgtable_maddr(drhd, 1, NULL)) )

         return -ENOMEM;



     iommu->reg = ioremap(drhd->address, PAGE_SIZE);

@@ -1270,7 +1273,7 @@ int domain_context_mapping_one(



     ASSERT(spin_is_locked(&pcidevs_lock));

     spin_lock(&iommu->lock);

-    maddr = bus_to_context_maddr(iommu, bus);

+    maddr = bus_to_context_maddr(iommu, bus, domain);

     context_entries = (struct context_entry *)map_vtd_domain_page(maddr);

     context = &context_entries[devfn];



@@ -1495,7 +1498,7 @@ int domain_context_unmap_one(

     ASSERT(spin_is_locked(&pcidevs_lock));

     spin_lock(&iommu->lock);



-    maddr = bus_to_context_maddr(iommu, bus);

+    maddr = bus_to_context_maddr(iommu, bus, domain);

     context_entries = (struct context_entry *)map_vtd_domain_page(maddr);

     context = &context_entries[devfn];



--- a/xen/drivers/passthrough/vtd/qinval.c

+++ b/xen/drivers/passthrough/vtd/qinval.c

@@ -374,7 +374,8 @@ int enable_qinval(struct iommu *iommu)

     if ( qi_ctrl->qinval_maddr == 0 )

     {

         drhd = iommu_to_drhd(iommu);

-        qi_ctrl->qinval_maddr = alloc_pgtable_maddr(drhd, QINVAL_ARCH_PAGE_NR);

+        qi_ctrl->qinval_maddr = alloc_pgtable_maddr(drhd, QINVAL_ARCH_PAGE_NR,

+                                                    NULL);

         if ( qi_ctrl->qinval_maddr == 0 )

         {

             dprintk(XENLOG_WARNING VTDPREFIX,








_______________________________________________

Xen-devel mailing list

Xen-devel@lists.xen.org<mailto:Xen-devel@lists.xen.org>

http://lists.xen.org/xen-devel


[-- Attachment #1.2: Type: text/html, Size: 20984 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 5/5] AMD IOMMU: widen NUMA nodes to be allocated from
  2015-02-26 13:44 [PATCH 0/5] (not just)x86/Dom0: NUMA related adjustments Jan Beulich
                   ` (3 preceding siblings ...)
  2015-02-26 13:55 ` [PATCH 4/5] VT-d: " Jan Beulich
@ 2015-02-26 13:56 ` Jan Beulich
  2015-03-05 17:30   ` Andrew Cooper
  2015-02-27 10:04 ` [PATCH 0/5] (not just)x86/Dom0: NUMA related adjustments Dario Faggioli
  5 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2015-02-26 13:56 UTC (permalink / raw)
  To: xen-devel; +Cc: Dario Faggioli, Aravind Gopalakrishnan, suravee.suthikulpanit

[-- Attachment #1: Type: text/plain, Size: 3464 bytes --]

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -488,7 +488,7 @@ static int iommu_pde_from_gfn(struct dom
             mfn = next_table_mfn;
 
             /* allocate lower level page table */
-            table = alloc_amd_iommu_pgtable();
+            table = alloc_amd_iommu_pgtable(d);
             if ( table == NULL )
             {
                 AMD_IOMMU_DEBUG("Cannot allocate I/O page table\n");
@@ -516,7 +516,7 @@ static int iommu_pde_from_gfn(struct dom
         {
             if ( next_table_mfn == 0 )
             {
-                table = alloc_amd_iommu_pgtable();
+                table = alloc_amd_iommu_pgtable(d);
                 if ( table == NULL )
                 {
                     AMD_IOMMU_DEBUG("Cannot allocate I/O page table\n");
@@ -567,7 +567,7 @@ static int update_paging_mode(struct dom
     {
         /* Allocate and install a new root table.
          * Only upper I/O page table grows, no need to fix next level bits */
-        new_root = alloc_amd_iommu_pgtable();
+        new_root = alloc_amd_iommu_pgtable(d);
         if ( new_root == NULL )
         {
             AMD_IOMMU_DEBUG("%s Cannot allocate I/O page table\n",
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -225,13 +225,15 @@ int __init amd_iov_detect(void)
     return scan_pci_devices();
 }
 
-static int allocate_domain_resources(struct hvm_iommu *hd)
+static int allocate_domain_resources(struct domain *d)
 {
+    struct hvm_iommu *hd = domain_hvm_iommu(d);
+
     /* allocate root table */
     spin_lock(&hd->arch.mapping_lock);
     if ( !hd->arch.root_table )
     {
-        hd->arch.root_table = alloc_amd_iommu_pgtable();
+        hd->arch.root_table = alloc_amd_iommu_pgtable(d);
         if ( !hd->arch.root_table )
         {
             spin_unlock(&hd->arch.mapping_lock);
@@ -263,7 +265,7 @@ static int amd_iommu_domain_init(struct 
     struct hvm_iommu *hd = domain_hvm_iommu(d);
 
     /* allocate page directroy */
-    if ( allocate_domain_resources(hd) != 0 )
+    if ( allocate_domain_resources(d) != 0 )
     {
         if ( hd->arch.root_table )
             free_domheap_page(hd->arch.root_table);
@@ -383,7 +385,7 @@ static int reassign_device(struct domain
     /* IO page tables might be destroyed after pci-detach the last device
      * In this case, we have to re-allocate root table for next pci-attach.*/
     if ( t->arch.root_table == NULL )
-        allocate_domain_resources(t);
+        allocate_domain_resources(target);
 
     amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
     AMD_IOMMU_DEBUG("Re-assign %04x:%02x:%02x.%u from dom%d to dom%d\n",
--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -158,12 +158,12 @@ static inline unsigned long region_to_pa
     return (PAGE_ALIGN(addr + size) - (addr & PAGE_MASK)) >> PAGE_SHIFT;
 }
 
-static inline struct page_info* alloc_amd_iommu_pgtable(void)
+static inline struct page_info *alloc_amd_iommu_pgtable(struct domain *d)
 {
     struct page_info *pg;
     void *vaddr;
 
-    pg = alloc_domheap_page(NULL, 0);
+    pg = alloc_domheap_page(d, MEMF_no_owner);
     if ( pg == NULL )
         return 0;
     vaddr = __map_domain_page(pg);




[-- Attachment #2: AMD-IOMMU-alloc-for-domain.patch --]
[-- Type: text/plain, Size: 3510 bytes --]

AMD IOMMU: widen NUMA nodes to be allocated from

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/drivers/passthrough/amd/iommu_map.c
+++ b/xen/drivers/passthrough/amd/iommu_map.c
@@ -488,7 +488,7 @@ static int iommu_pde_from_gfn(struct dom
             mfn = next_table_mfn;
 
             /* allocate lower level page table */
-            table = alloc_amd_iommu_pgtable();
+            table = alloc_amd_iommu_pgtable(d);
             if ( table == NULL )
             {
                 AMD_IOMMU_DEBUG("Cannot allocate I/O page table\n");
@@ -516,7 +516,7 @@ static int iommu_pde_from_gfn(struct dom
         {
             if ( next_table_mfn == 0 )
             {
-                table = alloc_amd_iommu_pgtable();
+                table = alloc_amd_iommu_pgtable(d);
                 if ( table == NULL )
                 {
                     AMD_IOMMU_DEBUG("Cannot allocate I/O page table\n");
@@ -567,7 +567,7 @@ static int update_paging_mode(struct dom
     {
         /* Allocate and install a new root table.
          * Only upper I/O page table grows, no need to fix next level bits */
-        new_root = alloc_amd_iommu_pgtable();
+        new_root = alloc_amd_iommu_pgtable(d);
         if ( new_root == NULL )
         {
             AMD_IOMMU_DEBUG("%s Cannot allocate I/O page table\n",
--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c
+++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c
@@ -225,13 +225,15 @@ int __init amd_iov_detect(void)
     return scan_pci_devices();
 }
 
-static int allocate_domain_resources(struct hvm_iommu *hd)
+static int allocate_domain_resources(struct domain *d)
 {
+    struct hvm_iommu *hd = domain_hvm_iommu(d);
+
     /* allocate root table */
     spin_lock(&hd->arch.mapping_lock);
     if ( !hd->arch.root_table )
     {
-        hd->arch.root_table = alloc_amd_iommu_pgtable();
+        hd->arch.root_table = alloc_amd_iommu_pgtable(d);
         if ( !hd->arch.root_table )
         {
             spin_unlock(&hd->arch.mapping_lock);
@@ -263,7 +265,7 @@ static int amd_iommu_domain_init(struct 
     struct hvm_iommu *hd = domain_hvm_iommu(d);
 
     /* allocate page directroy */
-    if ( allocate_domain_resources(hd) != 0 )
+    if ( allocate_domain_resources(d) != 0 )
     {
         if ( hd->arch.root_table )
             free_domheap_page(hd->arch.root_table);
@@ -383,7 +385,7 @@ static int reassign_device(struct domain
     /* IO page tables might be destroyed after pci-detach the last device
      * In this case, we have to re-allocate root table for next pci-attach.*/
     if ( t->arch.root_table == NULL )
-        allocate_domain_resources(t);
+        allocate_domain_resources(target);
 
     amd_iommu_setup_domain_device(target, iommu, devfn, pdev);
     AMD_IOMMU_DEBUG("Re-assign %04x:%02x:%02x.%u from dom%d to dom%d\n",
--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
+++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
@@ -158,12 +158,12 @@ static inline unsigned long region_to_pa
     return (PAGE_ALIGN(addr + size) - (addr & PAGE_MASK)) >> PAGE_SHIFT;
 }
 
-static inline struct page_info* alloc_amd_iommu_pgtable(void)
+static inline struct page_info *alloc_amd_iommu_pgtable(struct domain *d)
 {
     struct page_info *pg;
     void *vaddr;
 
-    pg = alloc_domheap_page(NULL, 0);
+    pg = alloc_domheap_page(d, MEMF_no_owner);
     if ( pg == NULL )
         return 0;
     vaddr = __map_domain_page(pg);

[-- Attachment #3: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5/5] AMD IOMMU: widen NUMA nodes to be allocated from
  2015-02-26 13:56 ` [PATCH 5/5] AMD IOMMU: " Jan Beulich
@ 2015-03-05 17:30   ` Andrew Cooper
  2015-03-06  7:50     ` Jan Beulich
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Cooper @ 2015-03-05 17:30 UTC (permalink / raw)
  To: Jan Beulich, xen-devel
  Cc: Dario Faggioli, Aravind Gopalakrishnan, suravee.suthikulpanit


[-- Attachment #1.1: Type: text/plain, Size: 964 bytes --]

On 26/02/15 13:56, Jan Beulich wrote:
> --- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
> +++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
> @@ -158,12 +158,12 @@ static inline unsigned long region_to_pa
>      return (PAGE_ALIGN(addr + size) - (addr & PAGE_MASK)) >> PAGE_SHIFT;
>  }
>  
> -static inline struct page_info* alloc_amd_iommu_pgtable(void)
> +static inline struct page_info *alloc_amd_iommu_pgtable(struct domain *d)
>  {
>      struct page_info *pg;
>      void *vaddr;
>  
> -    pg = alloc_domheap_page(NULL, 0);
> +    pg = alloc_domheap_page(d, MEMF_no_owner);

Same comment as with the VT-d side of things.  This should be based on
the proximity information of the IOMMU, not of the owning domain.

~Andrew

>      if ( pg == NULL )
>          return 0;
>      vaddr = __map_domain_page(pg);
>
>
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel


[-- Attachment #1.2: Type: text/html, Size: 1779 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5/5] AMD IOMMU: widen NUMA nodes to be allocated from
  2015-03-05 17:30   ` Andrew Cooper
@ 2015-03-06  7:50     ` Jan Beulich
  2015-03-06 12:15       ` Andrew Cooper
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2015-03-06  7:50 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: xen-devel, Dario Faggioli, Aravind Gopalakrishnan,
	suravee.suthikulpanit

>>> On 05.03.15 at 18:30, <andrew.cooper3@citrix.com> wrote:
> On 26/02/15 13:56, Jan Beulich wrote:
>> --- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
>> +++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
>> @@ -158,12 +158,12 @@ static inline unsigned long region_to_pa
>>      return (PAGE_ALIGN(addr + size) - (addr & PAGE_MASK)) >> PAGE_SHIFT;
>>  }
>>  
>> -static inline struct page_info* alloc_amd_iommu_pgtable(void)
>> +static inline struct page_info *alloc_amd_iommu_pgtable(struct domain *d)
>>  {
>>      struct page_info *pg;
>>      void *vaddr;
>>  
>> -    pg = alloc_domheap_page(NULL, 0);
>> +    pg = alloc_domheap_page(d, MEMF_no_owner);
> 
> Same comment as with the VT-d side of things.  This should be based on
> the proximity information of the IOMMU, not of the owning domain.

I think I buy this argument on the VT-d side (under the assumption
that there's going to be at least one IOMMU per node), but I'm not
sure here: The most modern AMD box I have has just a single
IOMMU for 4 nodes it reports.

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5/5] AMD IOMMU: widen NUMA nodes to be allocated from
  2015-03-06  7:50     ` Jan Beulich
@ 2015-03-06 12:15       ` Andrew Cooper
  2015-03-09 15:42         ` Suravee Suthikulanit
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Cooper @ 2015-03-06 12:15 UTC (permalink / raw)
  To: Jan Beulich
  Cc: xen-devel, Dario Faggioli, Aravind Gopalakrishnan,
	suravee.suthikulpanit

On 06/03/2015 07:50, Jan Beulich wrote:
>>>> On 05.03.15 at 18:30, <andrew.cooper3@citrix.com> wrote:
>> On 26/02/15 13:56, Jan Beulich wrote:
>>> --- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
>>> +++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
>>> @@ -158,12 +158,12 @@ static inline unsigned long region_to_pa
>>>      return (PAGE_ALIGN(addr + size) - (addr & PAGE_MASK)) >> PAGE_SHIFT;
>>>  }
>>>  
>>> -static inline struct page_info* alloc_amd_iommu_pgtable(void)
>>> +static inline struct page_info *alloc_amd_iommu_pgtable(struct domain *d)
>>>  {
>>>      struct page_info *pg;
>>>      void *vaddr;
>>>  
>>> -    pg = alloc_domheap_page(NULL, 0);
>>> +    pg = alloc_domheap_page(d, MEMF_no_owner);
>> Same comment as with the VT-d side of things.  This should be based on
>> the proximity information of the IOMMU, not of the owning domain.
> I think I buy this argument on the VT-d side (under the assumption
> that there's going to be at least one IOMMU per node), but I'm not
> sure here: The most modern AMD box I have has just a single
> IOMMU for 4 nodes it reports.

It is not possible for an IOMMU to cover multiple NUMA nodes worth of
IO, because of the position it has to sit relative to the IO root ports
and QPI/HT links.

In AMD systems, the IOMMUs lives in the northbridges, meaning one per
numa node (as it is the northbridges which contain the hypertransport links)

The BIOS/firmware will only report IOMMUs from northbridges which have
IO connected to their IO hypertransport link (most systems in the wild
have all IO hanging off one or two Numa nodes).  On the other hand, I
have an AMD system with 8 IOMMUs in use.

In Intel systems, there is one IOMMU for each socket (to cover the
on-chip root ports and GPU if applicable) and one IOMMU in the IOH/PCH
(depending on generation) to cover the legacy IO.

In all cases, the IOMMUs are local to a single NUMA node, and would
benefit from having the control pages and pagetables allocated in local RAM.

~Andrew

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5/5] AMD IOMMU: widen NUMA nodes to be allocated from
  2015-03-06 12:15       ` Andrew Cooper
@ 2015-03-09 15:42         ` Suravee Suthikulanit
  2015-03-09 17:26           ` Andrew Cooper
  0 siblings, 1 reply; 51+ messages in thread
From: Suravee Suthikulanit @ 2015-03-09 15:42 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich
  Cc: xen-devel, Dario Faggioli, Aravind Gopalakrishnan

On 3/6/2015 6:15 AM, Andrew Cooper wrote:
> On 06/03/2015 07:50, Jan Beulich wrote:
>>>>> On 05.03.15 at 18:30, <andrew.cooper3@citrix.com> wrote:
>>> On 26/02/15 13:56, Jan Beulich wrote:
>>>> --- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
>>>> +++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
>>>> @@ -158,12 +158,12 @@ static inline unsigned long region_to_pa
>>>>       return (PAGE_ALIGN(addr + size) - (addr & PAGE_MASK)) >> PAGE_SHIFT;
>>>>   }
>>>>
>>>> -static inline struct page_info* alloc_amd_iommu_pgtable(void)
>>>> +static inline struct page_info *alloc_amd_iommu_pgtable(struct domain *d)
>>>>   {
>>>>       struct page_info *pg;
>>>>       void *vaddr;
>>>>
>>>> -    pg = alloc_domheap_page(NULL, 0);
>>>> +    pg = alloc_domheap_page(d, MEMF_no_owner);
>>> Same comment as with the VT-d side of things.  This should be based on
>>> the proximity information of the IOMMU, not of the owning domain.
>> I think I buy this argument on the VT-d side (under the assumption
>> that there's going to be at least one IOMMU per node), but I'm not
>> sure here: The most modern AMD box I have has just a single
>> IOMMU for 4 nodes it reports.
>
> It is not possible for an IOMMU to cover multiple NUMA nodes worth of
> IO, because of the position it has to sit relative to the IO root ports
> and QPI/HT links.
>
> In AMD systems, the IOMMUs lives in the northbridges, meaning one per
> numa node (as it is the northbridges which contain the hypertransport links)
>
> The BIOS/firmware will only report IOMMUs from northbridges which have
> IO connected to their IO hypertransport link (most systems in the wild
> have all IO hanging off one or two Numa nodes).  On the other hand, I
> have an AMD system with 8 IOMMUs in use.


Actually, a single IOMMU could handle multiple nodes. For example, in 
scenario of a multi-chip-module (MCM) setup, there could be at least 2-4 
nodes sharing one IOMMU depending on how the platform vendor configuring 
the system. In the server platforms, IOMMU is in AMD northbridge 
chipsets (e.g. SR56xx). This website has an example of such system 
configuration 
(http://www.qdpma.com/systemarchitecture/SystemArchitecture_Opteron.html).

For AMD IOMMU, the IVRS table specifies the PCI bus/device ranges to be 
handled by each IOMMU. This is probably should be considered here.

>
> In Intel systems, there is one IOMMU for each socket (to cover the
> on-chip root ports and GPU if applicable) and one IOMMU in the IOH/PCH
> (depending on generation) to cover the legacy IO.
>
>
> In all cases, the IOMMUs are local to a single NUMA node, and would
> benefit from having the control pages and pagetables allocated in local RAM.
>

As state above, this is not the case for AMD IOMMU.

Thanks,

Suravee
> ~Andrew
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5/5] AMD IOMMU: widen NUMA nodes to be allocated from
  2015-03-09 15:42         ` Suravee Suthikulanit
@ 2015-03-09 17:26           ` Andrew Cooper
  2015-03-09 19:02             ` Suravee Suthikulanit
  0 siblings, 1 reply; 51+ messages in thread
From: Andrew Cooper @ 2015-03-09 17:26 UTC (permalink / raw)
  To: Suravee Suthikulanit, Jan Beulich
  Cc: xen-devel, Dario Faggioli, Aravind Gopalakrishnan

On 09/03/15 15:42, Suravee Suthikulanit wrote:
> On 3/6/2015 6:15 AM, Andrew Cooper wrote:
>> On 06/03/2015 07:50, Jan Beulich wrote:
>>>>>> On 05.03.15 at 18:30, <andrew.cooper3@citrix.com> wrote:
>>>> On 26/02/15 13:56, Jan Beulich wrote:
>>>>> --- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
>>>>> +++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
>>>>> @@ -158,12 +158,12 @@ static inline unsigned long region_to_pa
>>>>>       return (PAGE_ALIGN(addr + size) - (addr & PAGE_MASK)) >>
>>>>> PAGE_SHIFT;
>>>>>   }
>>>>>
>>>>> -static inline struct page_info* alloc_amd_iommu_pgtable(void)
>>>>> +static inline struct page_info *alloc_amd_iommu_pgtable(struct
>>>>> domain *d)
>>>>>   {
>>>>>       struct page_info *pg;
>>>>>       void *vaddr;
>>>>>
>>>>> -    pg = alloc_domheap_page(NULL, 0);
>>>>> +    pg = alloc_domheap_page(d, MEMF_no_owner);
>>>> Same comment as with the VT-d side of things.  This should be based on
>>>> the proximity information of the IOMMU, not of the owning domain.
>>> I think I buy this argument on the VT-d side (under the assumption
>>> that there's going to be at least one IOMMU per node), but I'm not
>>> sure here: The most modern AMD box I have has just a single
>>> IOMMU for 4 nodes it reports.
>>
>> It is not possible for an IOMMU to cover multiple NUMA nodes worth of
>> IO, because of the position it has to sit relative to the IO root ports
>> and QPI/HT links.
>>
>> In AMD systems, the IOMMUs lives in the northbridges, meaning one per
>> numa node (as it is the northbridges which contain the hypertransport
>> links)
>>
>> The BIOS/firmware will only report IOMMUs from northbridges which have
>> IO connected to their IO hypertransport link (most systems in the wild
>> have all IO hanging off one or two Numa nodes).  On the other hand, I
>> have an AMD system with 8 IOMMUs in use.
>
>
> Actually, a single IOMMU could handle multiple nodes. For example, in
> scenario of a multi-chip-module (MCM) setup, there could be at least
> 2-4 nodes sharing one IOMMU depending on how the platform vendor
> configuring the system. In the server platforms, IOMMU is in AMD
> northbridge chipsets (e.g. SR56xx). This website has an example of
> such system configuration
> (http://www.qdpma.com/systemarchitecture/SystemArchitecture_Opteron.html).

Ok - I was basing my example on the last layout I had the manual for,
which I believe was Bulldozer.

However, my point still stands that there is an IOMMU between any IO and
RAM.  An individual IOMMU will always benefit from having its
iopagetables on the local numa node, rather than the numa node(s) which
the domain owning the device is running on.

>
> For AMD IOMMU, the IVRS table specifies the PCI bus/device ranges to
> be handled by each IOMMU. This is probably should be considered here.

Presumably a PCI transaction must never get onto the HT bus without
having already undergone translation, or there can be no guarantee that
it would be routed via the IOMMU?  Or are you saying that there are
cases where a transaction will enter the HT bus, route sideways to an
IOMMU, undergo translation, then route back onto the HT bus to the
target RAM/processor?

~Andrew

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5/5] AMD IOMMU: widen NUMA nodes to be allocated from
  2015-03-09 17:26           ` Andrew Cooper
@ 2015-03-09 19:02             ` Suravee Suthikulanit
  2015-03-10  7:35               ` Jan Beulich
  0 siblings, 1 reply; 51+ messages in thread
From: Suravee Suthikulanit @ 2015-03-09 19:02 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich
  Cc: xen-devel, Dario Faggioli, Aravind Gopalakrishnan

On 3/9/2015 12:26 PM, Andrew Cooper wrote:
> On 09/03/15 15:42, Suravee Suthikulanit wrote:
>> On 3/6/2015 6:15 AM, Andrew Cooper wrote:
>>> On 06/03/2015 07:50, Jan Beulich wrote:
>>>>>>> On 05.03.15 at 18:30, <andrew.cooper3@citrix.com> wrote:
>>>>> On 26/02/15 13:56, Jan Beulich wrote:
>>>>>> --- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
>>>>>> +++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h
>>>>>> @@ -158,12 +158,12 @@ static inline unsigned long region_to_pa
>>>>>>        return (PAGE_ALIGN(addr + size) - (addr & PAGE_MASK)) >>
>>>>>> PAGE_SHIFT;
>>>>>>    }
>>>>>>
>>>>>> -static inline struct page_info* alloc_amd_iommu_pgtable(void)
>>>>>> +static inline struct page_info *alloc_amd_iommu_pgtable(struct
>>>>>> domain *d)
>>>>>>    {
>>>>>>        struct page_info *pg;
>>>>>>        void *vaddr;
>>>>>>
>>>>>> -    pg = alloc_domheap_page(NULL, 0);
>>>>>> +    pg = alloc_domheap_page(d, MEMF_no_owner);
>>>>> Same comment as with the VT-d side of things.  This should be based on
>>>>> the proximity information of the IOMMU, not of the owning domain.
>>>> I think I buy this argument on the VT-d side (under the assumption
>>>> that there's going to be at least one IOMMU per node), but I'm not
>>>> sure here: The most modern AMD box I have has just a single
>>>> IOMMU for 4 nodes it reports.
>>>
>>> It is not possible for an IOMMU to cover multiple NUMA nodes worth of
>>> IO, because of the position it has to sit relative to the IO root ports
>>> and QPI/HT links.
>>>
>>> In AMD systems, the IOMMUs lives in the northbridges, meaning one per
>>> numa node (as it is the northbridges which contain the hypertransport
>>> links)
>>>
>>> The BIOS/firmware will only report IOMMUs from northbridges which have
>>> IO connected to their IO hypertransport link (most systems in the wild
>>> have all IO hanging off one or two Numa nodes).  On the other hand, I
>>> have an AMD system with 8 IOMMUs in use.
>>
>>
>> Actually, a single IOMMU could handle multiple nodes. For example, in
>> scenario of a multi-chip-module (MCM) setup, there could be at least
>> 2-4 nodes sharing one IOMMU depending on how the platform vendor
>> configuring the system. In the server platforms, IOMMU is in AMD
>> northbridge chipsets (e.g. SR56xx). This website has an example of
>> such system configuration
>> (http://www.qdpma.com/systemarchitecture/SystemArchitecture_Opteron.html).
>
> Ok - I was basing my example on the last layout I had the manual for,
> which I believe was Bulldozer.
>
> However, my point still stands that there is an IOMMU between any IO and
> RAM.  An individual IOMMU will always benefit from having its
> iopagetables on the local numa node, rather than the numa node(s) which
> the domain owning the device is running on.
>

I agree that having the IO page tables on the NUMA node that is closest 
to the IOMMU would be beneficial.  However, I am not sure at the moment 
that this information could be easily determined. I think ACPI _PXM for 
devices should be able to provide this information, but this is optional 
and often not available.

>>
>> For AMD IOMMU, the IVRS table specifies the PCI bus/device ranges to
>> be handled by each IOMMU. This is probably should be considered here.
>
> Presumably a PCI transaction must never get onto the HT bus without
> having already undergone translation, or there can be no guarantee that
> it would be routed via the IOMMU?  Or are you saying that there are
> cases where a transaction will enter the HT bus, route sideways to an
> IOMMU, undergo translation, then route back onto the HT bus to the
> target RAM/processor?
>
> ~Andrew
>

IOMMU sits between PCI devices (downstream) and HT (uptream), all DMA 
transactions from downstream must go through IOMMU. On the other hand, 
the I/O page translation is handled by IOMMU, and it is a separate 
traffic than the downstream device DMA transactions.

Suravee

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5/5] AMD IOMMU: widen NUMA nodes to be allocated from
  2015-03-09 19:02             ` Suravee Suthikulanit
@ 2015-03-10  7:35               ` Jan Beulich
  2015-03-10 13:55                 ` Boris Ostrovsky
  0 siblings, 1 reply; 51+ messages in thread
From: Jan Beulich @ 2015-03-10  7:35 UTC (permalink / raw)
  To: Suravee Suthikulanit
  Cc: Andrew Cooper, Dario Faggioli, Aravind Gopalakrishnan, xen-devel

>>> On 09.03.15 at 20:02, <suravee.suthikulpanit@amd.com> wrote:
> I agree that having the IO page tables on the NUMA node that is closest 
> to the IOMMU would be beneficial.

And I already withdrew this patch and the corresponding VT-d one.

> However, I am not sure at the moment 
> that this information could be easily determined. I think ACPI _PXM for 
> devices should be able to provide this information, but this is optional 
> and often not available.

And even if it was available, it would be too late at least for Dom0's
allocations (as it requires Dom0's interpreter to dig out this detail).
The best we could do in that case would be to try to replace the
existing tables. Or assume Dom0 is being placed suitably by the
dom0_nodes= option. Or add yet another option.

Jan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5/5] AMD IOMMU: widen NUMA nodes to be allocated from
  2015-03-10  7:35               ` Jan Beulich
@ 2015-03-10 13:55                 ` Boris Ostrovsky
  0 siblings, 0 replies; 51+ messages in thread
From: Boris Ostrovsky @ 2015-03-10 13:55 UTC (permalink / raw)
  To: Jan Beulich, Suravee Suthikulanit
  Cc: Andrew Cooper, Dario Faggioli, Aravind Gopalakrishnan, xen-devel

On 03/10/2015 03:35 AM, Jan Beulich wrote:
>>>> On 09.03.15 at 20:02, <suravee.suthikulpanit@amd.com> wrote:
>> I agree that having the IO page tables on the NUMA node that is closest
>> to the IOMMU would be beneficial.
> And I already withdrew this patch and the corresponding VT-d one.
>
>> However, I am not sure at the moment
>> that this information could be easily determined. I think ACPI _PXM for
>> devices should be able to provide this information, but this is optional
>> and often not available.
> And even if it was available, it would be too late at least for Dom0's
> allocations (as it requires Dom0's interpreter to dig out this detail).
> The best we could do in that case would be to try to replace the
> existing tables. Or assume Dom0 is being placed suitably by the
> dom0_nodes= option. Or add yet another option.

There is a nodeID register on each northbridge (D1<x>F0x60). You would 
have to figure out how to map it to _PXMs though.

-boris

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 0/5] (not just)x86/Dom0: NUMA related adjustments
  2015-02-26 13:44 [PATCH 0/5] (not just)x86/Dom0: NUMA related adjustments Jan Beulich
                   ` (4 preceding siblings ...)
  2015-02-26 13:56 ` [PATCH 5/5] AMD IOMMU: " Jan Beulich
@ 2015-02-27 10:04 ` Dario Faggioli
  5 siblings, 0 replies; 51+ messages in thread
From: Dario Faggioli @ 2015-02-27 10:04 UTC (permalink / raw)
  To: JBeulich@suse.com
  Cc: Wei Liu, Keir (Xen.org), Ian Campbell, Andrew Cooper,
	Tim (Xen.org), xen-devel@lists.xenproject.org, Ian Jackson


[-- Attachment #1.1: Type: text/plain, Size: 627 bytes --]

[adding Wei, as he may be interested, for his vNUMA work]

On Thu, 2015-02-26 at 13:44 +0000, Jan Beulich wrote:
> 1: x86: allow specifying the NUMA nodes Dom0 should run on
> 2: allow domain heap allocations to specify more than one NUMA node
> 3: x86: widen NUMA nodes to be allocated from
> 4: VT-d: widen NUMA nodes to be allocated from
> 5: AMD IOMMU: widen NUMA nodes to be allocated from
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> ---
> To apply cleanly his depends on "x86/Dom0: account for shadow/HAP allocation"
> (http://lists.xenproject.org/archives/html/xen-devel/2015-02/msg03111.html).


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2015-03-10 13:57 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-26 13:44 [PATCH 0/5] (not just)x86/Dom0: NUMA related adjustments Jan Beulich
2015-02-26 13:52 ` [PATCH 1/5] x86: allow specifying the NUMA nodes Dom0 should run on Jan Beulich
2015-02-26 17:14   ` Dario Faggioli
2015-02-27  8:46     ` Jan Beulich
2015-02-27 10:04       ` Dario Faggioli
2015-02-27 10:50         ` Jan Beulich
2015-02-27 14:54           ` Dario Faggioli
2015-02-27 15:04             ` Jan Beulich
2015-03-03 10:51             ` Jan Beulich
2015-03-04 10:18               ` Dario Faggioli
2015-03-06  9:11               ` Jan Beulich
2015-03-06 10:46                 ` Dario Faggioli
2015-03-06 11:33                   ` Dario Faggioli
2015-03-06 13:26                     ` Jan Beulich
2015-03-06 11:49                   ` Jan Beulich
2015-03-03  9:59   ` Ian Campbell
2015-03-05 16:11   ` Andrew Cooper
2015-03-05 16:43     ` Jan Beulich
2015-03-05 17:27       ` Andrew Cooper
2015-03-06  9:19         ` [PATCH 1/5 v2] " Jan Beulich
2015-03-06 10:41           ` Dario Faggioli
2015-03-06 16:05           ` Andrew Cooper
2015-02-26 13:53 ` [PATCH 2/5] allow domain heap allocations to specify more than one NUMA node Jan Beulich
2015-02-27 11:34   ` Dario Faggioli
2015-03-02 17:12   ` Ian Campbell
2015-03-03  7:59     ` Jan Beulich
2015-03-05 16:18   ` Andrew Cooper
2015-02-26 13:54 ` [PATCH 3/5] x86: widen NUMA nodes to be allocated from Jan Beulich
2015-02-27 13:27   ` Dario Faggioli
2015-02-27 13:36     ` Jan Beulich
2015-02-27 14:11       ` Dario Faggioli
2015-02-27 13:38     ` Julien Grall
2015-02-27 13:55       ` Dario Faggioli
2015-02-27 13:58       ` Jan Beulich
2015-02-27 13:46     ` Ian Campbell
2015-02-27 14:00       ` Dario Faggioli
2015-02-27 14:03       ` Jan Beulich
2015-03-05 16:39   ` Andrew Cooper
2015-02-26 13:55 ` [PATCH 4/5] VT-d: " Jan Beulich
2015-03-05 17:08   ` Andrew Cooper
2015-03-09  3:07     ` Tian, Kevin
2015-02-26 13:56 ` [PATCH 5/5] AMD IOMMU: " Jan Beulich
2015-03-05 17:30   ` Andrew Cooper
2015-03-06  7:50     ` Jan Beulich
2015-03-06 12:15       ` Andrew Cooper
2015-03-09 15:42         ` Suravee Suthikulanit
2015-03-09 17:26           ` Andrew Cooper
2015-03-09 19:02             ` Suravee Suthikulanit
2015-03-10  7:35               ` Jan Beulich
2015-03-10 13:55                 ` Boris Ostrovsky
2015-02-27 10:04 ` [PATCH 0/5] (not just)x86/Dom0: NUMA related adjustments Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.