[PATCH v2 0/2] xen: vnuma introduction for pv guest

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] xen: vnuma introduction for pv guest
@ 2013-11-18 20:25 Elena Ufimtseva
  2013-11-18 20:25 ` [PATCH v2 1/2] xen: vnuma support for PV guests running as domU Elena Ufimtseva
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Elena Ufimtseva @ 2013-11-18 20:25 UTC (permalink / raw)
  To: xen-devel
  Cc: konrad.wilk, boris.ostrovsky, david.vrabel, tglx, mingo, hpa, x86,
	akpm, tangchen, wency, ian.campbell, stefano.stabellini,
	mukesh.rathor, linux-kernel, Elena Ufimtseva

Xen vnuma introduction.

The patchset introduces vnuma to paravirtualized Xen guests
runnning as domU.
Xen subop hypercall is used to retreive vnuma topology information.
Bases on the retreived topology from Xen, NUMA number of nodes,
memory ranges, distance table and cpumask is being set.
If initialization is incorrect, sets 'dummy' node and unsets
nodemask.
vNUMA topology is constructed by Xen toolstack. Xen patchset is
available at https://git.gitorious.org/xenvnuma/xenvnuma.git:v3.

Example of vnuma enabled pv domain dmesg:

[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x00001000-0x0009ffff]
[    0.000000]   node   0: [mem 0x00100000-0xffffffff]
[    0.000000]   node   1: [mem 0x100000000-0x1ffffffff]
[    0.000000]   node   2: [mem 0x200000000-0x2ffffffff]
[    0.000000]   node   3: [mem 0x300000000-0x3ffffffff]
[    0.000000] On node 0 totalpages: 1048479
[    0.000000]   DMA zone: 56 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3999 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 14280 pages used for memmap
[    0.000000]   DMA32 zone: 1044480 pages, LIFO batch:31
[    0.000000] On node 1 totalpages: 1048576
[    0.000000]   Normal zone: 14336 pages used for memmap
[    0.000000]   Normal zone: 1048576 pages, LIFO batch:31
[    0.000000] On node 2 totalpages: 1048576
[    0.000000]   Normal zone: 14336 pages used for memmap
[    0.000000]   Normal zone: 1048576 pages, LIFO batch:31
[    0.000000] On node 3 totalpages: 1048576
[    0.000000]   Normal zone: 14336 pages used for memmap
[    0.000000]   Normal zone: 1048576 pages, LIFO batch:31
[    0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[    0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
[    0.000000] No local APIC present
[    0.000000] APIC: disable apic facility
[    0.000000] APIC: switched to apic NOOP
[    0.000000] nr_irqs_gsi: 16
[    0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[    0.000000] e820: cannot find a gap in the 32bit address range
[    0.000000] e820: PCI devices with unassigned 32bit BARs may break!
[    0.000000] e820: [mem 0x400100000-0x4004fffff] available for PCI devices
[    0.000000] Booting paravirtualized kernel on Xen
[    0.000000] Xen version: 4.4-unstable (preserve-AD)
[    0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4 nr_node_ids:4
[    0.000000] PERCPU: Embedded 28 pages/cpu @ffff8800ffc00000 s85376 r8192 d21120 u2097152
[    0.000000] pcpu-alloc: s85376 r8192 d21120 u2097152 alloc=1*2097152


numactl output:
root@heatpipe:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0
node 0 size: 4031 MB
node 0 free: 3997 MB
node 1 cpus: 1
node 1 size: 4039 MB
node 1 free: 4022 MB
node 2 cpus: 2
node 2 size: 4039 MB
node 2 free: 4023 MB
node 3 cpus: 3
node 3 size: 3975 MB
node 3 free: 3963 MB
node distances:
node   0   1   2   3
  0:  10  20  20  20
  1:  20  10  20  20
  2:  20  20  10  20
  3:  20  20  20  10

Current patchset is available at https://git.gitorious.org/xenvnuma/linuxvnuma.git:v3
Xen patchset is available at: https://git.gitorious.org/xenvnuma/xenvnuma.git:v3

TODO
*       dom0, pvh and hvm vnuma support;
*       multiple memory ranges per node support;
*       benchmarking;


Elena Ufimtseva (2):
  xen: vnuma support for PV guests running as domU
  xen: enable vnuma for PV guest

 arch/x86/include/asm/xen/vnuma.h |   12 ++++
 arch/x86/mm/numa.c               |    3 +
 arch/x86/xen/Makefile            |    2 +-
 arch/x86/xen/setup.c             |    6 +-
 arch/x86/xen/vnuma.c             |  127 ++++++++++++++++++++++++++++++++++++++
 include/xen/interface/memory.h   |   44 +++++++++++++
 6 files changed, 192 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/include/asm/xen/vnuma.h
 create mode 100644 arch/x86/xen/vnuma.c

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 1/2] xen: vnuma support for PV guests running as domU
  2013-11-18 20:25 [PATCH v2 0/2] xen: vnuma introduction for pv guest Elena Ufimtseva
@ 2013-11-18 20:25 ` Elena Ufimtseva
  2013-11-18 21:14   ` H. Peter Anvin
  2013-11-19  7:15   ` [Xen-devel] " Dario Faggioli
  2013-11-18 20:25 ` [PATCH v2 2/2] xen: enable vnuma for PV guest Elena Ufimtseva
  2013-11-19 15:38 ` [PATCH v2 0/2] xen: vnuma introduction for pv guest Konrad Rzeszutek Wilk
  2 siblings, 2 replies; 15+ messages in thread
From: Elena Ufimtseva @ 2013-11-18 20:25 UTC (permalink / raw)
  To: xen-devel
  Cc: konrad.wilk, boris.ostrovsky, david.vrabel, tglx, mingo, hpa, x86,
	akpm, tangchen, wency, ian.campbell, stefano.stabellini,
	mukesh.rathor, linux-kernel, Elena Ufimtseva

Issues Xen hypercall subop XENMEM_get_vnumainfo and sets the
NUMA topology, otherwise sets dummy NUMA node and prevents
numa_init from calling other numa initializators as they dont
work with pv guests.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 arch/x86/include/asm/xen/vnuma.h |   12 ++++
 arch/x86/mm/numa.c               |    3 +
 arch/x86/xen/Makefile            |    2 +-
 arch/x86/xen/vnuma.c             |  127 ++++++++++++++++++++++++++++++++++++++
 include/xen/interface/memory.h   |   44 +++++++++++++
 5 files changed, 187 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/asm/xen/vnuma.h
 create mode 100644 arch/x86/xen/vnuma.c

diff --git a/arch/x86/include/asm/xen/vnuma.h b/arch/x86/include/asm/xen/vnuma.h
new file mode 100644
index 0000000..aee4e92
--- /dev/null
+++ b/arch/x86/include/asm/xen/vnuma.h
@@ -0,0 +1,12 @@
+#ifndef _ASM_X86_VNUMA_H
+#define _ASM_X86_VNUMA_H
+
+#ifdef CONFIG_XEN
+bool xen_vnuma_supported(void);
+int xen_numa_init(void);
+#else
+static inline bool xen_vnuma_supported(void) { return false; };
+static inline int xen_numa_init(void) { return -1; };
+#endif
+
+#endif /* _ASM_X86_VNUMA_H */
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 24aec58..99efa1b 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -17,6 +17,7 @@
 #include <asm/dma.h>
 #include <asm/acpi.h>
 #include <asm/amd_nb.h>
+#include "asm/xen/vnuma.h"
 
 #include "numa_internal.h"
 
@@ -632,6 +633,8 @@ static int __init dummy_numa_init(void)
 void __init x86_numa_init(void)
 {
 	if (!numa_off) {
+		if (!numa_init(xen_numa_init))
+			return;
 #ifdef CONFIG_X86_NUMAQ
 		if (!numa_init(numaq_numa_init))
 			return;
diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
index 96ab2c0..de9deab 100644
--- a/arch/x86/xen/Makefile
+++ b/arch/x86/xen/Makefile
@@ -13,7 +13,7 @@ CFLAGS_mmu.o			:= $(nostackp)
 obj-y		:= enlighten.o setup.o multicalls.o mmu.o irq.o \
 			time.o xen-asm.o xen-asm_$(BITS).o \
 			grant-table.o suspend.o platform-pci-unplug.o \
-			p2m.o
+			p2m.o vnuma.o
 
 obj-$(CONFIG_EVENT_TRACING) += trace.o
 
diff --git a/arch/x86/xen/vnuma.c b/arch/x86/xen/vnuma.c
new file mode 100644
index 0000000..bce4523
--- /dev/null
+++ b/arch/x86/xen/vnuma.c
@@ -0,0 +1,127 @@
+#include <linux/err.h>
+#include <linux/memblock.h>
+#include <xen/interface/xen.h>
+#include <xen/interface/memory.h>
+#include <asm/xen/interface.h>
+#include <asm/xen/hypercall.h>
+#include <asm/xen/vnuma.h>
+
+#ifdef CONFIG_NUMA
+
+/* Checks if hypercall is supported */
+bool xen_vnuma_supported()
+{
+	return HYPERVISOR_memory_op(XENMEM_get_vnuma_info, NULL) == -ENOSYS ? false : true;
+}
+
+/* 
+ * Called from numa_init if numa_off = 0;
+ * we set numa_off = 0 if xen_vnuma_supported()
+ * returns true and its a domU;
+ */
+int __init xen_numa_init(void)
+{
+	int rc;
+	unsigned int i, j, nr_nodes, cpu, idx, pcpus;
+	u64 physm, physd, physc;
+	unsigned int *vdistance, *cpu_to_node;
+	unsigned long mem_size, dist_size, cpu_to_node_size;
+	struct vmemrange *vblock;
+
+	struct vnuma_topology_info numa_topo = {
+		.domid = DOMID_SELF,
+		.__pad = 0
+	};
+	rc = -EINVAL;
+	physm = physd = physc = 0;
+
+	/* For now only PV guests are supported */
+	if (!xen_pv_domain())
+		return rc;
+
+	pcpus = num_possible_cpus();
+
+	mem_size =  pcpus * sizeof(struct vmemrange);
+	dist_size = pcpus * pcpus * sizeof(*numa_topo.distance);
+	cpu_to_node_size = pcpus * sizeof(*numa_topo.cpu_to_node);
+
+	physm = memblock_alloc(mem_size, PAGE_SIZE);
+	vblock = __va(physm);
+
+	physd = memblock_alloc(dist_size, PAGE_SIZE);
+	vdistance  = __va(physd);
+
+	physc = memblock_alloc(cpu_to_node_size, PAGE_SIZE);
+	cpu_to_node  = __va(physc);
+
+	if (!physm || !physc || !physd)
+		goto out;
+
+	set_xen_guest_handle(numa_topo.nr_nodes, &nr_nodes);
+	set_xen_guest_handle(numa_topo.memrange, vblock);
+	set_xen_guest_handle(numa_topo.distance, vdistance);
+	set_xen_guest_handle(numa_topo.cpu_to_node, cpu_to_node);
+
+	rc = HYPERVISOR_memory_op(XENMEM_get_vnuma_info, &numa_topo);
+
+	if (rc < 0)
+		goto out;
+	nr_nodes = *numa_topo.nr_nodes; 
+	if (nr_nodes == 0) {
+		goto out;
+	}
+	if (nr_nodes > num_possible_cpus()) {
+		pr_debug("vNUMA: Node without cpu is not supported in this version.\n");
+		goto out;
+	}
+
+	/*
+	 * NUMA nodes memory ranges are in pfns, constructed and
+	 * aligned based on e820 ram domain map.
+	 */
+	for (i = 0; i < nr_nodes; i++) {
+		if (numa_add_memblk(i, vblock[i].start, vblock[i].end))
+			goto out;
+		node_set(i, numa_nodes_parsed);
+	}
+
+	setup_nr_node_ids();
+	/* Setting the cpu, apicid to node */
+	for_each_cpu(cpu, cpu_possible_mask) {
+		set_apicid_to_node(cpu, cpu_to_node[cpu]);
+		numa_set_node(cpu, cpu_to_node[cpu]);
+		cpumask_set_cpu(cpu, node_to_cpumask_map[cpu_to_node[cpu]]);
+	}
+
+	for (i = 0; i < nr_nodes; i++) {
+		for (j = 0; j < *numa_topo.nr_nodes; j++) {
+			idx = (j * nr_nodes) + i;
+			numa_set_distance(i, j, *(vdistance + idx));
+		}
+	}
+
+	rc = 0;
+out:
+	if (physm)
+		memblock_free(__pa(physm), mem_size);
+	if (physd)
+		memblock_free(__pa(physd), dist_size);
+	if (physc)
+		memblock_free(__pa(physc), cpu_to_node_size);
+	/*
+	 * Set a dummy node and return success.  This prevents calling any
+	 * hardware-specific initializers which do not work in a PV guest.
+	 * Taken from dummy_numa_init code.
+	 */
+	if (rc != 0) {
+		for (i = 0; i < MAX_LOCAL_APIC; i++)
+			set_apicid_to_node(i, NUMA_NO_NODE);
+		nodes_clear(numa_nodes_parsed);
+		nodes_clear(node_possible_map);
+		nodes_clear(node_online_map);
+		node_set(0, numa_nodes_parsed);
+		numa_add_memblk(0, 0, PFN_PHYS(max_pfn));
+	}
+	return 0;
+}
+#endif
diff --git a/include/xen/interface/memory.h b/include/xen/interface/memory.h
index 2ecfe4f..b61482c 100644
--- a/include/xen/interface/memory.h
+++ b/include/xen/interface/memory.h
@@ -263,4 +263,48 @@ struct xen_remove_from_physmap {
 };
 DEFINE_GUEST_HANDLE_STRUCT(xen_remove_from_physmap);
 
+/* vNUMA structures */
+struct vmemrange {
+	uint64_t start, end;
+	/* reserved */
+	uint64_t _padm;
+};
+DEFINE_GUEST_HANDLE_STRUCT(vmemrange);
+
+struct vnuma_topology_info {
+	/* OUT */
+	domid_t domid;
+	uint32_t __pad;
+	/* IN */
+	/* number of virtual numa nodes */
+	union {
+		GUEST_HANDLE(uint) nr_nodes;
+		uint64_t    _padn;
+	};
+	/* distance table */
+	union {
+		GUEST_HANDLE(uint) distance;
+		uint64_t    _padd;
+	};
+	/* cpu mapping to vnodes */
+	union {
+		GUEST_HANDLE(uint) cpu_to_node;
+		uint64_t    _padc;
+	};
+	/*
+	* memory areas constructed by Xen, start and end
+	* of the ranges are specific to domain e820 map.
+	* Xen toolstack constructs these ranges for domain
+	* when building it.
+	*/
+	union {
+		GUEST_HANDLE(vmemrange) memrange;
+		uint64_t    _padm;
+	};
+};
+typedef struct vnuma_topology_info vnuma_topology_info_t;
+DEFINE_GUEST_HANDLE_STRUCT(vnuma_topology_info);
+
+#define XENMEM_get_vnuma_info	25
+
 #endif /* __XEN_PUBLIC_MEMORY_H__ */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/2] xen: vnuma support for PV guests running as domU
  2013-11-18 20:25 ` [PATCH v2 1/2] xen: vnuma support for PV guests running as domU Elena Ufimtseva
@ 2013-11-18 21:14   ` H. Peter Anvin
  2013-11-18 21:28     ` Elena Ufimtseva
  2013-11-18 22:13     ` Joe Perches
  2013-11-19  7:15   ` [Xen-devel] " Dario Faggioli
  1 sibling, 2 replies; 15+ messages in thread
From: H. Peter Anvin @ 2013-11-18 21:14 UTC (permalink / raw)
  To: Elena Ufimtseva, xen-devel
  Cc: konrad.wilk, boris.ostrovsky, david.vrabel, tglx, mingo, x86,
	akpm, tangchen, wency, ian.campbell, stefano.stabellini,
	mukesh.rathor, linux-kernel, Joe Perches

On 11/18/2013 12:25 PM, Elena Ufimtseva wrote:
> +/* Checks if hypercall is supported */
> +bool xen_vnuma_supported()

This isn't C++...

http://lwn.net/Articles/487493/

There are several more things in this patchset that get flagged by
checkpatch, but apparently this rather common (and rather serious)
problem is still not being detected, even through a patch was submitted
almost two years ago:

https://lkml.org/lkml/2012/3/16/510

	-hpa



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/2] xen: vnuma support for PV guests running as domU
  2013-11-18 21:14   ` H. Peter Anvin
@ 2013-11-18 21:28     ` Elena Ufimtseva
  2013-11-18 22:13     ` Joe Perches
  1 sibling, 0 replies; 15+ messages in thread
From: Elena Ufimtseva @ 2013-11-18 21:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: xen-devel, Konrad Rzeszutek Wilk, Boris Ostrovsky, David Vrabel,
	tglx, mingo, x86, akpm, tangchen, wency, Ian Campbell,
	Stefano Stabellini, mukesh.rathor, linux-kernel, Joe Perches

On Mon, Nov 18, 2013 at 4:14 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 11/18/2013 12:25 PM, Elena Ufimtseva wrote:
>> +/* Checks if hypercall is supported */
>> +bool xen_vnuma_supported()
>
> This isn't C++...
>
> http://lwn.net/Articles/487493/
>
> There are several more things in this patchset that get flagged by
> checkpatch, but apparently this rather common (and rather serious)
> problem is still not being detected, even through a patch was submitted
> almost two years ago:
>
> https://lkml.org/lkml/2012/3/16/510

Thank you Peter, good to know.  Will resend these.
>
>         -hpa
>
>



-- 
Elena

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 1/2] xen: vnuma support for PV guests running as domU
  2013-11-18 21:14   ` H. Peter Anvin
  2013-11-18 21:28     ` Elena Ufimtseva
@ 2013-11-18 22:13     ` Joe Perches
  1 sibling, 0 replies; 15+ messages in thread
From: Joe Perches @ 2013-11-18 22:13 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Elena Ufimtseva, xen-devel, konrad.wilk, boris.ostrovsky,
	david.vrabel, tglx, mingo, x86, akpm, tangchen, wency,
	ian.campbell, stefano.stabellini, mukesh.rathor, linux-kernel

On Mon, 2013-11-18 at 13:14 -0800, H. Peter Anvin wrote:
> On 11/18/2013 12:25 PM, Elena Ufimtseva wrote:
> > +/* Checks if hypercall is supported */
> > +bool xen_vnuma_supported()
> 
> This isn't C++...
> http://lwn.net/Articles/487493/
> 
> There are several more things in this patchset that get flagged by
> checkpatch, but apparently this rather common (and rather serious)
> problem is still not being detected, even through a patch was submitted
> almost two years ago:
> 
> https://lkml.org/lkml/2012/3/16/510

I gave notes to the patch and no follow up was done.
https://lkml.org/lkml/2012/3/16/514




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH v2 1/2] xen: vnuma support for PV guests running as domU
  2013-11-18 20:25 ` [PATCH v2 1/2] xen: vnuma support for PV guests running as domU Elena Ufimtseva
  2013-11-18 21:14   ` H. Peter Anvin
@ 2013-11-19  7:15   ` Dario Faggioli
  1 sibling, 0 replies; 15+ messages in thread
From: Dario Faggioli @ 2013-11-19  7:15 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: xen-devel, akpm, wency, x86, linux-kernel, tangchen, mingo,
	david.vrabel, hpa, boris.ostrovsky, tglx, stefano.stabellini,
	ian.campbell

[-- Attachment #1: Type: text/plain, Size: 1944 bytes --]

On lun, 2013-11-18 at 15:25 -0500, Elena Ufimtseva wrote:
> Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>

> diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
> index 96ab2c0..de9deab 100644
> --- a/arch/x86/xen/Makefile
> +++ b/arch/x86/xen/Makefile
> @@ -13,7 +13,7 @@ CFLAGS_mmu.o			:= $(nostackp)
>  obj-y		:= enlighten.o setup.o multicalls.o mmu.o irq.o \
>  			time.o xen-asm.o xen-asm_$(BITS).o \
>  			grant-table.o suspend.o platform-pci-unplug.o \
> -			p2m.o
> +			p2m.o vnuma.o
>  
>  obj-$(CONFIG_EVENT_TRACING) += trace.o

I think David said something about this during last round (going
fetchin'-cuttin'-pastin' it):

"
    obj-$(CONFIG_NUMA) += vnuma.o

    Then you can remove the #ifdef CONFIG_NUMA from xen/vnuma.c
"
 
> diff --git a/arch/x86/xen/vnuma.c b/arch/x86/xen/vnuma.c

> +/* 
> + * Called from numa_init if numa_off = 0;
                            ^ if numa_off = 1 ?

> + * we set numa_off = 0 if xen_vnuma_supported()
> + * returns true and its a domU;
> + */
> +int __init xen_numa_init(void)
> +{

> +	if (nr_nodes > num_possible_cpus()) {
> +		pr_debug("vNUMA: Node without cpu is not supported in this version.\n");
> +		goto out;
> +	}
> +
This is a super-minor thing, but I wouldn't say "in this version". It
makes people think that there will be a later version where that will be
supported, which we don't know. :-)

> +	/*
> +	 * Set a dummy node and return success.  This prevents calling any
> +	 * hardware-specific initializers which do not work in a PV guest.
> +	 * Taken from dummy_numa_init code.
> +	 */
>
This is a lot better... Thanks! :-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 2/2] xen: enable vnuma for PV guest
  2013-11-18 20:25 [PATCH v2 0/2] xen: vnuma introduction for pv guest Elena Ufimtseva
  2013-11-18 20:25 ` [PATCH v2 1/2] xen: vnuma support for PV guests running as domU Elena Ufimtseva
@ 2013-11-18 20:25 ` Elena Ufimtseva
  2013-11-19 15:38 ` [PATCH v2 0/2] xen: vnuma introduction for pv guest Konrad Rzeszutek Wilk
  2 siblings, 0 replies; 15+ messages in thread
From: Elena Ufimtseva @ 2013-11-18 20:25 UTC (permalink / raw)
  To: xen-devel
  Cc: konrad.wilk, boris.ostrovsky, david.vrabel, tglx, mingo, hpa, x86,
	akpm, tangchen, wency, ian.campbell, stefano.stabellini,
	mukesh.rathor, linux-kernel, Elena Ufimtseva

Enables numa if vnuma topology hypercall is supported and it is domU.

Signed-off-by: Elena Ufimtseva <ufimtseva@gmail.com>
---
 arch/x86/xen/setup.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
index 68c054f..0aab799 100644
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -20,6 +20,7 @@
 #include <asm/numa.h>
 #include <asm/xen/hypervisor.h>
 #include <asm/xen/hypercall.h>
+#include <asm/xen/vnuma.h>
 
 #include <xen/xen.h>
 #include <xen/page.h>
@@ -598,6 +599,9 @@ void __init xen_arch_setup(void)
 	WARN_ON(xen_set_default_idle());
 	fiddle_vdso();
 #ifdef CONFIG_NUMA
-	numa_off = 1;
+	if (!xen_initial_domain() && xen_vnuma_supported())
+		numa_off = 0;
+	else
+		numa_off = 1;
 #endif
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 0/2] xen: vnuma introduction for pv guest
  2013-11-18 20:25 [PATCH v2 0/2] xen: vnuma introduction for pv guest Elena Ufimtseva
  2013-11-18 20:25 ` [PATCH v2 1/2] xen: vnuma support for PV guests running as domU Elena Ufimtseva
  2013-11-18 20:25 ` [PATCH v2 2/2] xen: enable vnuma for PV guest Elena Ufimtseva
@ 2013-11-19 15:38 ` Konrad Rzeszutek Wilk
  2013-11-19 18:29   ` [Xen-devel] " Dario Faggioli
  2 siblings, 1 reply; 15+ messages in thread
From: Konrad Rzeszutek Wilk @ 2013-11-19 15:38 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: xen-devel, boris.ostrovsky, david.vrabel, tglx, mingo, hpa, x86,
	akpm, tangchen, wency, ian.campbell, stefano.stabellini,
	mukesh.rathor, linux-kernel

On Mon, Nov 18, 2013 at 03:25:48PM -0500, Elena Ufimtseva wrote:
> Xen vnuma introduction.
> 
> The patchset introduces vnuma to paravirtualized Xen guests
> runnning as domU.
> Xen subop hypercall is used to retreive vnuma topology information.
> Bases on the retreived topology from Xen, NUMA number of nodes,
> memory ranges, distance table and cpumask is being set.
> If initialization is incorrect, sets 'dummy' node and unsets
> nodemask.
> vNUMA topology is constructed by Xen toolstack. Xen patchset is
> available at https://git.gitorious.org/xenvnuma/xenvnuma.git:v3.

Yeey!

One question - I know you had questions about the
PROT_GLOBAL | ~PAGE_PRESENT being set on PTEs that are going to
be harvested for AutoNUMA balancing. 

And that the hypercall to set such PTE entry disallows the
PROT_GLOBAL (it stripts it off)? That means that when the
Linux page system kicks in (as it has ~PAGE_PRESENT) the
Linux pagehandler won't see the PROT_GLOBAL (as it has
been filtered out). Which means that the AutoNUMA code won't
kick in.

(see http://article.gmane.org/gmane.comp.emulators.xen.devel/174317)

Was that problem ever answered?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH v2 0/2] xen: vnuma introduction for pv guest
  2013-11-19 15:38 ` [PATCH v2 0/2] xen: vnuma introduction for pv guest Konrad Rzeszutek Wilk
@ 2013-11-19 18:29   ` Dario Faggioli
  2013-12-04  0:35     ` Elena Ufimtseva
  0 siblings, 1 reply; 15+ messages in thread
From: Dario Faggioli @ 2013-11-19 18:29 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Elena Ufimtseva, akpm, wency, stefano.stabellini, x86,
	linux-kernel, tangchen, mingo, david.vrabel, hpa, xen-devel,
	boris.ostrovsky, tglx, ian.campbell

[-- Attachment #1: Type: text/plain, Size: 2845 bytes --]

On mar, 2013-11-19 at 10:38 -0500, Konrad Rzeszutek Wilk wrote:
> On Mon, Nov 18, 2013 at 03:25:48PM -0500, Elena Ufimtseva wrote:
> > The patchset introduces vnuma to paravirtualized Xen guests
> > runnning as domU.
> > Xen subop hypercall is used to retreive vnuma topology information.
> > Bases on the retreived topology from Xen, NUMA number of nodes,
> > memory ranges, distance table and cpumask is being set.
> > If initialization is incorrect, sets 'dummy' node and unsets
> > nodemask.
> > vNUMA topology is constructed by Xen toolstack. Xen patchset is
> > available at https://git.gitorious.org/xenvnuma/xenvnuma.git:v3.
> 
> Yeey!
> 
:-)

> One question - I know you had questions about the
> PROT_GLOBAL | ~PAGE_PRESENT being set on PTEs that are going to
> be harvested for AutoNUMA balancing. 
> 
> And that the hypercall to set such PTE entry disallows the
> PROT_GLOBAL (it stripts it off)? That means that when the
> Linux page system kicks in (as it has ~PAGE_PRESENT) the
> Linux pagehandler won't see the PROT_GLOBAL (as it has
> been filtered out). Which means that the AutoNUMA code won't
> kick in.
> 
> (see http://article.gmane.org/gmane.comp.emulators.xen.devel/174317)
> 
> Was that problem ever answered?
> 
I think the issue is a twofold one.

If I remember correctly (Elena, please, correct me if I'm wrong) Elena
was seeing _crashes_ with both vNUMA and AutoNUMA enabled for the guest.
That's what pushed her to investigate the issue, and led to what you're
summing up above.

However, it appears the crash was due to something completely unrelated
to Xen and vNUMA, was affecting baremetal too, and got fixed, which
means the crash is now gone.

It remains to be seen (I think) whether that also means that AutoNUMA
works. In fact, chatting about this in Edinburgh, Elena managed to
convince me pretty badly that we should --as part of the vNUMA support--
do something about this, in order to make it work. At that time I
thought we should be doing something to avoid the system to go ka-boom,
but as I said, even now that it does not crash anymore, she was so
persuasive that I now find it quite hard to believe that we really don't
need to do anything. :-P

I guess, as soon as we get the chance, we should see if this actually
works, i.e., in addition to seeing the proper topology and not crashing,
verify that AutoNUMA in the guest is actually doing is job.

What do you think? Again, Elena, please chime in and explain how things
are, if I got something wrong. :-)

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH v2 0/2] xen: vnuma introduction for pv guest
  2013-11-19 18:29   ` [Xen-devel] " Dario Faggioli
@ 2013-12-04  0:35     ` Elena Ufimtseva
  2013-12-04  6:20       ` Elena Ufimtseva
  0 siblings, 1 reply; 15+ messages in thread
From: Elena Ufimtseva @ 2013-12-04  0:35 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Konrad Rzeszutek Wilk, akpm, wency, Stefano Stabellini, x86,
	linux-kernel, tangchen, mingo, David Vrabel, H. Peter Anvin,
	xen-devel, Boris Ostrovsky, tglx, Ian Campbell

On Tue, Nov 19, 2013 at 1:29 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On mar, 2013-11-19 at 10:38 -0500, Konrad Rzeszutek Wilk wrote:
>> On Mon, Nov 18, 2013 at 03:25:48PM -0500, Elena Ufimtseva wrote:
>> > The patchset introduces vnuma to paravirtualized Xen guests
>> > runnning as domU.
>> > Xen subop hypercall is used to retreive vnuma topology information.
>> > Bases on the retreived topology from Xen, NUMA number of nodes,
>> > memory ranges, distance table and cpumask is being set.
>> > If initialization is incorrect, sets 'dummy' node and unsets
>> > nodemask.
>> > vNUMA topology is constructed by Xen toolstack. Xen patchset is
>> > available at https://git.gitorious.org/xenvnuma/xenvnuma.git:v3.
>>
>> Yeey!
>>
> :-)
>
>> One question - I know you had questions about the
>> PROT_GLOBAL | ~PAGE_PRESENT being set on PTEs that are going to
>> be harvested for AutoNUMA balancing.
>>
>> And that the hypercall to set such PTE entry disallows the
>> PROT_GLOBAL (it stripts it off)? That means that when the
>> Linux page system kicks in (as it has ~PAGE_PRESENT) the
>> Linux pagehandler won't see the PROT_GLOBAL (as it has
>> been filtered out). Which means that the AutoNUMA code won't
>> kick in.
>>
>> (see http://article.gmane.org/gmane.comp.emulators.xen.devel/174317)
>>
>> Was that problem ever answered?
>>
> I think the issue is a twofold one.
>
> If I remember correctly (Elena, please, correct me if I'm wrong) Elena
> was seeing _crashes_ with both vNUMA and AutoNUMA enabled for the guest.
> That's what pushed her to investigate the issue, and led to what you're
> summing up above.
>
> However, it appears the crash was due to something completely unrelated
> to Xen and vNUMA, was affecting baremetal too, and got fixed, which
> means the crash is now gone.
>
> It remains to be seen (I think) whether that also means that AutoNUMA
> works. In fact, chatting about this in Edinburgh, Elena managed to
> convince me pretty badly that we should --as part of the vNUMA support--
> do something about this, in order to make it work. At that time I
> thought we should be doing something to avoid the system to go ka-boom,
> but as I said, even now that it does not crash anymore, she was so
> persuasive that I now find it quite hard to believe that we really don't
> need to do anything. :-P

Yes, you were right Dario :) See at the end. pv guests do not crash,
but they have user space memory corruption.
Ok, so I will try to understand what again had happened during this
weekend.
Meanwhile posting patches for Xen.

>
> I guess, as soon as we get the chance, we should see if this actually
> works, i.e., in addition to seeing the proper topology and not crashing,
> verify that AutoNUMA in the guest is actually doing is job.
>
> What do you think? Again, Elena, please chime in and explain how things
> are, if I got something wrong. :-)
>

Oh guys, I feel really bad about not replying to these emails... Somehow these
replies all got deleted.. wierd.

Ok, about that automatic balancing. At the moment of the last patch
automatic numa balancing seem to
work, but after rebasing on the top of 3.12-rc2 I see similar issues.
I will try to figure out what commits broke and will contact Ingo
Molnar and Mel Gorman.

Konrad,
as of PROT_GLOBAL flag, I will double check once more to exclude
errors from my side.
Last time I was able to have numa_balancing working without any
modifications from hypervisor side.
But again, I want to double check this, some experiments might have
appear being good :)


> Regards,
> Dario
>
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>



-- 
Elena

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH v2 0/2] xen: vnuma introduction for pv guest
  2013-12-04  0:35     ` Elena Ufimtseva
@ 2013-12-04  6:20       ` Elena Ufimtseva
  2013-12-05  1:13         ` Dario Faggioli
  0 siblings, 1 reply; 15+ messages in thread
From: Elena Ufimtseva @ 2013-12-04  6:20 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Konrad Rzeszutek Wilk, akpm, wency, Stefano Stabellini, x86,
	linux-kernel, tangchen, mingo, David Vrabel, H. Peter Anvin,
	xen-devel, Boris Ostrovsky, tglx, Ian Campbell

On Tue, Dec 3, 2013 at 7:35 PM, Elena Ufimtseva <ufimtseva@gmail.com> wrote:
> On Tue, Nov 19, 2013 at 1:29 PM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:
>> On mar, 2013-11-19 at 10:38 -0500, Konrad Rzeszutek Wilk wrote:
>>> On Mon, Nov 18, 2013 at 03:25:48PM -0500, Elena Ufimtseva wrote:
>>> > The patchset introduces vnuma to paravirtualized Xen guests
>>> > runnning as domU.
>>> > Xen subop hypercall is used to retreive vnuma topology information.
>>> > Bases on the retreived topology from Xen, NUMA number of nodes,
>>> > memory ranges, distance table and cpumask is being set.
>>> > If initialization is incorrect, sets 'dummy' node and unsets
>>> > nodemask.
>>> > vNUMA topology is constructed by Xen toolstack. Xen patchset is
>>> > available at https://git.gitorious.org/xenvnuma/xenvnuma.git:v3.
>>>
>>> Yeey!
>>>
>> :-)
>>
>>> One question - I know you had questions about the
>>> PROT_GLOBAL | ~PAGE_PRESENT being set on PTEs that are going to
>>> be harvested for AutoNUMA balancing.
>>>
>>> And that the hypercall to set such PTE entry disallows the
>>> PROT_GLOBAL (it stripts it off)? That means that when the
>>> Linux page system kicks in (as it has ~PAGE_PRESENT) the
>>> Linux pagehandler won't see the PROT_GLOBAL (as it has
>>> been filtered out). Which means that the AutoNUMA code won't
>>> kick in.
>>>
>>> (see http://article.gmane.org/gmane.comp.emulators.xen.devel/174317)
>>>
>>> Was that problem ever answered?
>>>
>> I think the issue is a twofold one.
>>
>> If I remember correctly (Elena, please, correct me if I'm wrong) Elena
>> was seeing _crashes_ with both vNUMA and AutoNUMA enabled for the guest.
>> That's what pushed her to investigate the issue, and led to what you're
>> summing up above.
>>
>> However, it appears the crash was due to something completely unrelated
>> to Xen and vNUMA, was affecting baremetal too, and got fixed, which
>> means the crash is now gone.
>>
>> It remains to be seen (I think) whether that also means that AutoNUMA
>> works. In fact, chatting about this in Edinburgh, Elena managed to
>> convince me pretty badly that we should --as part of the vNUMA support--
>> do something about this, in order to make it work. At that time I
>> thought we should be doing something to avoid the system to go ka-boom,
>> but as I said, even now that it does not crash anymore, she was so
>> persuasive that I now find it quite hard to believe that we really don't
>> need to do anything. :-P
>
> Yes, you were right Dario :) See at the end. pv guests do not crash,
> but they have user space memory corruption.
> Ok, so I will try to understand what again had happened during this
> weekend.
> Meanwhile posting patches for Xen.
>
>>
>> I guess, as soon as we get the chance, we should see if this actually
>> works, i.e., in addition to seeing the proper topology and not crashing,
>> verify that AutoNUMA in the guest is actually doing is job.
>>
>> What do you think? Again, Elena, please chime in and explain how things
>> are, if I got something wrong. :-)
>>
>
> Oh guys, I feel really bad about not replying to these emails... Somehow these
> replies all got deleted.. wierd.
>
> Ok, about that automatic balancing. At the moment of the last patch
> automatic numa balancing seem to
> work, but after rebasing on the top of 3.12-rc2 I see similar issues.
> I will try to figure out what commits broke and will contact Ingo
> Molnar and Mel Gorman.
>
> Konrad,
> as of PROT_GLOBAL flag, I will double check once more to exclude
> errors from my side.
> Last time I was able to have numa_balancing working without any
> modifications from hypervisor side.
> But again, I want to double check this, some experiments might have
> appear being good :)
>
>
>> Regards,
>> Dario
>>
>> --
>> <<This happens because I choose it to happen!>> (Raistlin Majere)
>> -----------------------------------------------------------------
>> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>>
>

As of now I have patch v4 for reviewing. Not sure if it will be
beneficial to post it for review
or look closer at the current problem.
The issue I am seeing right now is defferent from what was happening before.
The corruption happens when on change_prot_numa way :

[ 6638.021439]  pfn 45e602, highest_memmap_pfn - 14ddd7
[ 6638.021444] BUG: Bad page map in process dd  pte:800000045e602166
pmd:abf1a067
[ 6638.021449] addr:00007f4fda2d8000 vm_flags:00100073
anon_vma:ffff8800abf77b90 mapping:          (null) index:7f4fda2d8
[ 6638.021457] CPU: 1 PID: 1033 Comm: dd Tainted: G    B   W    3.13.0-rc2+ #10
[ 6638.021462]  0000000000000000 00007f4fda2d8000 ffffffff813ca5b1
ffff88010d68deb8
[ 6638.021471]  ffffffff810f2c88 00000000abf1a067 800000045e602166
0000000000000000
[ 6638.021482]  000000000045e602 ffff88010d68deb8 00007f4fda2d8000
800000045e602166
[ 6638.021492] Call Trace:
[ 6638.021497]  [<ffffffff813ca5b1>] ? dump_stack+0x41/0x51
[ 6638.021503]  [<ffffffff810f2c88>] ? print_bad_pte+0x19d/0x1c9
[ 6638.021509]  [<ffffffff810f3aef>] ? vm_normal_page+0x94/0xb3
[ 6638.021519]  [<ffffffff810fb788>] ? change_protection+0x35c/0x5a8
[ 6638.021527]  [<ffffffff81107965>] ? change_prot_numa+0x13/0x24
[ 6638.021533]  [<ffffffff81071697>] ? task_numa_work+0x1fb/0x299
[ 6638.021539]  [<ffffffff8105ef54>] ? task_work_run+0x7b/0x8f
[ 6638.021545]  [<ffffffff8100e658>] ? do_notify_resume+0x53/0x68
[ 6638.021552]  [<ffffffff813d4432>] ? int_signal+0x12/0x17
[ 6638.021560]  pfn 45d732, highest_memmap_pfn - 14ddd7
[ 6638.021565] BUG: Bad page map in process dd  pte:800000045d732166
pmd:10d684067
[ 6638.021572] addr:00007fff7c143000 vm_flags:00100173
anon_vma:ffff8800abf77960 mapping:          (null) index:7fffffffc
[ 6638.021582] CPU: 1 PID: 1033 Comm: dd Tainted: G    B   W    3.13.0-rc2+ #10
[ 6638.021587]  0000000000000000 00007fff7c143000 ffffffff813ca5b1
ffff8800abf339b0
[ 6638.021595]  ffffffff810f2c88 000000010d684067 800000045d732166
0000000000000000
[ 6638.021603]  000000000045d732 ffff8800abf339b0 00007fff7c143000
800000045d732166

The code has changed since last problem, I will work on this to see
where it comes from.

Elena

>
>
> --
> Elena



-- 
Elena

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH v2 0/2] xen: vnuma introduction for pv guest
  2013-12-04  6:20       ` Elena Ufimtseva
@ 2013-12-05  1:13         ` Dario Faggioli
  2013-12-20  7:39           ` Elena Ufimtseva
  0 siblings, 1 reply; 15+ messages in thread
From: Dario Faggioli @ 2013-12-05  1:13 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: Konrad Rzeszutek Wilk, akpm, wency, Stefano Stabellini, x86,
	linux-kernel, tangchen, mingo, David Vrabel, H. Peter Anvin,
	xen-devel, Boris Ostrovsky, tglx, Ian Campbell

[-- Attachment #1: Type: text/plain, Size: 3465 bytes --]

On mer, 2013-12-04 at 01:20 -0500, Elena Ufimtseva wrote:
> On Tue, Dec 3, 2013 at 7:35 PM, Elena Ufimtseva <ufimtseva@gmail.com> wrote:
> > Oh guys, I feel really bad about not replying to these emails... Somehow these
> > replies all got deleted.. wierd.
> >
No worries... You should see *my* backlog. :-P

> > Ok, about that automatic balancing. At the moment of the last patch
> > automatic numa balancing seem to
> > work, but after rebasing on the top of 3.12-rc2 I see similar issues.
> > I will try to figure out what commits broke and will contact Ingo
> > Molnar and Mel Gorman.
> >
> As of now I have patch v4 for reviewing. Not sure if it will be
> beneficial to post it for review
> or look closer at the current problem.
>
You mean the Linux side? Perhaps stick somewhere a reference to the git
tree/branch where it lives, but, before re-sending, let's wait for it to
be as issue free as we can tell?

> The issue I am seeing right now is defferent from what was happening before.
> The corruption happens when on change_prot_numa way :
> 
Ok, so, I think I need to step back a bit from the actual stack trace
and look at the big picture. Please, Elena or anyone, correct me if I'm
saying something wrong about how Linux's autonuma works and interacts
with Xen.

The way it worked when I last looked at it was sort of like this:
 - there was a kthread scanning all the pages, removing the PAGE_PRESENT
   bit from actually present pages, and adding a new special one
   (PAGE_NUMA or something like that);
 - when a page fault is triggered and the PAGE_NUMA flag is found, it
   figures out the page is actually there, so no swap or anything.
   However, it tracks from what node the access to that page came from,
   matches it with the node where the page actually is and collect some
   statistics about that;
 - at some point (and here I don't remember the exact logic, since it
   changed quite a few times) pages ranking badly in the stats above are
   moved from one node to another.

Is this description still accurate? If yes, here's what I would (double)
check, when running this in a PV guest on top of Xen:

 1. the NUMA hinting page fault, are we getting and handling them
    correctly in the PV guest? Are the stats in the guest kernel being
    updated in a sensible way, i.e., do they make sense and properly
    relate to the virtual topology of the guest?
    At some point we thought it would have been necessary to intercept
    these faults and make sure the above is true with some help from the
    hypervisor... Is this the case? Why? Why not?

 2. what happens when autonuma tries to move pages from one node to
    another? For us, that would mean in moving from one virtual node
    to another... Is there a need to do anything at all? I mean, is
    this, from our perspective, just copying the content of an MFN from
    node X into another MFN on node Y, or do we need to update some of
    our vnuma tracking data structures in Xen?

If we have this figured out already, then I think we just chase bugs and
repost the series. If not, well, I think we should. :-D

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH v2 0/2] xen: vnuma introduction for pv guest
  2013-12-05  1:13         ` Dario Faggioli
@ 2013-12-20  7:39           ` Elena Ufimtseva
  2013-12-20  7:48             ` Elena Ufimtseva
  0 siblings, 1 reply; 15+ messages in thread
From: Elena Ufimtseva @ 2013-12-20  7:39 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Konrad Rzeszutek Wilk, akpm, wency, Stefano Stabellini, x86,
	linux-kernel, tangchen, mingo, David Vrabel, H. Peter Anvin,
	xen-devel, Boris Ostrovsky, tglx, Ian Campbell

On Wed, Dec 4, 2013 at 8:13 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> On mer, 2013-12-04 at 01:20 -0500, Elena Ufimtseva wrote:
>> On Tue, Dec 3, 2013 at 7:35 PM, Elena Ufimtseva <ufimtseva@gmail.com> wrote:
>> > Oh guys, I feel really bad about not replying to these emails... Somehow these
>> > replies all got deleted.. wierd.
>> >
> No worries... You should see *my* backlog. :-P
>
>> > Ok, about that automatic balancing. At the moment of the last patch
>> > automatic numa balancing seem to
>> > work, but after rebasing on the top of 3.12-rc2 I see similar issues.
>> > I will try to figure out what commits broke and will contact Ingo
>> > Molnar and Mel Gorman.
>> >
>> As of now I have patch v4 for reviewing. Not sure if it will be
>> beneficial to post it for review
>> or look closer at the current problem.
>>
> You mean the Linux side? Perhaps stick somewhere a reference to the git
> tree/branch where it lives, but, before re-sending, let's wait for it to
> be as issue free as we can tell?
>
>> The issue I am seeing right now is defferent from what was happening before.
>> The corruption happens when on change_prot_numa way :
>>
> Ok, so, I think I need to step back a bit from the actual stack trace
> and look at the big picture. Please, Elena or anyone, correct me if I'm
> saying something wrong about how Linux's autonuma works and interacts
> with Xen.
>
> The way it worked when I last looked at it was sort of like this:
>  - there was a kthread scanning all the pages, removing the PAGE_PRESENT
>    bit from actually present pages, and adding a new special one
>    (PAGE_NUMA or something like that);
>  - when a page fault is triggered and the PAGE_NUMA flag is found, it
>    figures out the page is actually there, so no swap or anything.
>    However, it tracks from what node the access to that page came from,
>    matches it with the node where the page actually is and collect some
>    statistics about that;
>  - at some point (and here I don't remember the exact logic, since it
>    changed quite a few times) pages ranking badly in the stats above are
>    moved from one node to another.

Hello Dario, Konrad.

- Yes, there is a kernel worker that runs on each node and scans some
pages stats and
marks them as _PROT_NONE and resets _PAGE_PRESENT.
The page fault at this moment is triggered and control is being
returned back to the linux pv kernel
to process with handle_mm_fault and page numa fault handler if
discovered if that was a numa pmd/pte with
present flag cleared.
About the stats, I will have to collect some sensible information.

>
> Is this description still accurate? If yes, here's what I would (double)
> check, when running this in a PV guest on top of Xen:
>
>  1. the NUMA hinting page fault, are we getting and handling them
>     correctly in the PV guest? Are the stats in the guest kernel being
>     updated in a sensible way, i.e., do they make sense and properly
>     relate to the virtual topology of the guest?
>     At some point we thought it would have been necessary to intercept
>     these faults and make sure the above is true with some help from the
>     hypervisor... Is this the case? Why? Why not?

The real healp needed from hypervisor is to allow _PAGE_NUMA flags on
pte/pmd entries.
I have done so in hypervisor by utilizing same _PAGE_NUMA bit and
including into the allowed bit mask.
As this bit is the same as PAGE_GLOBAL in hypervisor, that may induce
some other errors. So far I have not seen any
and I will double check on this.

>
>  2. what happens when autonuma tries to move pages from one node to
>     another? For us, that would mean in moving from one virtual node
>     to another... Is there a need to do anything at all? I mean, is
>     this, from our perspective, just copying the content of an MFN from
>     node X into another MFN on node Y, or do we need to update some of
>     our vnuma tracking data structures in Xen?
>
> If we have this figured out already, then I think we just chase bugs and
> repost the series. If not, well, I think we should. :-D
>
here is the best part :)

After a fresh look at the numa autobalancing, applying recent patches,
talking some to riel who works now on mm numa autobalancing and
running some tests including dd, ltp, kernel compiling and my own
tests, autobalancing now is working
correctly with vnuma. Now I can see sucessfully migrated pages in /proc/vmstat:

numa_pte_updates 39
numa_huge_pte_updates 0
numa_hint_faults 36
numa_hint_faults_local 23
numa_pages_migrated 4
pgmigrate_success 4
pgmigrate_fail 0

I will be running some tests with transparent huge pages as the
migration of such will be failing.
Probably it is possible to find all the patches related to numa
autobalancing and figure out possible reasons
of why previously balancing was not working. Giving the amount of work
kernel folks spent recently to fix
issues with numa and the significance of the changes itself, I might
need few more attempts to understand it.

I am going to test THP and if that works will follow up with patches.

Dario, what tools did you use to test NUMA on xen? Maybe there is
something I can use as well?
Here http://lwn.net/Articles/558593/ Mel Gorman uses specjbb and jvm,
I though I can run something similar.

> Thanks and Regards,
> Dario
>
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>



-- 
Elena

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH v2 0/2] xen: vnuma introduction for pv guest
  2013-12-20  7:39           ` Elena Ufimtseva
@ 2013-12-20  7:48             ` Elena Ufimtseva
  2013-12-20 15:38               ` Dario Faggioli
  0 siblings, 1 reply; 15+ messages in thread
From: Elena Ufimtseva @ 2013-12-20  7:48 UTC (permalink / raw)
  To: Dario Faggioli
  Cc: Konrad Rzeszutek Wilk, akpm, wency, Stefano Stabellini, x86,
	linux-kernel, tangchen, mingo, David Vrabel, H. Peter Anvin,
	xen-devel, Boris Ostrovsky, tglx, Ian Campbell

On Fri, Dec 20, 2013 at 2:39 AM, Elena Ufimtseva <ufimtseva@gmail.com> wrote:
> On Wed, Dec 4, 2013 at 8:13 PM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:
>> On mer, 2013-12-04 at 01:20 -0500, Elena Ufimtseva wrote:
>>> On Tue, Dec 3, 2013 at 7:35 PM, Elena Ufimtseva <ufimtseva@gmail.com> wrote:
>>> > Oh guys, I feel really bad about not replying to these emails... Somehow these
>>> > replies all got deleted.. wierd.
>>> >
>> No worries... You should see *my* backlog. :-P
>>
>>> > Ok, about that automatic balancing. At the moment of the last patch
>>> > automatic numa balancing seem to
>>> > work, but after rebasing on the top of 3.12-rc2 I see similar issues.
>>> > I will try to figure out what commits broke and will contact Ingo
>>> > Molnar and Mel Gorman.
>>> >
>>> As of now I have patch v4 for reviewing. Not sure if it will be
>>> beneficial to post it for review
>>> or look closer at the current problem.
>>>
>> You mean the Linux side? Perhaps stick somewhere a reference to the git
>> tree/branch where it lives, but, before re-sending, let's wait for it to
>> be as issue free as we can tell?
>>
>>> The issue I am seeing right now is defferent from what was happening before.
>>> The corruption happens when on change_prot_numa way :
>>>
>> Ok, so, I think I need to step back a bit from the actual stack trace
>> and look at the big picture. Please, Elena or anyone, correct me if I'm
>> saying something wrong about how Linux's autonuma works and interacts
>> with Xen.
>>
>> The way it worked when I last looked at it was sort of like this:
>>  - there was a kthread scanning all the pages, removing the PAGE_PRESENT
>>    bit from actually present pages, and adding a new special one
>>    (PAGE_NUMA or something like that);
>>  - when a page fault is triggered and the PAGE_NUMA flag is found, it
>>    figures out the page is actually there, so no swap or anything.
>>    However, it tracks from what node the access to that page came from,
>>    matches it with the node where the page actually is and collect some
>>    statistics about that;
>>  - at some point (and here I don't remember the exact logic, since it
>>    changed quite a few times) pages ranking badly in the stats above are
>>    moved from one node to another.
>
> Hello Dario, Konrad.
>
> - Yes, there is a kernel worker that runs on each node and scans some
> pages stats and
> marks them as _PROT_NONE and resets _PAGE_PRESENT.
> The page fault at this moment is triggered and control is being
> returned back to the linux pv kernel
> to process with handle_mm_fault and page numa fault handler if
> discovered if that was a numa pmd/pte with
> present flag cleared.
> About the stats, I will have to collect some sensible information.
>
>>
>> Is this description still accurate? If yes, here's what I would (double)
>> check, when running this in a PV guest on top of Xen:
>>
>>  1. the NUMA hinting page fault, are we getting and handling them
>>     correctly in the PV guest? Are the stats in the guest kernel being
>>     updated in a sensible way, i.e., do they make sense and properly
>>     relate to the virtual topology of the guest?
>>     At some point we thought it would have been necessary to intercept
>>     these faults and make sure the above is true with some help from the
>>     hypervisor... Is this the case? Why? Why not?
>
> The real healp needed from hypervisor is to allow _PAGE_NUMA flags on
> pte/pmd entries.
> I have done so in hypervisor by utilizing same _PAGE_NUMA bit and
> including into the allowed bit mask.
> As this bit is the same as PAGE_GLOBAL in hypervisor, that may induce
> some other errors. So far I have not seen any
> and I will double check on this.
>
>>
>>  2. what happens when autonuma tries to move pages from one node to
>>     another? For us, that would mean in moving from one virtual node
>>     to another... Is there a need to do anything at all? I mean, is
>>     this, from our perspective, just copying the content of an MFN from
>>     node X into another MFN on node Y, or do we need to update some of
>>     our vnuma tracking data structures in Xen?
>>
>> If we have this figured out already, then I think we just chase bugs and
>> repost the series. If not, well, I think we should. :-D
>>
> here is the best part :)
>
> After a fresh look at the numa autobalancing, applying recent patches,
> talking some to riel who works now on mm numa autobalancing and
> running some tests including dd, ltp, kernel compiling and my own
> tests, autobalancing now is working
> correctly with vnuma. Now I can see sucessfully migrated pages in /proc/vmstat:
>
> numa_pte_updates 39
> numa_huge_pte_updates 0
> numa_hint_faults 36
> numa_hint_faults_local 23
> numa_pages_migrated 4
> pgmigrate_success 4
> pgmigrate_fail 0
>
> I will be running some tests with transparent huge pages as the
> migration of such will be failing.
> Probably it is possible to find all the patches related to numa
> autobalancing and figure out possible reasons
> of why previously balancing was not working. Giving the amount of work
> kernel folks spent recently to fix
> issues with numa and the significance of the changes itself, I might
> need few more attempts to understand it.
>
> I am going to test THP and if that works will follow up with patches.
>
> Dario, what tools did you use to test NUMA on xen? Maybe there is
> something I can use as well?
> Here http://lwn.net/Articles/558593/ Mel Gorman uses specjbb and jvm,
> I though I can run something similar.

And of course, more details will follow... :)



>
>> Thanks and Regards,
>> Dario
>>
>> --
>> <<This happens because I choose it to happen!>> (Raistlin Majere)
>> -----------------------------------------------------------------
>> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
>> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>>
>
>
>
> --
> Elena



-- 
Elena

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Xen-devel] [PATCH v2 0/2] xen: vnuma introduction for pv guest
  2013-12-20  7:48             ` Elena Ufimtseva
@ 2013-12-20 15:38               ` Dario Faggioli
  0 siblings, 0 replies; 15+ messages in thread
From: Dario Faggioli @ 2013-12-20 15:38 UTC (permalink / raw)
  To: Elena Ufimtseva
  Cc: Konrad Rzeszutek Wilk, akpm, wency, Stefano Stabellini, x86,
	linux-kernel, tangchen, mingo, David Vrabel, H. Peter Anvin,
	xen-devel, Boris Ostrovsky, tglx, Ian Campbell

[-- Attachment #1: Type: text/plain, Size: 1471 bytes --]

On ven, 2013-12-20 at 02:48 -0500, Elena Ufimtseva wrote:
> On Fri, Dec 20, 2013 at 2:39 AM, Elena Ufimtseva <ufimtseva@gmail.com> wrote:
>
> > Dario, what tools did you use to test NUMA on xen? Maybe there is
> > something I can use as well?
> > Here http://lwn.net/Articles/558593/ Mel Gorman uses specjbb and jvm,
> > I though I can run something similar.
> 
> And of course, more details will follow... :)
> 
Yeah, well, during early investigation, I also used basically just
SpecJBB. See here:
http://blog.xen.org/index.php/2012/04/26/numa-and-xen-part-1-introduction/
http://blog.xen.org/index.php/2012/05/16/numa-and-xen-part-ii-scheduling-and-placement/

For benchmarking our NUMA aware scheduling solution, in Xen, I used
SpecJBB, sysbench and LMbench:
http://blog.xen.org/index.php/2012/05/16/numa-and-xen-part-ii-scheduling-and-placement/

The trickiest part is have the benchmark(s) of choice running
concurrently is a bunch of VMs on the same host. I used some hand
crafted scripts at the time, very specific to my own environment.

I'm working on abstracting that and putting together something more easy
to be shared and used outside my attic. :-P

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-12-20 15:39 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-18 20:25 [PATCH v2 0/2] xen: vnuma introduction for pv guest Elena Ufimtseva
2013-11-18 20:25 ` [PATCH v2 1/2] xen: vnuma support for PV guests running as domU Elena Ufimtseva
2013-11-18 21:14   ` H. Peter Anvin
2013-11-18 21:28     ` Elena Ufimtseva
2013-11-18 22:13     ` Joe Perches
2013-11-19  7:15   ` [Xen-devel] " Dario Faggioli
2013-11-18 20:25 ` [PATCH v2 2/2] xen: enable vnuma for PV guest Elena Ufimtseva
2013-11-19 15:38 ` [PATCH v2 0/2] xen: vnuma introduction for pv guest Konrad Rzeszutek Wilk
2013-11-19 18:29   ` [Xen-devel] " Dario Faggioli
2013-12-04  0:35     ` Elena Ufimtseva
2013-12-04  6:20       ` Elena Ufimtseva
2013-12-05  1:13         ` Dario Faggioli
2013-12-20  7:39           ` Elena Ufimtseva
2013-12-20  7:48             ` Elena Ufimtseva
2013-12-20 15:38               ` Dario Faggioli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).