* [PATCH v2 0/7] vNUMA introduction.
@ 2013-11-14 3:25 Elena Ufimtseva
2013-11-14 10:03 ` Dario Faggioli
0 siblings, 1 reply; 3+ messages in thread
From: Elena Ufimtseva @ 2013-11-14 3:25 UTC (permalink / raw)
To: xen-devel
Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
dario.faggioli, lccycc123, ian.jackson, JBeulich, Elena Ufimtseva
vNUMA introduction
This series of patches introduces vNUMA topology awareness and
provides interfaces and data structures to enable vNUMA for
PV guests. There is a plan to extend this support for dom0 and
HVM domains.
vNUMA topology support should be supported by PV guest kernel.
Corresponging patches should be applied.
Introduction
-------------
vNUMA topology is exposed to the PV guest to improve performance when running
workloads on NUMA machines.
XEN vNUMA implementation provides a way to create vNUMA-enabled guests on NUMA/UMA
and map vNUMA topology to physical NUMA in a optimal way.
XEN vNUMA support
Current set of patches introduces subop hypercall that is available for enlightened
PV guests with vNUMA patches applied.
Domain structure was modified to reflect per-domain vNUMA topology for use in other
vNUMA-aware subsystems (e.g. ballooning).
libxc
libxc provides interfaces to build PV guests with vNUMA support and in case of NUMA
machines provides initial memory allocation on physical NUMA nodes. This implemented by
utilizing nodemap formed by automatic NUMA placement. Details are in patch #3.
libxl
libxl provides a way to predefine in VM config vNUMA topology - number of vnodes,
memory arrangement, vcpus to vnodes assignment, distance map.
PV guest
As of now, only PV guest can take advantage of vNUMA functionality. vNUMA Linux patches
should be applied and NUMA support should be compiled in kernel.
This patchset can be pulled from https://git.gitorious.org/xenvnuma/xenvnuma.git
Linux patchset https://git.gitorious.org/xenvnuma/linuxvnuma.git
Examples of booting vNUMA enabled PV Linux guest on real NUMA machine:
1. Automatic vNUMA placement on real NUMA machine:
VM config:
memory = 16384
vcpus = 4
name = "rcbig"
vnodes = 4
vnumamem = [10,10]
vnuma_distance = [10, 30, 10, 30]
vcpu_to_vnode = [0, 0, 1, 1]
Xen:
(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 2569511):
(XEN) Node 0: 1416166
(XEN) Node 1: 1153345
(XEN) Domain 5 (total: 4194304):
(XEN) Node 0: 2097152
(XEN) Node 1: 2097152
(XEN) Domain has 4 vnodes
(XEN) vnode 0 - pnode 0 (4096) MB
(XEN) vnode 1 - pnode 0 (4096) MB
(XEN) vnode 2 - pnode 1 (4096) MB
(XEN) vnode 3 - pnode 1 (4096) MB
(XEN) Domain vcpu to vnode:
(XEN) 0 1 2 3
dmesg on pv guest:
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x00001000-0x0009ffff]
[ 0.000000] node 0: [mem 0x00100000-0xffffffff]
[ 0.000000] node 1: [mem 0x100000000-0x1ffffffff]
[ 0.000000] node 2: [mem 0x200000000-0x2ffffffff]
[ 0.000000] node 3: [mem 0x300000000-0x3ffffffff]
[ 0.000000] On node 0 totalpages: 1048479
[ 0.000000] DMA zone: 56 pages used for memmap
[ 0.000000] DMA zone: 21 pages reserved
[ 0.000000] DMA zone: 3999 pages, LIFO batch:0
[ 0.000000] DMA32 zone: 14280 pages used for memmap
[ 0.000000] DMA32 zone: 1044480 pages, LIFO batch:31
[ 0.000000] On node 1 totalpages: 1048576
[ 0.000000] Normal zone: 14336 pages used for memmap
[ 0.000000] Normal zone: 1048576 pages, LIFO batch:31
[ 0.000000] On node 2 totalpages: 1048576
[ 0.000000] Normal zone: 14336 pages used for memmap
[ 0.000000] Normal zone: 1048576 pages, LIFO batch:31
[ 0.000000] On node 3 totalpages: 1048576
[ 0.000000] Normal zone: 14336 pages used for memmap
[ 0.000000] Normal zone: 1048576 pages, LIFO batch:31
[ 0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[ 0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
[ 0.000000] No local APIC present
[ 0.000000] APIC: disable apic facility
[ 0.000000] APIC: switched to apic NOOP
[ 0.000000] nr_irqs_gsi: 16
[ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[ 0.000000] e820: cannot find a gap in the 32bit address range
[ 0.000000] e820: PCI devices with unassigned 32bit BARs may break!
[ 0.000000] e820: [mem 0x400100000-0x4004fffff] available for PCI devices
[ 0.000000] Booting paravirtualized kernel on Xen
[ 0.000000] Xen version: 4.4-unstable (preserve-AD)
[ 0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4 nr_node_ids:4
[ 0.000000] PERCPU: Embedded 28 pages/cpu @ffff8800ffc00000 s85376 r8192 d21120 u2097152
[ 0.000000] pcpu-alloc: s85376 r8192 d21120 u2097152 alloc=1*2097152
[ 0.000000] pcpu-alloc: [0] 0 [1] 1 [2] 2 [3] 3
pv guest: numactl --hardware:
root@heatpipe:~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0
node 0 size: 4031 MB
node 0 free: 3997 MB
node 1 cpus: 1
node 1 size: 4039 MB
node 1 free: 4022 MB
node 2 cpus: 2
node 2 size: 4039 MB
node 2 free: 4023 MB
node 3 cpus: 3
node 3 size: 3975 MB
node 3 free: 3963 MB
node distances:
node 0 1 2 3
0: 10 20 20 20
1: 20 10 20 20
2: 20 20 10 20
3: 20 20 20 10
Comments:
None of the configuration options are correct so default values were used.
Since machine is NUMA machine and there is no vcpu pinning defines, NUMA
automatic node selection mechanism is used and you can see how vnodes
were split across physical nodes.
2. vNUMA enabled guest, no default values, real NUMA machine
Config:
memory = 4096
vcpus = 4
name = "rc9"
vnodes = 2
vnumamem = [2048, 2048]
vdistance = [10, 40, 40, 10]
vnuma_vcpumap = [1, 0, 1, 0]
vnuma_vnodemap = [1, 0]
Xen:
(XEN) 'u' pressed -> dumping numa info (now-0xA86:BD6C8829)
(XEN) idx0 -> NODE0 start->0 size->4521984 free->131471
(XEN) phys_to_nid(0000000000001000) -> 0 should be 0
(XEN) idx1 -> NODE1 start->4521984 size->4194304 free->341610
(XEN) phys_to_nid(0000000450001000) -> 1 should be 1
(XEN) CPU0 -> NODE0
(XEN) CPU1 -> NODE0
(XEN) CPU2 -> NODE0
(XEN) CPU3 -> NODE0
(XEN) CPU4 -> NODE1
(XEN) CPU5 -> NODE1
(XEN) CPU6 -> NODE1
(XEN) CPU7 -> NODE1
(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 2569511):
(XEN) Node 0: 1416166
(XEN) Node 1: 1153345
(XEN) Domain 6 (total: 1048576):
(XEN) Node 0: 524288
(XEN) Node 1: 524288
(XEN) Domain has 2 vnodes
(XEN) vnode 0 - pnode 1 (2048) MB
(XEN) vnode 1 - pnode 0 (2048) MB
(XEN) Domain vcpu to vnode:
(XEN) 1 0 1 0
pv guest dmesg:
[ 0.000000] NUMA: Initialized distance table, cnt=2
[ 0.000000] Initmem setup node 0 [mem 0x00000000-0x7fffffff]
[ 0.000000] NODE_DATA [mem 0x7ffd9000-0x7fffffff]
[ 0.000000] Initmem setup node 1 [mem 0x80000000-0xffffffff]
[ 0.000000] NODE_DATA [mem 0xff7f8000-0xff81efff]
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x00001000-0x00ffffff]
[ 0.000000] DMA32 [mem 0x01000000-0xffffffff]
[ 0.000000] Normal empty
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x00001000-0x0009ffff]
[ 0.000000] node 0: [mem 0x00100000-0x7fffffff]
[ 0.000000] node 1: [mem 0x80000000-0xffffffff]
[ 0.000000] On node 0 totalpages: 524191
[ 0.000000] DMA zone: 56 pages used for memmap
[ 0.000000] DMA zone: 21 pages reserved
[ 0.000000] DMA zone: 3999 pages, LIFO batch:0
[ 0.000000] DMA32 zone: 7112 pages used for memmap
[ 0.000000] DMA32 zone: 520192 pages, LIFO batch:31
[ 0.000000] On node 1 totalpages: 524288
[ 0.000000] DMA32 zone: 7168 pages used for memmap
[ 0.000000] DMA32 zone: 524288 pages, LIFO batch:31
[ 0.000000] SFI: Simple Firmware Interface v0.81 http://simplefirmware.org
[ 0.000000] smpboot: Allowing 4 CPUs, 0 hotplug CPUs
[ 0.000000] No local APIC present
[ 0.000000] APIC: disable apic facility
[ 0.000000] APIC: switched to apic NOOP
[ 0.000000] nr_irqs_gsi: 16
[ 0.000000] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[ 0.000000] e820: cannot find a gap in the 32bit address range
[ 0.000000] e820: PCI devices with unassigned 32bit BARs may break!
[ 0.000000] e820: [mem 0x100100000-0x1004fffff] available for PCI devices
[ 0.000000] Booting paravirtualized kernel on Xen
[ 0.000000] Xen version: 4.4-unstable (preserve-AD)
[ 0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:4 nr_node_ids:2
[ 0.000000] PERCPU: Embedded 28 pages/cpu @ffff88007fc00000 s85376 r8192 d21120 u1048576
[ 0.000000] pcpu-alloc: s85376 r8192 d21120 u1048576 alloc=1*2097152
[ 0.000000] pcpu-alloc: [0] 0 2 [1] 1 3
pv guest:
root@heatpipe:~# numactl --ha
available: 2 nodes (0-1)
node 0 cpus: 1 3
node 0 size: 2011 MB
node 0 free: 1975 MB
node 1 cpus: 0 2
node 1 size: 2003 MB
node 1 free: 1983 MB
node distances:
node 0 1
0: 10 40
1: 40 10
root@heatpipe:~# numactl --ha
available: 2 nodes (0-1)
node 0 cpus: 1 3
node 0 size: 2011 MB
node 0 free: 1975 MB
node 1 cpus: 0 2
node 1 size: 2003 MB
node 1 free: 1983 MB
node distances:
node 0 1
0: 10 40
1: 40 10
In this case every config option is correct and we have exact vNUMA topology
as it in VN config file.
Notes:
* to enable vNUMA in linux kernel the corresponding patch set should be
applied;
* automatic numa balancing featurue seem to be fixed in linux kernel:
https://lkml.org/lkml/2013/7/31/647
TODO:
* This version limits vdistance config option to only two values - same node
distance and other node distance; This prevents oopses on latest (3.13-rc1)
linux kernel with non-symmetric distance;
* cpu siblings for Linux machine and xen cpu trap should be detected and
warning should be given; Add cpuid check if set in VM config;
* benchmarking;
Elena Ufimtseva (7):
xen: vNUMA support for guests.
libxc: Plumb Xen with vNUMA topology for domain.
libxc: vnodes allocation on NUMA nodes.
libxl: vNUMA supporting interface.
libxl: vNUMA configuration parser
xen: adds vNUMA info debug-key u
xl: docs for xl config vnuma options
docs/man/xl.cfg.pod.5 | 55 +++++++++
tools/libxc/xc_dom.h | 10 ++
tools/libxc/xc_dom_x86.c | 85 ++++++++++++--
tools/libxc/xc_domain.c | 61 ++++++++++
tools/libxc/xenctrl.h | 9 ++
tools/libxc/xg_private.h | 1 +
tools/libxl/libxl.c | 20 ++++
tools/libxl/libxl.h | 20 ++++
tools/libxl/libxl_arch.h | 8 ++
tools/libxl/libxl_dom.c | 189 ++++++++++++++++++++++++++++-
tools/libxl/libxl_internal.h | 3 +
tools/libxl/libxl_types.idl | 5 +-
tools/libxl/libxl_vnuma.h | 7 ++
tools/libxl/libxl_x86.c | 58 +++++++++
tools/libxl/xl_cmdimpl.c | 268 +++++++++++++++++++++++++++++++++++++++++-
xen/arch/x86/numa.c | 19 +++
xen/common/domain.c | 10 ++
xen/common/domctl.c | 82 +++++++++++++
xen/common/memory.c | 36 ++++++
xen/include/public/domctl.h | 24 ++++
xen/include/public/memory.h | 8 ++
xen/include/public/vnuma.h | 44 +++++++
xen/include/xen/domain.h | 10 ++
xen/include/xen/sched.h | 1 +
24 files changed, 1020 insertions(+), 13 deletions(-)
create mode 100644 tools/libxl/libxl_vnuma.h
create mode 100644 xen/include/public/vnuma.h
--
1.7.10.4
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v2 0/7] vNUMA introduction.
2013-11-14 3:25 [PATCH v2 0/7] vNUMA introduction Elena Ufimtseva
@ 2013-11-14 10:03 ` Dario Faggioli
2013-11-14 13:40 ` Elena Ufimtseva
0 siblings, 1 reply; 3+ messages in thread
From: Dario Faggioli @ 2013-11-14 10:03 UTC (permalink / raw)
To: Elena Ufimtseva
Cc: keir, Ian.Campbell, stefano.stabellini, george.dunlap, msw,
lccycc123, ian.jackson, xen-devel, JBeulich
[-- Attachment #1.1: Type: text/plain, Size: 2707 bytes --]
Hi again, Elena,
And thanks for the good work! :-)
On mer, 2013-11-13 at 22:25 -0500, Elena Ufimtseva wrote:
> vNUMA introduction
>
> [...]
>
> This patchset can be pulled from https://git.gitorious.org/xenvnuma/xenvnuma.git
> Linux patchset https://git.gitorious.org/xenvnuma/linuxvnuma.git
>
AhA! When replying to the Linux series, I said the linuxvnuma.git repo
seemed empty, but trying again now it's cloning something... I guess you
either fixed this or it was just me having/bad timing. :-)
> Examples of booting vNUMA enabled PV Linux guest on real NUMA machine:
>
> 1. Automatic vNUMA placement on real NUMA machine:
>
> VM config:
>
> memory = 16384
> vcpus = 4
> name = "rcbig"
> vnodes = 4
> vnumamem = [10,10]
> vnuma_distance = [10, 30, 10, 30]
> vcpu_to_vnode = [0, 0, 1, 1]
>
> [..]
>
> Comments:
> None of the configuration options are correct so default values were used.
>
And (talking without having looked at the patches yet), you do print a
warning when this happens, right? :-)
> Notes:
> * to enable vNUMA in linux kernel the corresponding patch set should be
> applied;
> * automatic numa balancing featurue seem to be fixed in linux kernel:
> https://lkml.org/lkml/2013/7/31/647
>
Mmm... I'm quite curious about this, since we talked extensively about
it in Edinburgh. Does this mean you're not having issues with the NUMA
hinting page fault any longer? Even without doing anything, either in
Xen or Linux?
So what was that? The URL above is just someone reporting a quite
general 'memory corruption issue', and nothing about what the cause was
and whether and how it has been fixed? But even more important, was that
it that was causing the problem you were seeing?
> TODO:
> * This version limits vdistance config option to only two values - same node
> distance and other node distance; This prevents oopses on latest (3.13-rc1)
> linux kernel with non-symmetric distance;
>
Ok, that's fine for now. We'll work on allowing the syntax we agreed
with IanJ during last round of review (and yes, with "We", I mean
"I" :-D).
> * cpu siblings for Linux machine and xen cpu trap should be detected and
> warning should be given; Add cpuid check if set in VM config;
> * benchmarking;
>
That's a big one! I think I'll have something ready at least for
facilitating it soon.
Thanks again and Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
[-- Attachment #2: Type: text/plain, Size: 126 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v2 0/7] vNUMA introduction.
2013-11-14 10:03 ` Dario Faggioli
@ 2013-11-14 13:40 ` Elena Ufimtseva
0 siblings, 0 replies; 3+ messages in thread
From: Elena Ufimtseva @ 2013-11-14 13:40 UTC (permalink / raw)
To: Dario Faggioli
Cc: Keir Fraser, Ian Campbell, Stefano Stabellini, George Dunlap,
Matt Wilson, Li Yechen, Ian Jackson, xen-devel@lists.xen.org,
Jan Beulich
On Thu, Nov 14, 2013 at 5:03 AM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> Hi again, Elena,
Hello Dario )
>
> And thanks for the good work! :-)
>
> On mer, 2013-11-13 at 22:25 -0500, Elena Ufimtseva wrote:
>> vNUMA introduction
>>
>> [...]
>>
>> This patchset can be pulled from https://git.gitorious.org/xenvnuma/xenvnuma.git
>> Linux patchset https://git.gitorious.org/xenvnuma/linuxvnuma.git
>>
> AhA! When replying to the Linux series, I said the linuxvnuma.git repo
> seemed empty, but trying again now it's cloning something... I guess you
> either fixed this or it was just me having/bad timing. :-)
Yes, gitorious is not that reliable when it comes to bigger repository.
>
>> Examples of booting vNUMA enabled PV Linux guest on real NUMA machine:
>>
>> 1. Automatic vNUMA placement on real NUMA machine:
>>
>> VM config:
>>
>> memory = 16384
>> vcpus = 4
>> name = "rcbig"
>> vnodes = 4
>> vnumamem = [10,10]
>> vnuma_distance = [10, 30, 10, 30]
>> vcpu_to_vnode = [0, 0, 1, 1]
>>
>> [..]
>>
>> Comments:
>> None of the configuration options are correct so default values were used.
>>
> And (talking without having looked at the patches yet), you do print a
> warning when this happens, right? :-)
>
>> Notes:
>> * to enable vNUMA in linux kernel the corresponding patch set should be
>> applied;
>> * automatic numa balancing featurue seem to be fixed in linux kernel:
>> https://lkml.org/lkml/2013/7/31/647
>>
> Mmm... I'm quite curious about this, since we talked extensively about
> it in Edinburgh. Does this mean you're not having issues with the NUMA
> hinting page fault any longer? Even without doing anything, either in
> Xen or Linux?
Correct! :) I have run multiple tests, including kernel compilation.
I have reverted all numa balancing related code I introduced and checked again.
And now with automatic numa balancing turened on there is no such issue.
The only thing there is a potential for oops on linux kernel side when
migrating huge pages
(set_pmd_at is absent in pv_mmu_ops, but I will see if I can catch it
wirth testing).
>
> So what was that? The URL above is just someone reporting a quite
> general 'memory corruption issue', and nothing about what the cause was
> and whether and how it has been fixed? But even more important, was that
> it that was causing the problem you were seeing?
My apologies, correct link is as follows
https://lkml.org/lkml/2013/10/31/133
Thats Ingo Molnar's work. I did not look precisely into the code yet
and will do it today, but from the first glance
things has changed quite a bit.
>
>> TODO:
>> * This version limits vdistance config option to only two values - same node
>> distance and other node distance; This prevents oopses on latest (3.13-rc1)
>> linux kernel with non-symmetric distance;
>>
> Ok, that's fine for now. We'll work on allowing the syntax we agreed
> with IanJ during last round of review (and yes, with "We", I mean
> "I" :-D).
>
>> * cpu siblings for Linux machine and xen cpu trap should be detected and
>> warning should be given; Add cpuid check if set in VM config;
>> * benchmarking;
>>
> That's a big one! I think I'll have something ready at least for
> facilitating it soon.
I would like to know how you do this :)
>
> Thanks again and Regards,
> Dario
>
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>
--
Elena
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2013-11-14 13:40 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-14 3:25 [PATCH v2 0/7] vNUMA introduction Elena Ufimtseva
2013-11-14 10:03 ` Dario Faggioli
2013-11-14 13:40 ` Elena Ufimtseva
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).