All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] [HVM] NUMA support in HVM guests
@ 2007-08-13 10:01 Andre Przywara
  2007-09-07  8:42 ` Xu, Anthony
  0 siblings, 1 reply; 7+ messages in thread
From: Andre Przywara @ 2007-08-13 10:01 UTC (permalink / raw)
  To: xen-devel

Hi,

these four patches allow to forward NUMA characteristics into HVM
guests. This works by allocating memory explicitly from different NUMA 
nodes and create an appropriate SRAT-ACPI table which describes the 
topology. Needs a decent guest kernel which uses the SRAT table to 
discover the NUMA topology.
This allows to break the current de-facto limitation of guests to one 
NUMA node, one can use more memory and/or more VCPUs than there are 
available on one node.

	Patch 1/4: introduce numanodes=n config file option.
this states how many NUMA nodes the guest should see, the default 	is 0, 
which means to turn off most parts of the code.
	Patch 2/4: introduce CPU affinity for allocate_physmap call. currently 
the correct NUMA node to take the memory from is chosen by simply using 
the currently scheduled CPU, this patch allows to explicitly specify a 
CPU and provides XENMEM_DEFAULT_CPU for the old behavior
	Patch 3/4: allocate memory with NUMA in mind.
actually look at the numanodes=n option to split the memory request up
into n parts and allocate it from different nodes. Also change the VCPUs
affinity to match the nodes.
	Patch 4/4: inject created SRAT table into the guest.
create a SRAT table, fill it up with the desired NUMA topology and
inject it into the guest

Applies against staging c/s #15719.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 277-84917
----to satisfy European Law for business letters:
AMD Saxony Limited Liability Company & Co. KG
Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, 
Deutschland
Registergericht Dresden: HRA 4896
vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington, 
Delaware, USA)
Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH 0/4] [HVM] NUMA support in HVM guests
  2007-08-13 10:01 [PATCH 0/4] [HVM] NUMA support in HVM guests Andre Przywara
@ 2007-09-07  8:42 ` Xu, Anthony
  2007-09-07 12:49   ` Andre Przywara
  0 siblings, 1 reply; 7+ messages in thread
From: Xu, Anthony @ 2007-09-07  8:42 UTC (permalink / raw)
  To: Andre Przywara, xen-devel

Hi Andre,


This is a good start for supporting guest NUMA.

I have some comments.


+    for (i=0;i<=dominfo.max_vcpu_id;i++)
+    {
+        node= ( i * numanodes ) / (dominfo.max_vcpu_id+1);
+        xc_vcpu_setaffinity (xc_handle, dom, i, nodemasks[node]);
+    }

This always starts from node0, this may make node0 very busy, while other nodes may not have many work.
It may be nice to pin node from the lightest overhead node.

We also need to add some limitations for numanodes. The number of vcpus on vnode should not be larger than the number of pcpus on pnode. Otherwise vcpus belonging to a domain run on the same pcpu, which is not what we want.


In setup_numa_mem, each node has even memory size, if the memory allocation fails, the domain creation fails. This may be too "rude", I think we can support guest NUMA with each node has different memory size, even more, and maybe some node doesn't have memory. What we need guarantee is guest see physical topology. 


In your patch, when create NUMA guest, vnode is pinned to pnode. While after some creations and destroys domain operation, the workload on the platform may be very imbalanced, we need a method to dynamically balance workload.
There are two methods IMO.
1. Implement NUMA-aware scheduler and page migration
2. Run a daemon in dom0, this daemon monitors workload, and use live-migration to balance workload if necessary.


Regards
-Anthony


>-----Original Message-----
>From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-
>bounces@lists.xensource.com] On Behalf Of Andre Przywara
>Sent: Monday, August 13, 2007 6:01 PM
>To: xen-devel@lists.xensource.com
>Subject: [Xen-devel] [PATCH 0/4] [HVM] NUMA support in HVM guests
>
>Hi,
>
>these four patches allow to forward NUMA characteristics into HVM
>guests. This works by allocating memory explicitly from different NUMA
>nodes and create an appropriate SRAT-ACPI table which describes the
>topology. Needs a decent guest kernel which uses the SRAT table to
>discover the NUMA topology.
>This allows to break the current de-facto limitation of guests to one
>NUMA node, one can use more memory and/or more VCPUs than there are
>available on one node.
>
>	Patch 1/4: introduce numanodes=n config file option.
>this states how many NUMA nodes the guest should see, the default 	is
>0,
>which means to turn off most parts of the code.
>	Patch 2/4: introduce CPU affinity for allocate_physmap call.
>currently
>the correct NUMA node to take the memory from is chosen by simply using
>the currently scheduled CPU, this patch allows to explicitly specify a
>CPU and provides XENMEM_DEFAULT_CPU for the old behavior
>	Patch 3/4: allocate memory with NUMA in mind.
>actually look at the numanodes=n option to split the memory request up
>into n parts and allocate it from different nodes. Also change the VCPUs
>affinity to match the nodes.
>	Patch 4/4: inject created SRAT table into the guest.
>create a SRAT table, fill it up with the desired NUMA topology and
>inject it into the guest
>
>Applies against staging c/s #15719.
>
>Signed-off-by: Andre Przywara <andre.przywara@amd.com>
>
>Regards,
>Andre.
>
>--
>Andre Przywara
>AMD-Operating System Research Center (OSRC), Dresden, Germany
>Tel: +49 351 277-84917
>----to satisfy European Law for business letters:
>AMD Saxony Limited Liability Company & Co. KG
>Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden,
>Deutschland
>Registergericht Dresden: HRA 4896
>vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington,
>Delaware, USA)
>Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy
>
>
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/4] [HVM] NUMA support in HVM guests
  2007-09-07  8:42 ` Xu, Anthony
@ 2007-09-07 12:49   ` Andre Przywara
  2007-09-10  1:14     ` Xu, Anthony
  0 siblings, 1 reply; 7+ messages in thread
From: Andre Przywara @ 2007-09-07 12:49 UTC (permalink / raw)
  To: Xu, Anthony; +Cc: xen-devel

Anthony,

thanks for looking into the patches, I appreciate your comments.

> +    for (i=0;i<=dominfo.max_vcpu_id;i++)
> +    {
> +        node= ( i * numanodes ) / (dominfo.max_vcpu_id+1);
> +        xc_vcpu_setaffinity (xc_handle, dom, i, nodemasks[node]);
> +    }
> 
> This always starts from node0, this may make node0 very busy, while other nodes may not have many work.
This is true, I encountered this before, but didn't want to wait longer 
for sending up the patches. Actually the "numanodes=n" config file 
option shouldn't specify the number of nodes, but a list of specific 
nodes to use, like "numanodes=0,2" to pin the domain on the first and 
the third node.
> It may be nice to pin node from the lightest overhead node.
This sounds interesting. It shouldn't be that hard to do this in libxc, 
but we should think about a semantic to specify this behavior in the 
config file (if we change the semantic from the number to specific node 
like I described above).
> We also need to add some limitations for numanodes. The number of vcpus on vnode should not be larger
 >than the number of pcpus on pnode. Otherwise vcpus belonging to a 
domain run
 > on the same pcpu, which is not what we want.
Would be nice, but in the moment I would push this into the sysadmin's 
responsibility.
> In setup_numa_mem, each node has even memory size, if the memory allocation fails,  >the domain creation fails. This may be too "rude", I think we can 
support guest
 > NUMA with each node has different memory size, even more, and maybe 
some node doesn't have
> memory. What we need guarantee is guest see physical topology.
Sound reasonable. I will look into this.
> In your patch, when create NUMA guest, vnode is pinned to pnode. While after some creations and destroys domain operation,
 >the workload on the platform may be very imbalanced, we need a method 
to dynamically balance workload.
> There are two methods IMO.
> 1. Implement NUMA-aware scheduler and page migration
> 2. Run a daemon in dom0, this daemon monitors workload, and use live-migration to balance workload if necessary.
You are right, this may become a problem. I think the second solution is 
easier to implement. A NUMA-aware scheduler would be nice, but my idea 
was that the guest OS can better schedule (more fine-grained on a 
per-process base than on a per-machine base) things. Changing the 
processing node without moving the memory along should be an exception 
(as it changes NUMA topology and in the moment I don't see methods to 
propagate this nicely to the (HVM) guest), so I think a kind of 
"real-emergency balancer" which includes page-migration (quite expensive 
with bigger memory sizes!) would be more appropriate.

After all my patches were more a discussion base than a final solution, 
so I see there is more work to do. In the moment I am working on 
including PV guests.

Regards,
Andre.



-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 277-84917
----to satisfy European Law for business letters:
AMD Saxony Limited Liability Company & Co. KG
Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, 
Deutschland
Registergericht Dresden: HRA 4896
vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington, 
Delaware, USA)
Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH 0/4] [HVM] NUMA support in HVM guests
  2007-09-07 12:49   ` Andre Przywara
@ 2007-09-10  1:14     ` Xu, Anthony
  2007-11-23  8:42       ` [PATCH 0/4] [HVM][RFC] " Duan, Ronghui
  0 siblings, 1 reply; 7+ messages in thread
From: Xu, Anthony @ 2007-09-10  1:14 UTC (permalink / raw)
  To: Andre Przywara; +Cc: xen-devel

Andre

>>
>> This always starts from node0, this may make node0 very busy, while
other
>nodes may not have many work.
>This is true, I encountered this before, but didn't want to wait longer
>for sending up the patches. Actually the "numanodes=n" config file
>option shouldn't specify the number of nodes, but a list of specific
>nodes to use, like "numanodes=0,2" to pin the domain on the first and
>the third node.

That's a good idea to specify the nodes to use,
We can use "numamodes=0,2" in configure file, and it will be converted
into bitmap long numamodes, every bit indicates one node.
When guest doesn't specify "numamodes", XEN will need to choose proper
nodes for guest. So XEN also needs to implement some algorithm to choose
proper nodes.


>> We also need to add some limitations for numanodes. The number of
vcpus
>on vnode should not be larger
> >than the number of pcpus on pnode. Otherwise vcpus belonging to a
>domain run
> > on the same pcpu, which is not what we want.
>Would be nice, but in the moment I would push this into the sysadmin's
>responsibility.
It's reasonable.


>After all my patches were more a discussion base than a final solution,
>so I see there is more work to do. In the moment I am working on
>including PV guests.
>
That's a very good start for support guest NUMA.



Regards
- Anthony

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH 0/4] [HVM][RFC] NUMA support in HVM guests
  2007-09-10  1:14     ` Xu, Anthony
@ 2007-11-23  8:42       ` Duan, Ronghui
  2007-11-23 14:23         ` [RFC] NUMA support Andre Przywara
  0 siblings, 1 reply; 7+ messages in thread
From: Duan, Ronghui @ 2007-11-23  8:42 UTC (permalink / raw)
  To: Xu, Anthony, Andre Przywara; +Cc: xen-devel

Hi Andre,

I read your patches and Anthony's commands. Write a patch based on

1:    If guest set numanodes=n (default it will be 1 means that this
guest   	will be restricted in one node); hypervisor will choose
begin node to 	pin for this guest use round robin. But the method I use
need a 	spin_lock to prevent create domain at same time. Are there any
more 	good methods, hope for your suggestion.
2:	pass node parameter use higher bits in flags when create domain.
At  	this time, domain can record node information in domain struct
for 	further use, i.e. show which node to pin when setup_guest.    
	If use this method, in your patch, can simply balance nodes just
like 	below;

> +    for (i=0;i<=dominfo.max_vcpu_id;i++)
> +    {
> +        node= ( i * numanodes ) / (dominfo.max_vcpu_id+1)+ 		
> +		domaininfo.first_node;
> +        xc_vcpu_setaffinity (xc_handle, dom, i, nodemasks[node]);
> +    }
>
	BTW: I can't find your mail of Patch 2/4: introduce CPU affinity
for 	allocate_physmap call, so I can't add your patch on source.

I just begin my "NUMA trip", appreciate you suggestions. Thanks.

Best Regards
Ronghui

-----Original Message-----
From: xen-devel-bounces@lists.xensource.com
[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Xu, Anthony
Sent: Monday, September 10, 2007 9:14 AM
To: Andre Przywara
Cc: xen-devel@lists.xensource.com
Subject: RE: [Xen-devel] [PATCH 0/4] [HVM] NUMA support in HVM guests

Andre

>>
>> This always starts from node0, this may make node0 very busy, while
other
>nodes may not have many work.
>This is true, I encountered this before, but didn't want to wait longer
>for sending up the patches. Actually the "numanodes=n" config file
>option shouldn't specify the number of nodes, but a list of specific
>nodes to use, like "numanodes=0,2" to pin the domain on the first and
>the third node.

That's a good idea to specify the nodes to use,
We can use "numamodes=0,2" in configure file, and it will be converted
into bitmap long numamodes, every bit indicates one node.
When guest doesn't specify "numamodes", XEN will need to choose proper
nodes for guest. So XEN also needs to implement some algorithm to choose
proper nodes.


>> We also need to add some limitations for numanodes. The number of
vcpus
>on vnode should not be larger
> >than the number of pcpus on pnode. Otherwise vcpus belonging to a
>domain run
> > on the same pcpu, which is not what we want.
>Would be nice, but in the moment I would push this into the sysadmin's
>responsibility.
It's reasonable.


>After all my patches were more a discussion base than a final solution,
>so I see there is more work to do. In the moment I am working on
>including PV guests.
>
That's a very good start for support guest NUMA.



Regards
- Anthony

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC] NUMA support
  2007-11-23  8:42       ` [PATCH 0/4] [HVM][RFC] " Duan, Ronghui
@ 2007-11-23 14:23         ` Andre Przywara
  2007-11-24 15:57           ` Duan, Ronghui
  0 siblings, 1 reply; 7+ messages in thread
From: Andre Przywara @ 2007-11-23 14:23 UTC (permalink / raw)
  To: Duan, Ronghui, xen-devel; +Cc: Anthony.Xu

All,
thanks Ronghui for your patches and ideas. To make a more structured 
approach to a better NUMA support, I suggest to concentrate on 
one-node-guests first:
* introduce CPU affinity to memory allocation routines called from Dom0. 
This is basically my patch 2/4 from August. We should think about using 
a NUMA node number instead of a physical CPU, is there something to be 
said against this?

* find _some_ method of load balancing when creating guests. The method 
1 from Ronghui is a start, but a real decision based on each node's 
utilization (or free memory) would be more reasonable.

* patch the guest memory allocation routines to allocate memory from 
that specific node only (based on my patch 3/4)

* use live migration to local host to allow node migration. Assuming 
that localhost live migration works reliably (is that really true?) it 
shouldn't be too hard to implement this (basically just using node 
affinity while allocating guest memory). Since this is a rather 
expensive operation (takes twice the memory temporarily and quite some 
time), I'd suggest to trigger that explicitly from the admin via a xm 
command, maybe as an addition to migrate:
# xm migrate --live --node 1 <domid> localhost
There could be some Dom0 daemon based re-balancer to do this somewhat 
automatically later on.

I would take care of the memory allocation patch and would look into 
node migration. It would be great if Roughui or Anthony would help to 
improve the "load balancing" algorithm.

Meanwhile I will continue to patch that d*** Linux kernel to accept both 
CONFIG_NUMA and CONFIG_XEN without crashing that early ;-), this should 
allow both HVM and PV guests to support multiple NUMA nodes within one 
guest.

Also we should start a discussion on the config file options to add:
Shall we use "numanodes=<nr of nodes>", something like "numa=on" (for 
one-node-guests only), or something like "numanode=0,1" to explicitly 
specify certain nodes?

Any comments are appreciated.

> I read your patches and Anthony's commands. Write a patch based on
> 
> 1:    If guest set numanodes=n (default it will be 1 means that this
> guest   	will be restricted in one node); hypervisor will choose
> begin node to 	pin for this guest use round robin. But the method I use
> need a 	spin_lock to prevent create domain at same time. Are there any
> more 	good methods, hope for your suggestion.
That's a good start, thank you. Maybe Keir has some comments on the 
spinlock issue.
> 2:	pass node parameter use higher bits in flags when create domain.
> At  	this time, domain can record node information in domain struct
> for 	further use, i.e. show which node to pin when setup_guest.    
> 	If use this method, in your patch, can simply balance nodes just
> like 	below;
> 
>> +    for (i=0;i<=dominfo.max_vcpu_id;i++)
>> +    {
>> +        node= ( i * numanodes ) / (dominfo.max_vcpu_id+1)+ 		
>> +		domaininfo.first_node;
>> +        xc_vcpu_setaffinity (xc_handle, dom, i, nodemasks[node]);
>> +    }
How many bits do you want to use? Maybe it's not a good idea to abuse 
some variable to hold a limited number of nodes only ("640K ought to be 
enough for anybody" ;-) But the general idea is good.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 277-84917
----to satisfy European Law for business letters:
AMD Saxony Limited Liability Company & Co. KG
Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, 
Deutschland
Registergericht Dresden: HRA 4896
vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington, 
Delaware, USA)
Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [RFC] NUMA support
  2007-11-23 14:23         ` [RFC] NUMA support Andre Przywara
@ 2007-11-24 15:57           ` Duan, Ronghui
  0 siblings, 0 replies; 7+ messages in thread
From: Duan, Ronghui @ 2007-11-24 15:57 UTC (permalink / raw)
  To: Andre Przywara, xen-devel; +Cc: Xu, Anthony

Hi all,
>thanks Ronghui for your patches and ideas. To make a more structured
>approach to a better NUMA support, I suggest to concentrate on
>one-node-guests first:

That is exactly what we want do at first, don't support guest Numa.

>* introduce CPU affinity to memory allocation routines called from
Dom0.
>This is basically my patch 2/4 from August. We should think about using
>a NUMA node number instead of a physical CPU, is there something to be
>said against this?

I think it is reasonable to bind guest with node not CPU.

>* find _some_ method of load balancing when creating guests. The method
>1 from Ronghui is a start, but a real decision based on each node's
>utilization (or free memory) would be more reasonable.

Yes, it is only a start for balancing.

>* patch the guest memory allocation routines to allocate memory from
>that specific node only (based on my patch 3/4)
Considering the performance, we should do it.

>* use live migration to local host to allow node migration. Assuming
>that localhost live migration works reliably (is that really true?) it
>shouldn't be too hard to implement this (basically just using node
>affinity while allocating guest memory). Since this is a rather
>expensive operation (takes twice the memory temporarily and quite some
>time), I'd suggest to trigger that explicitly from the admin via a xm
>command, maybe as an addition to migrate:
># xm migrate --live --node 1 <domid> localhost
>There could be some Dom0 daemon based re-balancer to do this somewhat
>automatically later on.
>
>I would take care of the memory allocation patch and would look into
>node migration. It would be great if Roughui or Anthony would help to
>improve the "load balancing" algorithm.

I have no idea on this now.

>Meanwhile I will continue to patch that d*** Linux kernel to accept
both
>CONFIG_NUMA and CONFIG_XEN without crashing that early ;-), this should
>allow both HVM and PV guests to support multiple NUMA nodes within one
>guest.
>
>Also we should start a discussion on the config file options to add:
>Shall we use "numanodes=<nr of nodes>", something like "numa=on" (for
>one-node-guests only), or something like "numanode=0,1" to explicitly
>specify certain nodes?

Because now we don't support guest Numa, this configure options we don't
need now. If need to support guest Numa, I think users may even want to
configure the node's type, i.e. how many Cpu or memory in that node. I
think it will be too complicated. ^_^

>Any comments are appreciated.
>
>> I read your patches and Anthony's commands. Write a patch based on
>>
>> 1:    If guest set numanodes=n (default it will be 1 means that this
>> guest   	will be restricted in one node); hypervisor will choose
>> begin node to 	pin for this guest use round robin. But the
method I use
>> need a 	spin_lock to prevent create domain at same time. Are
there any
>> more 	good methods, hope for your suggestion.
>That's a good start, thank you. Maybe Keir has some comments on the
>spinlock issue.
>> 2:	pass node parameter use higher bits in flags when create domain.
>> At  	this time, domain can record node information in domain struct
>> for 	further use, i.e. show which node to pin when setup_guest.
>> 	If use this method, in your patch, can simply balance nodes just
>> like 	below;
>>
>>> +    for (i=0;i<=dominfo.max_vcpu_id;i++)
>>> +    {
>>> +        node= ( i * numanodes ) / (dominfo.max_vcpu_id+1)+
>>> +		domaininfo.first_node;
>>> +        xc_vcpu_setaffinity (xc_handle, dom, i, nodemasks[node]);
>>> +    }
>How many bits do you want to use? Maybe it's not a good idea to abuse
>some variable to hold a limited number of nodes only ("640K ought to be
>enough for anybody" ;-) But the general idea is good.
Actually if no need to support guest Numa, no parameter need to pass
down. 
Seems that one node for guest is a good method. ^_^

Best regards,
Ronghui

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-11-24 15:57 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-13 10:01 [PATCH 0/4] [HVM] NUMA support in HVM guests Andre Przywara
2007-09-07  8:42 ` Xu, Anthony
2007-09-07 12:49   ` Andre Przywara
2007-09-10  1:14     ` Xu, Anthony
2007-11-23  8:42       ` [PATCH 0/4] [HVM][RFC] " Duan, Ronghui
2007-11-23 14:23         ` [RFC] NUMA support Andre Przywara
2007-11-24 15:57           ` Duan, Ronghui

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.