* Xen 3.4.1 NUMA support @ 2009-11-04 12:02 Papagiannis Anastasios 2009-11-04 12:32 ` Keir Fraser 0 siblings, 1 reply; 30+ messages in thread From: Papagiannis Anastasios @ 2009-11-04 12:02 UTC (permalink / raw) To: xen-devel Hello, does the last version of Xen(3.4.1) support NUMA machines? Is there a .pdf or a link that can give me some more details about that? I work on a project for xen performace in numa machines. And in xen 3.3.0 this performance isn't good. Have something changed in last version? Thanks in advance, Papagiannis Anastasios ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-04 12:02 Xen 3.4.1 NUMA support Papagiannis Anastasios @ 2009-11-04 12:32 ` Keir Fraser 2009-11-06 18:07 ` Dan Magenheimer 0 siblings, 1 reply; 30+ messages in thread From: Keir Fraser @ 2009-11-04 12:32 UTC (permalink / raw) To: Papagiannis Anastasios, xen-devel@lists.xensource.com Add Xen boot parameter 'numa=on' to enable NUMA detection. Then it's up to you to, for example, pin domains to specific nodes, using the 'cpus=...' option in the domain config file. See /etc/xen/xmexample1 for an example of its usage. -- Keir On 04/11/2009 12:02, "Papagiannis Anastasios" <apapag@ics.forth.gr> wrote: > Hello, > > does the last version of Xen(3.4.1) support NUMA machines? Is there a .pdf > or a link that can give me some more details about that? I work on a > project for xen performace in numa machines. And in xen 3.3.0 this > performance isn't good. Have something changed in last version? > > Thanks in advance, > Papagiannis Anastasios > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: Xen 3.4.1 NUMA support 2009-11-04 12:32 ` Keir Fraser @ 2009-11-06 18:07 ` Dan Magenheimer 2009-11-09 11:33 ` George Dunlap 2009-11-09 15:02 ` Andre Przywara 0 siblings, 2 replies; 30+ messages in thread From: Dan Magenheimer @ 2009-11-06 18:07 UTC (permalink / raw) To: Keir Fraser, Papagiannis Anastasios, xen-devel; +Cc: George Dunlap VMware has the notion of a "cell" where VMs can be scheduled only within a cell, not across cells. Cell boundaries are determined by VMware by default, though certains settings can override them. An interesting project might be to implement "numa=cell" for Xen.... or maybe something similar is already in George Dunlap's scheduler plans? > -----Original Message----- > From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] > Sent: Wednesday, November 04, 2009 5:33 AM > To: Papagiannis Anastasios; xen-devel@lists.xensource.com > Subject: Re: [Xen-devel] Xen 3.4.1 NUMA support > > > Add Xen boot parameter 'numa=on' to enable NUMA detection. > Then it's up to > you to, for example, pin domains to specific nodes, using the > 'cpus=...' > option in the domain config file. See /etc/xen/xmexample1 for > an example of > its usage. > > -- Keir > > On 04/11/2009 12:02, "Papagiannis Anastasios" > <apapag@ics.forth.gr> wrote: > > > Hello, > > > > does the last version of Xen(3.4.1) support NUMA machines? > Is there a .pdf > > or a link that can give me some more details about that? I work on a > > project for xen performace in numa machines. And in xen 3.3.0 this > > performance isn't good. Have something changed in last version? > > > > Thanks in advance, > > Papagiannis Anastasios > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-06 18:07 ` Dan Magenheimer @ 2009-11-09 11:33 ` George Dunlap 2009-11-09 11:39 ` Dulloor 2009-11-09 11:44 ` Juergen Gross 2009-11-09 15:02 ` Andre Przywara 1 sibling, 2 replies; 30+ messages in thread From: George Dunlap @ 2009-11-09 11:33 UTC (permalink / raw) To: Dan Magenheimer Cc: xen-devel@lists.xensource.com, Keir Fraser, Papagiannis Anastasios I haven't had time to look at NUMA stuff at all. I probably will look at it eventually, if no one else does, but I'd be happy if someone else could pursue it. -George Dan Magenheimer wrote: > VMware has the notion of a "cell" where VMs can be > scheduled only within a cell, not across cells. > Cell boundaries are determined by VMware by > default, though certains settings can override them. > > An interesting project might be to implement > "numa=cell" for Xen.... or maybe something similar > is already in George Dunlap's scheduler plans? > > >> -----Original Message----- >> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >> Sent: Wednesday, November 04, 2009 5:33 AM >> To: Papagiannis Anastasios; xen-devel@lists.xensource.com >> Subject: Re: [Xen-devel] Xen 3.4.1 NUMA support >> >> >> Add Xen boot parameter 'numa=on' to enable NUMA detection. >> Then it's up to >> you to, for example, pin domains to specific nodes, using the >> 'cpus=...' >> option in the domain config file. See /etc/xen/xmexample1 for >> an example of >> its usage. >> >> -- Keir >> >> On 04/11/2009 12:02, "Papagiannis Anastasios" >> <apapag@ics.forth.gr> wrote: >> >> >>> Hello, >>> >>> does the last version of Xen(3.4.1) support NUMA machines? >>> >> Is there a .pdf >> >>> or a link that can give me some more details about that? I work on a >>> project for xen performace in numa machines. And in xen 3.3.0 this >>> performance isn't good. Have something changed in last version? >>> >>> Thanks in advance, >>> Papagiannis Anastasios >>> >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xensource.com >>> http://lists.xensource.com/xen-devel >>> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel >> >> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-09 11:33 ` George Dunlap @ 2009-11-09 11:39 ` Dulloor 2009-11-09 12:29 ` George Dunlap 2009-11-09 11:44 ` Juergen Gross 1 sibling, 1 reply; 30+ messages in thread From: Dulloor @ 2009-11-09 11:39 UTC (permalink / raw) To: George Dunlap Cc: Dan Magenheimer, xen-devel@lists.xensource.com, Keir Fraser, Papagiannis Anastasios George, What's the current scope and status of your scheduler work ? Is it going to look similar to the Linux scheduler (with scheduling domains, et al). In that case, topology is already accounted for, to a large extent. It would be good to know so that I can work on something that doesn't overlap. -dulloor On Mon, Nov 9, 2009 at 6:33 AM, George Dunlap <george.dunlap@eu.citrix.com> wrote: > I haven't had time to look at NUMA stuff at all. I probably will look at it > eventually, if no one else does, but I'd be happy if someone else could > pursue it. > > -George > > Dan Magenheimer wrote: >> >> VMware has the notion of a "cell" where VMs can be >> scheduled only within a cell, not across cells. >> Cell boundaries are determined by VMware by >> default, though certains settings can override them. >> >> An interesting project might be to implement >> "numa=cell" for Xen.... or maybe something similar >> is already in George Dunlap's scheduler plans? >> >> >>> >>> -----Original Message----- >>> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >>> Sent: Wednesday, November 04, 2009 5:33 AM >>> To: Papagiannis Anastasios; xen-devel@lists.xensource.com >>> Subject: Re: [Xen-devel] Xen 3.4.1 NUMA support >>> >>> >>> Add Xen boot parameter 'numa=on' to enable NUMA detection. Then it's up >>> to >>> you to, for example, pin domains to specific nodes, using the 'cpus=...' >>> option in the domain config file. See /etc/xen/xmexample1 for an example >>> of >>> its usage. >>> >>> -- Keir >>> >>> On 04/11/2009 12:02, "Papagiannis Anastasios" <apapag@ics.forth.gr> >>> wrote: >>> >>> >>>> >>>> Hello, >>>> >>>> does the last version of Xen(3.4.1) support NUMA machines? >>> >>> Is there a .pdf >>> >>>> >>>> or a link that can give me some more details about that? I work on a >>>> project for xen performace in numa machines. And in xen 3.3.0 this >>>> performance isn't good. Have something changed in last version? >>>> >>>> Thanks in advance, >>>> Papagiannis Anastasios >>>> >>>> >>>> _______________________________________________ >>>> Xen-devel mailing list >>>> Xen-devel@lists.xensource.com >>>> http://lists.xensource.com/xen-devel >>>> >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xensource.com >>> http://lists.xensource.com/xen-devel >>> >>> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-09 11:39 ` Dulloor @ 2009-11-09 12:29 ` George Dunlap 2009-11-09 12:51 ` Dulloor 0 siblings, 1 reply; 30+ messages in thread From: George Dunlap @ 2009-11-09 12:29 UTC (permalink / raw) To: Dulloor Cc: Dan Magenheimer, xen-devel@lists.xensource.com, Keir Fraser, Papagiannis Anastasios On Mon, Nov 9, 2009 at 11:39 AM, Dulloor <dulloor@gmail.com> wrote: > What's the current scope and status of your scheduler work ? Is it > going to look similar to the Linux scheduler (with scheduling domains, > et al). In that case, topology is already accounted for, to a large > extent. It would be good to know so that I can work on something that > doesn't overlap. My plan was to do something similar to Linux, but with this difference: Instead of having one runqueue per logical processor (as both Xen and Linux currently do), and having "domains" all the way up (as Linux currently does), I had planned on having one runqueue per L2 processor cache. The main reason to avoid migration is to preserve a warm cache; but since L1's are replaced so quickly, there should be little impact to a VM migrating between different threads and cores which share the same L2. Above the L2s I was planning on having an idea similar to the Linux "domains" (although obviously it would need a different name to avoid confusion), and doing explicit load-balancing between them. But as I have not had a chance to test this kind of load balancing yet, the plan may change somewhate before then. Problems to solve wrt NUMA, as I understand it, are to balance the performance cost of sharing a busy local CPU, vs the performance cost of non-local memory accesses. This would involve adding the NUMA logic to the load balancing algorithm. Which I guess would depend in part on having a load balancing algorithm to begin with. :-) Once I have the basic credit patches in working order, would you be interested in working on the load-balancing between runqueues? I can then work on further testing of the credit algorithm. My ultimate goal would be to have a basic regression test that people could use to measure how their changes to the scheduler affect a wide variety of workloads. -George ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-09 12:29 ` George Dunlap @ 2009-11-09 12:51 ` Dulloor 0 siblings, 0 replies; 30+ messages in thread From: Dulloor @ 2009-11-09 12:51 UTC (permalink / raw) To: George Dunlap Cc: Dan Magenheimer, xen-devel@lists.xensource.com, Keir Fraser, Papagiannis Anastasios Sure ! Let know when you have the patches ready. Also, that might be a good time to see if runq-per-l2 works better. -dulloor On Mon, Nov 9, 2009 at 7:29 AM, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > On Mon, Nov 9, 2009 at 11:39 AM, Dulloor <dulloor@gmail.com> wrote: >> What's the current scope and status of your scheduler work ? Is it >> going to look similar to the Linux scheduler (with scheduling domains, >> et al). In that case, topology is already accounted for, to a large >> extent. It would be good to know so that I can work on something that >> doesn't overlap. > > My plan was to do something similar to Linux, but with this > difference: Instead of having one runqueue per logical processor (as > both Xen and Linux currently do), and having "domains" all the way up > (as Linux currently does), I had planned on having one runqueue per L2 > processor cache. The main reason to avoid migration is to preserve a > warm cache; but since L1's are replaced so quickly, there should be > little impact to a VM migrating between different threads and cores > which share the same L2. > > Above the L2s I was planning on having an idea similar to the Linux > "domains" (although obviously it would need a different name to avoid > confusion), and doing explicit load-balancing between them. But as I > have not had a chance to test this kind of load balancing yet, the > plan may change somewhate before then. > > Problems to solve wrt NUMA, as I understand it, are to balance the > performance cost of sharing a busy local CPU, vs the performance cost > of non-local memory accesses. This would involve adding the NUMA > logic to the load balancing algorithm. Which I guess would depend in > part on having a load balancing algorithm to begin with. :-) > > Once I have the basic credit patches in working order, would you be > interested in working on the load-balancing between runqueues? I can > then work on further testing of the credit algorithm. My ultimate > goal would be to have a basic regression test that people could use to > measure how their changes to the scheduler affect a wide variety of > workloads. > > -George > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-09 11:33 ` George Dunlap 2009-11-09 11:39 ` Dulloor @ 2009-11-09 11:44 ` Juergen Gross 2009-11-09 12:07 ` George Dunlap 2009-11-09 12:40 ` Keir Fraser 1 sibling, 2 replies; 30+ messages in thread From: Juergen Gross @ 2009-11-09 11:44 UTC (permalink / raw) To: George Dunlap Cc: Dan Magenheimer, xen-devel@lists.xensource.com, Keir Fraser, Papagiannis Anastasios Cpupools? :-) NUMA was a topic I wanted to look at as soon as cpupools are officially accepted. Keir wanted to propose a way to get rid of the function continue_hypercall_on_cpu() which was causing most of the stuff leading to the objection of cpupools. I guess Keir had some higher priority jobs. :-) So I will try a new patch for cpupools without continue_hypercall_on_cpu() and perhaps with NUMA support. George, would this be okay for you? I think your scheduler still will have problems with domain weights as long as domains are restricted to some processors, right? Juergen George Dunlap wrote: > I haven't had time to look at NUMA stuff at all. I probably will look > at it eventually, if no one else does, but I'd be happy if someone else > could pursue it. > > -George > > Dan Magenheimer wrote: >> VMware has the notion of a "cell" where VMs can be >> scheduled only within a cell, not across cells. >> Cell boundaries are determined by VMware by >> default, though certains settings can override them. >> >> An interesting project might be to implement >> "numa=cell" for Xen.... or maybe something similar >> is already in George Dunlap's scheduler plans? >> >> >>> -----Original Message----- >>> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >>> Sent: Wednesday, November 04, 2009 5:33 AM >>> To: Papagiannis Anastasios; xen-devel@lists.xensource.com >>> Subject: Re: [Xen-devel] Xen 3.4.1 NUMA support >>> >>> >>> Add Xen boot parameter 'numa=on' to enable NUMA detection. Then it's >>> up to >>> you to, for example, pin domains to specific nodes, using the 'cpus=...' >>> option in the domain config file. See /etc/xen/xmexample1 for an >>> example of >>> its usage. >>> >>> -- Keir >>> >>> On 04/11/2009 12:02, "Papagiannis Anastasios" <apapag@ics.forth.gr> >>> wrote: >>> >>> >>>> Hello, >>>> >>>> does the last version of Xen(3.4.1) support NUMA machines? >>> Is there a .pdf >>> >>>> or a link that can give me some more details about that? I work on a >>>> project for xen performace in numa machines. And in xen 3.3.0 this >>>> performance isn't good. Have something changed in last version? >>>> >>>> Thanks in advance, >>>> Papagiannis Anastasios >>>> >>>> >>>> _______________________________________________ >>>> Xen-devel mailing list >>>> Xen-devel@lists.xensource.com >>>> http://lists.xensource.com/xen-devel >>>> >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xensource.com >>> http://lists.xensource.com/xen-devel >>> >>> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > > -- Juergen Gross Principal Developer Operating Systems TSP ES&S SWE OS6 Telephone: +49 (0) 89 636 47950 Fujitsu Technolgy Solutions e-mail: juergen.gross@ts.fujitsu.com Otto-Hahn-Ring 6 Internet: ts.fujitsu.com D-81739 Muenchen Company details: ts.fujitsu.com/imprint.html ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-09 11:44 ` Juergen Gross @ 2009-11-09 12:07 ` George Dunlap 2009-11-09 12:40 ` Keir Fraser 1 sibling, 0 replies; 30+ messages in thread From: George Dunlap @ 2009-11-09 12:07 UTC (permalink / raw) To: Juergen Gross Cc: Dan Magenheimer, xen-devel@lists.xensource.com, Keir Fraser, Papagiannis Anastasios On Mon, Nov 9, 2009 at 11:44 AM, Juergen Gross <juergen.gross@ts.fujitsu.com> wrote: > George, would this be okay for you? I think your scheduler still will have > problems with domain weights as long as domains are restricted to some > processors, right? Hmm, this may be a point of discussion at some point. My plan was actually to have one runqueue per L2 processor cache. Thus as many as 4 cores (and possibly 8 hyperthreads) would be sharing the same runqueue; doing CPU pinning within the same runqueue would be problematic. I was planning on having credits work mainly within one runqueue, and then do load balancing between runqueues. In that case pinning to a specific runqueue shouldn't cause a problem, because credits of one runqueue wouldn't affect credtis of another one. However, I haven't implemented or tested this idea yet; it's possible that having credits kept distinct and doing load balancing between runqueues will cause unacceptable levels of unfairness. I expect it to be fine (esp since Linux's scheduler does this kind of load balancing, but doesn't share runqueues between logical processors), but without implementation and testing I can't say for sure. Thoughts are welcome at this point, but it will probably be better to have a real discussion once I've posted some patches. -George ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-09 11:44 ` Juergen Gross 2009-11-09 12:07 ` George Dunlap @ 2009-11-09 12:40 ` Keir Fraser 1 sibling, 0 replies; 30+ messages in thread From: Keir Fraser @ 2009-11-09 12:40 UTC (permalink / raw) To: Juergen Gross, George Dunlap Cc: Dan Magenheimer, xen-devel@lists.xensource.com, Papagiannis Anastasios On 09/11/2009 11:44, "Juergen Gross" <juergen.gross@ts.fujitsu.com> wrote: > NUMA was a topic I wanted to look at as soon as cpupools are officially > accepted. Keir wanted to propose a way to get rid of the function > continue_hypercall_on_cpu() which was causing most of the stuff leading > to the objection of cpupools. > I guess Keir had some higher priority jobs. :-) Well, I forgot about it. I think the plan was to perhaps keep something like continue_hypercall_on_cpu(), but not need to actually run the vcpu itself 'over there' but instead schedule a tasklet or somesuch, and sleep on its completion. That would get rid of the skanky affinity hacks you had to do to support continue_hypercall_on_cpu(). I'll have a look back at what we discussed. -- Keir ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-06 18:07 ` Dan Magenheimer 2009-11-09 11:33 ` George Dunlap @ 2009-11-09 15:02 ` Andre Przywara 2009-11-09 15:06 ` George Dunlap 2009-11-09 15:19 ` Jan Beulich 1 sibling, 2 replies; 30+ messages in thread From: Andre Przywara @ 2009-11-09 15:02 UTC (permalink / raw) To: Dan Magenheimer Cc: George Dunlap, xen-devel, Keir Fraser, Papagiannis Anastasios Dan Magenheimer wrote: >> Add Xen boot parameter 'numa=on' to enable NUMA detection. >> Then it's up to you to, for example, pin domains to specific nodes, >> using the 'cpus=...' option in the domain config file. See >> /etc/xen/xmexample1 for an example of its usage. > VMware has the notion of a "cell" where VMs can be > scheduled only within a cell, not across cells. > Cell boundaries are determined by VMware by > default, though certains settings can override them. Well, If I got this right, then you are describing the current behaviour of Xen. It has a similar feature for some time now (since 3.3, I guess). When you launch a domain on a numa=on machine, it will pick the least busiest node (which can hold the requested memory) and restrict the domain to that node (by only allowing CPUs of that node). This is in XendDomainInfo.py (c/s 17131, 17247, 17709) Looks like this one: (kernel xen.gz numa=on dom0_mem=6144M dom0_max_vcpus=6 dom0_vcpus_pin) # xm create opensuse.hvm # xm create opensuse2.hvm # xm vcpu-list Name ID VCPU CPU State Time(s) CPU Affinity 001-LTP 1 0 6 -b- 17.8 6-11 001-LTP 1 1 7 -b- 6.3 6-11 002-LTP 2 0 12 -b- 19.0 12-17 002-LTP 2 1 16 -b- 1.6 12-17 002-LTP 2 2 17 -b- 1.7 12-17 002-LTP 2 3 14 -b- 1.6 12-17 002-LTP 2 4 16 -b- 1.6 12-17 002-LTP 2 5 15 -b- 1.5 12-17 002-LTP 2 6 12 -b- 1.3 12-17 002-LTP 2 7 13 -b- 1.8 12-17 Domain-0 0 0 0 -b- 12.6 0 Domain-0 0 1 1 -b- 7.6 1 Domain-0 0 2 2 -b- 8.0 2 Domain-0 0 3 3 -b- 14.6 3 Domain-0 0 4 4 r-- 1.4 4 Domain-0 0 5 5 -b- 0.9 5 # xm debug-keys U (XEN) Domain 0 (total: 2097152): (XEN) Node 0: 2097152 (XEN) Node 1: 0 (XEN) Node 2: 0 (XEN) Node 3: 0 (XEN) Node 4: 0 (XEN) Node 5: 0 (XEN) Node 6: 0 (XEN) Node 7: 0 (XEN) Domain 1 (total: 394219): (XEN) Node 0: 0 (XEN) Node 1: 394219 (XEN) Node 2: 0 (XEN) Node 3: 0 (XEN) Node 4: 0 (XEN) Node 5: 0 (XEN) Node 6: 0 (XEN) Node 7: 0 (XEN) Domain 2 (total: 394219): (XEN) Node 0: 0 (XEN) Node 1: 0 (XEN) Node 2: 394219 (XEN) Node 3: 0 (XEN) Node 4: 0 (XEN) Node 5: 0 (XEN) Node 6: 0 (XEN) Node 7: 0 Note that there were no cpus= lines in the config files, Xen did that automatically. Domains can be localhost-migrated to another node: # xm migrate --node=4 1 localhost The only issue is with domains larger than a node. If someone has a useful use-case, I can start rebasing my old patches for NUMA aware HVM domains to Xen unstable. Regards, Andre. BTW: Shouldn't we set finally numa=on as the default value? -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-09 15:02 ` Andre Przywara @ 2009-11-09 15:06 ` George Dunlap 2009-11-09 22:51 ` Andre Przywara 2009-11-13 14:14 ` Andre Przywara 2009-11-09 15:19 ` Jan Beulich 1 sibling, 2 replies; 30+ messages in thread From: George Dunlap @ 2009-11-09 15:06 UTC (permalink / raw) To: Andre Przywara Cc: Dan Magenheimer, xen-devel@lists.xensource.com, Keir Fraser, Papagiannis Anastasios Andre Przywara wrote: > BTW: Shouldn't we set finally numa=on as the default value? > Is there any data to support the idea that this helps significantly on common systems? -George ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-09 15:06 ` George Dunlap @ 2009-11-09 22:51 ` Andre Przywara 2009-11-10 6:56 ` Dulloor 2009-11-13 14:14 ` Andre Przywara 1 sibling, 1 reply; 30+ messages in thread From: Andre Przywara @ 2009-11-09 22:51 UTC (permalink / raw) To: George Dunlap Cc: Dan Magenheimer, xen-devel@lists.xensource.com, Keir Fraser, Papagiannis Anastasios George Dunlap wrote: > Andre Przywara wrote: >> BTW: Shouldn't we set finally numa=on as the default value? >> > Is there any data to support the idea that this helps significantly on > common systems? I don't have any numbers handy, but I will try if I can generate some. Looking from a high level perspective it is a shame that it's not the default: With numa=off the Xen domain loader will allocate physical memory from some node (maybe even from several nodes) and will schedule the guest on some other (even rapidly changing) nodes. According to Murphy's law you will end up with _all_ the memory access of a guest to be remote. But in fact a NUMA architecture is really beneficial for virtualization: As there are close to zero cross domain memory accesses (except for Dom0), each node is more or less self contained and each guest can use the node's memory controller almost exclusively. But this is all spoiled as most people don't know about Xen's NUMA capabilities and don't set numa=on. Using this as a default would solve this. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 488-3567-12 ----to satisfy European Law for business letters: Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Jochen Polster; Thomas M. McCoy; Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-09 22:51 ` Andre Przywara @ 2009-11-10 6:56 ` Dulloor 2009-11-10 7:49 ` Andre Przywara 0 siblings, 1 reply; 30+ messages in thread From: Dulloor @ 2009-11-10 6:56 UTC (permalink / raw) To: Andre Przywara Cc: George Dunlap, Dan Magenheimer, xen-devel@lists.xensource.com, Keir Fraser, Papagiannis Anastasios I am not finding this. Can you please point to the code ? numa=on/off is only for setting up numa in xen (similar to the linux knob, but turned off by default). The allocation of memory from a single node (that you observe) could be because of the way alloc_heap_pages is implemented (trying to allocate from all the heaps from a node, before trying the next one) - try looking at dump_numa output. And, affinities are not set anywhere based on the node from which allocation happens. -dulloor On Mon, Nov 9, 2009 at 5:51 PM, Andre Przywara <andre.przywara@amd.com> wrote: > George Dunlap wrote: >> >> Andre Przywara wrote: >>> >>> BTW: Shouldn't we set finally numa=on as the default value? >>> >> >> Is there any data to support the idea that this helps significantly on >> common systems? > > I don't have any numbers handy, but I will try if I can generate some. > > Looking from a high level perspective it is a shame that it's not the > default: With numa=off the Xen domain loader will allocate physical memory > from some node (maybe even from several nodes) and will schedule the guest > on some other (even rapidly changing) nodes. According to Murphy's law you > will end up with _all_ the memory access of a guest to be remote. But in > fact a NUMA architecture is really beneficial for virtualization: As there > are close to zero cross domain memory accesses (except for Dom0), each node > is more or less self contained and each guest can use the node's memory > controller almost exclusively. > But this is all spoiled as most people don't know about Xen's NUMA > capabilities and don't set numa=on. Using this as a default would solve > this. > > Regards, > Andre. > > -- > Andre Przywara > AMD-Operating System Research Center (OSRC), Dresden, Germany > Tel: +49 351 488-3567-12 > ----to satisfy European Law for business letters: > Advanced Micro Devices GmbH > Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen > Geschaeftsfuehrer: Jochen Polster; Thomas M. McCoy; Giuliano Meroni > Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen > Registergericht Muenchen, HRB Nr. 43632 > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-10 6:56 ` Dulloor @ 2009-11-10 7:49 ` Andre Przywara 0 siblings, 0 replies; 30+ messages in thread From: Andre Przywara @ 2009-11-10 7:49 UTC (permalink / raw) To: Dulloor Cc: George Dunlap, Dan Magenheimer, xen-devel@lists.xensource.com, Keir Fraser, Papagiannis Anastasios Dulloor wrote: > I am not finding this. Can you please point to the code ? tools/python/xen/xend/XendDomainInfo.py (around line 2600) with the core code being: ------------- index = nodeload.index( min(nodeload) ) cpumask = info['node_to_cpu'][index] for v in range(0, self.info['VCPUs_max']): xc.vcpu_setaffinity(self.domid, v, cpumask) -------------- The code got introduced with c/s 17131 and later got refined with c/s 17247 and c/s 17709. > > numa=on/off is only for setting up numa in xen (similar to the linux > knob, but turned off by default). The allocation of memory from a > single node (that you observe) could be because of the way > alloc_heap_pages is implemented (trying to allocate from all the heaps > from a node, before trying the next one) Yes, but if the domain is pinned before it allocated it's memory, then the natural behavior of Xen is to take memory from this local node. > - try looking at dump_numa > output. And, affinities are not set anywhere based on the node from > which allocation happens. It is the other way round, first the domain is pinned, later the memory is allocated (based on the node to which the currently scheduled CPU is belonging to). Regards, Andre. > > -dulloor > > On Mon, Nov 9, 2009 at 5:51 PM, Andre Przywara <andre.przywara@amd.com> wrote: >> George Dunlap wrote: >>> Andre Przywara wrote: >>>> BTW: Shouldn't we set finally numa=on as the default value? >>>> >>> Is there any data to support the idea that this helps significantly on >>> common systems? >> I don't have any numbers handy, but I will try if I can generate some. >> >> Looking from a high level perspective it is a shame that it's not the >> default: With numa=off the Xen domain loader will allocate physical memory >> from some node (maybe even from several nodes) and will schedule the guest >> on some other (even rapidly changing) nodes. According to Murphy's law you >> will end up with _all_ the memory access of a guest to be remote. But in >> fact a NUMA architecture is really beneficial for virtualization: As there >> are close to zero cross domain memory accesses (except for Dom0), each node >> is more or less self contained and each guest can use the node's memory >> controller almost exclusively. >> But this is all spoiled as most people don't know about Xen's NUMA >> capabilities and don't set numa=on. Using this as a default would solve >> this. >> >> Regards, >> Andre. >> -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448 3567 12 ----to satisfy European Law for business letters: Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-09 15:06 ` George Dunlap 2009-11-09 22:51 ` Andre Przywara @ 2009-11-13 14:14 ` Andre Przywara 2009-11-13 14:29 ` Ian Pratt 2009-11-13 14:31 ` Keir Fraser 1 sibling, 2 replies; 30+ messages in thread From: Andre Przywara @ 2009-11-13 14:14 UTC (permalink / raw) To: George Dunlap Cc: Dan Magenheimer, xen-devel@lists.xensource.com, Keir Fraser, Papagiannis Anastasios George Dunlap wrote: > Andre Przywara wrote: >> BTW: Shouldn't we set finally numa=on as the default value? >> > Is there any data to support the idea that this helps significantly on > common systems? I did some tests on an 8 node machine. I will retry this later on 4-nodes and 2-nodes systems, but I assume similar numbers. I used multiple guests in parallel each running bw_mem of lmbench, which is admittedly quite NUMA sensitive. I cannot publish real numbers (yet?), but the results were dramatic: with numa=on I got the same results for each guest (the same as the native result) when the number of guests was smaller or equal the number of nodes (since each guest got it's own memory controller). If I disabled NUMA aware placement by explicitly specifying cpus="0-31" in the config file or booted with numa=off, the values dropped down by factor 3-5 (!) (even for a few guests) with some variance due to the random nature of core to memory mapping. Overcommitting the nodes (letting multiple guests use each node) lowered the values to about 80% for two guests and 60% for three guests per node, but it never got anywhere close to the numa=off values. So these results encourage me again to opt for numa=on as the default value. Keir, I will check if dropping the node containment in the CPU overcommitment case is an option, but what would be the right strategy in that case? Warn the user? Don't contain at all? Contain to more than onde node? Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448 3567 12 ----to satisfy European Law for business letters: Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: Xen 3.4.1 NUMA support 2009-11-13 14:14 ` Andre Przywara @ 2009-11-13 14:29 ` Ian Pratt 2009-11-13 15:25 ` George Dunlap 2009-11-13 15:27 ` Keir Fraser 2009-11-13 14:31 ` Keir Fraser 1 sibling, 2 replies; 30+ messages in thread From: Ian Pratt @ 2009-11-13 14:29 UTC (permalink / raw) To: Andre Przywara, George Dunlap Cc: Dan Magenheimer, xen-devel@lists.xensource.com, Ian Pratt, Keir Fraser, Papagiannis Anastasios > Overcommitting the nodes (letting multiple guests use each node) lowered > the values to about 80% for two guests and 60% for three guests per > node, but it never got anywhere close to the numa=off values. > So these results encourage me again to opt for numa=on as the default > value. > Keir, I will check if dropping the node containment in the CPU > overcommitment case is an option, but what would be the right strategy > in that case? > Warn the user? > Don't contain at all? > Contain to more than onde node? In the case where a VM is asking for more vCPUs there are pCPUs in a node we should contain the guest to multiple nodes. (I presume we favour nodes according to the number of vCPUs they already have committed to them?) We should turn off automatic node containment of any kind if the total number of pCPUs in the system is <= 8 -- on such systems the statistical multiplexing gain of having access to more pCPUs likely outweighs the NUMA placement benefit and memory striping will be a better strategy. I'm inclined to believe that may be true for 2 node systems with <=16 pCPUs too under many workloads I'd really like to see us enumerate pCPUs in a sensible order so that it's easier to see the topology. It should be nodes.sockets.cores{.threads}, leaving gaps for missing execution units due to hot plug or non power of two packing. Right now we're inconsistent in the enumeration order depending on how the BIOS has set things up. It would be great if someone could volunteer to fix this... Ian ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-13 14:29 ` Ian Pratt @ 2009-11-13 15:25 ` George Dunlap 2009-11-13 15:35 ` Ian Pratt 2009-11-13 15:27 ` Keir Fraser 1 sibling, 1 reply; 30+ messages in thread From: George Dunlap @ 2009-11-13 15:25 UTC (permalink / raw) To: Ian Pratt Cc: Andre Przywara, Dan Magenheimer, xen-devel@lists.xensource.com, Keir Fraser, Papagiannis Anastasios Ian Pratt wrote: > In the case where a VM is asking for more vCPUs there are pCPUs in a node we should contain the guest to multiple nodes. (I presume we favour nodes according to the number of vCPUs they already have committed to them?) Seems like CPU load might be a better measure. Xen doesn't calculate load currently, but it's on my list of things to do. -George ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: Xen 3.4.1 NUMA support 2009-11-13 15:25 ` George Dunlap @ 2009-11-13 15:35 ` Ian Pratt 0 siblings, 0 replies; 30+ messages in thread From: Ian Pratt @ 2009-11-13 15:35 UTC (permalink / raw) To: George Dunlap Cc: Andre Przywara, Dan Magenheimer, xen-devel@lists.xensource.com, Ian Pratt, Keir Fraser, Papagiannis Anastasios > Ian Pratt wrote: > > In the case where a VM is asking for more vCPUs there are pCPUs in a > node we should contain the guest to multiple nodes. (I presume we favour > nodes according to the number of vCPUs they already have committed to > them?) > > Seems like CPU load might be a better measure. Xen doesn't calculate > load currently, but it's on my list of things to do. I'd rather get this stuff fixed now than wait for the new scheduler. It's not clear that instantaneous CPU load is any better than just counting the number of vCPUs. The XCP xapi stack also records good historical data, and would be in a better position to do the placement. Further work. Ian ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-13 14:29 ` Ian Pratt 2009-11-13 15:25 ` George Dunlap @ 2009-11-13 15:27 ` Keir Fraser 2009-11-13 15:40 ` Ian Pratt 1 sibling, 1 reply; 30+ messages in thread From: Keir Fraser @ 2009-11-13 15:27 UTC (permalink / raw) To: Ian Pratt, Andre Przywara, George Dunlap Cc: Dan Magenheimer, xen-devel@lists.xensource.com, Papagiannis Anastasios On 13/11/2009 14:29, "Ian Pratt" <Ian.Pratt@eu.citrix.com> wrote: > I'd really like to see us enumerate pCPUs in a sensible order so that it's > easier to see the topology. It should be nodes.sockets.cores{.threads}, > leaving gaps for missing execution units due to hot plug or non power of two > packing. > Right now we're inconsistent in the enumeration order depending on how the > BIOS has set things up. It would be great if someone could volunteer to fix > this... Even better would be to have pCPUs addressable and listable explicitly as dotted tuples. That can be implemented entirely within the toolstack, and could even allow wildcarding of tuple components to efficiently express cpumasks. -- Keir ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: Xen 3.4.1 NUMA support 2009-11-13 15:27 ` Keir Fraser @ 2009-11-13 15:40 ` Ian Pratt 2009-11-13 16:02 ` Keir Fraser 0 siblings, 1 reply; 30+ messages in thread From: Ian Pratt @ 2009-11-13 15:40 UTC (permalink / raw) To: Keir Fraser, Andre Przywara, George Dunlap Cc: Dan Magenheimer, xen-devel@lists.xensource.com, Ian Pratt, Papagiannis, Anastasios > > I'd really like to see us enumerate pCPUs in a sensible order so that > it's > > easier to see the topology. It should be nodes.sockets.cores{.threads}, > > leaving gaps for missing execution units due to hot plug or non power of > two > > packing. > > Right now we're inconsistent in the enumeration order depending on how > the > > BIOS has set things up. It would be great if someone could volunteer to > fix > > this... > > Even better would be to have pCPUs addressable and listable explicitly as > dotted tuples. That can be implemented entirely within the toolstack, and > could even allow wildcarding of tuple components to efficiently express > cpumasks. Yes, I'd certainly like to see the toolstack support dotted tuple notation. However, I just don't trust the toolstack to get this right unless xen has already set it up nicely for it with a sensible enumeration and defined sockets-per-node, cores-per-socket and threads-per-core parameters. Xen should provide a clean interface to the toolstack in this respect. Ian ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-13 15:40 ` Ian Pratt @ 2009-11-13 16:02 ` Keir Fraser 0 siblings, 0 replies; 30+ messages in thread From: Keir Fraser @ 2009-11-13 16:02 UTC (permalink / raw) To: Ian Pratt, Andre Przywara, George Dunlap Cc: Dan Magenheimer, xen-devel@lists.xensource.com, Papagiannis Anastasios On 13/11/2009 15:40, "Ian Pratt" <Ian.Pratt@eu.citrix.com> wrote: >> Even better would be to have pCPUs addressable and listable explicitly as >> dotted tuples. That can be implemented entirely within the toolstack, and >> could even allow wildcarding of tuple components to efficiently express >> cpumasks. > > Yes, I'd certainly like to see the toolstack support dotted tuple notation. > > However, I just don't trust the toolstack to get this right unless xen has > already set it up nicely for it with a sensible enumeration and defined > sockets-per-node, cores-per-socket and threads-per-core parameters. Xen should > provide a clean interface to the toolstack in this respect. Xen provides a topology-interrogation hypercall which should suffice for tools to build up a {node,socket,core,thread}<->cpuid mapping table. -- Keir ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-13 14:14 ` Andre Przywara 2009-11-13 14:29 ` Ian Pratt @ 2009-11-13 14:31 ` Keir Fraser 2009-11-13 15:38 ` Ian Pratt 1 sibling, 1 reply; 30+ messages in thread From: Keir Fraser @ 2009-11-13 14:31 UTC (permalink / raw) To: Andre Przywara, George Dunlap Cc: Dan Magenheimer, xen-devel@lists.xensource.com, Papagiannis Anastasios On 13/11/2009 14:14, "Andre Przywara" <andre.przywara@amd.com> wrote: > Keir, I will check if dropping the node containment in the CPU > overcommitment case is an option, but what would be the right strategy > in that case? > Warn the user? > Don't contain at all? > Contain to more than onde node? I would suggest simply don't contain at all (i.e., keep equivalent numa=off behaviour) would be safest. -- Keir ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: Xen 3.4.1 NUMA support 2009-11-13 14:31 ` Keir Fraser @ 2009-11-13 15:38 ` Ian Pratt 0 siblings, 0 replies; 30+ messages in thread From: Ian Pratt @ 2009-11-13 15:38 UTC (permalink / raw) To: Keir Fraser, Andre Przywara, George Dunlap Cc: Dan Magenheimer, xen-devel@lists.xensource.com, Ian Pratt, Papagiannis, Anastasios > > Keir, I will check if dropping the node containment in the CPU > > overcommitment case is an option, but what would be the right strategy > > in that case? > > Warn the user? > > Don't contain at all? > > Contain to more than onde node? > > I would suggest simply don't contain at all (i.e., keep equivalent > numa=off > behaviour) would be safest. I disagree. In systems with 2 nodes it will use all nodes, which is the same as your propose[*]. In systems with more nodes it will do placement to some subset. Note that systems with >2 nodes generally have stronger NUMA effects and these are exactly the systems where node placement is a good thing. [*] note that numa=off is quite different from just disabling node placement. If node placement is disabled we still get the benefit of memory striping across nodes, which at least avoids some performance cliffs. Ian ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-09 15:02 ` Andre Przywara 2009-11-09 15:06 ` George Dunlap @ 2009-11-09 15:19 ` Jan Beulich 2009-11-10 1:46 ` Ian Pratt ` (2 more replies) 1 sibling, 3 replies; 30+ messages in thread From: Jan Beulich @ 2009-11-09 15:19 UTC (permalink / raw) To: Andre Przywara, Dan Magenheimer Cc: George Dunlap, xen-devel, Keir Fraser, Papagiannis Anastasios >>> Andre Przywara <andre.przywara@amd.com> 09.11.09 16:02 >>> >BTW: Shouldn't we set finally numa=on as the default value? I'd say no, at least until the default confinement of a guest to a single node gets fixed to properly deal with guests having more vCPU-s than a node's worth of pCPU-s (i.e. I take it for granted that the benefits of not overcommitting CPUs outweigh the drawbacks of cross-node memory accesses at the very least for CPU-bound workloads). Jan ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: Xen 3.4.1 NUMA support 2009-11-09 15:19 ` Jan Beulich @ 2009-11-10 1:46 ` Ian Pratt 2009-11-10 8:51 ` Jan Beulich 2009-11-12 16:09 ` Keir Fraser 2009-11-30 15:40 ` [PATCH] tools: avoid over-commitment if numa=on Andre Przywara 2 siblings, 1 reply; 30+ messages in thread From: Ian Pratt @ 2009-11-10 1:46 UTC (permalink / raw) To: Jan Beulich, Andre Przywara, Dan Magenheimer Cc: George Dunlap, Ian Pratt, xen-devel@lists.xensource.com, Keir Fraser, Papagiannis Anastasios > >>> Andre Przywara <andre.przywara@amd.com> 09.11.09 16:02 >>> > >BTW: Shouldn't we set finally numa=on as the default value? > > I'd say no, at least until the default confinement of a guest to a single > node gets fixed to properly deal with guests having more vCPU-s than > a node's worth of pCPU-s (i.e. I take it for granted that the benefits of > not overcommitting CPUs outweigh the drawbacks of cross-node memory > accesses at the very least for CPU-bound workloads). What default confinement? I thought guests had an all-pCPUs affinity mask be default? I suspect we will get benefits enabling NUMA even if all the guests have all-pCPUs affinity masks: all guests will have memory stripped across all nodes, which is likely better than allocating from one node and then the other. Obviously assigning VMs to node(s) and allocating memory accordingly is the best plan. Ian ^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: Xen 3.4.1 NUMA support 2009-11-10 1:46 ` Ian Pratt @ 2009-11-10 8:51 ` Jan Beulich 2009-11-10 8:57 ` Keir Fraser 0 siblings, 1 reply; 30+ messages in thread From: Jan Beulich @ 2009-11-10 8:51 UTC (permalink / raw) To: Ian Pratt Cc: Andre Przywara, Dan Magenheimer, xen-devel@lists.xensource.com, George Dunlap, Keir Fraser, Papagiannis Anastasios >>> Ian Pratt <Ian.Pratt@eu.citrix.com> 10.11.09 02:46 >>> >> >>> Andre Przywara <andre.przywara@amd.com> 09.11.09 16:02 >>> >> >BTW: Shouldn't we set finally numa=on as the default value? >> >> I'd say no, at least until the default confinement of a guest to a single >> node gets fixed to properly deal with guests having more vCPU-s than >> a node's worth of pCPU-s (i.e. I take it for granted that the benefits of >> not overcommitting CPUs outweigh the drawbacks of cross-node memory >> accesses at the very least for CPU-bound workloads). > >What default confinement? I thought guests had an all-pCPUs affinity mask be default? Not with numa=on (see also Andre's post to this effect): The guest will get assigned to a node, and its affinity set to that node's CPUs. Jan ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-10 8:51 ` Jan Beulich @ 2009-11-10 8:57 ` Keir Fraser 0 siblings, 0 replies; 30+ messages in thread From: Keir Fraser @ 2009-11-10 8:57 UTC (permalink / raw) To: Jan Beulich, Ian Pratt Cc: George Dunlap, Andre Przywara, Dan Magenheimer, xen-devel@lists.xensource.com, Papagiannis Anastasios On 10/11/2009 08:51, "Jan Beulich" <JBeulich@novell.com> wrote: >> What default confinement? I thought guests had an all-pCPUs affinity mask be >> default? > > Not with numa=on (see also Andre's post to this effect): The guest will > get assigned to a node, and its affinity set to that node's CPUs. ...And if it didn't, striping would not happen. In fact iirc the default NUMA allocation policy for an all-pcpus domain is in some respects pessimal: vcpu0's initial node gets drained of memory first. I.e., you get *less* 'striping' than you could with numa=off where you might at least get lucky. -- Keir ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Xen 3.4.1 NUMA support 2009-11-09 15:19 ` Jan Beulich 2009-11-10 1:46 ` Ian Pratt @ 2009-11-12 16:09 ` Keir Fraser 2009-11-30 15:40 ` [PATCH] tools: avoid over-commitment if numa=on Andre Przywara 2 siblings, 0 replies; 30+ messages in thread From: Keir Fraser @ 2009-11-12 16:09 UTC (permalink / raw) To: Jan Beulich, Andre Przywara, Dan Magenheimer Cc: George Dunlap, xen-devel@lists.xensource.com, Papagiannis Anastasios On 09/11/2009 15:19, "Jan Beulich" <JBeulich@novell.com> wrote: >>>> Andre Przywara <andre.przywara@amd.com> 09.11.09 16:02 >>> >> BTW: Shouldn't we set finally numa=on as the default value? > > I'd say no, at least until the default confinement of a guest to a single > node gets fixed to properly deal with guests having more vCPU-s than > a node's worth of pCPU-s (i.e. I take it for granted that the benefits of > not overcommitting CPUs outweigh the drawbacks of cross-node memory > accesses at the very least for CPU-bound workloads). If this would be fixed (e.g., turn off node locality entirely by default for domains which will not fit into a single node) then I think we could consider numa=on by default. -- Keir ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH] tools: avoid over-commitment if numa=on 2009-11-09 15:19 ` Jan Beulich 2009-11-10 1:46 ` Ian Pratt 2009-11-12 16:09 ` Keir Fraser @ 2009-11-30 15:40 ` Andre Przywara 2 siblings, 0 replies; 30+ messages in thread From: Andre Przywara @ 2009-11-30 15:40 UTC (permalink / raw) To: Keir Fraser Cc: George Dunlap, Dan Magenheimer, xen-devel, Papagiannis Anastasios, Jan Beulich [-- Attachment #1: Type: text/plain, Size: 1271 bytes --] Jan Beulich wrote: >>>> Andre Przywara <andre.przywara@amd.com> 09.11.09 16:02 >>> >> BTW: Shouldn't we set finally numa=on as the default value? > > I'd say no, at least until the default confinement of a guest to a single > node gets fixed to properly deal with guests having more vCPU-s than > a node's worth of pCPU-s (i.e. I take it for granted that the benefits of > not overcommitting CPUs outweigh the drawbacks of cross-node memory > accesses at the very least for CPU-bound workloads). That sounds reasonable. Attached a patch to lift the restriction of one node per guest if the number of VCPUs is greater than the number of cores / node. This isn't optimal (the best way would be to inform the guest about it, but this is another patchset ;-), but should solve the above concerns. Please apply, Andre. Signed-off-by: Andre Przywara <andre.przywara@amd.com> -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448 3567 12 ----to satisfy European Law for business letters: Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 [-- Attachment #2: more_NUMA_nodes.patch --] [-- Type: text/x-patch, Size: 2389 bytes --] # HG changeset patch # User Andre Przywara <andre.przywara@amd.com> # Date 1259594006 -3600 # Node ID bdf4109edffbcc0cbac605a19d2fd7a7459f1117 # Parent abc6183f486e66b5721dbf0313ee0d3460613a99 allocate enough NUMA nodes for all VCPUs If numa=on, we constrain a guest to one node to keep it's memory accesses local. This will hurt performance if the number of VCPUs is greater than the number of cores per node. We detect this case now and allocate further NUMA nodes to allow all VCPUs to run simultaneously. Signed-off-by: Andre Przywara <andre.przywara@amd.com> diff -r abc6183f486e -r bdf4109edffb tools/python/xen/xend/XendDomainInfo.py --- a/tools/python/xen/xend/XendDomainInfo.py Mon Nov 30 10:58:23 2009 +0000 +++ b/tools/python/xen/xend/XendDomainInfo.py Mon Nov 30 16:13:26 2009 +0100 @@ -2637,8 +2637,7 @@ nodeload[i] = int(nodeload[i] * 16 / len(info['node_to_cpu'][i])) else: nodeload[i] = sys.maxint - index = nodeload.index( min(nodeload) ) - return index + return map(lambda x: x[0], sorted(enumerate(nodeload), key=lambda x:x[1])) info = xc.physinfo() if info['nr_nodes'] > 1: @@ -2648,8 +2647,15 @@ for i in range(0, info['nr_nodes']): if node_memory_list[i] >= needmem and len(info['node_to_cpu'][i]) > 0: candidate_node_list.append(i) - index = find_relaxed_node(candidate_node_list) - cpumask = info['node_to_cpu'][index] + best_node = find_relaxed_node(candidate_node_list)[0] + cpumask = info['node_to_cpu'][best_node] + cores_per_node = info['nr_cpus'] / info['nr_nodes'] + nodes_required = (self.info['VCPUs_max'] + cores_per_node - 1) / cores_per_node + if nodes_required > 1: + log.debug("allocating %d NUMA nodes", nodes_required) + best_nodes = find_relaxed_node(filter(lambda x: x != best_node, range(0,info['nr_nodes']))) + for i in best_nodes[:nodes_required - 1]: + cpumask = cpumask + info['node_to_cpu'][i] for v in range(0, self.info['VCPUs_max']): xc.vcpu_setaffinity(self.domid, v, cpumask) return index [-- Attachment #3: Type: text/plain, Size: 138 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2009-11-30 15:40 UTC | newest] Thread overview: 30+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-11-04 12:02 Xen 3.4.1 NUMA support Papagiannis Anastasios 2009-11-04 12:32 ` Keir Fraser 2009-11-06 18:07 ` Dan Magenheimer 2009-11-09 11:33 ` George Dunlap 2009-11-09 11:39 ` Dulloor 2009-11-09 12:29 ` George Dunlap 2009-11-09 12:51 ` Dulloor 2009-11-09 11:44 ` Juergen Gross 2009-11-09 12:07 ` George Dunlap 2009-11-09 12:40 ` Keir Fraser 2009-11-09 15:02 ` Andre Przywara 2009-11-09 15:06 ` George Dunlap 2009-11-09 22:51 ` Andre Przywara 2009-11-10 6:56 ` Dulloor 2009-11-10 7:49 ` Andre Przywara 2009-11-13 14:14 ` Andre Przywara 2009-11-13 14:29 ` Ian Pratt 2009-11-13 15:25 ` George Dunlap 2009-11-13 15:35 ` Ian Pratt 2009-11-13 15:27 ` Keir Fraser 2009-11-13 15:40 ` Ian Pratt 2009-11-13 16:02 ` Keir Fraser 2009-11-13 14:31 ` Keir Fraser 2009-11-13 15:38 ` Ian Pratt 2009-11-09 15:19 ` Jan Beulich 2009-11-10 1:46 ` Ian Pratt 2009-11-10 8:51 ` Jan Beulich 2009-11-10 8:57 ` Keir Fraser 2009-11-12 16:09 ` Keir Fraser 2009-11-30 15:40 ` [PATCH] tools: avoid over-commitment if numa=on Andre Przywara
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.