From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Andre Przywara" Subject: Re: [PATCH 0/4] [HVM] NUMA support in HVM guests Date: Fri, 07 Sep 2007 14:49:07 +0200 Message-ID: <46E148C3.90908@amd.com> References: <46C02BE0.2070400@amd.com> <51CFAB8CB6883745AE7B93B3E084EBE2010B74F5@pdsmsx412.ccr.corp.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <51CFAB8CB6883745AE7B93B3E084EBE2010B74F5@pdsmsx412.ccr.corp.intel.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: "Xu, Anthony" Cc: xen-devel@lists.xensource.com List-Id: xen-devel@lists.xenproject.org Anthony, thanks for looking into the patches, I appreciate your comments. > + for (i=3D0;i<=3Ddominfo.max_vcpu_id;i++) > + { > + node=3D ( i * numanodes ) / (dominfo.max_vcpu_id+1); > + xc_vcpu_setaffinity (xc_handle, dom, i, nodemasks[node]); > + } >=20 > This always starts from node0, this may make node0 very busy, while oth= er nodes may not have many work. This is true, I encountered this before, but didn't want to wait longer=20 for sending up the patches. Actually the "numanodes=3Dn" config file=20 option shouldn't specify the number of nodes, but a list of specific=20 nodes to use, like "numanodes=3D0,2" to pin the domain on the first and=20 the third node. > It may be nice to pin node from the lightest overhead node. This sounds interesting. It shouldn't be that hard to do this in libxc,=20 but we should think about a semantic to specify this behavior in the=20 config file (if we change the semantic from the number to specific node=20 like I described above). > We also need to add some limitations for numanodes. The number of vcpus= on vnode should not be larger >than the number of pcpus on pnode. Otherwise vcpus belonging to a=20 domain run > on the same pcpu, which is not what we want. Would be nice, but in the moment I would push this into the sysadmin's=20 responsibility. > In setup_numa_mem, each node has even memory size, if the memory alloca= tion fails, >the domain creation fails. This may be too "rude", I think = we can=20 support guest > NUMA with each node has different memory size, even more, and maybe=20 some node doesn't have > memory. What we need guarantee is guest see physical topology. Sound reasonable. I will look into this. > In your patch, when create NUMA guest, vnode is pinned to pnode. While = after some creations and destroys domain operation, >the workload on the platform may be very imbalanced, we need a method=20 to dynamically balance workload. > There are two methods IMO. > 1. Implement NUMA-aware scheduler and page migration > 2. Run a daemon in dom0, this daemon monitors workload, and use live-mi= gration to balance workload if necessary. You are right, this may become a problem. I think the second solution is=20 easier to implement. A NUMA-aware scheduler would be nice, but my idea=20 was that the guest OS can better schedule (more fine-grained on a=20 per-process base than on a per-machine base) things. Changing the=20 processing node without moving the memory along should be an exception=20 (as it changes NUMA topology and in the moment I don't see methods to=20 propagate this nicely to the (HVM) guest), so I think a kind of=20 "real-emergency balancer" which includes page-migration (quite expensive=20 with bigger memory sizes!) would be more appropriate. After all my patches were more a discussion base than a final solution,=20 so I see there is more work to do. In the moment I am working on=20 including PV guests. Regards, Andre. --=20 Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 277-84917 ----to satisfy European Law for business letters: AMD Saxony Limited Liability Company & Co. KG Sitz (Gesch=E4ftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden,=20 Deutschland Registergericht Dresden: HRA 4896 vertretungsberechtigter Komplement=E4r: AMD Saxony LLC (Sitz Wilmington,=20 Delaware, USA) Gesch=E4ftsf=FChrer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy