From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Andre Przywara" <andre.przywara@amd.com>
Subject: Re: [PATCH 0/4] [HVM] NUMA support in HVM guests
Date: Fri, 07 Sep 2007 14:49:07 +0200
Message-ID: <46E148C3.90908@amd.com>
References: <46C02BE0.2070400@amd.com>
	<51CFAB8CB6883745AE7B93B3E084EBE2010B74F5@pdsmsx412.ccr.corp.intel.com>
Mime-Version: 1.0
Content-Type: text/plain;
 charset=iso-8859-1;
 format=flowed
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <51CFAB8CB6883745AE7B93B3E084EBE2010B74F5@pdsmsx412.ccr.corp.intel.com>
List-Unsubscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: "Xu, Anthony" <anthony.xu@intel.com>
Cc: xen-devel@lists.xensource.com
List-Id: xen-devel@lists.xenproject.org

Anthony,

thanks for looking into the patches, I appreciate your comments.

> +    for (i=3D0;i<=3Ddominfo.max_vcpu_id;i++)
> +    {
> +        node=3D ( i * numanodes ) / (dominfo.max_vcpu_id+1);
> +        xc_vcpu_setaffinity (xc_handle, dom, i, nodemasks[node]);
> +    }
>=20
> This always starts from node0, this may make node0 very busy, while oth=
er nodes may not have many work.
This is true, I encountered this before, but didn't want to wait longer=20
for sending up the patches. Actually the "numanodes=3Dn" config file=20
option shouldn't specify the number of nodes, but a list of specific=20
nodes to use, like "numanodes=3D0,2" to pin the domain on the first and=20
the third node.
> It may be nice to pin node from the lightest overhead node.
This sounds interesting. It shouldn't be that hard to do this in libxc,=20
but we should think about a semantic to specify this behavior in the=20
config file (if we change the semantic from the number to specific node=20
like I described above).
> We also need to add some limitations for numanodes. The number of vcpus=
 on vnode should not be larger
 >than the number of pcpus on pnode. Otherwise vcpus belonging to a=20
domain run
 > on the same pcpu, which is not what we want.
Would be nice, but in the moment I would push this into the sysadmin's=20
responsibility.
> In setup_numa_mem, each node has even memory size, if the memory alloca=
tion fails,  >the domain creation fails. This may be too "rude", I think =
we can=20
support guest
 > NUMA with each node has different memory size, even more, and maybe=20
some node doesn't have
> memory. What we need guarantee is guest see physical topology.
Sound reasonable. I will look into this.
> In your patch, when create NUMA guest, vnode is pinned to pnode. While =
after some creations and destroys domain operation,
 >the workload on the platform may be very imbalanced, we need a method=20
to dynamically balance workload.
> There are two methods IMO.
> 1. Implement NUMA-aware scheduler and page migration
> 2. Run a daemon in dom0, this daemon monitors workload, and use live-mi=
gration to balance workload if necessary.
You are right, this may become a problem. I think the second solution is=20
easier to implement. A NUMA-aware scheduler would be nice, but my idea=20
was that the guest OS can better schedule (more fine-grained on a=20
per-process base than on a per-machine base) things. Changing the=20
processing node without moving the memory along should be an exception=20
(as it changes NUMA topology and in the moment I don't see methods to=20
propagate this nicely to the (HVM) guest), so I think a kind of=20
"real-emergency balancer" which includes page-migration (quite expensive=20
with bigger memory sizes!) would be more appropriate.

After all my patches were more a discussion base than a final solution,=20
so I see there is more work to do. In the moment I am working on=20
including PV guests.

Regards,
Andre.


--=20
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 277-84917
----to satisfy European Law for business letters:
AMD Saxony Limited Liability Company & Co. KG
Sitz (Gesch=E4ftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden,=20
Deutschland
Registergericht Dresden: HRA 4896
vertretungsberechtigter Komplement=E4r: AMD Saxony LLC (Sitz Wilmington,=20
Delaware, USA)
Gesch=E4ftsf=FChrer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy