From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andre Przywara Subject: Re: Host Numa informtion in dom0 Date: Mon, 1 Feb 2010 22:39:36 +0100 Message-ID: <4B674A18.8010106@amd.com> References: <8EA2C2C4116BF44AB370468FBF85A7770123904A29@orsmsx504.amr.corp.intel.com> <4B66AB88.6090208@amd.com> <940bcfd21002010953y74e43db7h838f5021207bfa8f@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: quoted-printable Return-path: In-Reply-To: <940bcfd21002010953y74e43db7h838f5021207bfa8f@mail.gmail.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Dulloor Cc: "xen-devel@lists.xensource.com" , Keir Fraser List-Id: xen-devel@lists.xenproject.org Dulloor wrote: >> Beside that I have to oppose the introduction of sockets_per_node agai= n. >> Future AMD processors will feature _two_ nodes on _one_ socket, so thi= s >> variable should hold 1/2, but this will be rounded to zero. I think th= is >> information is pretty useless anyway, as the number of sockets is most= ly >> interesting for licensing purposes, where a single number is sufficien= t. >=20 > I sent a similar patch (was using to enlist pcpu-tuples and in > vcpu-pin/unpin) and I didn't pursue it because of this same argument. > When we talk of cpu topology, that's how it is currently : > nodes-socket-cpu-core. Don't sockets also figure in the cache and > interconnect hierarchy ? Not necessarily. Think of Intel's Core2Quad, they have two separate L2=20 caches each associated to two of the four cores in one socket. If you=20 move from core0 to core2 then AFAIK the cost would be very similar to=20 moving to another processor socket. So in fact the term socket does not=20 help here. The situation is similar to the new AMD CPUs, just that it replaces "L2=20 cache" with "node" (aka shared memory controller, which also matches=20 shared L3 cache). In fact the cost of moving from one node to the=20 neighbor in the same socket is exactly the same as moving to another=20 socket. > What would be the hierarchy in those future AMD processors ? Even Keir > and Ian Pratt initially wanted the pcpu-tuples > to be listed that way. So, it would be helpful to make a call and move = ahead. You could create variables like cores_per_socket and cores_per_node,=20 this would solve this issue for now. Actually better would be an array=20 mapping cores (or threads) to {nodes,sockets,L[123]_caches}, as this=20 would allow asymmetrical configurations (useful for guests). In the past there once was a socket_per_node value in physinfo, but it=20 has been removed. It was not used anywhere, and multiplying the whole=20 chain of x_per_y sometimes ended up in wrong values anyway. Anyway, if you insist on this value it will hold bogus values for the=20 upcoming processors. If it will be zero, you end up in trouble when=20 multiplying or dividing with it, and letting it be one is also wrong. I am sorry to spoil this whole game, but that it's how it is. If you or Nitin show me how the socket_per_node variable should be used,=20 we can maybe find a pleasing solution. Regards, Andre. >=20 > On Mon, Feb 1, 2010 at 5:23 AM, Andre Przywara = wrote: >> Kamble, Nitin A wrote: >>> Hi Keir, >>> >>> Attached is the patch which exposes the host numa information to do= m0. >>> With the patch =93xm info=94 command now also gives the cpu topology = & host numa >>> information. This will be later used to build guest numa support. >> What information are you missing from the current physinfo? As far as = I can >> see, only the total amount of memory per node is not provided. But one= could >> get this info from parsing the SRAT table in Dom0, which is at least m= apped >> into Dom0's memory. >> Or do you want to provide NUMA information to all PV guests (but then = it >> cannot be a sysctl)? This would be helpful, as this would avoid to ena= ble >> ACPI parsing in PV Linux for NUMA guest support. >> >> Beside that I have to oppose the introduction of sockets_per_node agai= n. >> Future AMD processors will feature _two_ nodes on _one_ socket, so thi= s >> variable should hold 1/2, but this will be rounded to zero. I think th= is >> information is pretty useless anyway, as the number of sockets is most= ly >> interesting for licensing purposes, where a single number is sufficien= t. >> For scheduling purposes cache topology is more important. >> >> My NUMA guest patches (currently for HVM only) are doing fine, I will = try to >> send out a RFC patches this week. I think they don't interfere with th= is >> patch, but if you have other patches in development, we should sync on= this. >> The scope of my patches is to let the user (or xend) describe a guest'= s >> topology (either by specifying only the number of guest nodes in the = config >> file or by explicitly describing the whole NUMA topology). Some code w= ill >> assign host nodes to the guest nodes (I am not sure yet whether this r= eally >> belongs into xend as it currently does, or is better done in libxc, wh= ere >> libxenlight would also benefit). >> Then libxc's hvm_build_* will pass that info into the hvm_info_table, = where >> code in the hvmloader will generate an appropriate SRAT table. >> An extension of this would be to let Xen automatically decide whether = a >> split of the resources is necessary (because there is not enough memor= y >> available (anymore) on one node). >> >> Looking forward to comments... >> >> Regards, >> Andre. >>