From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andre Przywara <andre.przywara@amd.com>
Subject: Re: Host Numa informtion in dom0
Date: Mon, 1 Feb 2010 22:39:36 +0100
Message-ID: <4B674A18.8010106@amd.com>
References: <8EA2C2C4116BF44AB370468FBF85A7770123904A29@orsmsx504.amr.corp.intel.com>	
	<4B66AB88.6090208@amd.com>
	<940bcfd21002010953y74e43db7h838f5021207bfa8f@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <940bcfd21002010953y74e43db7h838f5021207bfa8f@mail.gmail.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Dulloor <dulloor@gmail.com>
Cc: "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>, Keir Fraser <keir.fraser@eu.citrix.com>
List-Id: xen-devel@lists.xenproject.org

Dulloor wrote:
>> Beside that I have to oppose the introduction of sockets_per_node agai=
n.
>> Future AMD processors will feature _two_ nodes on _one_ socket, so thi=
s
>> variable should hold 1/2, but this will be rounded to zero. I think th=
is
>> information is pretty useless anyway, as the number of sockets is most=
ly
>> interesting for licensing purposes, where a single number is sufficien=
t.
>=20
> I sent a similar patch (was using to enlist pcpu-tuples and in
> vcpu-pin/unpin) and I didn't pursue it because of this same argument.
> When we talk of cpu topology, that's how it is currently :
> nodes-socket-cpu-core. Don't sockets also figure in the cache and
> interconnect hierarchy ?
Not necessarily. Think of Intel's Core2Quad, they have two separate L2=20
caches each associated to two of the four cores in one socket. If you=20
move from core0 to core2 then AFAIK the cost would be very similar to=20
moving to another processor socket. So in fact the term socket does not=20
help here.
The situation is similar to the new AMD CPUs, just that it replaces "L2=20
cache" with "node" (aka shared memory controller, which also matches=20
shared L3 cache). In fact the cost of moving from one node to the=20
neighbor in the same socket is exactly the same as moving to another=20
socket.
> What would be the hierarchy in those future AMD processors ? Even Keir
> and Ian Pratt initially wanted the pcpu-tuples
> to be listed that way. So, it would be helpful to make a call and move =
ahead.
You could create variables like cores_per_socket and cores_per_node,=20
this would solve this issue for now. Actually better would be an array=20
mapping cores (or threads) to {nodes,sockets,L[123]_caches}, as this=20
would allow asymmetrical configurations (useful for guests).
In the past there once was a socket_per_node value in physinfo, but it=20
has been removed. It was not used anywhere, and multiplying the whole=20
chain of x_per_y sometimes ended up in wrong values anyway.
Anyway, if you insist on this value it will hold bogus values for the=20
upcoming processors. If it will be zero, you end up in trouble when=20
multiplying or dividing with it, and letting it be one is also wrong.
I am sorry to spoil this whole game, but that it's how it is.

If you or Nitin show me how the socket_per_node variable should be used,=20
we can maybe find a pleasing solution.

Regards,
Andre.
>=20
> On Mon, Feb 1, 2010 at 5:23 AM, Andre Przywara <andre.przywara@amd.com>=
 wrote:
>> Kamble, Nitin A wrote:
>>> Hi Keir,
>>>
>>>   Attached is the patch which exposes the host numa information to do=
m0.
>>> With the patch =93xm info=94 command now also gives the cpu topology =
& host numa
>>> information. This will be later used to build guest numa support.
>> What information are you missing from the current physinfo? As far as =
I can
>> see, only the total amount of memory per node is not provided. But one=
 could
>> get this info from parsing the SRAT table in Dom0, which is at least m=
apped
>> into Dom0's memory.
>> Or do you want to provide NUMA information to all PV guests (but then =
it
>> cannot be a sysctl)? This would be helpful, as this would avoid to ena=
ble
>> ACPI parsing in PV Linux for NUMA guest support.
>>
>> Beside that I have to oppose the introduction of sockets_per_node agai=
n.
>> Future AMD processors will feature _two_ nodes on _one_ socket, so thi=
s
>> variable should hold 1/2, but this will be rounded to zero. I think th=
is
>> information is pretty useless anyway, as the number of sockets is most=
ly
>> interesting for licensing purposes, where a single number is sufficien=
t.
>>  For scheduling purposes cache topology is more important.
>>
>> My NUMA guest patches (currently for HVM only) are doing fine, I will =
try to
>> send out a RFC patches this week. I think they don't interfere with th=
is
>> patch, but if you have other patches in development, we should sync on=
 this.
>> The scope of my patches is to let the user (or xend) describe a guest'=
s
>>  topology (either by specifying only the number of guest nodes in the =
config
>> file or by explicitly describing the whole NUMA topology). Some code w=
ill
>> assign host nodes to the guest nodes (I am not sure yet whether this r=
eally
>> belongs into xend as it currently does, or is better done in libxc, wh=
ere
>> libxenlight would also benefit).
>> Then libxc's hvm_build_* will pass that info into the hvm_info_table, =
where
>> code in the hvmloader will generate an appropriate SRAT table.
>> An extension of this would be to let Xen automatically decide whether =
a
>> split of the resources is necessary (because there is not enough memor=
y
>> available (anymore) on one node).
>>
>> Looking forward to comments...
>>
>> Regards,
>> Andre.
>>