From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40639) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WO0dW-0001Y4-Jv for qemu-devel@nongnu.org; Thu, 13 Mar 2014 04:05:12 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1WO0dQ-0006WF-KF for qemu-devel@nongnu.org; Thu, 13 Mar 2014 04:05:06 -0400 Received: from mx1.redhat.com ([209.132.183.28]:53619) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1WO0dQ-0006W5-CL for qemu-devel@nongnu.org; Thu, 13 Mar 2014 04:05:00 -0400 Message-ID: <1394697892.23859.21.camel@nilsson.home.kraxel.org> From: Gerd Hoffmann Date: Thu, 13 Mar 2014 09:04:52 +0100 In-Reply-To: <20140312215528.GF17184@ERROL.INI.CMU.EDU> References: <20140312215528.GF17184@ERROL.INI.CMU.EDU> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] SMBIOS vs. NUMA (was: Build full type 19 tables) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Gabriel L. Somlo" Cc: agraf@suse.de, qemu-devel@nongnu.org, armbru@redhat.com, alex.williamson@redhat.com, kevin@koconnor.net, lersek@redhat.com > ---------------------------------------------------------------------------- > | Type16 0x1000 | > ---------------------------------------------------------------------------- > ^ ^ ^ ^ ^ ^ > | | | | | | > | ----+--- ----+---- ----+---- ---------+-------- | > | | Type17 | | Type17 | | Type17 | | Type17 | | > | | 0..16G | | 16..32G | | 32..48G | ... | N*16G..(N+1)*16G | | > | | 0x1100 | | 0x1101 | | 0x1102 | | 0x110 | | > | -------- --------- --------- ------------------ | > | ^ ^ ^ ^ ^ | > | | | | | | | > | +--+ +--+ | | | | > | | | | | | | > | ----+--- ---+---- ----+---- ----+---- ---------+-------- | > | | Type20 | | Type20 | | Type20 | | Type20 | | Type20 | | > | | 0..4G | | 4..16G | | 16..32G | | 32..48G | ... | N*16G..(N+1)*16G | | > | | 0x1400 | | 0x1401 | | 0x1402 | | 0x1403 | | 0x140 | | > | ----+--- ---+---- ----+---- ----+---- ---------+-------- | > | | | | | | | > | | | +-------+ | +----------------+ | > | | +----------------+ | | | | > | | | | | | | > | v v v v v | > | -------- -------------- | > | | Type19 | | Type19 | | > | | 0..4G | | 4G..ram_size | | > | | 0x1300 | | 0x1301 | | > | ----+--- ------+------- | > | | | | > +-------+ +----------------------------------+ Very nice. > Here are some of the limit values, and some questions and thoughts: > > - Type16 max == 2T - 1K; > > Should we just assert((ram_size >> 10) < 0x80000000), and officially > limit guests to < 2T ? No. Not fully sure what reasonable behavier would be in case more than 2T are present. I guess either not generating type16 entries at all or simply fill in the maximum value we can represent. > - Type17 max == 32G - 1M; > > This explains why we create Type17 device tables in increments of 16G, > since that's the largest possible value that's a nice, round power of > two :) Yes. > - Type19 & Type20 max == 4T - 1K; > > If we limit ourselves to what Type16 can currently represent (2T), > this should be plenty enough to work with... And there is the option to simply create multiple type19+20 entries to cover more I think. > So, currently, we split available ram into blobs of up to 16G each, > and assign each blob a Type17 node. > > We then split available ram into <4G and 4G+, and create up to two > Type19 nodes for these two areas. Yes. > Now, re. e820: currently, the expectation is that the (up to) two > Type19 nodes in the above figure correspond to (up to) two entries of > type E820_RAM in the e820 table. Yes. If more e820 ram entries show up one day, additional type19 nodes should be generated (i.e. basially simply loop over the e830 table). > Then, a type20 node is assigned to the sub-4G portion of the first > Type17 "device", and another type20 node is assigned to the over-4G > portion of the same. > > From then on, type20 nodes correspond to the rest of the 16G-or-less > type17 devices pretty much on a 1:1 basis. Hmm, not sure why type20 entries are handled the way they are. I think it would make more sense to have one type20 entry per e820 ram entry, similar to type19. > If the e820 table will contain more than just two E820_RAM entries, > and therefore we'll have more than the two Type19 nodes on the bottom > row, what are the rules for extending the rest of the figure > accordingly (i.e. how do we hook together more Type17 and Type20 nodes > to go along with the extra Type19 nodes) ? See above for type19+20. type17 represents the dimms, so where the memory is actually mapped doesn't matter there. Lets simply sum up all memory, then split into 16g pieces and create a type17 entry for each piece. At least initially. As further improvement we could make the dimm size configurable, so if you have a 4 node numa machine with 4g ram on each node you can present 4 virtual 4g ram dimms to the guest instead of a single 16g dimm. But that is clearly beyond the scope of the initial revision ... cheers, Gerd