[Qemu-devel] SMBIOS vs. NUMA (was: Build full type 19 tables)

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] SMBIOS vs. NUMA (was: Build full type 19 tables)
@ 2014-03-12 21:55 Gabriel L. Somlo
  2014-03-13  8:04 ` Gerd Hoffmann
  0 siblings, 1 reply; 9+ messages in thread
From: Gabriel L. Somlo @ 2014-03-12 21:55 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: agraf, qemu-devel, armbru, alex.williamson, kevin, lersek

On Wed, Mar 12, 2014 at 02:24:54PM +0100, Gerd Hoffmann wrote:
> On Mi, 2014-03-12 at 09:05 -0400, Gabriel L. Somlo wrote:
> > On Wed, Mar 12, 2014 at 09:27:18AM +0100, Gerd Hoffmann wrote:
> > > I think we should just use e820_table (see pc.c) here.  Loop over it and
> > > add a type 19 table for each ram region in there.
> > 
> > I'm assuming this should be another post-Seabios-compatibility patch,
> > at the end of the series, and I should still do the (start,size)
> > arithmetic cut'n'pasted from SeaBIOS first, right ?
> 
> You should get identical results with both methods.  It's just that the
> e820 method is more future proof, i.e. if the numa people add support
> for non-contignous memory some day we don't have to adapt the smbios
> code to handle it.

So I spent some time reverse-engineering the way Type 16..20 (memory)
smbios tables are built in SeaBIOS, and therefore in the QEMU smbios
patch set currently under revision... And I came up with the following
picture (caution: ascii art, fixed-width font strongly recommended):

 ----------------------------------------------------------------------------
|                               Type16  0x1000                               |
 ----------------------------------------------------------------------------
 ^             ^               ^           ^                    ^           ^
 |             |               |           |                    |           |
 |         ----+---        ----+----   ----+----       ---------+--------   |
 |        | Type17 |      | Type17  | | Type17  |     | Type17           |  |
 |        | 0..16G |      | 16..32G | | 32..48G | ... | N*16G..(N+1)*16G |  |
 |        | 0x1100 |      | 0x1101  | | 0x1102  |     | 0x110<N>         |  |
 |         --------        ---------   ---------       ------------------   |
 |          ^   ^              ^           ^                    ^           |
 |          |   |              |           |                    |           |
 |       +--+   +--+           |           |                    |           |
 |       |         |           |           |                    |           |
 |   ----+---   ---+----   ----+----   ----+----       ---------+--------   |
 |  | Type20 | | Type20 | | Type20  | | Type20  |     | Type20           |  |
 |  | 0..4G  | | 4..16G | | 16..32G | | 32..48G | ... | N*16G..(N+1)*16G |  |
 |  | 0x1400 | | 0x1401 | | 0x1402  | | 0x1403  |     | 0x140<N+1>       |  |
 |   ----+---   ---+----   ----+----   ----+----       ---------+--------   |
 |       |         |           |           |                    |           |
 |       |         |           +-------+   |   +----------------+           |
 |       |         +----------------+  |   |   |                            |
 |       |                          |  |   |   |                            |
 |       v                          v  v   v   v                            |
 |   --------                      --------------                           |
 |  | Type19 |                    | Type19       |                          |
 |  | 0..4G  |                    | 4G..ram_size |                          |
 |  | 0x1300 |                    | 0x1301       |                          |
 |   ----+---                      ------+-------                           |
 |       |                               |                                  |
 +-------+                               +----------------------------------+

Here are some of the limit values, and some questions and thoughts:

- Type16 max == 2T - 1K;

Should we just assert((ram_size >> 10) < 0x80000000), and officially
limit guests to < 2T ?

- Type17 max == 32G - 1M;

This explains why we create Type17 device tables in increments of 16G,
since that's the largest possible value that's a nice, round power of
two :)

- Type19 & Type20 max == 4T - 1K;

If we limit ourselves to what Type16 can currently represent (2T),
this should be plenty enough to work with...

So, currently, we split available ram into blobs of up to 16G each,
and assign each blob a Type17 node.

We then split available ram into <4G and 4G+, and create up to two
Type19 nodes for these two areas.

Now, re. e820: currently, the expectation is that the (up to) two
Type19 nodes in the above figure correspond to (up to) two entries of
type E820_RAM in the e820 table.


Then, a type20 node is assigned to the sub-4G portion of the first
Type17 "device", and another type20 node is assigned to the over-4G
portion of the same.

>From then on, type20 nodes correspond to the rest of the 16G-or-less
type17 devices pretty much on a 1:1 basis.


If the e820 table will contain more than just two E820_RAM entries,
and therefore we'll have more than the two Type19 nodes on the bottom
row, what are the rules for extending the rest of the figure
accordingly (i.e. how do we hook together more Type17 and Type20 nodes
to go along with the extra Type19 nodes) ?

Thanks much,
--Gabriel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] SMBIOS vs. NUMA (was: Build full type 19 tables)
  2014-03-12 21:55 [Qemu-devel] SMBIOS vs. NUMA (was: Build full type 19 tables) Gabriel L. Somlo
@ 2014-03-13  8:04 ` Gerd Hoffmann
  2014-03-13 14:37   ` Gabriel L. Somlo
  0 siblings, 1 reply; 9+ messages in thread
From: Gerd Hoffmann @ 2014-03-13  8:04 UTC (permalink / raw)
  To: Gabriel L. Somlo
  Cc: agraf, qemu-devel, armbru, alex.williamson, kevin, lersek


>  ----------------------------------------------------------------------------
> |                               Type16  0x1000                               |
>  ----------------------------------------------------------------------------
>  ^             ^               ^           ^                    ^           ^
>  |             |               |           |                    |           |
>  |         ----+---        ----+----   ----+----       ---------+--------   |
>  |        | Type17 |      | Type17  | | Type17  |     | Type17           |  |
>  |        | 0..16G |      | 16..32G | | 32..48G | ... | N*16G..(N+1)*16G |  |
>  |        | 0x1100 |      | 0x1101  | | 0x1102  |     | 0x110<N>         |  |
>  |         --------        ---------   ---------       ------------------   |
>  |          ^   ^              ^           ^                    ^           |
>  |          |   |              |           |                    |           |
>  |       +--+   +--+           |           |                    |           |
>  |       |         |           |           |                    |           |
>  |   ----+---   ---+----   ----+----   ----+----       ---------+--------   |
>  |  | Type20 | | Type20 | | Type20  | | Type20  |     | Type20           |  |
>  |  | 0..4G  | | 4..16G | | 16..32G | | 32..48G | ... | N*16G..(N+1)*16G |  |
>  |  | 0x1400 | | 0x1401 | | 0x1402  | | 0x1403  |     | 0x140<N+1>       |  |
>  |   ----+---   ---+----   ----+----   ----+----       ---------+--------   |
>  |       |         |           |           |                    |           |
>  |       |         |           +-------+   |   +----------------+           |
>  |       |         +----------------+  |   |   |                            |
>  |       |                          |  |   |   |                            |
>  |       v                          v  v   v   v                            |
>  |   --------                      --------------                           |
>  |  | Type19 |                    | Type19       |                          |
>  |  | 0..4G  |                    | 4G..ram_size |                          |
>  |  | 0x1300 |                    | 0x1301       |                          |
>  |   ----+---                      ------+-------                           |
>  |       |                               |                                  |
>  +-------+                               +----------------------------------+

Very nice.

> Here are some of the limit values, and some questions and thoughts:
> 
> - Type16 max == 2T - 1K;
> 
> Should we just assert((ram_size >> 10) < 0x80000000), and officially
> limit guests to < 2T ?

No.  Not fully sure what reasonable behavier would be in case more than
2T are present.  I guess either not generating type16 entries at all or
simply fill in the maximum value we can represent.

> - Type17 max == 32G - 1M;
> 
> This explains why we create Type17 device tables in increments of 16G,
> since that's the largest possible value that's a nice, round power of
> two :)

Yes.

> - Type19 & Type20 max == 4T - 1K;
> 
> If we limit ourselves to what Type16 can currently represent (2T),
> this should be plenty enough to work with...

And there is the option to simply create multiple type19+20 entries to
cover more I think.

> So, currently, we split available ram into blobs of up to 16G each,
> and assign each blob a Type17 node.
> 
> We then split available ram into <4G and 4G+, and create up to two
> Type19 nodes for these two areas.

Yes.

> Now, re. e820: currently, the expectation is that the (up to) two
> Type19 nodes in the above figure correspond to (up to) two entries of
> type E820_RAM in the e820 table.

Yes.  If more e820 ram entries show up one day, additional type19 nodes
should be generated (i.e. basially simply loop over the e830 table).

> Then, a type20 node is assigned to the sub-4G portion of the first
> Type17 "device", and another type20 node is assigned to the over-4G
> portion of the same.
> 
> From then on, type20 nodes correspond to the rest of the 16G-or-less
> type17 devices pretty much on a 1:1 basis.

Hmm, not sure why type20 entries are handled the way they are.  I think
it would make more sense to have one type20 entry per e820 ram entry,
similar to type19.

> If the e820 table will contain more than just two E820_RAM entries,
> and therefore we'll have more than the two Type19 nodes on the bottom
> row, what are the rules for extending the rest of the figure
> accordingly (i.e. how do we hook together more Type17 and Type20 nodes
> to go along with the extra Type19 nodes) ?

See above for type19+20.  type17 represents the dimms, so where the
memory is actually mapped doesn't matter there.  Lets simply sum up all
memory, then split into 16g pieces and create a type17 entry for each
piece.  At least initially.

As further improvement we could make the dimm size configurable, so if
you have a 4 node numa machine with 4g ram on each node you can present
4 virtual 4g ram dimms to the guest instead of a single 16g dimm.  But
that is clearly beyond the scope of the initial revision ...

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] SMBIOS vs. NUMA (was: Build full type 19 tables)
  2014-03-13  8:04 ` Gerd Hoffmann
@ 2014-03-13 14:37   ` Gabriel L. Somlo
  2014-03-13 15:36     ` Igor Mammedov
  0 siblings, 1 reply; 9+ messages in thread
From: Gabriel L. Somlo @ 2014-03-13 14:37 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: agraf, qemu-devel, armbru, alex.williamson, kevin, lersek

On Thu, Mar 13, 2014 at 09:04:52AM +0100, Gerd Hoffmann wrote:
>> Should we just assert((ram_size >> 10) < 0x80000000), and officially
>> limit guests to < 2T ?
> No.  Not fully sure what reasonable behavier would be in case more than
> 2T are present.  I guess either not generating type16 entries at all or
> simply fill in the maximum value we can represent.

Well, there's an "extended maximum capacity" field available starting
with smbios v2.7, which is an uint64_t counting bytes. Bumping the few
other types up to 2.7 shouldn't be too onerous, but I have no idea how
well the various currently supported OSs would react to smbios suddenly
going v2.7...

> > Then, a type20 node is assigned to the sub-4G portion of the first
> > Type17 "device", and another type20 node is assigned to the over-4G
> > portion of the same.
> > 
> > From then on, type20 nodes correspond to the rest of the 16G-or-less
> > type17 devices pretty much on a 1:1 basis.
> 
> Hmm, not sure why type20 entries are handled the way they are.  I think
> it would make more sense to have one type20 entry per e820 ram entry,
> similar to type19.

Type20 entries have pointers to type17 (memory_device_handle) and
type19 (memory_array_mapped_address_handle). Which, if you turn it
upside down could be interpreted as "every type 17 dimm needs (at
least) a type20 device mapped address to point at it".

> > If the e820 table will contain more than just two E820_RAM entries,
> > and therefore we'll have more than the two Type19 nodes on the bottom
> > row, what are the rules for extending the rest of the figure
> > accordingly (i.e. how do we hook together more Type17 and Type20 nodes
> > to go along with the extra Type19 nodes) ?
> 
> See above for type19+20.  type17 represents the dimms, so where the
> memory is actually mapped doesn't matter there.  Lets simply sum up all
> memory, then split into 16g pieces and create a type17 entry for each
> piece.  At least initially.

That's pretty much what happens now. If we decide to use e820 instead
of simply (below_4g, above_4g), I'd like add some sort of assertion
that would alert anyone who might start adding extra entries into e820
beyond the current two (below_4g and above_4g) :)

> As further improvement we could make the dimm size configurable, so if
> you have a 4 node numa machine with 4g ram on each node you can present
> 4 virtual 4g ram dimms to the guest instead of a single 16g dimm.  But
> that is clearly beyond the scope of the initial revision ...

Minimum number of largest-possible power-of-two dimms per node, given
the size of RAM assigned to each node. Then we'd basically just
replicate the figure laterally, one instance per node (perhaps keeping
a common T16 on top, but having one T19 at the bottom per node, and 
one T17,T20 pair per DIMM):

                       t16
<------------------------------------------------------->
t17 t17 ... t17   t17 t17 ... t17   ...   t17 t17 ... t17
t20 t20 ... t20   t20 t20 ... t20   ...   t20 t20 ... t20
<------------->   <------------->         <------------->
      t19              t19                      t19
    (node 0)         (node 1)                 (node N)

Would the 4G boundary issue still occur on a NUMA system (i.e., would
node 0 have two t19s, and two t20s for the first t17, just like my
current picture)? Do NUMA systems even have (or need) a smbios table ? :)

But I agree, this shouldn't have to be sorted out right away :)

--Gabriel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] SMBIOS vs. NUMA (was: Build full type 19 tables)
  2014-03-13 14:37   ` Gabriel L. Somlo
@ 2014-03-13 15:36     ` Igor Mammedov
  2014-03-13 19:01       ` Gabriel L. Somlo
  0 siblings, 1 reply; 9+ messages in thread
From: Igor Mammedov @ 2014-03-13 15:36 UTC (permalink / raw)
  To: Gabriel L. Somlo
  Cc: agraf, qemu-devel, armbru, alex.williamson, kevin, Gerd Hoffmann,
	lersek

On Thu, 13 Mar 2014 10:37:52 -0400
"Gabriel L. Somlo" <gsomlo@gmail.com> wrote:

> On Thu, Mar 13, 2014 at 09:04:52AM +0100, Gerd Hoffmann wrote:
> >> Should we just assert((ram_size >> 10) < 0x80000000), and officially
> >> limit guests to < 2T ?
> > No.  Not fully sure what reasonable behavier would be in case more than
> > 2T are present.  I guess either not generating type16 entries at all or
> > simply fill in the maximum value we can represent.
> 
> Well, there's an "extended maximum capacity" field available starting
> with smbios v2.7, which is an uint64_t counting bytes. Bumping the few
> other types up to 2.7 shouldn't be too onerous, but I have no idea how
> well the various currently supported OSs would react to smbios suddenly
> going v2.7...
> 
> > > Then, a type20 node is assigned to the sub-4G portion of the first
> > > Type17 "device", and another type20 node is assigned to the over-4G
> > > portion of the same.
> > > 
> > > From then on, type20 nodes correspond to the rest of the 16G-or-less
> > > type17 devices pretty much on a 1:1 basis.
> > 
> > Hmm, not sure why type20 entries are handled the way they are.  I think
> > it would make more sense to have one type20 entry per e820 ram entry,
> > similar to type19.
> 
> Type20 entries have pointers to type17 (memory_device_handle) and
> type19 (memory_array_mapped_address_handle). Which, if you turn it
> upside down could be interpreted as "every type 17 dimm needs (at
> least) a type20 device mapped address to point at it".
> 
> > > If the e820 table will contain more than just two E820_RAM entries,
> > > and therefore we'll have more than the two Type19 nodes on the bottom
> > > row, what are the rules for extending the rest of the figure
> > > accordingly (i.e. how do we hook together more Type17 and Type20 nodes
> > > to go along with the extra Type19 nodes) ?
> > 
> > See above for type19+20.  type17 represents the dimms, so where the
> > memory is actually mapped doesn't matter there.  Lets simply sum up all
> > memory, then split into 16g pieces and create a type17 entry for each
> > piece.  At least initially.
> 
> That's pretty much what happens now. If we decide to use e820 instead
> of simply (below_4g, above_4g), I'd like add some sort of assertion
> that would alert anyone who might start adding extra entries into e820
> beyond the current two (below_4g and above_4g) :)

After memory hotplug is in I might add e820 entries after above_4g
for present at boot hotpluggable DIMMDevices. They would have 1:1 mapping
i.e. t19<-t20<-t17 and belong only to 1 node.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] SMBIOS vs. NUMA (was: Build full type 19 tables)
  2014-03-13 15:36     ` Igor Mammedov
@ 2014-03-13 19:01       ` Gabriel L. Somlo
  2014-03-14  9:28         ` Igor Mammedov
  0 siblings, 1 reply; 9+ messages in thread
From: Gabriel L. Somlo @ 2014-03-13 19:01 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: agraf, qemu-devel, armbru, alex.williamson, kevin, Gerd Hoffmann,
	lersek

On Thu, Mar 13, 2014 at 04:36:12PM +0100, Igor Mammedov wrote:
> 
> After memory hotplug is in I might add e820 entries after above_4g
> for present at boot hotpluggable DIMMDevices. They would have 1:1 mapping
> i.e. t19<-t20<-t17 and belong only to 1 node.

Any idea what the max size could be for each one of those ?

Thanks,
--Gabriel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] SMBIOS vs. NUMA (was: Build full type 19 tables)
  2014-03-13 19:01       ` Gabriel L. Somlo
@ 2014-03-14  9:28         ` Igor Mammedov
  2014-03-14 15:14           ` Gabriel Somlo
  0 siblings, 1 reply; 9+ messages in thread
From: Igor Mammedov @ 2014-03-14  9:28 UTC (permalink / raw)
  To: Gabriel L. Somlo
  Cc: qemu-devel, armbru, agraf, alex.williamson, kevin, Gerd Hoffmann,
	lersek

On Thu, 13 Mar 2014 15:01:16 -0400
"Gabriel L. Somlo" <gsomlo@gmail.com> wrote:

> On Thu, Mar 13, 2014 at 04:36:12PM +0100, Igor Mammedov wrote:
> > 
> > After memory hotplug is in I might add e820 entries after above_4g
> > for present at boot hotpluggable DIMMDevices. They would have 1:1 mapping
> > i.e. t19<-t20<-t17 and belong only to 1 node.
> 
> Any idea what the max size could be for each one of those ?
So far there isn't anything to limit max size of a DEIMMDevice except of
allowed max RAM size on QEMU CLI at startup time.

> 
> Thanks,
> --Gabriel
> 
> 
> 


-- 
Regards,
  Igor

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] SMBIOS vs. NUMA (was: Build full type 19 tables)
  2014-03-14  9:28         ` Igor Mammedov
@ 2014-03-14 15:14           ` Gabriel Somlo
  2014-03-14 17:51             ` Igor Mammedov
  0 siblings, 1 reply; 9+ messages in thread
From: Gabriel Somlo @ 2014-03-14 15:14 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, armbru, agraf, alex.williamson, kevin, Gerd Hoffmann,
	lersek

On Fri, Mar 14, 2014 at 10:28:30AM +0100, Igor Mammedov wrote:
> On Thu, 13 Mar 2014 15:01:16 -0400
> "Gabriel L. Somlo" <gsomlo@gmail.com> wrote:
> 
> > On Thu, Mar 13, 2014 at 04:36:12PM +0100, Igor Mammedov wrote:
> > > 
> > > After memory hotplug is in I might add e820 entries after above_4g
> > > for present at boot hotpluggable DIMMDevices. They would have 1:1 mapping
> > > i.e. t19<-t20<-t17 and belong only to 1 node.
> > 
> > Any idea what the max size could be for each one of those ?
> So far there isn't anything to limit max size of a DEIMMDevice except of
> allowed max RAM size on QEMU CLI at startup time.

OK, so then it's more like:

<-----t19----->
t20 t20 ... t20
t17 t17 ... t17

since t17 is currently limited to 16G. Unless we went to smbios v2.7,
but that would require lots more external coordination (the bios still
generates the smbios entry point, where version is recorded). Right
now that's 2.4.

More questions about e820: 

1. is it safe to assume that E820_RAM (start_addr, size) entries are
non-overlapping and sorted by increasing start_addr ?

2. will there always be a below-4g entry ? if so, will it *and* the
next entry automatically be assumed to belong to the first node, and
only subsequently will there be one E820_RAM entry per node (for nodes
2, 3, etc) ? 

Thanks,
--Gabriel

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] SMBIOS vs. NUMA (was: Build full type 19 tables)
  2014-03-14 15:14           ` Gabriel Somlo
@ 2014-03-14 17:51             ` Igor Mammedov
  2014-03-14 19:36               ` Gabriel Somlo
  0 siblings, 1 reply; 9+ messages in thread
From: Igor Mammedov @ 2014-03-14 17:51 UTC (permalink / raw)
  To: Gabriel Somlo
  Cc: armbru, agraf, qemu-devel, alex.williamson, kevin, Gerd Hoffmann,
	lersek

On Fri, 14 Mar 2014 11:14:35 -0400
Gabriel Somlo <gsomlo@gmail.com> wrote:

> On Fri, Mar 14, 2014 at 10:28:30AM +0100, Igor Mammedov wrote:
> > On Thu, 13 Mar 2014 15:01:16 -0400
> > "Gabriel L. Somlo" <gsomlo@gmail.com> wrote:
> > 
> > > On Thu, Mar 13, 2014 at 04:36:12PM +0100, Igor Mammedov wrote:
> > > > 
> > > > After memory hotplug is in I might add e820 entries after above_4g
> > > > for present at boot hotpluggable DIMMDevices. They would have 1:1 mapping
> > > > i.e. t19<-t20<-t17 and belong only to 1 node.
> > > 
> > > Any idea what the max size could be for each one of those ?
> > So far there isn't anything to limit max size of a DEIMMDevice except of
> > allowed max RAM size on QEMU CLI at startup time.
> 
> OK, so then it's more like:
> 
> <-----t19----->
> t20 t20 ... t20
> t17 t17 ... t17
> 
> since t17 is currently limited to 16G. Unless we went to smbios v2.7,
> but that would require lots more external coordination (the bios still
> generates the smbios entry point, where version is recorded). Right
> now that's 2.4.
> 
> More questions about e820: 
> 
> 1. is it safe to assume that E820_RAM (start_addr, size) entries are
> non-overlapping and sorted by increasing start_addr ?
They might overlap, grep for e820_add_entry(). If you interested in
what kernel does with such table look for sanitize_e820_map() there.

Does SMBIOS/t17 actually care about shadowing parts of it by something
else in unrelated e820?

> 
> 2. will there always be a below-4g entry ? if so, will it *and* the
> next entry automatically be assumed to belong to the first node, and
> only subsequently will there be one E820_RAM entry per node (for nodes
> 2, 3, etc) ? 
Once we have DIMMDevices, I'm planning to convert below-4g and above-4g to
a set of DIMMDevices, there will be at least 1 device per node but there could
be more of them to satisfy different backend requirements like hugepage
size, alignment, e.t.c.

BTW why do we care how smbios tables are build in relation to NUMA mapping,
they seem to be totally independent.

> Thanks,
> --Gabriel
> 


-- 
Regards,
  Igor

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] SMBIOS vs. NUMA (was: Build full type 19 tables)
  2014-03-14 17:51             ` Igor Mammedov
@ 2014-03-14 19:36               ` Gabriel Somlo
  0 siblings, 0 replies; 9+ messages in thread
From: Gabriel Somlo @ 2014-03-14 19:36 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: armbru, agraf, qemu-devel, alex.williamson, kevin, Gerd Hoffmann,
	lersek

On Fri, Mar 14, 2014 at 06:51:05PM +0100, Igor Mammedov wrote:
> > 1. is it safe to assume that E820_RAM (start_addr, size) entries are
> > non-overlapping and sorted by increasing start_addr ?
> They might overlap, grep for e820_add_entry(). If you interested in
> what kernel does with such table look for sanitize_e820_map() there.
> 
> Does SMBIOS/t17 actually care about shadowing parts of it by something
> else in unrelated e820?
> 
> > 
> > 2. will there always be a below-4g entry ? if so, will it *and* the
> > next entry automatically be assumed to belong to the first node, and
> > only subsequently will there be one E820_RAM entry per node (for nodes
> > 2, 3, etc) ? 
> Once we have DIMMDevices, I'm planning to convert below-4g and above-4g to
> a set of DIMMDevices, there will be at least 1 device per node but there could
> be more of them to satisfy different backend requirements like hugepage
> size, alignment, e.t.c.
> 
> BTW why do we care how smbios tables are build in relation to NUMA mapping,
> they seem to be totally independent.

Here's the context: I'm working on migrating smbios table generation
into QEMU, see:

http://lists.nongnu.org/archive/html/qemu-devel/2014-03/msg02473.html

Currently, to generate memory-related tables (t16, 17, 19, and 20),
SeaBIOS starts with ram_size, below_4g_mem_size, and
above_4g_mem_size, and does the following:

1. create a t16 to reflect all of ram_size (this currently only works up
   to 2T, btw).

2. create t17s in increments of 16G (max. t17 size allowed by currently
   supported smbios version, 2.4

3. Create a t19 for [0..below_4g_mem_size], and, if applicable,
   another one for [4G..(4G+above_4g_mem_size)].

4. Create t20s, each of which needs a pointer to a t17 and a t19, so:

	- first t20 goes from [0..below_4g_mem_size] and points at
	  the first t17 and at the below_4g t19

	- second t20 goes from [4G..(16G - below_4g_mem_size)] and
	  still points at the first t17, but at the above_4g t19;

	- further t20s all point at the above_4g t19 and at one
	  t17 each, starting with the *second* t17.

Kinda like this:

  16G     16G   16G       16G
------------------------------

<-t17->   t17   t17  ...  t17    dimm devices

t20 t20   t20   t20  ...  t20    device address maps

t19 <----------t19---------->    memory address maps (E820_RAM entries ?)

------------------------------
<4G <------- over-4g ------->

t17s are "devices" or dimms, t20s are "device address maps", and
t19s are "memory address maps" or somesuch.

Currently, t19s exactly match the exactly two E820_RAM type entries
the e820 table, and the subject under discussion is whether or not to
create one smbios t19 per E82_RAM entry from the e820 table, instead
of going with just two for below- vs above- 4Gig. And from there, how
to apportion more t20s and t17s to the additional t19s.

Sounds like if E820_RAM segments are allowed to overlap, creating t19s
for each one of them might not work out. But then again, it may be that
I'm missing something, which is probably a more likely scenario :) :)

Thanks,
--Gabriel

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-03-14 19:37 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-12 21:55 [Qemu-devel] SMBIOS vs. NUMA (was: Build full type 19 tables) Gabriel L. Somlo
2014-03-13  8:04 ` Gerd Hoffmann
2014-03-13 14:37   ` Gabriel L. Somlo
2014-03-13 15:36     ` Igor Mammedov
2014-03-13 19:01       ` Gabriel L. Somlo
2014-03-14  9:28         ` Igor Mammedov
2014-03-14 15:14           ` Gabriel Somlo
2014-03-14 17:51             ` Igor Mammedov
2014-03-14 19:36               ` Gabriel Somlo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).