NUMA guest config options (was: Re: [PATCH 00/11] PV NUMA Guests)

All of lore.kernel.org
 help / color / mirror / Atom feed

* NUMA guest config options (was: Re: [PATCH 00/11] PV NUMA Guests)
@ 2010-04-23 12:46 Andre Przywara
  2010-04-23 14:09 ` Dan Magenheimer
  0 siblings, 1 reply; 3+ messages in thread
From: Andre Przywara @ 2010-04-23 12:46 UTC (permalink / raw)
  To: Cui, Dexuan, Dulloor, xen-devel; +Cc: Nakajima, Jun

Hi,

yesterday Dulloor, Jun and I had a discussion about the NUMA guest 
configuration scheme, we came to the following conclusions:
1. The configuration would be the same for HVM and PV guests, only the 
internal method of propagation would differ.
2. We want to make it as easy as possible, with best performance out of 
the box as the design goal. Another goal is predictable performance.
3. We (at least for now) omit more sophisticated tuning options (exact 
user-driven description of the guest's topology), so the guest's 
resources are split equally across the guest nodes.
4. We have three basic strategies:
  - CONFINE: let the guest use only one node. If that does not work, fail.
  - SPLIT: allocate resources from multiple nodes, inject a NUMA 
topology into the guest (includes PV querying via hypercall). If the 
guest is paravirtualized and does not know about NUMA (missing ELF 
hint): fail.
  - STRIPE: allocate the memory in an interleaved way from multiple 
nodes, don't tell the guest about NUMA at all.

If any one the above strategies is explicitly specified in the config 
file and it cannot be met, then the guest creation will fail.
A fourth option would be the default: AUTOMATIC. This will try the three 
strategies after each other (order: CONFINE, SPLIT, STRIP). If one 
fails, the next will be tried (this will never use striping for HVM guests).

5. The number of guest nodes is internally specified via a min/max pair. 
By default min is 1, max is the number of system nodes. The algorithm 
will try to use the smallest possible number of nodes.

The question remaining is whether we want to expose this pair to the user:
  - For predictable performance we want to specify an exact number of 
guest nodes, so set min=max=<number of nodes>
  - For best performance, the number of nodes should be at small as 
possible, so min is always 1. For the explicit CONFINE strategy, max 
would also be one, for AUTOMATIC it should be as few as possible, which 
is already built in the algorithm.
So it is not clear if "max nodes" is a useful option. If it would serve 
as an upper boundary, then it is questionable whether 
"failing-if-not-possible" is a useful result.

So maybe we get along with just one (optional) value: guestnodes.
This will be useful in the SPLIT case, where it specifies the number of 
nodes the guest sees (for predictable performance). CONFINE internally 
overrides this value with "1". If one would impose a limit on the number 
of nodes, one would choose "AUTOMATIC" and set guestnodes to this 
number. If single-node allocations fail, it will use as few nodes as 
possible, not exceeding the specified number.

Please comment on this.

Thanks and regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448-3567-12

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: NUMA guest config options (was: Re: [PATCH 00/11] PV NUMA Guests)
  2010-04-23 12:46 NUMA guest config options (was: Re: [PATCH 00/11] PV NUMA Guests) Andre Przywara
@ 2010-04-23 14:09 ` Dan Magenheimer
  2010-04-23 17:08   ` Nakajima, Jun
  0 siblings, 1 reply; 3+ messages in thread
From: Dan Magenheimer @ 2010-04-23 14:09 UTC (permalink / raw)
  To: Andre Przywara, Cui, Dexuan, Dulloor, xen-devel; +Cc: Nakajima, Jun

While I like the direction this is going, please try to extend
your model to cover the cases of ballooning and live-migration.
For example, for "CONFINE", ballooning should probably be
disallowed as pages surrendered on "this node" via ballooning
may be only recoverable later on a different node.  Similarly,
creating a CONFINE guest is defined to fail if there is
insufficient memory on any node... will live migration to a
different physical machine similarly fail, even if an administrator
explicitly requests it?

In general, communicating NUMA topology to a guest is a "performance
thing" and ballooning and live-migration are "flexibility things";
and performance and flexibility mix like oil and water.

> -----Original Message-----
> From: Andre Przywara [mailto:andre.przywara@amd.com]
> Sent: Friday, April 23, 2010 6:46 AM
> To: Cui, Dexuan; Dulloor; xen-devel
> Cc: Nakajima, Jun
> Subject: [Xen-devel] NUMA guest config options (was: Re: [PATCH 00/11]
> PV NUMA Guests)
> 
> Hi,
> 
> yesterday Dulloor, Jun and I had a discussion about the NUMA guest
> configuration scheme, we came to the following conclusions:
> 1. The configuration would be the same for HVM and PV guests, only the
> internal method of propagation would differ.
> 2. We want to make it as easy as possible, with best performance out of
> the box as the design goal. Another goal is predictable performance.
> 3. We (at least for now) omit more sophisticated tuning options (exact
> user-driven description of the guest's topology), so the guest's
> resources are split equally across the guest nodes.
> 4. We have three basic strategies:
>   - CONFINE: let the guest use only one node. If that does not work,
> fail.
>   - SPLIT: allocate resources from multiple nodes, inject a NUMA
> topology into the guest (includes PV querying via hypercall). If the
> guest is paravirtualized and does not know about NUMA (missing ELF
> hint): fail.
>   - STRIPE: allocate the memory in an interleaved way from multiple
> nodes, don't tell the guest about NUMA at all.
> 
> If any one the above strategies is explicitly specified in the config
> file and it cannot be met, then the guest creation will fail.
> A fourth option would be the default: AUTOMATIC. This will try the
> three
> strategies after each other (order: CONFINE, SPLIT, STRIP). If one
> fails, the next will be tried (this will never use striping for HVM
> guests).
> 
> 5. The number of guest nodes is internally specified via a min/max
> pair.
> By default min is 1, max is the number of system nodes. The algorithm
> will try to use the smallest possible number of nodes.
> 
> The question remaining is whether we want to expose this pair to the
> user:
>   - For predictable performance we want to specify an exact number of
> guest nodes, so set min=max=<number of nodes>
>   - For best performance, the number of nodes should be at small as
> possible, so min is always 1. For the explicit CONFINE strategy, max
> would also be one, for AUTOMATIC it should be as few as possible, which
> is already built in the algorithm.
> So it is not clear if "max nodes" is a useful option. If it would serve
> as an upper boundary, then it is questionable whether
> "failing-if-not-possible" is a useful result.
> 
> So maybe we get along with just one (optional) value: guestnodes.
> This will be useful in the SPLIT case, where it specifies the number of
> nodes the guest sees (for predictable performance). CONFINE internally
> overrides this value with "1". If one would impose a limit on the
> number
> of nodes, one would choose "AUTOMATIC" and set guestnodes to this
> number. If single-node allocations fail, it will use as few nodes as
> possible, not exceeding the specified number.
> 
> Please comment on this.
> 
> Thanks and regards,
> Andre.
> 
> --
> Andre Przywara
> AMD-Operating System Research Center (OSRC), Dresden, Germany
> Tel: +49 351 448-3567-12
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: NUMA guest config options (was: Re: [PATCH 00/11] PV NUMA Guests)
  2010-04-23 14:09 ` Dan Magenheimer
@ 2010-04-23 17:08   ` Nakajima, Jun
  0 siblings, 0 replies; 3+ messages in thread
From: Nakajima, Jun @ 2010-04-23 17:08 UTC (permalink / raw)
  To: Dan Magenheimer, Andre Przywara, Cui, Dexuan, Dulloor, xen-devel

Dan Magenheimer wrote on Fri, 23 Apr 2010 at 07:09:57:

> While I like the direction this is going, please try to extend
> your model to cover the cases of ballooning and live-migration.
> For example, for "CONFINE", ballooning should probably be
> disallowed as pages surrendered on "this node" via ballooning
> may be only recoverable later on a different node.  Similarly,
> creating a CONFINE guest is defined to fail if there is
> insufficient memory on any node... will live migration to a
> different physical machine similarly fail, even if an administrator
> explicitly requests it?

For the memory resources, I think we can use the same allocation strategies for the live-migration cases. In general, the memory allocation strategy should be _inherited_ upon live-migration. For example, if a NUMA guest is created with SPLIT, it is guaranteed that the memory allocation strategy will be SPLIT at live-migration time. If AUTOMATIC is used, the memory topology may change upon live-migration. The guest will continue to run, but the NUMA memory topology in the guest may not reflect the underlying hardware conditions. We may be able to use some PV technique to re-initialize the memory subsystem in the guest when updating the memory topology.

> 
> In general, communicating NUMA topology to a guest is a "performance
> thing" and ballooning and live-migration are "flexibility things";
> and performance and flexibility mix like oil and water.
> 
>> -----Original Message----- From: Andre Przywara
>> [mailto:andre.przywara@amd.com] Sent: Friday, April 23, 2010 6:46 AM
>> To: Cui, Dexuan; Dulloor; xen-devel Cc: Nakajima, Jun Subject:
>> [Xen-devel] NUMA guest config options (was: Re: [PATCH 00/11] PV NUMA
>> Guests)
>> 
>> Hi,
>> 
>> yesterday Dulloor, Jun and I had a discussion about the NUMA guest
>> configuration scheme, we came to the following conclusions: 1. The
>> configuration would be the same for HVM and PV guests, only the
>> internal method of propagation would differ. 2. We want to make it as
>> easy as possible, with best performance out of the box as the design
>> goal. Another goal is predictable performance. 3. We (at least for now)
>> omit more sophisticated tuning options (exact user-driven description
>> of the guest's topology), so the guest's resources are split equally
>> across the guest nodes. 4. We have three basic strategies:
>>   - CONFINE: let the guest use only one node. If that does not work,
>>   fail. - SPLIT: allocate resources from multiple nodes, inject a NUMA
>> topology into the guest (includes PV querying via hypercall). If the
>> guest is paravirtualized and does not know about NUMA (missing ELF
>> hint): fail.
>>   - STRIPE: allocate the memory in an interleaved way from multiple
>> nodes, don't tell the guest about NUMA at all.
>> 
>> If any one the above strategies is explicitly specified in the config
>> file and it cannot be met, then the guest creation will fail.
>> A fourth option would be the default: AUTOMATIC. This will try the
>> three
>> strategies after each other (order: CONFINE, SPLIT, STRIP). If one
>> fails, the next will be tried (this will never use striping for HVM
>> guests).
>> 
>> 5. The number of guest nodes is internally specified via a min/max
>> pair.
>> By default min is 1, max is the number of system nodes. The algorithm
>> will try to use the smallest possible number of nodes.
>> 
>> The question remaining is whether we want to expose this pair to the
>> user:
>>   - For predictable performance we want to specify an exact number of
>>   guest nodes, so set min=max=<number of nodes> - For best performance,
>>   the number of nodes should be at small as
>> possible, so min is always 1. For the explicit CONFINE strategy, max
>> would also be one, for AUTOMATIC it should be as few as possible, which
>> is already built in the algorithm. So it is not clear if "max nodes" is
>> a useful option. If it would serve as an upper boundary, then it is
>> questionable whether "failing-if-not-possible" is a useful result.
>> 
>> So maybe we get along with just one (optional) value: guestnodes. This
>> will be useful in the SPLIT case, where it specifies the number of
>> nodes the guest sees (for predictable performance). CONFINE internally
>> overrides this value with "1". If one would impose a limit on the
>> number of nodes, one would choose "AUTOMATIC" and set guestnodes to
>> this number. If single-node allocations fail, it will use as few nodes
>> as possible, not exceeding the specified number.
>> 
>> Please comment on this.
>> 
>> Thanks and regards,
>> Andre.
>> 
>> --
>> Andre Przywara
>> AMD-Operating System Research Center (OSRC), Dresden, Germany
>> Tel: +49 351 448-3567-12
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel

Jun
___
Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-04-23 17:08 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-23 12:46 NUMA guest config options (was: Re: [PATCH 00/11] PV NUMA Guests) Andre Przywara
2010-04-23 14:09 ` Dan Magenheimer
2010-04-23 17:08   ` Nakajima, Jun

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.