[RFC] Extend the number of event channels availabe to guests

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Extend the number of event channels availabe to guests
@ 2012-09-19 23:49 Attilio Rao
  2012-09-20  7:47 ` Jan Beulich
  0 siblings, 1 reply; 7+ messages in thread
From: Attilio Rao @ 2012-09-19 23:49 UTC (permalink / raw)
  To: xen-devel@lists.xensource.com, Ian Campbell, Stefano Stabellini

Hello,
reported below there is a request for comment on the plan to extend the 
number of event channel the hypervisor can grok. I've informally 
discussed some parts of this with Ian Campbell but I would like to 
formalize it someway, hear more opinion on it and possibly give the 
project more exposure and guidance, as this is supposed to be one of the 
major features for 4.3.

SYNOPSIS
Currently the number of eventchannel every guest can setup is 1k or 4k, 
depending if the guest is 32 or 64 bits. This is a limitation in the 
number of guests an host can actively run, because the host and guest 
will need to setup eventchannel among them and thus at some point Dom0 
will exhaust the number of available eventchannel for all guests. Scope 
of this work is to raise the number of available eventchannel for every 
guest (and then, also for Dom0).

The 4k number cames out directly by the eventchannel organization. In 
order to address a single channel, every guest keeps a map of 
corresponding bits in its shared page with the hypervisor. However, in 
order to avoid to search anytime for 4k bits, a per-cpu, upper level 
further mask is present to address singularly smaller words of the 
pending eventchannel mask (making the organization of the code at all 
the effects a two-level lookup table.

In order to expand the number of available eventchannels, one must take 
into account 2 important aspects, related to compatibility: ABI and 
ability to run both old and new method altogether.
The former one is about the fact that all the controlling structures 
related to eventchannels live in public ABI of the hypervisor. A valid 
solution, then, must not enforce any ABI changes at all.
The latter one is about the ability to leave the hypervisor to work with 
both the old model and the new one. This is to keep support with guests 
running an older kernel than the patched one.

Proposal
The proposal is pretty simple: the eventchannel search will become a 
three-level lookup table, with the leaf level being composed by shared 
pages registered at boot time by the guests.
The bitmap working now as leaf (then called "second level") will work 
alternatively as leaf level still (for older kernel) or for intermediate 
level to address into a new array of shared pages (for newer kernels). 
This leaves the possibility to reuse the existing mechanisms without 
modifying its internals.

More specifically, what needs to happen:
- Add new members to struct domain to handle an array of pages (to 
contain the actual evtchn bitmaps), a further array of pages (to contain 
the evtchn masks) and a control bit to say if it is subjective to the 
new mode or not. Initially the arrays will be empty and the control bit 
will be OFF.
- At init_platform() time, the guest must allocate the pages to compose 
the 2 arrays and invoke a novel hypercall which, at big lines, does the 
following:
   * Creates some pages to populate the new arrays in struct domain via 
alloc_xenheap_pages()
   * Recreates the mapping with the gpfn passed from the userland, using 
basically guest_physmap_add_page()
   * Sets the control bit to ON
- Places that need to access to the actual leaf bit (like, for example, 
xen_evtchn_do_upcall()) will need to double check the control bit. If it 
is OFF they consider the second level as the leaf one, otherwise they 
will do a further lookup to get the bit from the new array of pages.

Of course there are some nits to be decided yet, like, for example:
* How many pages should the new level have? We can start by populating 
just one, for example
* Who should have really the knowledge of how many pages to allocate? 
Likely the hypervisor should have a threshhold, but in general we may 
want to have a posting mechanism to have the guest ask the hypervisor 
before-hand and satisfy its actual request
* How many bits should be indirected in the third-level by every single 
bit in the second-level? (that is a really minor factor, but still).

Please let me know what do you think about this.

Thanks,
Attilio

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Extend the number of event channels availabe to guests
  2012-09-19 23:49 [RFC] Extend the number of event channels availabe to guests Attilio Rao
@ 2012-09-20  7:47 ` Jan Beulich
  2012-09-20  7:55   ` Ian Campbell
  2012-09-20 14:05   ` Attilio Rao
  0 siblings, 2 replies; 7+ messages in thread
From: Jan Beulich @ 2012-09-20  7:47 UTC (permalink / raw)
  To: Attilio Rao; +Cc: xen-devel, IanCampbell, Stefano Stabellini

>>> On 20.09.12 at 01:49, Attilio Rao <attilio.rao@citrix.com> wrote:
> Proposal
> The proposal is pretty simple: the eventchannel search will become a 
> three-level lookup table, with the leaf level being composed by shared 
> pages registered at boot time by the guests.
> The bitmap working now as leaf (then called "second level") will work 
> alternatively as leaf level still (for older kernel) or for intermediate 
> level to address into a new array of shared pages (for newer kernels). 
> This leaves the possibility to reuse the existing mechanisms without 
> modifying its internals.

While adding one level would seem to leave ample room, so did
the originally 4096 originally. Therefore, even if unimplemented
right now, I'd like the interface to allow for the guest to specify
more levels.

> More specifically, what needs to happen:
> - Add new members to struct domain to handle an array of pages (to 
> contain the actual evtchn bitmaps), a further array of pages (to contain 
> the evtchn masks) and a control bit to say if it is subjective to the 
> new mode or not. Initially the arrays will be empty and the control bit 
> will be OFF.
> - At init_platform() time, the guest must allocate the pages to compose 
> the 2 arrays and invoke a novel hypercall which, at big lines, does the 
> following:
>    * Creates some pages to populate the new arrays in struct domain via 
> alloc_xenheap_pages()

Why? The guest allocated the pages already. Just have the
hypervisor map them (similar, but without the per-vCPU needs,
to registering an alternative per-vCPU shared page). Whether
it turns out more practical to require the guest to enforce
certain restrictions (like the pages being contiguous and/or
address restricted) is a secondary aspect.

>    * Recreates the mapping with the gpfn passed from the userland, using 
> basically guest_physmap_add_page()

This would then be superfluous.

>    * Sets the control bit to ON
> - Places that need to access to the actual leaf bit (like, for example, 
> xen_evtchn_do_upcall()) will need to double check the control bit. If it 
> is OFF they consider the second level as the leaf one, otherwise they 
> will do a further lookup to get the bit from the new array of pages.

Just like for variable depth page tables - if at all possible, just
make the accesses variable depth, so that all you need to track
on a per-domain basis is the depth of the tree.

> Of course there are some nits to be decided yet, like, for example:
> * How many pages should the new level have? We can start by populating 
> just one, for example

Just let the guest specify this (and error if the number is too large).

> * Who should have really the knowledge of how many pages to allocate? 
> Likely the hypervisor should have a threshhold, but in general we may 
> want to have a posting mechanism to have the guest ask the hypervisor 
> before-hand and satisfy its actual request

Same here (this is really the same with the previous item, if you
follow the earlier suggestions).

> * How many bits should be indirected in the third-level by every single 
> bit in the second-level? (that is a really minor factor, but still).

The tree should clearly be uniform (i.e. having a factor of
BITS_PER_LONG per level), just like it is now. For 64-bit guests,
this would mean 256k channels with 3 levels (32k for 32-bit
guests).

One aspect to also consider is migration - will the guest have to
re-issue the extending hypercall, or will this be taken care of for
it? If the former approach is chosen, would the guest be
expected to deal with not being able to set up the extension
again on the new host?

And another important (but implementation only) aspect not to
forget is making domain_dump_evtchn_info() scale with the
then much higher amount of dumping potentially to be done (i.e.
not just extend it to cope with the count, but also make sure it
properly allows softirqs to be handled, which in turn requires to
not hold the event lock across the whole loop).

Jan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Extend the number of event channels availabe to guests
  2012-09-20  7:47 ` Jan Beulich
@ 2012-09-20  7:55   ` Ian Campbell
  2012-09-20  8:06     ` Jan Beulich
  2012-09-20 14:05   ` Attilio Rao
  1 sibling, 1 reply; 7+ messages in thread
From: Ian Campbell @ 2012-09-20  7:55 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Attilio Rao, xen-devel, Stefano Stabellini

On Thu, 2012-09-20 at 08:47 +0100, Jan Beulich wrote:
> One aspect to also consider is migration - will the guest have to
> re-issue the extending hypercall, or will this be taken care of for
> it? If the former approach is chosen, would the guest be
> expected to deal with not being able to set up the extension
> again on the new host? 

We only properly care about N->N and N->N+1 migrations, if we think that
we would never reduce the limit over a hypervisor version upgrade then
this would be fine, I think.

We do a similar thing with the grant table v2 stuff I think, i.e. panic
after migration if we can't setup the feature again.

Ian.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Extend the number of event channels availabe to guests
  2012-09-20  7:55   ` Ian Campbell
@ 2012-09-20  8:06     ` Jan Beulich
  0 siblings, 0 replies; 7+ messages in thread
From: Jan Beulich @ 2012-09-20  8:06 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Attilio Rao, xen-devel, Stefano Stabellini

>>> On 20.09.12 at 09:55, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Thu, 2012-09-20 at 08:47 +0100, Jan Beulich wrote:
>> One aspect to also consider is migration - will the guest have to
>> re-issue the extending hypercall, or will this be taken care of for
>> it? If the former approach is chosen, would the guest be
>> expected to deal with not being able to set up the extension
>> again on the new host? 
> 
> We only properly care about N->N and N->N+1 migrations, if we think that
> we would never reduce the limit over a hypervisor version upgrade then
> this would be fine, I think.
> 
> We do a similar thing with the grant table v2 stuff I think, i.e. panic
> after migration if we can't setup the feature again.

Which doesn't sound right - if hypervisor/tools did the re-setup,
then migration could fail in a recoverable way (i.e. continuing to
run on the old host) instead. After all, this may not be just about
feature availability but - especially if indeed there were multiple
pages to be allocated by the hypervisor - resource constraints.

Jan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Extend the number of event channels availabe to guests
  2012-09-20  7:47 ` Jan Beulich
  2012-09-20  7:55   ` Ian Campbell
@ 2012-09-20 14:05   ` Attilio Rao
  2012-09-20 15:42     ` Jan Beulich
  1 sibling, 1 reply; 7+ messages in thread
From: Attilio Rao @ 2012-09-20 14:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Ian Campbell, Stefano Stabellini

On 20/09/12 08:47, Jan Beulich wrote:
>>>> On 20.09.12 at 01:49, Attilio Rao<attilio.rao@citrix.com>  wrote:
>>>>          
>> Proposal
>> The proposal is pretty simple: the eventchannel search will become a
>> three-level lookup table, with the leaf level being composed by shared
>> pages registered at boot time by the guests.
>> The bitmap working now as leaf (then called "second level") will work
>> alternatively as leaf level still (for older kernel) or for intermediate
>> level to address into a new array of shared pages (for newer kernels).
>> This leaves the possibility to reuse the existing mechanisms without
>> modifying its internals.
>>      
> While adding one level would seem to leave ample room, so did
> the originally 4096 originally. Therefore, even if unimplemented
> right now, I'd like the interface to allow for the guest to specify
> more levels.
>    

There is a big difference here. The third/new level will be composed of 
pages registered at guest installing so it can be expanded on demanded 
necessity. The second-level we have now doesn't work because it is stuck 
in the immutable ABI.
The only useful way to have another level would be in the case we think 
the second-level is not enough to address all the necessary bits in the 
third level in efficient way.

To make you an example, the first level is 64 bits while the second 
level can address 64 times the first level. The third level, to be 
on-par with the same ratio of the second level in terms of performance, 
would be large something like 4 pages. I think we are very far from 
reaching critical levels.

>    
>> More specifically, what needs to happen:
>> - Add new members to struct domain to handle an array of pages (to
>> contain the actual evtchn bitmaps), a further array of pages (to contain
>> the evtchn masks) and a control bit to say if it is subjective to the
>> new mode or not. Initially the arrays will be empty and the control bit
>> will be OFF.
>> - At init_platform() time, the guest must allocate the pages to compose
>> the 2 arrays and invoke a novel hypercall which, at big lines, does the
>> following:
>>     * Creates some pages to populate the new arrays in struct domain via
>> alloc_xenheap_pages()
>>      
> Why? The guest allocated the pages already. Just have the
> hypervisor map them (similar, but without the per-vCPU needs,
> to registering an alternative per-vCPU shared page). Whether
> it turns out more practical to require the guest to enforce
> certain restrictions (like the pages being contiguous and/or
> address restricted) is a secondary aspect.
>    

Actually what I propose seems to be what happens infact in the shared 
page case. Look at what arch_domain_create() and XENMEM_add_to_physmap 
hypercall do (in the XENMAPSPACE_shared_info case). I think this is the 
quicker way to get what we want.

>    
>>     * Recreates the mapping with the gpfn passed from the userland, using
>> basically guest_physmap_add_page()
>>      
> This would then be superfluous.
>
>    
>>     * Sets the control bit to ON
>> - Places that need to access to the actual leaf bit (like, for example,
>> xen_evtchn_do_upcall()) will need to double check the control bit. If it
>> is OFF they consider the second level as the leaf one, otherwise they
>> will do a further lookup to get the bit from the new array of pages.
>>      
> Just like for variable depth page tables - if at all possible, just
> make the accesses variable depth, so that all you need to track
> on a per-domain basis is the depth of the tree.
>    

I agree.

>    
>> Of course there are some nits to be decided yet, like, for example:
>> * How many pages should the new level have? We can start by populating
>> just one, for example
>>      
> Just let the guest specify this (and error if the number is too large).
>    

I agree.

>    
>> * Who should have really the knowledge of how many pages to allocate?
>> Likely the hypervisor should have a threshhold, but in general we may
>> want to have a posting mechanism to have the guest ask the hypervisor
>> before-hand and satisfy its actual request
>>      
> Same here (this is really the same with the previous item, if you
> follow the earlier suggestions).
>
>    
>> * How many bits should be indirected in the third-level by every single
>> bit in the second-level? (that is a really minor factor, but still).
>>      
> The tree should clearly be uniform (i.e. having a factor of
> BITS_PER_LONG per level), just like it is now. For 64-bit guests,
> this would mean 256k channels with 3 levels (32k for 32-bit
> guests).
>
> One aspect to also consider is migration - will the guest have to
> re-issue the extending hypercall, or will this be taken care of for
> it? If the former approach is chosen, would the guest be
> expected to deal with not being able to set up the extension
> again on the new host?
>    

I think this could be also handled with some trickery by switching the 
control bit off. I need to make an assessment on the races invovled 
because we are not any longer in the "domain startup" case.

> And another important (but implementation only) aspect not to
> forget is making domain_dump_evtchn_info() scale with the
> then much higher amount of dumping potentially to be done (i.e.
> not just extend it to cope with the count, but also make sure it
> properly allows softirqs to be handled, which in turn requires to
> not hold the event lock across the whole loop).
>
>    

I still didn't look into it, but thanks for pointing out.

Attilio

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Extend the number of event channels availabe to guests
  2012-09-20 14:05   ` Attilio Rao
@ 2012-09-20 15:42     ` Jan Beulich
  2012-09-20 22:05       ` Attilio Rao
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2012-09-20 15:42 UTC (permalink / raw)
  To: Attilio Rao; +Cc: xen-devel, Ian Campbell, Stefano Stabellini

>>> On 20.09.12 at 16:05, Attilio Rao <attilio.rao@citrix.com> wrote:
> On 20/09/12 08:47, Jan Beulich wrote:
>>>>> On 20.09.12 at 01:49, Attilio Rao<attilio.rao@citrix.com>  wrote:
>>>>>          
>>> Proposal
>>> The proposal is pretty simple: the eventchannel search will become a
>>> three-level lookup table, with the leaf level being composed by shared
>>> pages registered at boot time by the guests.
>>> The bitmap working now as leaf (then called "second level") will work
>>> alternatively as leaf level still (for older kernel) or for intermediate
>>> level to address into a new array of shared pages (for newer kernels).
>>> This leaves the possibility to reuse the existing mechanisms without
>>> modifying its internals.
>>>      
>> While adding one level would seem to leave ample room, so did
>> the originally 4096 originally. Therefore, even if unimplemented
>> right now, I'd like the interface to allow for the guest to specify
>> more levels.
>>    
> 
> There is a big difference here. The third/new level will be composed of 
> pages registered at guest installing so it can be expanded on demanded 
> necessity. The second-level we have now doesn't work because it is stuck 
> in the immutable ABI.
> The only useful way to have another level would be in the case we think 
> the second-level is not enough to address all the necessary bits in the 
> third level in efficient way.
> 
> To make you an example, the first level is 64 bits while the second 
> level can address 64 times the first level. The third level, to be 
> on-par with the same ratio of the second level in terms of performance, 
> would be large something like 4 pages. I think we are very far from 
> reaching critical levels.

What I'm saying is that further levels should be continuing at the
rate, i.e. times BITS_PER_LONG per level. Allowing for an only
partially populated leaf level is certainly an option. But similarly
it should be an option to have a fourth level once needed, without
having to start over from scratch again.

>>> More specifically, what needs to happen:
>>> - Add new members to struct domain to handle an array of pages (to
>>> contain the actual evtchn bitmaps), a further array of pages (to contain
>>> the evtchn masks) and a control bit to say if it is subjective to the
>>> new mode or not. Initially the arrays will be empty and the control bit
>>> will be OFF.
>>> - At init_platform() time, the guest must allocate the pages to compose
>>> the 2 arrays and invoke a novel hypercall which, at big lines, does the
>>> following:
>>>     * Creates some pages to populate the new arrays in struct domain via
>>> alloc_xenheap_pages()
>>>      
>> Why? The guest allocated the pages already. Just have the
>> hypervisor map them (similar, but without the per-vCPU needs,
>> to registering an alternative per-vCPU shared page). Whether
>> it turns out more practical to require the guest to enforce
>> certain restrictions (like the pages being contiguous and/or
>> address restricted) is a secondary aspect.
>>    
> 
> Actually what I propose seems to be what happens infact in the shared 
> page case. Look at what arch_domain_create() and XENMEM_add_to_physmap 
> hypercall do (in the XENMAPSPACE_shared_info case). I think this is the 
> quicker way to get what we want.

This is HVM-only thinking. PV doesn't use this, and I don't think
artificially inserting something somewhere in the physmap of a
PV guest is a good idea either. To have things done uniformly,
going the PV route and using guest allocated pages seems the
better choice to me. Alternatively, you'd have to implement a
HVM mechanism (via add-to-physmap) and a PV one.

Plus the add-to-physmap one has the drawback of limiting the
space available for adding pages (as these would generally
have to go into the MMIO space of the platform PCI device).

Jan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC] Extend the number of event channels availabe to guests
  2012-09-20 15:42     ` Jan Beulich
@ 2012-09-20 22:05       ` Attilio Rao
  0 siblings, 0 replies; 7+ messages in thread
From: Attilio Rao @ 2012-09-20 22:05 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel, Ian Campbell, Stefano Stabellini

On 20/09/12 16:42, Jan Beulich wrote:
>>>> On 20.09.12 at 16:05, Attilio Rao<attilio.rao@citrix.com>  wrote:
>>>>          
>> On 20/09/12 08:47, Jan Beulich wrote:
>>      
>>>>>> On 20.09.12 at 01:49, Attilio Rao<attilio.rao@citrix.com>   wrote:
>>>>>>
>>>>>>              
>>>> Proposal
>>>> The proposal is pretty simple: the eventchannel search will become a
>>>> three-level lookup table, with the leaf level being composed by shared
>>>> pages registered at boot time by the guests.
>>>> The bitmap working now as leaf (then called "second level") will work
>>>> alternatively as leaf level still (for older kernel) or for intermediate
>>>> level to address into a new array of shared pages (for newer kernels).
>>>> This leaves the possibility to reuse the existing mechanisms without
>>>> modifying its internals.
>>>>
>>>>          
>>> While adding one level would seem to leave ample room, so did
>>> the originally 4096 originally. Therefore, even if unimplemented
>>> right now, I'd like the interface to allow for the guest to specify
>>> more levels.
>>>
>>>        
>> There is a big difference here. The third/new level will be composed of
>> pages registered at guest installing so it can be expanded on demanded
>> necessity. The second-level we have now doesn't work because it is stuck
>> in the immutable ABI.
>> The only useful way to have another level would be in the case we think
>> the second-level is not enough to address all the necessary bits in the
>> third level in efficient way.
>>
>> To make you an example, the first level is 64 bits while the second
>> level can address 64 times the first level. The third level, to be
>> on-par with the same ratio of the second level in terms of performance,
>> would be large something like 4 pages. I think we are very far from
>> reaching critical levels.
>>      
> What I'm saying is that further levels should be continuing at the
> rate, i.e. times BITS_PER_LONG per level. Allowing for an only
> partially populated leaf level is certainly an option. But similarly
> it should be an option to have a fourth level once needed, without
> having to start over from scratch again.
>    

Yes, I agree, but I don't see a big problem here, besides having a way 
to specify which level pages should compose and deal with them.
The only difference is that maybe we could be ending up building sort of 
containers for such topology, to deal with a multi-digi table. I think 
it will not be too difficult to do, but I would leave this as very last 
item, eventually, once that the "third-level" already works ok.

>    
>>>> More specifically, what needs to happen:
>>>> - Add new members to struct domain to handle an array of pages (to
>>>> contain the actual evtchn bitmaps), a further array of pages (to contain
>>>> the evtchn masks) and a control bit to say if it is subjective to the
>>>> new mode or not. Initially the arrays will be empty and the control bit
>>>> will be OFF.
>>>> - At init_platform() time, the guest must allocate the pages to compose
>>>> the 2 arrays and invoke a novel hypercall which, at big lines, does the
>>>> following:
>>>>      * Creates some pages to populate the new arrays in struct domain via
>>>> alloc_xenheap_pages()
>>>>
>>>>          
>>> Why? The guest allocated the pages already. Just have the
>>> hypervisor map them (similar, but without the per-vCPU needs,
>>> to registering an alternative per-vCPU shared page). Whether
>>> it turns out more practical to require the guest to enforce
>>> certain restrictions (like the pages being contiguous and/or
>>> address restricted) is a secondary aspect.
>>>
>>>        
>> Actually what I propose seems to be what happens infact in the shared
>> page case. Look at what arch_domain_create() and XENMEM_add_to_physmap
>> hypercall do (in the XENMAPSPACE_shared_info case). I think this is the
>> quicker way to get what we want.
>>      
> This is HVM-only thinking. PV doesn't use this, and I don't think
> artificially inserting something somewhere in the physmap of a
> PV guest is a good idea either. To have things done uniformly,
> going the PV route and using guest allocated pages seems the
> better choice to me. Alternatively, you'd have to implement a
> HVM mechanism (via add-to-physmap) and a PV one.
>
> Plus the add-to-physmap one has the drawback of limiting the
> space available for adding pages (as these would generally
> have to go into the MMIO space of the platform PCI device).
>
>    

On a second thought, I think I can use something very similar to the 
sharing mechanism of the grant tables, basically modeled over 
grant_table_create() and subsequent gnttab_setup_table() mapping 
creation. This should also work in the PV case.

Attilio

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-09-20 22:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-19 23:49 [RFC] Extend the number of event channels availabe to guests Attilio Rao
2012-09-20  7:47 ` Jan Beulich
2012-09-20  7:55   ` Ian Campbell
2012-09-20  8:06     ` Jan Beulich
2012-09-20 14:05   ` Attilio Rao
2012-09-20 15:42     ` Jan Beulich
2012-09-20 22:05       ` Attilio Rao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).