* x86's context switch ordering of operations
@ 2008-04-29 12:39 Jan Beulich
2008-04-29 12:50 ` Keir Fraser
2008-04-29 17:03 ` Jeremy Fitzhardinge
0 siblings, 2 replies; 7+ messages in thread
From: Jan Beulich @ 2008-04-29 12:39 UTC (permalink / raw)
To: xen-devel
In the process of inventing a reasonable mechanism to support some
advanced debugging features for pv guests (last exception record
MSRs, last branch stack MSRs after #DE, DS area) I was considering
to add another shared state area (similar to the relocated vCPU info,
but read-only to the guest and not permanently mapped), where the
hypervisor could store relevant information which otherwise can get
destroyed before the guest would be able to pick it up, as well as
state the CPU is to use which the guest must not be able to modify
directly (and extensible to a reasonable degree to support future
hardware enhancements).
To do so, I was considering using {un,}map_domain_page() from
the context switch path, but there are two major problems with the
ordering of operations:
- for the outgoing task, 'current' is being changed before the
ctxt_switch_from() hook is being called
- for the incoming task, write_ptbase() happens only after the
ctxt_switch_to() hook was already called
I'm wondering whether there are hidden dependencies that require
this particular (somewhat non-natural) ordering.
While looking into this, I noticed two things that I'm not quite clear
on regarding VCPUOP_register_vcpu_info:
1) How does the storing of vcpu_info_mfn in the hypervisor survive
migration or save/restore? The mainline Linux code, which uses this
hypercall, doesn't appear to make any attempt to revert to using the
default location during suspend or to re-setup the alternate location
during resume (but of course I'm not sure that guest is save/restore/
migrate ready in the first place). I would imagine it to be at least
difficult for the guest to manage its state post resume without the
hypervisor having restored the previously established alternative
placement.
2) The implementation in the hypervisor seems to have added yet another
scalibility issue (on 32-bits), as this is being carried out using
map_domain_page_global() - if there are sufficiently many guests with
sufficiently many vCPU-s, there just won't be any space left at some
point. This worries me especially in the context of seeing a call to
sh_map_domain_page_global() that is followed by a BUG_ON() checking
whether the call failed.
Jan
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: x86's context switch ordering of operations
2008-04-29 12:39 x86's context switch ordering of operations Jan Beulich
@ 2008-04-29 12:50 ` Keir Fraser
2008-04-29 13:39 ` Jan Beulich
2008-04-29 17:03 ` Jeremy Fitzhardinge
1 sibling, 1 reply; 7+ messages in thread
From: Keir Fraser @ 2008-04-29 12:50 UTC (permalink / raw)
To: Jan Beulich, xen-devel
On 29/4/08 13:39, "Jan Beulich" <jbeulich@novell.com> wrote:
> To do so, I was considering using {un,}map_domain_page() from
> the context switch path, but there are two major problems with the
> ordering of operations:
> - for the outgoing task, 'current' is being changed before the
> ctxt_switch_from() hook is being called
> - for the incoming task, write_ptbase() happens only after the
> ctxt_switch_to() hook was already called
> I'm wondering whether there are hidden dependencies that require
> this particular (somewhat non-natural) ordering.
ctxt_switch_{from,to} exist only in x86 Xen and are called from a single
hook point out from the common scheduler. Thus either they both happen
before, or both happen after, current is changed by the common scheduler. It
took a while for the scheduler interfaces to settle down to something both
x86 and ia64 was happy with so I'm not particularly excited about revisiting
them. I'm not sure why you'd want to map_domain_page() on context switch
anyway. The map_domain_page() 32-bit implementation is inherently per-domain
already.
> 1) How does the storing of vcpu_info_mfn in the hypervisor survive
> migration or save/restore? The mainline Linux code, which uses this
> hypercall, doesn't appear to make any attempt to revert to using the
> default location during suspend or to re-setup the alternate location
> during resume (but of course I'm not sure that guest is save/restore/
> migrate ready in the first place). I would imagine it to be at least
> difficult for the guest to manage its state post resume without the
> hypervisor having restored the previously established alternative
> placement.
I don't see that it would be hard for the guest to do it itself before
bringing back all VCPUs (either by bringing them up or by exiting the
stopmachine state). Is save/restore even supported by pv_ops kernels yet?
> 2) The implementation in the hypervisor seems to have added yet another
> scalibility issue (on 32-bits), as this is being carried out using
> map_domain_page_global() - if there are sufficiently many guests with
> sufficiently many vCPU-s, there just won't be any space left at some
> point. This worries me especially in the context of seeing a call to
> sh_map_domain_page_global() that is followed by a BUG_ON() checking
> whether the call failed.
The hypervisor generally assumes that vcpu_info's are permanently and
globally mapped. That obviously places an unavoidable scalability limit for
32-bit Xen. I have no problem with telling people who are concerned about
the limit to use 64-bit Xen instead.
-- Keir
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: x86's context switch ordering of operations
2008-04-29 12:50 ` Keir Fraser
@ 2008-04-29 13:39 ` Jan Beulich
2008-04-29 13:58 ` Keir Fraser
0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2008-04-29 13:39 UTC (permalink / raw)
To: Keir Fraser; +Cc: xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 29.04.08 14:50 >>>
>On 29/4/08 13:39, "Jan Beulich" <jbeulich@novell.com> wrote:
>
>> To do so, I was considering using {un,}map_domain_page() from
>> the context switch path, but there are two major problems with the
>> ordering of operations:
>> - for the outgoing task, 'current' is being changed before the
>> ctxt_switch_from() hook is being called
>> - for the incoming task, write_ptbase() happens only after the
>> ctxt_switch_to() hook was already called
>> I'm wondering whether there are hidden dependencies that require
>> this particular (somewhat non-natural) ordering.
>
>ctxt_switch_{from,to} exist only in x86 Xen and are called from a single
>hook point out from the common scheduler. Thus either they both happen
>before, or both happen after, current is changed by the common scheduler. It
Maybe I'm mistaken (or it is being done twice with no good reason), but
I see a set_current(next) in x86's context_switch() ...
>took a while for the scheduler interfaces to settle down to something both
>x86 and ia64 was happy with so I'm not particularly excited about revisiting
>them. I'm not sure why you'd want to map_domain_page() on context switch
>anyway. The map_domain_page() 32-bit implementation is inherently per-domain
>already.
If pages mapped that way survive context switches, then it would
certainly be possible to map them once and keep them until no longer
needed. Doing this during context switch was more as an attempt to
conserve on virtual address use (so other vCPU-s of the same guest
not using this functionality would have less chances of running out
of space). The background is that I think that it'll also be necessary
to extend MAX_VIRT_CPUS beyond 32 at some not too distant point
(at least in dom0 for CPU frequency management - or do you have
another scheme in mind how to deal with systems having more than
32 CPU threads), resulting in more pressure on the address space.
>> 2) The implementation in the hypervisor seems to have added yet another
>> scalibility issue (on 32-bits), as this is being carried out using
>> map_domain_page_global() - if there are sufficiently many guests with
>> sufficiently many vCPU-s, there just won't be any space left at some
>> point. This worries me especially in the context of seeing a call to
>> sh_map_domain_page_global() that is followed by a BUG_ON() checking
>> whether the call failed.
>
>The hypervisor generally assumes that vcpu_info's are permanently and
>globally mapped. That obviously places an unavoidable scalability limit for
>32-bit Xen. I have no problem with telling people who are concerned about
>the limit to use 64-bit Xen instead.
I know your position here, but - are all 32-on-64 migration/save/restore
issues meanwhile resolved (that is, can the tools meanwhile deal with
either size domains no matter whether using a 32- or 64-bit dom0)? If
not, there may be reasons beyond that of needing vm86 mode that
might force people to stay with 32-bit Xen. (I certainly agree that there
are unavoidable limitations, but obviously there is a big difference
between requiring 64 bytes and 4k per vCPU for this particular
functionality.)
Jan
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: x86's context switch ordering of operations
2008-04-29 13:39 ` Jan Beulich
@ 2008-04-29 13:58 ` Keir Fraser
2008-04-29 15:37 ` Jan Beulich
0 siblings, 1 reply; 7+ messages in thread
From: Keir Fraser @ 2008-04-29 13:58 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel
On 29/4/08 14:39, "Jan Beulich" <jbeulich@novell.com> wrote:
>> ctxt_switch_{from,to} exist only in x86 Xen and are called from a single
>> hook point out from the common scheduler. Thus either they both happen
>> before, or both happen after, current is changed by the common scheduler. It
>
> Maybe I'm mistaken (or it is being done twice with no good reason), but
> I see a set_current(next) in x86's context_switch() ...
Um, good point, I'd forgotten exactly how the code fitted together. Anyhow,
the reason you see ctxt_switch_{from,to} happening after set_current() is
because context_switch() and __context_switch() can actually be decoupled.
When switching to the idle vcpu we run context_switch() but we do not run
__context_switch().
> If pages mapped that way survive context switches, then it would
> certainly be possible to map them once and keep them until no longer
> needed. Doing this during context switch was more as an attempt to
> conserve on virtual address use (so other vCPU-s of the same guest
> not using this functionality would have less chances of running out
> of space). The background is that I think that it'll also be necessary
> to extend MAX_VIRT_CPUS beyond 32 at some not too distant point
> (at least in dom0 for CPU frequency management - or do you have
> another scheme in mind how to deal with systems having more than
> 32 CPU threads), resulting in more pressure on the address space.
I'm hoping that Intel's patches to allow uniproc dom0 to perform multiproc
Cx and Px state management will be acceptable. Apart from that, yes we may
have to increase MAX_VIRT_CPUS.
> I know your position here, but - are all 32-on-64 migration/save/restore
> issues meanwhile resolved (that is, can the tools meanwhile deal with
> either size domains no matter whether using a 32- or 64-bit dom0)? If
> not, there may be reasons beyond that of needing vm86 mode that
> might force people to stay with 32-bit Xen. (I certainly agree that there
> are unavoidable limitations, but obviously there is a big difference
> between requiring 64 bytes and 4k per vCPU for this particular
> functionality.)
I don't really see a few kilobytes of overhead per vcpu as very significant.
Given the limitations of the map_domain_page_global() address space, we're
limiting ourselves to probably around 700-800 vcpus. That's quite a lot imo!
I'm not sure on our position regarding 32-on-64 save/restore compatibility.
Tim Deegan made some patches a while ago, but that was mainly focused on
correctly saving 64-bit HVM domUs from a 32-bit dom0. I also know that
Oracle had some patches they floated a while ago. I don;t they ever got
posted for inclusion into xen-unstable though. *However* I do know that I'd
rather we spent time fixing 32-on-64 save/restore compatibility than
fretting about and optimising 32-bit Xen scalability. The former has greater
long-term usefulness.
-- Keir
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: x86's context switch ordering of operations
2008-04-29 13:58 ` Keir Fraser
@ 2008-04-29 15:37 ` Jan Beulich
2008-04-29 16:52 ` Keir Fraser
0 siblings, 1 reply; 7+ messages in thread
From: Jan Beulich @ 2008-04-29 15:37 UTC (permalink / raw)
To: Keir Fraser; +Cc: xen-devel
>>> Keir Fraser <keir.fraser@eu.citrix.com> 29.04.08 15:58 >>>
>Um, good point, I'd forgotten exactly how the code fitted together. Anyhow,
>the reason you see ctxt_switch_{from,to} happening after set_current() is
>because context_switch() and __context_switch() can actually be decoupled.
>When switching to the idle vcpu we run context_switch() but we do not run
>__context_switch().
Okay, that could be easily dealt with by doing set_current() explicitly
in the switch-to-idle case, and moving it into __context_switch() in
the other cases.
Any word on the significance of doing write_ptbase() after calling
ctxt_switch_to()?
Jan
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: x86's context switch ordering of operations
2008-04-29 15:37 ` Jan Beulich
@ 2008-04-29 16:52 ` Keir Fraser
0 siblings, 0 replies; 7+ messages in thread
From: Keir Fraser @ 2008-04-29 16:52 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel
On 29/4/08 16:37, "Jan Beulich" <jbeulich@novell.com> wrote:
> Okay, that could be easily dealt with by doing set_current() explicitly
> in the switch-to-idle case, and moving it into __context_switch() in
> the other cases.
It wouldn't really help you. If you switch to from VCPU A to idle and then
to VCPU B, you would still end up calling ctxt_switch_from(A) when
current==idle.
> Any word on the significance of doing write_ptbase() after calling
> ctxt_switch_to()?
It probably could be done earlier.
-- Keir
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: x86's context switch ordering of operations
2008-04-29 12:39 x86's context switch ordering of operations Jan Beulich
2008-04-29 12:50 ` Keir Fraser
@ 2008-04-29 17:03 ` Jeremy Fitzhardinge
1 sibling, 0 replies; 7+ messages in thread
From: Jeremy Fitzhardinge @ 2008-04-29 17:03 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel
Jan Beulich wrote:
> 1) How does the storing of vcpu_info_mfn in the hypervisor survive
> migration or save/restore? The mainline Linux code, which uses this
> hypercall, doesn't appear to make any attempt to revert to using the
> default location during suspend or to re-setup the alternate location
> during resume (but of course I'm not sure that guest is save/restore/
> migrate ready in the first place). I would imagine it to be at least
> difficult for the guest to manage its state post resume without the
> hypervisor having restored the previously established alternative
> placement.
>
The only kernel which uses it is 32-on-32 pvops, and that doesn't
currently support migration. It would be easy for the guest to restore
that state for itself shortly after resuming.
I still need to add 32-on-64 and 64-on-64 implementations for this.
Just haven't looked at it yet.
> 2) The implementation in the hypervisor seems to have added yet another
> scalibility issue (on 32-bits), as this is being carried out using
> map_domain_page_global() - if there are sufficiently many guests with
> sufficiently many vCPU-s, there just won't be any space left at some
> point. This worries me especially in the context of seeing a call to
> sh_map_domain_page_global() that is followed by a BUG_ON() checking
> whether the call failed.
>
Yes, we discussed it, and, erm, don't do that. Guests should be able to
deal with VCPUOP_register_vcpu_info failing, but that doesn't address
overall heap starvation.
J
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2008-04-29 17:03 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-29 12:39 x86's context switch ordering of operations Jan Beulich
2008-04-29 12:50 ` Keir Fraser
2008-04-29 13:39 ` Jan Beulich
2008-04-29 13:58 ` Keir Fraser
2008-04-29 15:37 ` Jan Beulich
2008-04-29 16:52 ` Keir Fraser
2008-04-29 17:03 ` Jeremy Fitzhardinge
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.