public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* kvm vmload/vmsave vs tss.ist
@ 2008-12-25 14:59 Avi Kivity
  2008-12-25 15:17 ` Ingo Molnar
  0 siblings, 1 reply; 10+ messages in thread
From: Avi Kivity @ 2008-12-25 14:59 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar, Joerg Roedel, Benjamin Serebrin
  Cc: linux-kernel, kvm, Alexander Graf

kvm performance is largely dependent on the frequency and cost of 
switches between guest and host mode.  The cost of a switch is greatly 
influenced by the amount of state we have to load and save.

One of the optimizations that kvm makes in order to reduce the cost is 
to partition the guest state into two; let's call the two parts kernel 
state and user state.  The kernel state consists of registers that are 
used for general kernel execution, for example the general purpose 
registers.  User state consists of registers that are only used in user 
mode (or in the transition to user mode).  When switching from guest to 
host, we only save and reload the kernel state, delaying reloading of 
user state until we actually need to switch to user mode.  Since many 
exits are satisfied entirely in the kernel, we can avoid switching user 
state entirely.  In effect the host kernel runs with some of the cpu 
registers containing guest values.  The mechanism used for deferring 
state switch is PREEMPT_NOTIFIERS, introduced in 2.6.23 IIRC.

Now, AMD SVM instructions also partition register state into two.  The 
VMRUN instruction, which is used to switch to guest mode, loads and 
saves registers corresponding to kernel state.  The VMLOAD and VMSAVE 
instructions load and save user state registers.

The exact registers managed by VMLOAD and VMSAVE are:

  FS GS TR LDTR
  KernelGSBase
  STAR LSTAR CSTAR SFMASK
  SYSENTER_CS SYSENTER_ESP SYSENTER_EIP

None of these registers are ever touched in 64-bit kernel mode, except 
gs.base (which we can save/restore manually), and TR.  The only part of 
the TSS (pointed to by the TR) used in 64-bit mode are the seven 
Interrupt Stack Table (IST) entries.  These are used to provide 
known-good stacks for critical exceptions.

These critical exceptions are: debug, nmi, double fault, stack fault, 
and machine check.

Because of this one detail, kvm must execute vmload/vmsave on every 
guest/host switch. Hardware architects, give yourself a pat on the back.

The impact is even greater when using nested virtualization, since we 
must trap on two additional instructions on every switch.

I would like to remove this limitation.  I see several ways to go about it:

1. Drop the use of IST

This would reduce the (perceived) reliability of the kernel and would 
probably not be welcomed.

2. Introduce a config item for dropping IST, and have kvm defer 
vmload/vmsave depending on the configuration

This would pose a dilemma for kitchen sink distro kernels: kvm 
performance or maximum reliability?

3. Switch off IST when the first VM is created, switch it back on when 
the last VM is destroyed

Most likely no additional code would need to be modified.  It could be 
made conditional if someone wants to retain IST even while kvm is 
active.  We already have hooks in place and know where the host IST is.  
I favor this option. 

4. Some other brilliant idea?

Might be even better than option 3.

hpa/Ingo, any opinions?


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kvm vmload/vmsave vs tss.ist
  2008-12-25 14:59 kvm vmload/vmsave vs tss.ist Avi Kivity
@ 2008-12-25 15:17 ` Ingo Molnar
  2008-12-25 15:46   ` Avi Kivity
  0 siblings, 1 reply; 10+ messages in thread
From: Ingo Molnar @ 2008-12-25 15:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: H. Peter Anvin, Joerg Roedel, Benjamin Serebrin, linux-kernel,
	kvm, Alexander Graf


* Avi Kivity <avi@redhat.com> wrote:

> I would like to remove this limitation.  I see several ways to go about 
> it:
>
> 1. Drop the use of IST
>
> This would reduce the (perceived) reliability of the kernel and would 
> probably not be welcomed.

> hpa/Ingo, any opinions?

i think we should actually do #1 unconditionally.

ISTs are bad for the native kernel too. They have various nasty 
complications in the stack walker (and hence they _reduce_ reliability in 
practice), and they are non-preemptible as well. Plus we have the 
maximum-stack-footprint ftrace plugin now, which can remove any perception 
about how bad the worst-case stack footprint is in practice.

If it ever becomes an issue we could also soft-switch to a larger (per 
CPU) exception stack from the exception handlers themselves. The 
architectural stack footprint of the various critical exceptions are 
calculatable and low - so we could switch away and get almost the kind of 
separation that ISTs give. There's no deep reason to actually make use of 
hw switched ISTs.

So feel free to send a patch that just standardizes the critical 
exceptions to use the regular kernel stack. (I havent actually tried this 
but it should be relatively simple to implement. Roadblocks are possible.)

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kvm vmload/vmsave vs tss.ist
  2008-12-25 15:17 ` Ingo Molnar
@ 2008-12-25 15:46   ` Avi Kivity
  2008-12-25 16:21     ` Ingo Molnar
  0 siblings, 1 reply; 10+ messages in thread
From: Avi Kivity @ 2008-12-25 15:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Joerg Roedel, Benjamin Serebrin, linux-kernel,
	kvm, Alexander Graf

Ingo Molnar wrote:
> i think we should actually do #1 unconditionally.
>
> ISTs are bad for the native kernel too. They have various nasty 
> complications in the stack walker (and hence they _reduce_ reliability in 
> practice), and they are non-preemptible as well. Plus we have the 
> maximum-stack-footprint ftrace plugin now, which can remove any perception 
> about how bad the worst-case stack footprint is in practice.
>
> If it ever becomes an issue we could also soft-switch to a larger (per 
> CPU) exception stack from the exception handlers themselves. The 
> architectural stack footprint of the various critical exceptions are 
> calculatable and low - so we could switch away and get almost the kind of 
> separation that ISTs give. There's no deep reason to actually make use of 
> hw switched ISTs.
>
> So feel free to send a patch that just standardizes the critical 
> exceptions to use the regular kernel stack. (I havent actually tried this 
> but it should be relatively simple to implement. Roadblocks are possible.)
>   

Certainly.  There is provision for a debug stack that can be larger than 
the normal exception stack.  This is used for vectors 1 and 3.  If we 
wish to preserve this, we need to to manual stack switching.

Currently DEBUG_STKSZ is 8K, the same as the normal stack (compared to 
4K for the other execption stacks).  Do we need to implement stack 
switching for debug vectors?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kvm vmload/vmsave vs tss.ist
  2008-12-25 15:46   ` Avi Kivity
@ 2008-12-25 16:21     ` Ingo Molnar
  2008-12-25 16:42       ` Ingo Molnar
  0 siblings, 1 reply; 10+ messages in thread
From: Ingo Molnar @ 2008-12-25 16:21 UTC (permalink / raw)
  To: Avi Kivity
  Cc: H. Peter Anvin, Joerg Roedel, Benjamin Serebrin, linux-kernel,
	kvm, Alexander Graf, Arjan van de Ven, Alexander van Heukelum


* Avi Kivity <avi@redhat.com> wrote:

> Ingo Molnar wrote:
>> i think we should actually do #1 unconditionally.
>>
>> ISTs are bad for the native kernel too. They have various nasty  
>> complications in the stack walker (and hence they _reduce_ reliability 
>> in practice), and they are non-preemptible as well. Plus we have the  
>> maximum-stack-footprint ftrace plugin now, which can remove any 
>> perception about how bad the worst-case stack footprint is in practice.
>>
>> If it ever becomes an issue we could also soft-switch to a larger (per  
>> CPU) exception stack from the exception handlers themselves. The  
>> architectural stack footprint of the various critical exceptions are  
>> calculatable and low - so we could switch away and get almost the kind 
>> of separation that ISTs give. There's no deep reason to actually make 
>> use of hw switched ISTs.
>>
>> So feel free to send a patch that just standardizes the critical  
>> exceptions to use the regular kernel stack. (I havent actually tried 
>> this but it should be relatively simple to implement. Roadblocks are 
>> possible.)
>>   
>
> Certainly.  There is provision for a debug stack that can be larger than 
> the normal exception stack.  This is used for vectors 1 and 3.  If we 
> wish to preserve this, we need to to manual stack switching.
>
> Currently DEBUG_STKSZ is 8K, the same as the normal stack (compared to 
> 4K for the other execption stacks).  Do we need to implement stack 
> switching for debug vectors?

i'd suggest to reuse the irq-stacks for this. Right now on 64-bit we've 
got the following stack layout: 8K process stacks, a 16K IRQ stack on each 
CPU, shared by all IRQs. Then we have the IST stacks with weird sizes: 
debug:8K, the others: 4K.

Then all the unnecessary IST complications can be removed. If nesting ever 
becomes an issue, the IRQ stack size can be doubled to 32K.

This way we save some small amount of RAM too (right now the IST stacks 
take up 28K of RAM per CPU), and reduce complexity and fragility quite 
visibly. And help KVM ;-)

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kvm vmload/vmsave vs tss.ist
  2008-12-25 16:21     ` Ingo Molnar
@ 2008-12-25 16:42       ` Ingo Molnar
  2008-12-25 17:40         ` Avi Kivity
  0 siblings, 1 reply; 10+ messages in thread
From: Ingo Molnar @ 2008-12-25 16:42 UTC (permalink / raw)
  To: Avi Kivity
  Cc: H. Peter Anvin, Joerg Roedel, Benjamin Serebrin, linux-kernel,
	kvm, Alexander Graf, Arjan van de Ven, Alexander van Heukelum


* Ingo Molnar <mingo@elte.hu> wrote:

> i'd suggest to reuse the irq-stacks for this. Right now on 64-bit we've 
> got the following stack layout: 8K process stacks, a 16K IRQ stack on 
> each CPU, shared by all IRQs. Then we have the IST stacks with weird 
> sizes: debug:8K, the others: 4K.

this has to be done carefully though, as there's a subtle detail here: 
right now the pda_irqcount and the pda_irqstackptr logic in entry_64.S is 
not re-entry safe and relies on IRQs being off.

If critical exceptions are moved to the IRQ stack then %rsp switching to 
the IRQ stack has to be done atomically: instead of using the pda_irqcount 
check the %rsp value itself should be checked against pda_irqstackptr - if 
it's within that 16K range then we are already on the IRQ stack and do not 
need to switch to it but can just use the current %rsp.

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kvm vmload/vmsave vs tss.ist
  2008-12-25 16:42       ` Ingo Molnar
@ 2008-12-25 17:40         ` Avi Kivity
  2008-12-25 17:58           ` Ingo Molnar
  0 siblings, 1 reply; 10+ messages in thread
From: Avi Kivity @ 2008-12-25 17:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Joerg Roedel, Benjamin Serebrin, linux-kernel,
	kvm, Alexander Graf, Arjan van de Ven, Alexander van Heukelum

Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
>
>   
>> i'd suggest to reuse the irq-stacks for this. Right now on 64-bit we've 
>> got the following stack layout: 8K process stacks, a 16K IRQ stack on 
>> each CPU, shared by all IRQs. Then we have the IST stacks with weird 
>> sizes: debug:8K, the others: 4K.
>>     
>
> this has to be done carefully though, as there's a subtle detail here: 
> right now the pda_irqcount and the pda_irqstackptr logic in entry_64.S is 
> not re-entry safe and relies on IRQs being off.
>
> If critical exceptions are moved to the IRQ stack then %rsp switching to 
> the IRQ stack has to be done atomically: instead of using the pda_irqcount 
> check the %rsp value itself should be checked against pda_irqstackptr - if 
> it's within that 16K range then we are already on the IRQ stack and do not 
> need to switch to it but can just use the current %rsp.
>   

I think it's enough to switch %rsp before incrementing irqcount, no?

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kvm vmload/vmsave vs tss.ist
  2008-12-25 17:40         ` Avi Kivity
@ 2008-12-25 17:58           ` Ingo Molnar
  2008-12-25 18:12             ` Avi Kivity
  0 siblings, 1 reply; 10+ messages in thread
From: Ingo Molnar @ 2008-12-25 17:58 UTC (permalink / raw)
  To: Avi Kivity
  Cc: H. Peter Anvin, Joerg Roedel, Benjamin Serebrin, linux-kernel,
	kvm, Alexander Graf, Arjan van de Ven, Alexander van Heukelum


* Avi Kivity <avi@redhat.com> wrote:

> Ingo Molnar wrote:
>> * Ingo Molnar <mingo@elte.hu> wrote:
>>
>>   
>>> i'd suggest to reuse the irq-stacks for this. Right now on 64-bit 
>>> we've got the following stack layout: 8K process stacks, a 16K IRQ 
>>> stack on each CPU, shared by all IRQs. Then we have the IST stacks 
>>> with weird sizes: debug:8K, the others: 4K.
>>>     
>>
>> this has to be done carefully though, as there's a subtle detail here:  
>> right now the pda_irqcount and the pda_irqstackptr logic in entry_64.S 
>> is not re-entry safe and relies on IRQs being off.
>>
>> If critical exceptions are moved to the IRQ stack then %rsp switching 
>> to the IRQ stack has to be done atomically: instead of using the 
>> pda_irqcount check the %rsp value itself should be checked against 
>> pda_irqstackptr - if it's within that 16K range then we are already on 
>> the IRQ stack and do not need to switch to it but can just use the 
>> current %rsp.
>>   
>
> I think it's enough to switch %rsp before incrementing irqcount, no?

no - that would introduce a small race: if an exception (say an NMI or 
MCE, or a debug trap) happens in that small window then the exception 
context thinks that it's on the IRQ stack already, and would use the task 
stack.

So if we want to move them to IRQ stacks all the time, we have to check 
that condition atomically - the safest way of which is to check RSP 
against the (static) pda:[irqstackptr-16K+64..irqstackptr] range.

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kvm vmload/vmsave vs tss.ist
  2008-12-25 17:58           ` Ingo Molnar
@ 2008-12-25 18:12             ` Avi Kivity
  2008-12-25 18:18               ` Ingo Molnar
  2008-12-25 18:19               ` Avi Kivity
  0 siblings, 2 replies; 10+ messages in thread
From: Avi Kivity @ 2008-12-25 18:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Joerg Roedel, Benjamin Serebrin, linux-kernel,
	kvm, Alexander Graf, Arjan van de Ven, Alexander van Heukelum

Ingo Molnar wrote:
>> I think it's enough to switch %rsp before incrementing irqcount, no?
>>     
>
> no - that would introduce a small race: if an exception (say an NMI or 
> MCE, or a debug trap) happens in that small window then the exception 
> context thinks that it's on the IRQ stack already, and would use the task 
> stack.
>
>   

I'm suggesting

    check irqcount
    if (wasnt_in_irq)
        rsp = irqstack
    ++irqcount

If the NMI happens before the increment, we'll switch the stack 
unconditionally, and if the NMI happens after the increment, then we 
won't switch the stack, but we're guaranteed to be on the irqstack 
anyway.  The window size is negative :)

Similarly, the exit path should be

    oldstack_reg = oldstack;
    --irqcount;
    rsp = oldstack_register;

To guarantee that by the time we decrement irqcount, we don't need the 
stack anymore.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kvm vmload/vmsave vs tss.ist
  2008-12-25 18:12             ` Avi Kivity
@ 2008-12-25 18:18               ` Ingo Molnar
  2008-12-25 18:19               ` Avi Kivity
  1 sibling, 0 replies; 10+ messages in thread
From: Ingo Molnar @ 2008-12-25 18:18 UTC (permalink / raw)
  To: Avi Kivity
  Cc: H. Peter Anvin, Joerg Roedel, Benjamin Serebrin, linux-kernel,
	kvm, Alexander Graf, Arjan van de Ven, Alexander van Heukelum


* Avi Kivity <avi@redhat.com> wrote:

> Ingo Molnar wrote:
>>> I think it's enough to switch %rsp before incrementing irqcount, no?
>>>     
>>
>> no - that would introduce a small race: if an exception (say an NMI or  
>> MCE, or a debug trap) happens in that small window then the exception  
>> context thinks that it's on the IRQ stack already, and would use the 
>> task stack.
>>
>>   
>
> I'm suggesting
>
>    check irqcount
>    if (wasnt_in_irq)
>        rsp = irqstack
>    ++irqcount
>
> If the NMI happens before the increment, we'll switch the stack 
> unconditionally, and if the NMI happens after the increment, then we 
> won't switch the stack, but we're guaranteed to be on the irqstack 
> anyway.  The window size is negative :)
>
> Similarly, the exit path should be
>
>    oldstack_reg = oldstack;
>    --irqcount;
>    rsp = oldstack_register;
>
> To guarantee that by the time we decrement irqcount, we don't need the  
> stack anymore.

agreed, something like this would work too. My suggestion, to eliminate 
irqcount altogether and just check RSP against the known-irqstack-range, 
could result in slightly smaller (and thus faster) code, but it's a 
marginal difference at best.

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: kvm vmload/vmsave vs tss.ist
  2008-12-25 18:12             ` Avi Kivity
  2008-12-25 18:18               ` Ingo Molnar
@ 2008-12-25 18:19               ` Avi Kivity
  1 sibling, 0 replies; 10+ messages in thread
From: Avi Kivity @ 2008-12-25 18:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Joerg Roedel, Benjamin Serebrin, linux-kernel,
	kvm, Alexander Graf, Arjan van de Ven, Alexander van Heukelum

Avi Kivity wrote:
>
> I'm suggesting
>
>    check irqcount
>    if (wasnt_in_irq)
>        rsp = irqstack
>    ++irqcount
>
> If the NMI happens before the increment, we'll switch the stack 
> unconditionally, and if the NMI happens after the increment, then we 
> won't switch the stack, but we're guaranteed to be on the irqstack 
> anyway.  The window size is negative :)
>
> Similarly, the exit path should be
>
>    oldstack_reg = oldstack;
>    --irqcount;
>    rsp = oldstack_register;
>
> To guarantee that by the time we decrement irqcount, we don't need the 
> stack anymore.
>

On the other hand, checking %rsp allows us to drop irqcount completely, 
so maybe it's better.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-12-25 18:20 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-25 14:59 kvm vmload/vmsave vs tss.ist Avi Kivity
2008-12-25 15:17 ` Ingo Molnar
2008-12-25 15:46   ` Avi Kivity
2008-12-25 16:21     ` Ingo Molnar
2008-12-25 16:42       ` Ingo Molnar
2008-12-25 17:40         ` Avi Kivity
2008-12-25 17:58           ` Ingo Molnar
2008-12-25 18:12             ` Avi Kivity
2008-12-25 18:18               ` Ingo Molnar
2008-12-25 18:19               ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox