[PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler
@ 2013-11-06  6:41 Zhu Yanhai
  2013-11-06  8:51 ` Jan Beulich
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Zhu Yanhai @ 2013-11-06  6:41 UTC (permalink / raw)
  To: xen-devel
  Cc: Charles Wang, Ian Campbell, George Dunlap, Andrew Cooper,
	Zhu Yanhai, Shen Yiben, Wan Jia

As we know Intel X86's CR0.TS is a sticky bit, which means once set
it remains set until cleared by some software routines, in other words,
the exception handler expects the bit is set when it starts to execute.

However xen doesn't simulate this behavior quite well for PV guests -
vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very beginning,
so the guest kernel's #NM handler runs with CR0.TS cleared. Generally speaking
it's fine since the linux kernel executes the exception handler with
interrupt disabled and a sane #NM handler will clear the bit anyway
before it exits, but there's a catch: if it's the first FPU trap for the process,
the linux kernel must allocate a piece of SLAB memory for it to save
the FPU registers, which opens a schedule window as the memory
allocation might sleep -- and with CR0.TS keeps clear!

[see the code below in linux kernel,

void math_state_restore(void)
{
    struct task_struct *tsk = current;

    if (!tsk_used_math(tsk)) {
        local_irq_enable();
        /*
         * does a slab alloc which can sleep
         */
        if (init_fpu(tsk)) {                 <<<< Here it might open a schedule window
            /*
             * ran out of memory!
             */
            do_group_exit(SIGKILL);
            return;
        }
        local_irq_disable();
    }

    __thread_fpu_begin(tsk);    <<<< Here the process gets marked as a 'fpu user'
                                         after the schedule window

    /*
     * Paranoid restore. send a SIGSEGV if we fail to restore the state.
     */
    if (unlikely(restore_fpu_checking(tsk))) {
        drop_init_fpu(tsk);
        force_sig(SIGSEGV, tsk);
        return;
    }

    tsk->fpu_counter++;
}
]

The check in linux kernel's switch_fpu_prepare() doesn't stts() either because
the current process only gets marked as a FPU user after the schedule window
(the story is a bit different if eagerfpu is enabled, anyway a sane hypervisor
cannot depend on such undetermined things). And then supposing that the new
process scheduled-in wants to touch FPU, nobody will do fxrstor/frstor for it anymore,
conducing to a serious data damage.

Also, The point is everything is fine on linux + baremetal since CR0.TS will
keep set until the interrupted #NM handler got the memory it needs and exits,
so the incomer FPU user will get trapped as it's supposed to be.

The test case is as below,

        buf = malloc(BUF_SIZE);
        if (!buf) {
                fprintf(stderr, "error %s during %s\n",
                        strerror(-err),
                        "malloc");
                return 1;
        }
        memset(buf, IO_PATTERN, BUF_SIZE);
        memset(cmp_buf, IO_PATTERN, BUF_SIZE);

        if (memcmp(buf, cmp_buf, BUF_SIZE)) {
                unsigned long long *ubuf = (unsigned long long *)buf;
                int i;

                for (i = 0; i < BUF_SIZE / sizeof(unsigned long long); i++)
                        printf("%d: 0x%llx\n", i, ubuf[i]);

                return 2;
        }

Two shell scripts on each box's dom0 runs above program repeatedly until
the compare fails (so every time the C program is a new fpu user and triggers
memory allocation). we can see the data damage at least once with
xen 4.3 + linux 2.6.32 on ~200 physical machines within two hours.
With xen 4.3 + linux 3.11.6 stable it becomes harder to reproduce
(guess it's because of the eagerfpu feature introduced in linux kernel 3.7)
but it's still possible to come out within about four hours.

The fix here is trying to make xen behave as close to the hardware as possible.

This bug only has effects on PV guests (and including dom0 kernel of course).

Cc: Wan Jia <jia.wanj@alibaba-inc.com>
Cc: Shen Yiben <zituan@taobao.com>
Cc: Charles Wang <muming.wq@taobao.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com>
---
 xen/arch/x86/traps.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 77c200b..b0321a6 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -3267,8 +3267,8 @@ void do_device_not_available(struct cpu_user_regs *regs)

     if ( curr->arch.pv_vcpu.ctrlreg[0] & X86_CR0_TS )
     {
+        stts();
         do_guest_trap(TRAP_no_device, regs, 0);
-        curr->arch.pv_vcpu.ctrlreg[0] &= ~X86_CR0_TS;
     }
     else
         TRACE_0D(TRC_PV_MATH_STATE_RESTORE);
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler
  2013-11-06  6:41 [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler Zhu Yanhai
@ 2013-11-06  8:51 ` Jan Beulich
  2013-11-06  9:15   ` Zhu Yanhai
  2014-02-27  0:04   ` Matt Wilson
  2014-02-27 12:21 ` George Dunlap
  2015-09-11 16:50 ` Konrad Rzeszutek Wilk
  2 siblings, 2 replies; 11+ messages in thread
From: Jan Beulich @ 2013-11-06  8:51 UTC (permalink / raw)
  To: Zhu Yanhai
  Cc: Charles Wang, Ian Campbell, George Dunlap, Andrew Cooper,
	Zhu Yanhai, Shen Yiben, David Vrabel, xen-devel, Boris Ostrovsky,
	Wan Jia

>>> On 06.11.13 at 07:41, Zhu Yanhai <zhu.yanhai@gmail.com> wrote:
> As we know Intel X86's CR0.TS is a sticky bit, which means once set
> it remains set until cleared by some software routines, in other words,
> the exception handler expects the bit is set when it starts to execute.

Since when would that be the case? CR0.TS is entirely unaffected
by exception invocations according to all I know. All that is known
here is that #NM wouldn't have occurred in the first place if CR0.TS
was clear.

> However xen doesn't simulate this behavior quite well for PV guests -
> vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very beginning,
> so the guest kernel's #NM handler runs with CR0.TS cleared. Generally 
> speaking
> it's fine since the linux kernel executes the exception handler with
> interrupt disabled and a sane #NM handler will clear the bit anyway
> before it exits, but there's a catch: if it's the first FPU trap for the 
> process,
> the linux kernel must allocate a piece of SLAB memory for it to save
> the FPU registers, which opens a schedule window as the memory
> allocation might sleep -- and with CR0.TS keeps clear!
> 
> [see the code below in linux kernel,

You're apparently referring to the pvops kernel.

> void math_state_restore(void)
> {
>     struct task_struct *tsk = current;
> 
>     if (!tsk_used_math(tsk)) {
>         local_irq_enable();
>         /*
>          * does a slab alloc which can sleep
>          */
>         if (init_fpu(tsk)) {                 <<<< Here it might open a schedule window
>             /*
>              * ran out of memory!
>              */
>             do_group_exit(SIGKILL);
>             return;
>         }
>         local_irq_disable();
>     }
> 
>     __thread_fpu_begin(tsk);    <<<< Here the process gets marked as a 'fpu user'
>                                          after the schedule window
> 
>     /*
>      * Paranoid restore. send a SIGSEGV if we fail to restore the state.
>      */
>     if (unlikely(restore_fpu_checking(tsk))) {
>         drop_init_fpu(tsk);
>         force_sig(SIGSEGV, tsk);
>         return;
>     }
> 
>     tsk->fpu_counter++;
> }
> ]

May I direct your attention to the XenoLinux one:

asmlinkage void math_state_restore(void)
{
	struct task_struct *me = current;

	/* NB. 'clts' is done for us by Xen during virtual trap. */
	__get_cpu_var(xen_x86_cr0) &= ~X86_CR0_TS;
	if (!used_math())
		init_fpu(me);
	restore_fpu_checking(&me->thread.i387.fxsave);
	task_thread_info(me)->status |= TS_USEDFPU;
}

Note the comment close to the beginning - the fact that CR0.TS
is clear at exception handler entry is actually part of the PV ABI,
i.e. by altering hypervisor behavior here you break all forward
ported kernels.

Nevertheless I agree that there is an issue, but this needs to be
fixed on the Linux side (hence adding the Linux maintainers to Cc);
this issue was introduced way back in 2.6.26 (before that there
was no allocation on that path). It's not clear though whether
using GFP_ATOMIC for the allocation would be preferable over
stts() before calling the allocation function (and clts() if it
succeeded), or whether perhaps to defer the stts() until we
actually know the task is being switched out. It's going to be an
ugly, Xen-specific hack in any event.

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler
  2013-11-06  8:51 ` Jan Beulich
@ 2013-11-06  9:15   ` Zhu Yanhai
  2013-11-06  9:28     ` Jan Beulich
  2014-02-27  0:04   ` Matt Wilson
  1 sibling, 1 reply; 11+ messages in thread
From: Zhu Yanhai @ 2013-11-06  9:15 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Charles Wang, Ian Campbell, George Dunlap, Andrew Cooper,
	Zhu Yanhai, Shen Yiben, David Vrabel, xen-devel, Boris Ostrovsky,
	Wan Jia

2013/11/6 Jan Beulich <JBeulich@suse.com>:
>>>> On 06.11.13 at 07:41, Zhu Yanhai <zhu.yanhai@gmail.com> wrote:
>> As we know Intel X86's CR0.TS is a sticky bit, which means once set
>> it remains set until cleared by some software routines, in other words,
>> the exception handler expects the bit is set when it starts to execute.
>
> Since when would that be the case? CR0.TS is entirely unaffected
> by exception invocations according to all I know. All that is known
> here is that #NM wouldn't have occurred in the first place if CR0.TS
> was clear.

Yes, you are right, nevertheless IMHO no exception handlers in the
real world do not clear this bit.

>
>> However xen doesn't simulate this behavior quite well for PV guests -
>> vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very beginning,
>> so the guest kernel's #NM handler runs with CR0.TS cleared. Generally
>> speaking
>> it's fine since the linux kernel executes the exception handler with
>> interrupt disabled and a sane #NM handler will clear the bit anyway
>> before it exits, but there's a catch: if it's the first FPU trap for the
>> process,
>> the linux kernel must allocate a piece of SLAB memory for it to save
>> the FPU registers, which opens a schedule window as the memory
>> allocation might sleep -- and with CR0.TS keeps clear!
>>
>> [see the code below in linux kernel,
>
> You're apparently referring to the pvops kernel.

Yes, it's from Linus tree with tag 3.11.

>
>> void math_state_restore(void)
>> {
>>     struct task_struct *tsk = current;
>>
>>     if (!tsk_used_math(tsk)) {
>>         local_irq_enable();
>>         /*
>>          * does a slab alloc which can sleep
>>          */
>>         if (init_fpu(tsk)) {                 <<<< Here it might open a schedule window
>>             /*
>>              * ran out of memory!
>>              */
>>             do_group_exit(SIGKILL);
>>             return;
>>         }
>>         local_irq_disable();
>>     }
>>
>>     __thread_fpu_begin(tsk);    <<<< Here the process gets marked as a 'fpu user'
>>                                          after the schedule window
>>
>>     /*
>>      * Paranoid restore. send a SIGSEGV if we fail to restore the state.
>>      */
>>     if (unlikely(restore_fpu_checking(tsk))) {
>>         drop_init_fpu(tsk);
>>         force_sig(SIGSEGV, tsk);
>>         return;
>>     }
>>
>>     tsk->fpu_counter++;
>> }
>> ]
>
> May I direct your attention to the XenoLinux one:
>
> asmlinkage void math_state_restore(void)
> {
>         struct task_struct *me = current;
>
>         /* NB. 'clts' is done for us by Xen during virtual trap. */
>         __get_cpu_var(xen_x86_cr0) &= ~X86_CR0_TS;
>         if (!used_math())
>                 init_fpu(me);
>         restore_fpu_checking(&me->thread.i387.fxsave);
>         task_thread_info(me)->status |= TS_USEDFPU;
> }
>
> Note the comment close to the beginning - the fact that CR0.TS
> is clear at exception handler entry is actually part of the PV ABI,
> i.e. by altering hypervisor behavior here you break all forward
> ported kernels.

I see, XenoLinux kernel doesn't sleep in init_fpu() so it doesn't have
this issue. But I wonder why PV ABI decide to clear this bit for the
guest kernel, isn't it better for the guest kernel itself to see bit
set? Since it's more similar with the hardware. I know the ABI cannot
be changed, just for curious.

>
> Nevertheless I agree that there is an issue, but this needs to be
> fixed on the Linux side (hence adding the Linux maintainers to Cc);
> this issue was introduced way back in 2.6.26 (before that there
> was no allocation on that path). It's not clear though whether
> using GFP_ATOMIC for the allocation would be preferable over
> stts() before calling the allocation function (and clts() if it
> succeeded), or whether perhaps to defer the stts() until we
> actually know the task is being switched out. It's going to be an
> ugly, Xen-specific hack in any event.

Yes, it also can be fixed at the linux kernel side. I didn't know such
behavior was part of PV ABI before.

--
Thanks,
Zhu Yanhai

>
> Jan
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler
  2013-11-06  9:15   ` Zhu Yanhai
@ 2013-11-06  9:28     ` Jan Beulich
  0 siblings, 0 replies; 11+ messages in thread
From: Jan Beulich @ 2013-11-06  9:28 UTC (permalink / raw)
  To: Zhu Yanhai
  Cc: Charles Wang, Ian Campbell, George Dunlap, Andrew Cooper,
	Zhu Yanhai, Shen Yiben, David Vrabel, xen-devel, Boris Ostrovsky,
	Wan Jia

>>> On 06.11.13 at 10:15, Zhu Yanhai <zhu.yanhai@gmail.com> wrote:
> 2013/11/6 Jan Beulich <JBeulich@suse.com>:
>> May I direct your attention to the XenoLinux one:
>>
>> asmlinkage void math_state_restore(void)
>> {
>>         struct task_struct *me = current;
>>
>>         /* NB. 'clts' is done for us by Xen during virtual trap. */
>>         __get_cpu_var(xen_x86_cr0) &= ~X86_CR0_TS;
>>         if (!used_math())
>>                 init_fpu(me);
>>         restore_fpu_checking(&me->thread.i387.fxsave);
>>         task_thread_info(me)->status |= TS_USEDFPU;
>> }
>>
>> Note the comment close to the beginning - the fact that CR0.TS
>> is clear at exception handler entry is actually part of the PV ABI,
>> i.e. by altering hypervisor behavior here you break all forward
>> ported kernels.
> 
> I see, XenoLinux kernel doesn't sleep in init_fpu() so it doesn't have
> this issue. But I wonder why PV ABI decide to clear this bit for the
> guest kernel, isn't it better for the guest kernel itself to see bit
> set? Since it's more similar with the hardware. I know the ABI cannot
> be changed, just for curious.

Quite obvious - performance. Since (as you also confirmed) it is
(almost) guaranteed for the handler to want to clear the bit, we
can save it from having to do another hypercall here. In the
forward ported XenoLinux the change is quite trivial (leaving
aside any optimization, as we're on a rarely used path here
anyway - it's being taken only the first time a process accesses
the FPU): stts() before the local_irq_enable(), and clts() after
the local_irq_disable(). But the x86 maintainers probably won't
like this for pvops...

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler
  2013-11-06  8:51 ` Jan Beulich
  2013-11-06  9:15   ` Zhu Yanhai
@ 2014-02-27  0:04   ` Matt Wilson
  2014-02-27  8:00     ` Jan Beulich
  2014-02-27 12:37     ` David Vrabel
  1 sibling, 2 replies; 11+ messages in thread
From: Matt Wilson @ 2014-02-27  0:04 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Zhu Yanhai, Ian Campbell, George Dunlap, Andrew Cooper,
	Charles Wang, Shen Yiben, David Vrabel, xen-devel, Zhu Yanhai,
	Wan Jia, Boris Ostrovsky

On Wed, Nov 06, 2013 at 08:51:56AM +0000, Jan Beulich wrote:
> >>> On 06.11.13 at 07:41, Zhu Yanhai <zhu.yanhai@gmail.com> wrote:
> > As we know Intel X86's CR0.TS is a sticky bit, which means once set
> > it remains set until cleared by some software routines, in other words,
> > the exception handler expects the bit is set when it starts to execute.
> 
> Since when would that be the case? CR0.TS is entirely unaffected
> by exception invocations according to all I know. All that is known
> here is that #NM wouldn't have occurred in the first place if CR0.TS
> was clear.
> 
> > However xen doesn't simulate this behavior quite well for PV guests -
> > vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very beginning,
> > so the guest kernel's #NM handler runs with CR0.TS cleared. Generally 
> > speaking
> > it's fine since the linux kernel executes the exception handler with
> > interrupt disabled and a sane #NM handler will clear the bit anyway
> > before it exits, but there's a catch: if it's the first FPU trap for the 
> > process,
> > the linux kernel must allocate a piece of SLAB memory for it to save
> > the FPU registers, which opens a schedule window as the memory
> > allocation might sleep -- and with CR0.TS keeps clear!
> > 
> > [see the code below in linux kernel,
> 
> You're apparently referring to the pvops kernel.
> 
> > void math_state_restore(void)
> > {
> >     struct task_struct *tsk = current;
> > 
> >     if (!tsk_used_math(tsk)) {
> >         local_irq_enable();
> >         /*
> >          * does a slab alloc which can sleep
> >          */
> >         if (init_fpu(tsk)) {                 <<<< Here it might open a schedule window
> >             /*
> >              * ran out of memory!
> >              */
> >             do_group_exit(SIGKILL);
> >             return;
> >         }
> >         local_irq_disable();
> >     }
> > 
> >     __thread_fpu_begin(tsk);    <<<< Here the process gets marked as a 'fpu user'
> >                                          after the schedule window
> > 
> >     /*
> >      * Paranoid restore. send a SIGSEGV if we fail to restore the state.
> >      */
> >     if (unlikely(restore_fpu_checking(tsk))) {
> >         drop_init_fpu(tsk);
> >         force_sig(SIGSEGV, tsk);
> >         return;
> >     }
> > 
> >     tsk->fpu_counter++;
> > }
> > ]
> 
> May I direct your attention to the XenoLinux one:
> 
> asmlinkage void math_state_restore(void)
> {
> 	struct task_struct *me = current;
> 
> 	/* NB. 'clts' is done for us by Xen during virtual trap. */
> 	__get_cpu_var(xen_x86_cr0) &= ~X86_CR0_TS;
> 	if (!used_math())
> 		init_fpu(me);
> 	restore_fpu_checking(&me->thread.i387.fxsave);
> 	task_thread_info(me)->status |= TS_USEDFPU;
> }
> 
> Note the comment close to the beginning - the fact that CR0.TS
> is clear at exception handler entry is actually part of the PV ABI,
> i.e. by altering hypervisor behavior here you break all forward
> ported kernels.
> 
> Nevertheless I agree that there is an issue, but this needs to be
> fixed on the Linux side (hence adding the Linux maintainers to Cc);
> this issue was introduced way back in 2.6.26 (before that there
> was no allocation on that path). It's not clear though whether
> using GFP_ATOMIC for the allocation would be preferable over
> stts() before calling the allocation function (and clts() if it
> succeeded), or whether perhaps to defer the stts() until we
> actually know the task is being switched out. It's going to be an
> ugly, Xen-specific hack in any event.

Was there ever a resolution to this problem? I never saw a comment
from the Linux Xen PV maintainers.

--msw

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler
  2014-02-27  0:04   ` Matt Wilson
@ 2014-02-27  8:00     ` Jan Beulich
  2014-02-27 12:46       ` George Dunlap
  2014-02-27 12:37     ` David Vrabel
  1 sibling, 1 reply; 11+ messages in thread
From: Jan Beulich @ 2014-02-27  8:00 UTC (permalink / raw)
  To: Matt Wilson
  Cc: Charles Wang, Ian Campbell, George Dunlap, Andrew Cooper,
	Zhu Yanhai, Shen Yiben, David Vrabel, xen-devel, Zhu Yanhai,
	Wan Jia, Boris Ostrovsky

>>> On 27.02.14 at 01:04, Matt Wilson <msw@linux.com> wrote:
> On Wed, Nov 06, 2013 at 08:51:56AM +0000, Jan Beulich wrote:
>> Nevertheless I agree that there is an issue, but this needs to be
>> fixed on the Linux side (hence adding the Linux maintainers to Cc);
>> this issue was introduced way back in 2.6.26 (before that there
>> was no allocation on that path). It's not clear though whether
>> using GFP_ATOMIC for the allocation would be preferable over
>> stts() before calling the allocation function (and clts() if it
>> succeeded), or whether perhaps to defer the stts() until we
>> actually know the task is being switched out. It's going to be an
>> ugly, Xen-specific hack in any event.
> 
> Was there ever a resolution to this problem? I never saw a comment
> from the Linux Xen PV maintainers.

Neither did I, so no, I'm not aware of a solution.

Jan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler
  2014-02-27  8:00     ` Jan Beulich
@ 2014-02-27 12:46       ` George Dunlap
  0 siblings, 0 replies; 11+ messages in thread
From: George Dunlap @ 2014-02-27 12:46 UTC (permalink / raw)
  To: Jan Beulich, Matt Wilson
  Cc: Charles Wang, Ian Campbell, Andrew Cooper, Zhu Yanhai, Shen Yiben,
	David Vrabel, xen-devel, Zhu Yanhai, Wan Jia, Boris Ostrovsky

On 02/27/2014 08:00 AM, Jan Beulich wrote:
>>>> On 27.02.14 at 01:04, Matt Wilson <msw@linux.com> wrote:
>> On Wed, Nov 06, 2013 at 08:51:56AM +0000, Jan Beulich wrote:
>>> Nevertheless I agree that there is an issue, but this needs to be
>>> fixed on the Linux side (hence adding the Linux maintainers to Cc);
>>> this issue was introduced way back in 2.6.26 (before that there
>>> was no allocation on that path). It's not clear though whether
>>> using GFP_ATOMIC for the allocation would be preferable over
>>> stts() before calling the allocation function (and clts() if it
>>> succeeded), or whether perhaps to defer the stts() until we
>>> actually know the task is being switched out. It's going to be an
>>> ugly, Xen-specific hack in any event.
>> Was there ever a resolution to this problem? I never saw a comment
>> from the Linux Xen PV maintainers.
> Neither did I, so no, I'm not aware of a solution.

Well we basically have two solutions I think:

1. Add a flag to the guest kernel that requests Xen to keep the TS bit 
set (and eat the extra cost of the trap on clearing it).

2. In the uncommon case of the first use, set the TS  bit again 
(incurring the cost of the extra trap) before calling allocate.

  -George

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler
  2014-02-27  0:04   ` Matt Wilson
  2014-02-27  8:00     ` Jan Beulich
@ 2014-02-27 12:37     ` David Vrabel
  1 sibling, 0 replies; 11+ messages in thread
From: David Vrabel @ 2014-02-27 12:37 UTC (permalink / raw)
  To: Matt Wilson
  Cc: Charles Wang, Ian Campbell, George Dunlap, Andrew Cooper,
	Zhu Yanhai, Shen Yiben, David Vrabel, Jan Beulich, xen-devel,
	Zhu Yanhai, Wan Jia, Boris Ostrovsky

On 27/02/14 00:04, Matt Wilson wrote:
> On Wed, Nov 06, 2013 at 08:51:56AM +0000, Jan Beulich wrote:
>>>>> On 06.11.13 at 07:41, Zhu Yanhai <zhu.yanhai@gmail.com> wrote:
>>> As we know Intel X86's CR0.TS is a sticky bit, which means once set
>>> it remains set until cleared by some software routines, in other words,
>>> the exception handler expects the bit is set when it starts to execute.
>>
>> Since when would that be the case? CR0.TS is entirely unaffected
>> by exception invocations according to all I know. All that is known
>> here is that #NM wouldn't have occurred in the first place if CR0.TS
>> was clear.
>>
>>> However xen doesn't simulate this behavior quite well for PV guests -
>>> vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very beginning,
>>> so the guest kernel's #NM handler runs with CR0.TS cleared. Generally 
>>> speaking
>>> it's fine since the linux kernel executes the exception handler with
>>> interrupt disabled and a sane #NM handler will clear the bit anyway
>>> before it exits, but there's a catch: if it's the first FPU trap for the 
>>> process,
>>> the linux kernel must allocate a piece of SLAB memory for it to save
>>> the FPU registers, which opens a schedule window as the memory
>>> allocation might sleep -- and with CR0.TS keeps clear!
>>>
>>> [see the code below in linux kernel,
>>
>> You're apparently referring to the pvops kernel.
>>
>>> void math_state_restore(void)
>>> {
>>>     struct task_struct *tsk = current;
>>>
>>>     if (!tsk_used_math(tsk)) {
>>>         local_irq_enable();
>>>         /*
>>>          * does a slab alloc which can sleep
>>>          */
>>>         if (init_fpu(tsk)) {                 <<<< Here it might open a schedule window
>>>             /*
>>>              * ran out of memory!
>>>              */
>>>             do_group_exit(SIGKILL);
>>>             return;
>>>         }
>>>         local_irq_disable();
>>>     }
>>>
>>>     __thread_fpu_begin(tsk);    <<<< Here the process gets marked as a 'fpu user'
>>>                                          after the schedule window
>>>
>>>     /*
>>>      * Paranoid restore. send a SIGSEGV if we fail to restore the state.
>>>      */
>>>     if (unlikely(restore_fpu_checking(tsk))) {
>>>         drop_init_fpu(tsk);
>>>         force_sig(SIGSEGV, tsk);
>>>         return;
>>>     }
>>>
>>>     tsk->fpu_counter++;
>>> }
>>> ]
>>
>> May I direct your attention to the XenoLinux one:
>>
>> asmlinkage void math_state_restore(void)
>> {
>> 	struct task_struct *me = current;
>>
>> 	/* NB. 'clts' is done for us by Xen during virtual trap. */
>> 	__get_cpu_var(xen_x86_cr0) &= ~X86_CR0_TS;
>> 	if (!used_math())
>> 		init_fpu(me);
>> 	restore_fpu_checking(&me->thread.i387.fxsave);
>> 	task_thread_info(me)->status |= TS_USEDFPU;
>> }
>>
>> Note the comment close to the beginning - the fact that CR0.TS
>> is clear at exception handler entry is actually part of the PV ABI,
>> i.e. by altering hypervisor behavior here you break all forward
>> ported kernels.
>>
>> Nevertheless I agree that there is an issue, but this needs to be
>> fixed on the Linux side (hence adding the Linux maintainers to Cc);
>> this issue was introduced way back in 2.6.26 (before that there
>> was no allocation on that path). It's not clear though whether
>> using GFP_ATOMIC for the allocation would be preferable over
>> stts() before calling the allocation function (and clts() if it
>> succeeded), or whether perhaps to defer the stts() until we
>> actually know the task is being switched out. It's going to be an
>> ugly, Xen-specific hack in any event.
> 
> Was there ever a resolution to this problem? I never saw a comment
> from the Linux Xen PV maintainers.

I think allocating on the context switch is mad and the irq
enable/disable just to allow the allocation looks equally mad.

I had vague plans to maintain a mempool for FPU contexts but couldn't
immediately think how we could guarantee that the pool would be kept
sufficiently populated.

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler
  2013-11-06  6:41 [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler Zhu Yanhai
  2013-11-06  8:51 ` Jan Beulich
@ 2014-02-27 12:21 ` George Dunlap
  2014-02-27 12:30   ` Processed: " xen
  2015-09-11 16:50 ` Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 11+ messages in thread
From: George Dunlap @ 2014-02-27 12:21 UTC (permalink / raw)
  To: Zhu Yanhai
  Cc: Zhu Yanhai, xen-devel@lists.xensource.com, Ian Campbell,
	Andrew Cooper, Charles Wang, Shen Yiben, Wan Jia

create ^
title it linux pvops: fpu corruption due to incorrect assumptions
about TS bit after exception under Xen
thanks

On Wed, Nov 6, 2013 at 6:41 AM, Zhu Yanhai <zhu.yanhai@gmail.com> wrote:
> As we know Intel X86's CR0.TS is a sticky bit, which means once set
> it remains set until cleared by some software routines, in other words,
> the exception handler expects the bit is set when it starts to execute.
>
> However xen doesn't simulate this behavior quite well for PV guests -
> vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very beginning,
> so the guest kernel's #NM handler runs with CR0.TS cleared. Generally speaking
> it's fine since the linux kernel executes the exception handler with
> interrupt disabled and a sane #NM handler will clear the bit anyway
> before it exits, but there's a catch: if it's the first FPU trap for the process,
> the linux kernel must allocate a piece of SLAB memory for it to save
> the FPU registers, which opens a schedule window as the memory
> allocation might sleep -- and with CR0.TS keeps clear!
>
> [see the code below in linux kernel,
>
> void math_state_restore(void)
> {
>     struct task_struct *tsk = current;
>
>     if (!tsk_used_math(tsk)) {
>         local_irq_enable();
>         /*
>          * does a slab alloc which can sleep
>          */
>         if (init_fpu(tsk)) {                 <<<< Here it might open a schedule window
>             /*
>              * ran out of memory!
>              */
>             do_group_exit(SIGKILL);
>             return;
>         }
>         local_irq_disable();
>     }
>
>     __thread_fpu_begin(tsk);    <<<< Here the process gets marked as a 'fpu user'
>                                          after the schedule window
>
>     /*
>      * Paranoid restore. send a SIGSEGV if we fail to restore the state.
>      */
>     if (unlikely(restore_fpu_checking(tsk))) {
>         drop_init_fpu(tsk);
>         force_sig(SIGSEGV, tsk);
>         return;
>     }
>
>     tsk->fpu_counter++;
> }
> ]
>
> The check in linux kernel's switch_fpu_prepare() doesn't stts() either because
> the current process only gets marked as a FPU user after the schedule window
> (the story is a bit different if eagerfpu is enabled, anyway a sane hypervisor
> cannot depend on such undetermined things). And then supposing that the new
> process scheduled-in wants to touch FPU, nobody will do fxrstor/frstor for it anymore,
> conducing to a serious data damage.
>
> Also, The point is everything is fine on linux + baremetal since CR0.TS will
> keep set until the interrupted #NM handler got the memory it needs and exits,
> so the incomer FPU user will get trapped as it's supposed to be.
>
> The test case is as below,
>
>         buf = malloc(BUF_SIZE);
>         if (!buf) {
>                 fprintf(stderr, "error %s during %s\n",
>                         strerror(-err),
>                         "malloc");
>                 return 1;
>         }
>         memset(buf, IO_PATTERN, BUF_SIZE);
>         memset(cmp_buf, IO_PATTERN, BUF_SIZE);
>
>         if (memcmp(buf, cmp_buf, BUF_SIZE)) {
>                 unsigned long long *ubuf = (unsigned long long *)buf;
>                 int i;
>
>                 for (i = 0; i < BUF_SIZE / sizeof(unsigned long long); i++)
>                         printf("%d: 0x%llx\n", i, ubuf[i]);
>
>                 return 2;
>         }
>
> Two shell scripts on each box's dom0 runs above program repeatedly until
> the compare fails (so every time the C program is a new fpu user and triggers
> memory allocation). we can see the data damage at least once with
> xen 4.3 + linux 2.6.32 on ~200 physical machines within two hours.
> With xen 4.3 + linux 3.11.6 stable it becomes harder to reproduce
> (guess it's because of the eagerfpu feature introduced in linux kernel 3.7)
> but it's still possible to come out within about four hours.
>
> The fix here is trying to make xen behave as close to the hardware as possible.
>
> This bug only has effects on PV guests (and including dom0 kernel of course).
>
> Cc: Wan Jia <jia.wanj@alibaba-inc.com>
> Cc: Shen Yiben <zituan@taobao.com>
> Cc: Charles Wang <muming.wq@taobao.com>
> Cc: George Dunlap <George.Dunlap@eu.citrix.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
> Cc: Ian Campbell <ian.campbell@citrix.com>
> Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com>
> ---
>  xen/arch/x86/traps.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
> index 77c200b..b0321a6 100644
> --- a/xen/arch/x86/traps.c
> +++ b/xen/arch/x86/traps.c
> @@ -3267,8 +3267,8 @@ void do_device_not_available(struct cpu_user_regs *regs)
>
>      if ( curr->arch.pv_vcpu.ctrlreg[0] & X86_CR0_TS )
>      {
> +        stts();
>          do_guest_trap(TRAP_no_device, regs, 0);
> -        curr->arch.pv_vcpu.ctrlreg[0] &= ~X86_CR0_TS;
>      }
>      else
>          TRACE_0D(TRC_PV_MATH_STATE_RESTORE);
> --
> 1.7.4.4
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Processed: Re: [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler
  2014-02-27 12:21 ` George Dunlap
@ 2014-02-27 12:30   ` xen
  0 siblings, 0 replies; 11+ messages in thread
From: xen @ 2014-02-27 12:30 UTC (permalink / raw)
  To: George Dunlap, xen-devel

Processing commands for xen@bugs.xenproject.org:

> create ^
Created new bug #40 rooted at `<1383720072-6242-1-git-send-email-gaoyang.zyh@taobao.com>'
Title: `Re: [Xen-devel] [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler'
> title it linux pvops: fpu corruption due to incorrect assumptions
Set title for #40 to `linux pvops: fpu corruption due to incorrect assumptions'
> about TS bit after exception under Xen
Command failed: Unknown command `about'. at /srv/xen-devel-bugs/lib/emesinae/control.pl line 455, <GEN1> line 3.
Stop processing here.

Modified/created Bugs:
 - 40: http://bugs.xenproject.org/xen/bug/40 (new)

---
Xen Hypervisor Bug Tracker
See http://wiki.xen.org/wiki/Reporting_Bugs_against_Xen for information on reporting bugs
Contact xen-bugs-owner@bugs.xenproject.org with any infrastructure issues

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler
  2013-11-06  6:41 [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler Zhu Yanhai
  2013-11-06  8:51 ` Jan Beulich
  2014-02-27 12:21 ` George Dunlap
@ 2015-09-11 16:50 ` Konrad Rzeszutek Wilk
  2 siblings, 0 replies; 11+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-09-11 16:50 UTC (permalink / raw)
  To: Zhu Yanhai
  Cc: Zhu Yanhai, xen-devel, Ian Campbell, George Dunlap, Andrew Cooper,
	xen, Charles Wang, Shen Yiben, Wan Jia

On Wed, Nov 06, 2013 at 02:41:12PM +0800, Zhu Yanhai wrote:
> As we know Intel X86's CR0.TS is a sticky bit, which means once set
> it remains set until cleared by some software routines, in other words,
> the exception handler expects the bit is set when it starts to execute.
> 
> However xen doesn't simulate this behavior quite well for PV guests -
> vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very beginning,
> so the guest kernel's #NM handler runs with CR0.TS cleared. Generally speaking
> it's fine since the linux kernel executes the exception handler with
> interrupt disabled and a sane #NM handler will clear the bit anyway
> before it exits, but there's a catch: if it's the first FPU trap for the process,
> the linux kernel must allocate a piece of SLAB memory for it to save
> the FPU registers, which opens a schedule window as the memory
> allocation might sleep -- and with CR0.TS keeps clear!
> 


With the Ingo's FPU rewrite we haven't been able to retrigger this.
(Tests ran for 2 weeks while they would have failed within
two hours).

And when I dug in this I found the reason:

commit 0c8c0f03e3a292e031596484275c14cf39c0ab7a
Author: Dave Hansen <dave@sr71.net>
Date:   Fri Jul 17 12:28:11 2015 +0200

    x86/fpu, sched: Dynamically allocate 'struct fpu'
    
    The FPU rewrite removed the dynamic allocations of 'struct fpu'.
    But, this potentially wastes massive amounts of memory (2k per
    task on systems that do not have AVX-512 for instance).
    
    Instead of having a separate slab, this patch just appends the
    space that we need to the 'task_struct' which we dynamically
    allocate already.  This saves from doing an extra slab
    allocation at fork().

And that when the #NM is called ('do_device_not_available')
it does:

 fpu__restore(&current->thread.fpu); /* interrupts still off */          
   |+- fpu__activate_curr (which just inits the already allocated space)
   |     \- memset(state, 0, xstate_size);
   |+- fpregs_activate
         \- stts()

So there is no scheduling window during this time, while in
kernels prior to Linux 4.2 there was.

And it took a bit of time to figure out what exactly the problem was.

I appreciate folks emails (and this giant thread) about this but
without some sort of diagram it was hard to understand this
(at least to me).

So here it is in case somebody is doing code archaeology:

For simplicity we assume the guest/baremetal use the lazy mechanism
not eager. That makes 'switch_fpu_prepare' (called by schedule()) effectively:

if (previous task had PF_USED_MATH set)
   stts (CR0.TS=1)
else
   ;

I am ignoring the case if the task had used the FPU more than
five times - where we do things a bit different.

The time diagram looks great at 132x42.

Anyhow, lets assume that we have two tasks: A and B. Both
haven't used the FPU. This is on PVHVM:


CR0.TS=1                       CR0.TS=1                 CR0.TS=0                   CR0.TS=1                       CR0.TS=0
------------------------------------------------------------------------------------+--------+-------------------+-------+
task A | #NM                     |task B|                    |taskB |               | task A |                   |taskA  |
MMX    |math_state_restore       |      |                    |      |               |        |                   |       |
op     |  \- fpu_init            |      |                    |      |               |        |                   |       |
       |       \- .. schedule()  |      |                    |      |               |        |                   |       |
       |           [swap task B] |      |                    |      |               |        |                   |       |
       |           [since task A |      |                    |      |               |        |                   |       |
       |            hadn't set   |      |                    |      |               |        |                   |       |
       |            PF_USED_MATH |      |                    |      |               |        |                   |       |
       |            we don't muck|      |                    |      |               |        |                   |       |
       |            with CR0.TS] |      |                    |      |               |        |                   |       |
       |                         |MMX op|                    |      |               |        |                   |       |
       |                         |      |#NM                 |      |               |        |                   |       |
       |                         |      |math_state_restore  |      |               |        |                   |       |
       |                         |      | fpu_init worked    |      |               |        |                   |       |
       |                         |      |  clts()            |      |               |        |                   |       |
       |                         |      |task_B->flags |=    |      |               |        |                   |       |
       |                         |      |  PF_USED_MATH      |      |               |        |                   |       |
       |                         |      |  return;           |      |               |        |                   |       |
       |                         |      |                    |syscall|              |        |                   |       |
       |                         |      |                    |      |schedule()     |        |                   |       |
       |                         |      |                    |      |[swap task A]  |        |                   |       |
       |                         |      |                    |      |[taskB has     |        |                   |       |
       |                         |      |                    |      | PF_USED_MATH] |        |                   |       |
       |                         |      |                    |      |[so CR0.TS=1]  |        |                   |       |
       |                         |      |                    |      |  task A runs  |        |                   |       |
       |                         |      |                    |      |               |MMX op  |                   |       |
       |                         |      |                    |      |               |        |#NM                |       |
       |                         |      |                    |      |               |        | fpu_init works    |       |
       |                         |      |                    |      |               |        | clts()            |       |
       |                         |      |                    |      |               |        |  taskA->flags |=  |       |
       |                         |      |                    |      |               |        |  PF_USED_MATH     |       |
       |                         |      |                    |      |               |        |  return           |       |
       |                         |      |                    |      |               |        |                   |MMX op |



And under Xen PV:



CR0.TS=1                       CR0.TS=0                 CR0.TS=0                   CR0.TS=0                       CR0.TS=0
[but Xen sets it to
CR0.TS=0 and calls
Linux #NM:]
------------------------------------------------------------------------------------+--------+-------------------+-------+
task A | #NM                     |task B|                    |taskB |               | task A |                   |taskA  |
MMX    |math_state_restore       |      |                    |      |               |        |                   |       |
op     |  \- fpu_init            |      |                    |      |               |        |                   |       |
       |       \- .. schedule()  |      |                    |      |               |        |                   |       |
       |           [swap task B] |      |                    |      |               |        |                   |       |
       |           [since task A |      |                    |      |               |        |                   |       |
       |            hadn't set   |      |                    |      |               |        |                   |       |
       |            PF_USED_MATH |      |                    |      |               |        |                   |       |
       |            we don't muck|      |                    |      |               |        |                   |       |
       |            with CR0.TS] |      |                    |      |               |        |                   |       |
       |                         |MMX op|                    |      |               |        |                   |       |
       |                         |      |[no trap to Linux or|      |               |        |                   |       |
       |                         |      |Xen as CR0.TS=0]    |      |               |        |                   |       |
       |                         |      |                    |      |               |        |                   |       |
       |                         |      |                    |      |               |        |                   |       |
       |                         |      |                    |      |               |        |                   |       |
       |                         |      |                    |      |               |        |                   |       |
       |                         |      |                    |      |               |        |                   |       |
       |                         |      |                    |syscall|              |        |                   |       |
       |                         |      |                    |      |schedule()     |        |                   |       |
       |                         |      |                    |      |[swap task A]  |        |                   |       |
       |                         |      |                    |      |[with task B]  |        |                   |       |
       |                         |      |                    |      |  task A runs  |        |                   |       |
       |                         |      |                    |      |               |MMX op  |                   |       |
       |                         |      |                    |      |               |        |[again, no trap to |       |
       |                         |      |                    |      |               |        | Xen or Linux b/c  |       |
       |                         |      |                    |      |               |        | CR0.TS=0 *1]      |       |
       |                         |      |                    |      |               |        |                   |MMX op |

And so on - FPU registers are effectively leaking across tasks.

The [*1] refers to the Xen scheduler. If any of the
syscalls that the user application called, ended in the Linux kernel
halt (xen_safe_halt) routine - we would deschedule the guest VCPU.

When that VCPU is re-scheduled, Xen would set CR0.TS=1 back
so the #NM would function again.

Not pretty - and only happening if the fpu_alloc() ends
up calling the schedule().

I followed Jan's recommendation and cobbled this up:

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 324ab52..06b7843 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -837,11 +837,17 @@ asmlinkage __visible void __attribute__((weak)) smp_threshold_interrupt(void)
  * Must be called with kernel preemption disabled (eg with local
  * local interrupts as in the case of do_device_not_available).
  */
+#include <xen/xen.h>
 void math_state_restore(void)
 {
 	struct task_struct *tsk = current;
 
 	if (!tsk_used_math(tsk)) {
+		/*
+		 * See http://bugs.xenproject.org/xen/bug/40
+		 */
+		if (xen_pv_domain())
+			stts();
 		local_irq_enable();
 		/*
 		 * does a slab alloc which can sleep
@@ -854,6 +860,8 @@ void math_state_restore(void)
 			return;
 		}
 		local_irq_disable();
+		if (xen_pv_domain())
+			clts();
 	}
 
 	/* Avoid __kernel_fpu_begin() right after __thread_fpu_begin() */

For the older pvops kernels and trying it out now.

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-09-11 16:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-06  6:41 [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler Zhu Yanhai
2013-11-06  8:51 ` Jan Beulich
2013-11-06  9:15   ` Zhu Yanhai
2013-11-06  9:28     ` Jan Beulich
2014-02-27  0:04   ` Matt Wilson
2014-02-27  8:00     ` Jan Beulich
2014-02-27 12:46       ` George Dunlap
2014-02-27 12:37     ` David Vrabel
2014-02-27 12:21 ` George Dunlap
2014-02-27 12:30   ` Processed: " xen
2015-09-11 16:50 ` Konrad Rzeszutek Wilk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).