xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler
@ 2013-11-06  6:41 Zhu Yanhai
  2013-11-06  8:51 ` Jan Beulich
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Zhu Yanhai @ 2013-11-06  6:41 UTC (permalink / raw)
  To: xen-devel
  Cc: Charles Wang, Ian Campbell, George Dunlap, Andrew Cooper,
	Zhu Yanhai, Shen Yiben, Wan Jia

As we know Intel X86's CR0.TS is a sticky bit, which means once set
it remains set until cleared by some software routines, in other words,
the exception handler expects the bit is set when it starts to execute.

However xen doesn't simulate this behavior quite well for PV guests -
vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very beginning,
so the guest kernel's #NM handler runs with CR0.TS cleared. Generally speaking
it's fine since the linux kernel executes the exception handler with
interrupt disabled and a sane #NM handler will clear the bit anyway
before it exits, but there's a catch: if it's the first FPU trap for the process,
the linux kernel must allocate a piece of SLAB memory for it to save
the FPU registers, which opens a schedule window as the memory
allocation might sleep -- and with CR0.TS keeps clear!

[see the code below in linux kernel,

void math_state_restore(void)
{
    struct task_struct *tsk = current;

    if (!tsk_used_math(tsk)) {
        local_irq_enable();
        /*
         * does a slab alloc which can sleep
         */
        if (init_fpu(tsk)) {                 <<<< Here it might open a schedule window
            /*
             * ran out of memory!
             */
            do_group_exit(SIGKILL);
            return;
        }
        local_irq_disable();
    }

    __thread_fpu_begin(tsk);    <<<< Here the process gets marked as a 'fpu user'
                                         after the schedule window

    /*
     * Paranoid restore. send a SIGSEGV if we fail to restore the state.
     */
    if (unlikely(restore_fpu_checking(tsk))) {
        drop_init_fpu(tsk);
        force_sig(SIGSEGV, tsk);
        return;
    }

    tsk->fpu_counter++;
}
]

The check in linux kernel's switch_fpu_prepare() doesn't stts() either because
the current process only gets marked as a FPU user after the schedule window
(the story is a bit different if eagerfpu is enabled, anyway a sane hypervisor
cannot depend on such undetermined things). And then supposing that the new
process scheduled-in wants to touch FPU, nobody will do fxrstor/frstor for it anymore,
conducing to a serious data damage.

Also, The point is everything is fine on linux + baremetal since CR0.TS will
keep set until the interrupted #NM handler got the memory it needs and exits,
so the incomer FPU user will get trapped as it's supposed to be.

The test case is as below,

        buf = malloc(BUF_SIZE);
        if (!buf) {
                fprintf(stderr, "error %s during %s\n",
                        strerror(-err),
                        "malloc");
                return 1;
        }
        memset(buf, IO_PATTERN, BUF_SIZE);
        memset(cmp_buf, IO_PATTERN, BUF_SIZE);

        if (memcmp(buf, cmp_buf, BUF_SIZE)) {
                unsigned long long *ubuf = (unsigned long long *)buf;
                int i;

                for (i = 0; i < BUF_SIZE / sizeof(unsigned long long); i++)
                        printf("%d: 0x%llx\n", i, ubuf[i]);

                return 2;
        }

Two shell scripts on each box's dom0 runs above program repeatedly until
the compare fails (so every time the C program is a new fpu user and triggers
memory allocation). we can see the data damage at least once with
xen 4.3 + linux 2.6.32 on ~200 physical machines within two hours.
With xen 4.3 + linux 3.11.6 stable it becomes harder to reproduce
(guess it's because of the eagerfpu feature introduced in linux kernel 3.7)
but it's still possible to come out within about four hours.

The fix here is trying to make xen behave as close to the hardware as possible.

This bug only has effects on PV guests (and including dom0 kernel of course).

Cc: Wan Jia <jia.wanj@alibaba-inc.com>
Cc: Shen Yiben <zituan@taobao.com>
Cc: Charles Wang <muming.wq@taobao.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com>
---
 xen/arch/x86/traps.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 77c200b..b0321a6 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -3267,8 +3267,8 @@ void do_device_not_available(struct cpu_user_regs *regs)
 
     if ( curr->arch.pv_vcpu.ctrlreg[0] & X86_CR0_TS )
     {
+        stts();
         do_guest_trap(TRAP_no_device, regs, 0);
-        curr->arch.pv_vcpu.ctrlreg[0] &= ~X86_CR0_TS;
     }
     else
         TRACE_0D(TRC_PV_MATH_STATE_RESTORE);
-- 
1.7.4.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-09-11 16:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-06  6:41 [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler Zhu Yanhai
2013-11-06  8:51 ` Jan Beulich
2013-11-06  9:15   ` Zhu Yanhai
2013-11-06  9:28     ` Jan Beulich
2014-02-27  0:04   ` Matt Wilson
2014-02-27  8:00     ` Jan Beulich
2014-02-27 12:46       ` George Dunlap
2014-02-27 12:37     ` David Vrabel
2014-02-27 12:21 ` George Dunlap
2014-02-27 12:30   ` Processed: " xen
2015-09-11 16:50 ` Konrad Rzeszutek Wilk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).