From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Vrabel Subject: Re: [PATCH] x86/fpu: CR0.TS should be set before trap into PV guest's #NM exception handler Date: Thu, 27 Feb 2014 12:37:43 +0000 Message-ID: <530F3197.2040403@citrix.com> References: <1383720072-6242-1-git-send-email-gaoyang.zyh@taobao.com> <527A113C02000078000FFF99@nat28.tlf.novell.com> <20140227000405.GA11825@u109add4315675089e695.ant.amazon.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Received: from mail6.bemta5.messagelabs.com ([195.245.231.135]) by lists.xen.org with esmtp (Exim 4.72) (envelope-from ) id 1WJ0Dk-0005V4-Tt for xen-devel@lists.xenproject.org; Thu, 27 Feb 2014 12:37:49 +0000 In-Reply-To: <20140227000405.GA11825@u109add4315675089e695.ant.amazon.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Matt Wilson Cc: Charles Wang , Ian Campbell , George Dunlap , Andrew Cooper , Zhu Yanhai , Shen Yiben , David Vrabel , Jan Beulich , xen-devel , Zhu Yanhai , Wan Jia , Boris Ostrovsky List-Id: xen-devel@lists.xenproject.org On 27/02/14 00:04, Matt Wilson wrote: > On Wed, Nov 06, 2013 at 08:51:56AM +0000, Jan Beulich wrote: >>>>> On 06.11.13 at 07:41, Zhu Yanhai wrote: >>> As we know Intel X86's CR0.TS is a sticky bit, which means once set >>> it remains set until cleared by some software routines, in other words, >>> the exception handler expects the bit is set when it starts to execute. >> >> Since when would that be the case? CR0.TS is entirely unaffected >> by exception invocations according to all I know. All that is known >> here is that #NM wouldn't have occurred in the first place if CR0.TS >> was clear. >> >>> However xen doesn't simulate this behavior quite well for PV guests - >>> vcpu_restore_fpu_lazy() clears CR0.TS unconditionally in the very beginning, >>> so the guest kernel's #NM handler runs with CR0.TS cleared. Generally >>> speaking >>> it's fine since the linux kernel executes the exception handler with >>> interrupt disabled and a sane #NM handler will clear the bit anyway >>> before it exits, but there's a catch: if it's the first FPU trap for the >>> process, >>> the linux kernel must allocate a piece of SLAB memory for it to save >>> the FPU registers, which opens a schedule window as the memory >>> allocation might sleep -- and with CR0.TS keeps clear! >>> >>> [see the code below in linux kernel, >> >> You're apparently referring to the pvops kernel. >> >>> void math_state_restore(void) >>> { >>> struct task_struct *tsk = current; >>> >>> if (!tsk_used_math(tsk)) { >>> local_irq_enable(); >>> /* >>> * does a slab alloc which can sleep >>> */ >>> if (init_fpu(tsk)) { <<<< Here it might open a schedule window >>> /* >>> * ran out of memory! >>> */ >>> do_group_exit(SIGKILL); >>> return; >>> } >>> local_irq_disable(); >>> } >>> >>> __thread_fpu_begin(tsk); <<<< Here the process gets marked as a 'fpu user' >>> after the schedule window >>> >>> /* >>> * Paranoid restore. send a SIGSEGV if we fail to restore the state. >>> */ >>> if (unlikely(restore_fpu_checking(tsk))) { >>> drop_init_fpu(tsk); >>> force_sig(SIGSEGV, tsk); >>> return; >>> } >>> >>> tsk->fpu_counter++; >>> } >>> ] >> >> May I direct your attention to the XenoLinux one: >> >> asmlinkage void math_state_restore(void) >> { >> struct task_struct *me = current; >> >> /* NB. 'clts' is done for us by Xen during virtual trap. */ >> __get_cpu_var(xen_x86_cr0) &= ~X86_CR0_TS; >> if (!used_math()) >> init_fpu(me); >> restore_fpu_checking(&me->thread.i387.fxsave); >> task_thread_info(me)->status |= TS_USEDFPU; >> } >> >> Note the comment close to the beginning - the fact that CR0.TS >> is clear at exception handler entry is actually part of the PV ABI, >> i.e. by altering hypervisor behavior here you break all forward >> ported kernels. >> >> Nevertheless I agree that there is an issue, but this needs to be >> fixed on the Linux side (hence adding the Linux maintainers to Cc); >> this issue was introduced way back in 2.6.26 (before that there >> was no allocation on that path). It's not clear though whether >> using GFP_ATOMIC for the allocation would be preferable over >> stts() before calling the allocation function (and clts() if it >> succeeded), or whether perhaps to defer the stts() until we >> actually know the task is being switched out. It's going to be an >> ugly, Xen-specific hack in any event. > > Was there ever a resolution to this problem? I never saw a comment > from the Linux Xen PV maintainers. I think allocating on the context switch is mad and the irq enable/disable just to allow the allocation looks equally mad. I had vague plans to maintain a mempool for FPU contexts but couldn't immediately think how we could guarantee that the pool would be kept sufficiently populated. David