From mboxrd@z Thu Jan  1 00:00:00 1970
Message-ID: <5112B51A.6000207@siemens.com>
Date: Wed, 06 Feb 2013 20:55:06 +0100
From: Jan Kiszka <jan.kiszka@siemens.com>
MIME-Version: 1.0
References: <51128CE4.4020303@siemens.com> <51128E3E.808@xenomai.org>
	<511293EB.1080502@siemens.com> <5112945F.8080102@xenomai.org>
	<51129599.3080709@siemens.com> <51129693.1040400@xenomai.org>
	<5112974A.8050008@siemens.com> <5112982B.1020901@xenomai.org>
	<5112A06A.7030809@siemens.com> <5112A175.5010002@xenomai.org>
	<5112A269.40609@siemens.com> <5112A392.3050302@xenomai.org>
	<5112AD78.5080308@siemens.com> <5112AF72.3020201@xenomai.org>
In-Reply-To: <5112AF72.3020201@xenomai.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Subject: Re: [Xenomai] ipipe/x86: do not restore during context switch
List-Id: Discussions about the Xenomai project <xenomai.xenomai.org>
List-Unsubscribe: <http://www.xenomai.org/mailman/options/xenomai>,
	<mailto:xenomai-request@xenomai.org?subject=unsubscribe>
List-Archive: <http://www.xenomai.org/pipermail/xenomai>
List-Post: <mailto:xenomai@xenomai.org>
List-Help: <mailto:xenomai-request@xenomai.org?subject=help>
List-Subscribe: <http://www.xenomai.org/mailman/listinfo/xenomai>,
	<mailto:xenomai-request@xenomai.org?subject=subscribe>
To: Gilles Chanteperdrix <gilles.chanteperdrix@xenomai.org>
Cc: Xenomai <xenomai@xenomai.org>

On 2013-02-06 20:30, Gilles Chanteperdrix wrote:
> On 02/06/2013 08:22 PM, Jan Kiszka wrote:
> 
>> On 2013-02-06 19:40, Gilles Chanteperdrix wrote:
>>> On 02/06/2013 07:35 PM, Jan Kiszka wrote:
>>>
>>>> On 2013-02-06 19:31, Gilles Chanteperdrix wrote:
>>>>> On 02/06/2013 07:26 PM, Jan Kiszka wrote:
>>>>>
>>>>>> On 2013-02-06 18:51, Gilles Chanteperdrix wrote:
>>>>>>> On 02/06/2013 06:47 PM, Jan Kiszka wrote:
>>>>>>>
>>>>>>>> On 2013-02-06 18:44, Gilles Chanteperdrix wrote:
>>>>>>>>> On 02/06/2013 06:40 PM, Jan Kiszka wrote:
>>>>>>>>>
>>>>>>>>>> On 2013-02-06 18:35, Gilles Chanteperdrix wrote:
>>>>>>>>>>> On 02/06/2013 06:33 PM, Jan Kiszka wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 2013-02-06 18:09, Gilles Chanteperdrix wrote:
>>>>>>>>>>>>> On 02/06/2013 06:03 PM, Jan Kiszka wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Gilles,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> do you remember if this core-3.4 change was a performance optimization
>>>>>>>>>>>>>> or a necessary fix? Also, I'm not yet understanding why we need all the
>>>>>>>>>>>>>> #ifdefs except for the first one which forces fpu.preload to 0.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> It is a performance optimization, without it, we systematically hit the
>>>>>>>>>>>>> maximum latency when the timer would tick during a context switch which
>>>>>>>>>>>>> restores the FPU. Note that if you change that, you will probably break
>>>>>>>>>>>>> -forge.
>>>>>>>>>>>>
>>>>>>>>>>>> According to the Intel folks who introduced eagerfpu, xsave, or at least
>>>>>>>>>>>> xsaveopt (which I didn't implemented yet) is now faster than serializing
>>>>>>>>>>>> clts/stts. On the other hand, the worst case is a full SSE + AVX restore
>>>>>>>>>>>> while the target RT task is not depending on the FPU.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Without xsave, we never restore fpu if the RT task never used it. This
>>>>>>>>>>> changes with xsave?
>>>>>>>>>>
>>>>>>>>>> This would change with eagerfpu which depends on xsave. The kernel
>>>>>>>>>> sticks with lazy switching in the absence of xsaveopt.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am not sure you understand what I mean, so, I am going to reformulate.
>>>>>>>>> Without xsave, Linux uses lazy fpu restore, and Xenomai uses eager fpu
>>>>>>>>> restore. But Xenomai eager fpu restore is a nop if the RT task never
>>>>>>>>> used FPU since its inception (and all the parents from which it is
>>>>>>>>> cloned never used FPU either). Does Linux eager switching mean the same
>>>>>>>>> thing?
>>>>>>>>
>>>>>>>> eagerfpu means: always call xsaveopt/xrstor, it will optimize the case
>>>>>>>> that the FPU was unused by the source/destination. And no fiddling with
>>>>>>>> TS anymore, at no time.
>>>>>>>
>>>>>>>
>>>>>>> I still do not understand this sentence then: "the worst case is a full
>>>>>>> SSE + AVX restore while the target RT task is not depending on the FPU."
>>>>>>> If the RT task does not depend on the FPU, why would xsaveopt/xrstor
>>>>>>> restore SSE and AVX context?
>>>>>>
>>>>>> Switching between two tasks that both use the full state space defines
>>>>>> the maximum latency of the FPU save/restore step. We cannot interrupt
>>>>>> xsave or xrstor instructions, but we couldn't interrupt fxsave either.
>>>>>>
>>>>>> What we can do, though, is to ensure that we have at least an preemption
>>>>>> point between both. Do we have such thing so far, a chance to handle a
>>>>>> Xenomai IRQ between some FPU save for Linux task A and a FPU restore for
>>>>>> the following task B? If not, the discussion is mood and we are just
>>>>>> shifting probabilities of the very same worst case.
>>>>>
>>>>>
>>>>> We can implement unlocked context switch support on x86 as we do on
>>>>> other platforms. I tried that on atom actually and it did not really
>>>>> improve latencies. You do not answer my question though, why would
>>>>> xsave/xrstor do anything if the RT thread has not used FPU (and all its
>>>>> parents have not used fpu) ?
>>>>
>>>> We first of all would have to wait for the unrelated switch between
>>>> those two Linux tasks before we could handle the IRQ and switch to the
>>>> FPU-free RT task. __switch_to is atomic, also for Linux->Linux, no?
>>>
>>>
>>> Only the *IP and *SP switch need to be atomic, the whole __switch_to can
>>> be split in several atomic sections, this is what I tested on atom. But
>>> as I said, it did not lead to any latency improvement.
>>
>> Ok, so back to the patch about which this discussion started: It
>> enforced that Linux only saves the FPU state on switches, never directly
>> restores it but enforces lazy restoring, right? To ensure that
>> save+restore for Linux tasks is always interruptible in the middle.
>> However, that sounds pretty expensive when applying FPU/SSE/etc. load on
>> Linux.
> 
> 
> To the contrary, the overhead is the cost of the fault (with the
> user/kernel and kernel/user switches), so, the larger the context
> switch, the smaller the overhead in proportion.

Yes, continuously faulting in FPU states of heavy Linux users is the
problem. That must be changed.

> 
>>
>> Instead of always doing stts for the new task, we could do the restore
>> later, after the hard_local_irq_enable of __ipipe_switch_tail. That
>> should allow the eager model for Linux as well without making
>> save+restore of Linux-Linux switches atomic.
> 
> 
> That could be done, but it is probably simpler to implement unlocked
> context switch, and split __switch_to into several atomic sections.

Yep, indeed.

> Anyway, any change in this area will probably break the work done for
> kthreads on -forge, so, can't we postpone this?

For how long? What are the dependencies? I thought unlocked context
switches already exit for other archs.

At least I will need to look into this internally - we are using less
than 10% of our CPUs for RT, the rest wants high performance.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux