* FP register corruption in Exynos 4210 (Cortex-A9)
@ 2014-10-07 21:48 Lanchon
2014-10-07 22:15 ` Russell King - ARM Linux
0 siblings, 1 reply; 20+ messages in thread
From: Lanchon @ 2014-10-07 21:48 UTC (permalink / raw)
To: linux-arm-kernel
Hi,
There is a longstanding bug in all the after-market kernels (and maybe
manufacturer's kernels too) for all the Exynos 4210 (Cortex-A9)-based
devices. These include:
Samsung Galaxy S II
Samsung Galaxy Note
Samsung Galaxy Tab 7.0 Plus
...and others.
Under rare conditions which are not easy to reproduce, floating point
registers of userland processes get clobbered.
There is a vital FUSE process in Android 4.4 (called 'sdcard.c') that
mediates access to internal phone storage as an emulated sdcard, and to
external sdcards too. This process, normally compiled using
-mfloat-abi=softfp, calls pread64() after saving the value of a 64-bit
integer variable (called 'unique') in an FPU register (d8). On very rare
occasions, upon return from pread64() the value of the FP register is
corrupted; as a result the process stops responding and the devices
loose access to storage.
There are other instabilities in the platform suspected of having the
same cause. This bug has plagued the platform for years, but only
recently FP clobbering was identified as the culprit.
More context:
This only happens on 4210-based devices. The same kernel tree compiled
for 4212- and 4412-based devices does not exhibit the behavior. (The
4x12 SoCs are a newer iteration of the 4210, with the 'x' corresponding
to the number of cores. See:
http://en.wikipedia.org/wiki/Exynos#List_of_Exynos_SoCs ) This points to
a hardware issue, maybe a missing errata in the kernel, or to a driver
issue.
Simply busy-spinning in userland waiting for FP corruption does not seem
to trigger the issue. Concurrently accessing storage in another process
while spinning also does not work; power management (sleep, etc) may be
involved.
Compiling 'sdcard.c' using -mfloat-abi=soft solves the issue (for this
vital process) since the 'unique' variable is saved in regular instead
of FP registers then.
Objdumping the complete kernel does not show any instructions that
access 'd' registers, except in context switching code, and in the code
that implements traps that old VFP units need to handle some corner
cases. Also, objdumps of *.ko files do not reveal any instructions that
access 'd' registers.
We do not have 'kernel_neon_begin' nor 'kernel_vfp_begin' support in
these kernels; the code is just not there.
Some links:
One of the affected kernel trees:
https://github.com/CyanogenMod/android_kernel_samsung_smdk4412/tree/cm-11.0
First direct observation of corruption:
http://forum.xda-developers.com/showthread.php?p=51237856&highlight=unique
The 'sdcard.c' process:
http://forum.xda-developers.com/showthread.php?p=55787440
Post showing that 'unique' is saved in 'd8':
http://forum.xda-developers.com/showthread.php?p=55783884
A busy-spin FP corruption test (that fails to reproduce the bug):
http://forum.xda-developers.com/showthread.php?p=55861206
Objdumps:
http://forum.xda-developers.com/showthread.php?p=55839635
And finally some questions:
Kernel threads (such as the worker thread of a threaded interrupt)
should guard FPU access in between 'kernel_XXX_begin'/'kernel_XXX_end'
calls (which our kernels do not implement). But what if it did not?
1) What is the FPU enable state while executing a kernel thread in ARM
arch? Which of these answers is correct?
1a) the FPU is always disabled in kernel threads.
1b) the FPU might be enabled or disabled in a kernel thread, depending
on the FPU enable state of the userland context that executed before
and/or some other factors.
2) What would happen if a kernel thread executed an FPU instruction
without the kernel_XXX_begin'/'kernel_XXX_end' guards in ARM arch and
the FPU was disabled at the time?
2a) In the FPU trap the kernel would always detect the issue and panic
or oops or something.
2b) In the FPU trap the kernel might enable the FPU, load the FPU
context of some userland process and resume the kernel thread.
Of course an ISR should not touch the FPU at all. But what if it did?
3) What would happen if an ISR executed an FPU instruction in ARM arch
and the FPU was disabled in the context that was interrupted:
3a) In the FPU trap the kernel would always detect the issue and panic
or oops or something.
3b) In the FPU trap the kernel would react as if the interrupted context
executed the FPU instruction: If the interrupted context was user mode,
it would restore the userland process' FP context into the FPU. If the
interrupted context was kernel mode, it would react as per the answer to
question 2) above.
4) What would happen if an ISR executed an FPU instruction in ARM arch
and the FPU was enabled in the context that was interrupted:
4a) The processor would disable the FPU on ISR entry automatically and
thus the system would behave as described in the answer to question 3)
above.
4b) If the driver uses the standard kernel interrupt dispatch
architecture, the kernel would disable the FPU before dispatching the
interrupt to the driver ISR, and so the system would also behave as
described in 3).
4c) The FPU instruction would execute. There is no fail-fast or
detection of this kind of violation by the kernel.
Of course every pointer, idea, or suspicion that might seem relevant to
the case is welcome.
Thank you very much for reading and for your help.
Regards,
Lanchon
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-10-07 21:48 FP register corruption in Exynos 4210 (Cortex-A9) Lanchon
@ 2014-10-07 22:15 ` Russell King - ARM Linux
2014-10-08 7:58 ` Lanchon
2014-10-08 8:19 ` Lanchon
0 siblings, 2 replies; 20+ messages in thread
From: Russell King - ARM Linux @ 2014-10-07 22:15 UTC (permalink / raw)
To: linux-arm-kernel
On Tue, Oct 07, 2014 at 06:48:23PM -0300, Lanchon wrote:
> Simply busy-spinning in userland waiting for FP corruption does not seem
> to trigger the issue. Concurrently accessing storage in another process
> while spinning also does not work; power management (sleep, etc) may be
> involved.
You need two processes accessing VFP to cause VFP state to be saved and
restored.
> We do not have 'kernel_neon_begin' nor 'kernel_vfp_begin' support in
> these kernels; the code is just not there.
Which means that the kernel itself must /never/ make use of floating
point itself - if it does, it /will/ corrupt the user state in the way
you are seeing. That's a pretty hard requirement, and something that
we have enforced with mainline kernels by building the kernel in
soft FP mode, thereby preventing the compiler emitting FP instructions.
Hence, the only way to get VFP instructions in the kernel is via
explicit assembly sequences.
The exception to this rule is the VFP support code itself, which
maintains the VFP state on behalf of the hardware and userspace (and
even then, that code is only concerned with reading and writing the
VFP registers, not using FP itself.)
In SMP environments, VFP state is saved each time we context switch
away from a thread. If we resume the thread on the _same_ CPU and
no one else has used the VFP since, we just re-enable access to VFP.
Otherwise, we re-load the VFP state from the previously saved state.
In UP environments, we do something similar, but we don't save until
we need to.
However, neon shares the VFP registers, and we have some code (crypto
stuff) which uses neon, and this has appropriate guards to ensure that
userspace does not see any changes. This is only available when
CONFIG_KERNEL_MODE_NEON is enabled (but as you say you don't have
kernel_neon_begin anywhere, you should /never/ execute any neon
instructions in the kernel.)
I hope this helps; I didn't answer your specific questions because it
seemed I would just end up repeating what I've said above.
--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-10-07 22:15 ` Russell King - ARM Linux
@ 2014-10-08 7:58 ` Lanchon
2014-10-08 8:19 ` Lanchon
1 sibling, 0 replies; 20+ messages in thread
From: Lanchon @ 2014-10-08 7:58 UTC (permalink / raw)
To: linux-arm-kernel
thank you for your answer, please see comments below.
On 10/07/2014 07:15 PM, Russell King - ARM Linux wrote:
> On Tue, Oct 07, 2014 at 06:48:23PM -0300, Lanchon wrote:
>> Simply busy-spinning in userland waiting for FP corruption does not seem
>> to trigger the issue. Concurrently accessing storage in another process
>> while spinning also does not work; power management (sleep, etc) may be
>> involved.
> You need two processes accessing VFP to cause VFP state to be saved and
> restored.
yes. these are dual core systems so i used 4 simultaneous processes
running the busy-spin.
>> We do not have 'kernel_neon_begin' nor 'kernel_vfp_begin' support in
>> these kernels; the code is just not there.
> Which means that the kernel itself must /never/ make use of floating
> point itself - if it does, it /will/ corrupt the user state in the way
> you are seeing. That's a pretty hard requirement, and something that
> we have enforced with mainline kernels by building the kernel in
> soft FP mode, thereby preventing the compiler emitting FP instructions.
> Hence, the only way to get VFP instructions in the kernel is via
> explicit assembly sequences.
>
> The exception to this rule is the VFP support code itself, which
> maintains the VFP state on behalf of the hardware and userspace (and
> even then, that code is only concerned with reading and writing the
> VFP registers, not using FP itself.)
and also the VFP support trap for corner cases needed in old VFP
implementations (VFP 2?). as i said before, this is consistent with what
i found with objdump: only context switch and old VFP support trap code.
>
> In SMP environments, VFP state is saved each time we context switch
> away from a thread. If we resume the thread on the _same_ CPU and
> no one else has used the VFP since, we just re-enable access to VFP.
> Otherwise, we re-load the VFP state from the previously saved state.
>
> In UP environments, we do something similar, but we don't save until
> we need to.
this is SMP, and i verified that the resulting kernel uses eager FP
state save (as required for SMP) and lazy restore.
>
> However, neon shares the VFP registers, and we have some code (crypto
> stuff) which uses neon, and this has appropriate guards to ensure that
> userspace does not see any changes. This is only available when
> CONFIG_KERNEL_MODE_NEON is enabled (but as you say you don't have
> kernel_neon_begin anywhere, you should /never/ execute any neon
> instructions in the kernel.)
no other neon/vfp instructions found in objdumps. the crypto
acceleration (if the crypto code is in our trees at all) must be
disabled then, for lack of CONFIG_KERNEL_MODE_NEON or some other config.
i am grepping the output of the full kernel and *.ko objdumps (see
previous link) for 'dN' and 'dNN'; i am supposing that any useful
VFP/NEON code that clobbers d8 should refer to some 'd' register by name.
>
> I hope this helps; I didn't answer your specific questions because it
> seemed I would just end up repeating what I've said above.
>
actually no, answers to my very specific questions would help me
understand this: if we had a close-source driver (ISR or kernel thread)
that touched the FPU, how would the kernel react? would the kernel
fast-fail in every possible instance? if not, where would the code need
to be and under what circumstances would it not cause fast-fail? knowing
this would help me find the offending code (it such code exists; it may
well be hardware error).
thanks again.
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-10-07 22:15 ` Russell King - ARM Linux
2014-10-08 7:58 ` Lanchon
@ 2014-10-08 8:19 ` Lanchon
2014-10-08 8:27 ` Russell King - ARM Linux
2014-10-08 8:35 ` Russell King - ARM Linux
1 sibling, 2 replies; 20+ messages in thread
From: Lanchon @ 2014-10-08 8:19 UTC (permalink / raw)
To: linux-arm-kernel
On 10/07/2014 07:44 PM, Russell King - ARM Linux wrote:
> On Tue, Oct 07, 2014 at 07:35:14PM -0300, Lanchon wrote:
>>> I hope this helps; I didn't answer your specific questions because it
>>> seemed I would just end up repeating what I've said above.
>>>
>> actually no, answers to my very specific questions would help me
>> understand this: if we had a close-source driver (ISR or kernel thread)
>> that touched the FPU, how would the kernel react?
> I already covered this. It would corrupt the VFP state, thereby
> corrupting the VFP state which userspace sees.
>
> Hence why I said:
>
> Which means that the kernel itself must /never/ make use of floating
> point itself - if it does, it /will/ corrupt the user state in the way
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> you are seeing.
> ^^^^^^^^^^^^^^^
>
> How can I make that more clear?
no, actually you did not answer my questions. you stated that the end
result would be corruption of user FP state, which i already know. i am
inquiring as to *how* the process of corruption comes about exactly, not
the end result.
knowing exactly how corruption can happen and how it cannot would help
me decide where to look for the offending code.
for instance, you say that if an ISR uses the FPU it would corrupt user
FP state. fine, but it is not that simple. what if the FPU was disabled
at the time of interrupt? (ie: lazy restore did not yet happen in this
time-slice.) then the ISR FPU instruction would trap, not corrupt
immediately. would the kernel recognize the trap was generated in ISR
code and panic, or just blindly restore the FP context of the
interrupted thread? if the former is true, then i can discount ISRs as
sources of corruptions because i am not seeing panics, so there is no
point in instrumenting ISRs. if the latter is true, ok fine... but what
if the interrupted thread was a kernel thread? where would the restored
FP context come from?
answering these questions require both knowledge of the architecture of
the linux kernel and of cortex-A, and i know neither of them, which is
why i am asking in this list.
a plausible answer (which i am making up out of the blue) would be:
"each cpu is always working in the context of a 'current' or 'executing'
userland process (which may be the idle process), with the MMU
configured to its virtual address space and all, even when the cpu is
executing a kernel thread. the FPU state and handling is not affected by
user/kernel mode switches, only by userland context switches. this means
that if a kernel thread executes FP instructions, the kernel will trap
if the FPU is disabled and happily restore the context of the current
userland process of the CPU for the kernel thread to corrupt next, never
noticing that the trap originated in kernel mode.
also the arm architecture will not disable the FPU on interrupt
processing, and the kernel will not disable the FPU prior to dispatching
the interrupt to the registered drivers. so the same thing would happen
in an ISR, even if the ISR is interrupting a kernel thread."
another plausible answer would be:
"the kernel always disables the FPU on scheduling a kernel thread. the
cost of fiddling with this is low compared to the safety it provides. if
triggered, the FPU trap will notice that the CPU is in kernel mode and
panic, failing fast. there are no special rules applying to interrupts:
if an ISR issues FP instructions they will be handled as if they had
been issued in the interrupted thread (kernel: panic; user: lazy restore
and/or execution)."
yet another:
"the FPU is effectively disabled on interrupt processing by the arm
architecture. while running in interrupt mode, and independent of the
FPU enable status, all FP instructions will trap to a different FPU
vector which will cause the kernel to panic."
any and all of these hypothetical details would help me determine where
*not* to look for the cause of the problem, where and what type of
instrumentation is worth trying, etc. a simple "state would be
corrupted" sentence does not give me any useful information that helps
me find the source of the problem, but understanding the process of
corruption might. (disregarding the fact that this is probably a
hardware bug, maybe a cache coherence problem or something of the sort,
and there might be no error in the code at all.)
this is why i will close this email with a copy my questions for
context. maybe someone can provide the answer for some.
thanks again, and in advance to anyone who can help.
regards,
lanchon
--------------------------
Kernel threads (such as the worker thread of a threaded interrupt)
should guard FPU access in between 'kernel_XXX_begin'/'kernel_XXX_end'
calls (which our kernels do not implement). But what if it did not?
1) What is the FPU enable state while executing a kernel thread in ARM
arch? Which of these answers is correct?
1a) the FPU is always disabled in kernel threads.
1b) the FPU might be enabled or disabled in a kernel thread, depending
on the FPU enable state of the userland context that executed before
and/or some other factors.
2) What would happen if a kernel thread executed an FPU instruction
without the kernel_XXX_begin'/'kernel_XXX_end' guards in ARM arch and
the FPU was disabled at the time?
2a) In the FPU trap the kernel would always detect the issue and panic
or oops or something.
2b) In the FPU trap the kernel might enable the FPU, load the FPU
context of some userland process and resume the kernel thread.
Of course an ISR should not touch the FPU at all. But what if it did?
3) What would happen if an ISR executed an FPU instruction in ARM arch
and the FPU was disabled in the context that was interrupted:
3a) In the FPU trap the kernel would always detect the issue and panic
or oops or something.
3b) In the FPU trap the kernel would react as if the interrupted context
executed the FPU instruction: If the interrupted context was user mode,
it would restore the userland process' FP context into the FPU. If the
interrupted context was kernel mode, it would react as per the answer to
question 2) above.
4) What would happen if an ISR executed an FPU instruction in ARM arch
and the FPU was enabled in the context that was interrupted:
4a) The processor would disable the FPU on ISR entry automatically and
thus the system would behave as described in the answer to question 3)
above.
4b) If the driver uses the standard kernel interrupt dispatch
architecture, the kernel would disable the FPU before dispatching the
interrupt to the driver ISR, and so the system would also behave as
described in 3).
4c) The FPU instruction would execute. There is no fail-fast or
detection of this kind of violation by the kernel.
Of course every pointer, idea, or suspicion that might seem relevant to
the case is welcome.
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-10-08 8:19 ` Lanchon
@ 2014-10-08 8:27 ` Russell King - ARM Linux
2014-10-08 8:35 ` Russell King - ARM Linux
1 sibling, 0 replies; 20+ messages in thread
From: Russell King - ARM Linux @ 2014-10-08 8:27 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Oct 08, 2014 at 05:19:19AM -0300, Lanchon wrote:
>
> On 10/07/2014 07:44 PM, Russell King - ARM Linux wrote:
>> On Tue, Oct 07, 2014 at 07:35:14PM -0300, Lanchon wrote:
>>>> I hope this helps; I didn't answer your specific questions because it
>>>> seemed I would just end up repeating what I've said above.
>>>>
>>> actually no, answers to my very specific questions would help me
>>> understand this: if we had a close-source driver (ISR or kernel thread)
>>> that touched the FPU, how would the kernel react?
>> I already covered this. It would corrupt the VFP state, thereby
>> corrupting the VFP state which userspace sees.
>>
>> Hence why I said:
>>
>> Which means that the kernel itself must /never/ make use of floating
>> point itself - if it does, it /will/ corrupt the user state in the way
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> you are seeing.
>> ^^^^^^^^^^^^^^^
>>
>> How can I make that more clear?
>
> no, actually you did not answer my questions. you stated that the end
> result would be corruption of user FP state, which i already know. i am
> inquiring as to *how* the process of corruption comes about exactly, not
> the end result.
It is really /very/ simple.
1. ISR changes VFP registers.
2. Userspace sees changed VFP registers.
3. Userspace state is corrupted.
For some reason, you think that there's more going on here than that.
There isn't. The kernel sees the very same set of registers as
userspace sees. Any changes which the kernel makes to those registers
will be visible to userspace.
Hence, using VFP instructions in the kernel will result in VFP registers
changing. Userspace will then see the changed VFP registers. The
userspace state will then be corrupted.
Simple. Really.
--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-10-08 8:19 ` Lanchon
2014-10-08 8:27 ` Russell King - ARM Linux
@ 2014-10-08 8:35 ` Russell King - ARM Linux
2014-10-08 8:53 ` Ard Biesheuvel
1 sibling, 1 reply; 20+ messages in thread
From: Russell King - ARM Linux @ 2014-10-08 8:35 UTC (permalink / raw)
To: linux-arm-kernel
On Wed, Oct 08, 2014 at 05:19:19AM -0300, Lanchon wrote:
> for instance, you say that if an ISR uses the FPU it would corrupt user
> FP state. fine, but it is not that simple. what if the FPU was disabled
> at the time of interrupt? (ie: lazy restore did not yet happen in this
> time-slice.)
At that point, it depends on which kernel version you are using. Yes,
older kernels will just restore the state. Newer kernels will trap this
and complain.
> a plausible answer (which i am making up out of the blue) would be:
If you want to continue asking questions and getting answers, change
your attitude; I am not a child.
You should also consider *not* writing essays, but instead ask clear,
direct and to the point questions - in other words, short emails. Not
everyone has the time or the patience to read huge long emails, or
huge rambling threads of 50+ pages on web forums.
--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-10-08 8:35 ` Russell King - ARM Linux
@ 2014-10-08 8:53 ` Ard Biesheuvel
2014-10-08 9:22 ` Ard Biesheuvel
2014-10-09 22:20 ` Lanchon
0 siblings, 2 replies; 20+ messages in thread
From: Ard Biesheuvel @ 2014-10-08 8:53 UTC (permalink / raw)
To: linux-arm-kernel
On 8 October 2014 10:35, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Wed, Oct 08, 2014 at 05:19:19AM -0300, Lanchon wrote:
>> for instance, you say that if an ISR uses the FPU it would corrupt user
>> FP state. fine, but it is not that simple. what if the FPU was disabled
>> at the time of interrupt? (ie: lazy restore did not yet happen in this
>> time-slice.)
>
> At that point, it depends on which kernel version you are using. Yes,
> older kernels will just restore the state. Newer kernels will trap this
> and complain.
>
Indeed. As part of the kernel mode NEON support (which landed in 3.12
I think?), the VFP trap handling now checks whether it occurred in
kernel mode or user mode.
Check arch/arm/vfp/vfphw.S:84 in your kernel tree for
"""
ldr r3, [sp, #S_PSR] @ Neither lazy restore nor FP exceptions
and r3, r3, #MODE_MASK @ are supported in kernel mode
teq r3, #USR_MODE
bne vfp_kmode_exception @ Returns through lr
"""
Without these lines, the lazy restore machinery may kick in during the
execution of an ISR that uses NEON registers inadvertently, and
overwrite your VFP state with that of the process that happens to be
active when the interrupt is taken.
You should also be aware that q4 is an alias of d8-d9, so grep'ing
your objdump for d8 is not sufficient.
--
Ard.
>> a plausible answer (which i am making up out of the blue) would be:
>
> If you want to continue asking questions and getting answers, change
> your attitude; I am not a child.
>
> You should also consider *not* writing essays, but instead ask clear,
> direct and to the point questions - in other words, short emails. Not
> everyone has the time or the patience to read huge long emails, or
> huge rambling threads of 50+ pages on web forums.
>
> --
> FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
> according to speedtest.net.
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-10-08 8:53 ` Ard Biesheuvel
@ 2014-10-08 9:22 ` Ard Biesheuvel
2014-10-08 9:55 ` Russell King - ARM Linux
2014-10-09 22:36 ` Lanchon
2014-10-09 22:20 ` Lanchon
1 sibling, 2 replies; 20+ messages in thread
From: Ard Biesheuvel @ 2014-10-08 9:22 UTC (permalink / raw)
To: linux-arm-kernel
On 8 October 2014 10:53, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 8 October 2014 10:35, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
>> On Wed, Oct 08, 2014 at 05:19:19AM -0300, Lanchon wrote:
>>> for instance, you say that if an ISR uses the FPU it would corrupt user
>>> FP state. fine, but it is not that simple. what if the FPU was disabled
>>> at the time of interrupt? (ie: lazy restore did not yet happen in this
>>> time-slice.)
>>
>> At that point, it depends on which kernel version you are using. Yes,
>> older kernels will just restore the state. Newer kernels will trap this
>> and complain.
>>
>
> Indeed. As part of the kernel mode NEON support (which landed in 3.12
> I think?), the VFP trap handling now checks whether it occurred in
> kernel mode or user mode.
> Check arch/arm/vfp/vfphw.S:84 in your kernel tree for
>
> """
> ldr r3, [sp, #S_PSR] @ Neither lazy restore nor FP exceptions
> and r3, r3, #MODE_MASK @ are supported in kernel mode
> teq r3, #USR_MODE
> bne vfp_kmode_exception @ Returns through lr
> """
>
> Without these lines, the lazy restore machinery may kick in during the
> execution of an ISR that uses NEON registers inadvertently, and
> overwrite your VFP state with that of the process that happens to be
> active when the interrupt is taken.
>
Ehm ... maybe this is not entirely true. In order for the userland VFP
state of some process to be clobbered, an ISR being executed while
another process is active (which itself may not use the VFP at all)
would not be sufficient, as it would be /that/ process's VFP state
getting clobbered. So it is more likely (if you suspect the kernel)
that the register is getting clobbered while the storage process has
already 'unlocked' the VFP by accessing it from userland, which seems
to be in agreement with your scenario of a syscall being performed,
i.e., if no task switch occurs, the VFP would be unlocked during the
execution of that syscall.
So the question is, where does the VFP register write come from? Are
there any out of tree modules in use, and if so, can you verify the
CFLAGS? Note that merely using -O3 combined with -mfloat-abi=softfp
may result in GCC emitting NEON instructions when it detects loops it
can vectorize.
--
Ard.
> You should also be aware that q4 is an alias of d8-d9, so grep'ing
> your objdump for d8 is not sufficient.
>
> --
> Ard.
>
>
>>> a plausible answer (which i am making up out of the blue) would be:
>>
>> If you want to continue asking questions and getting answers, change
>> your attitude; I am not a child.
>>
>> You should also consider *not* writing essays, but instead ask clear,
>> direct and to the point questions - in other words, short emails. Not
>> everyone has the time or the patience to read huge long emails, or
>> huge rambling threads of 50+ pages on web forums.
>>
>> --
>> FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
>> according to speedtest.net.
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel at lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-10-08 9:22 ` Ard Biesheuvel
@ 2014-10-08 9:55 ` Russell King - ARM Linux
2014-10-08 10:32 ` Ard Biesheuvel
2014-10-09 22:36 ` Lanchon
1 sibling, 1 reply; 20+ messages in thread
From: Russell King - ARM Linux @ 2014-10-08 9:55 UTC (permalink / raw)
To: linux-arm-kernel
Ard,
Note that you sent this message To: me, therefore you are addressing
me in your message when you use "you".
On Wed, Oct 08, 2014 at 11:22:32AM +0200, Ard Biesheuvel wrote:
> On 8 October 2014 10:53, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> > On 8 October 2014 10:35, Russell King - ARM Linux
> > <linux@arm.linux.org.uk> wrote:
> >> On Wed, Oct 08, 2014 at 05:19:19AM -0300, Lanchon wrote:
> >>> for instance, you say that if an ISR uses the FPU it would corrupt user
> >>> FP state. fine, but it is not that simple. what if the FPU was disabled
> >>> at the time of interrupt? (ie: lazy restore did not yet happen in this
> >>> time-slice.)
> >>
> >> At that point, it depends on which kernel version you are using. Yes,
> >> older kernels will just restore the state. Newer kernels will trap this
> >> and complain.
> >>
> >
> > Indeed. As part of the kernel mode NEON support (which landed in 3.12
> > I think?), the VFP trap handling now checks whether it occurred in
> > kernel mode or user mode.
> > Check arch/arm/vfp/vfphw.S:84 in your kernel tree for
> >
> > """
> > ldr r3, [sp, #S_PSR] @ Neither lazy restore nor FP exceptions
> > and r3, r3, #MODE_MASK @ are supported in kernel mode
> > teq r3, #USR_MODE
> > bne vfp_kmode_exception @ Returns through lr
> > """
> >
> > Without these lines, the lazy restore machinery may kick in during the
> > execution of an ISR that uses NEON registers inadvertently, and
> > overwrite your VFP state with that of the process that happens to be
> > active when the interrupt is taken.
> >
>
> Ehm ... maybe this is not entirely true.
It is true if VFP access was disabled, which is the scenario which
Lanchon was asking about. The answer to that is "it depends on the
kernel version", and whether it has your patch as part of the kernel
mode neon in place which traps these.
Sure, if VFP access was enabled, then it is as I have already explained
several times - kernel mode VFP usage will change the VFP state, and
lead to userspace VFP state corruption.
Given the reported scenario, it depends when the VFP access is happening.
If it's happening before any scheduling, then VFP access will still be
enabled and corruption will be silent. If it happens after a scheduling
event, then VFP access will have been disabled, and we should get a trap
into the VFP support code.
It should be noted that if it happens in a separate kernel thread, it
will be independent of userland as it will have its own private and
independent VFP state (but that doesn't mean we permit it.)
> So the question is, where does the VFP register write come from? Are
> there any out of tree modules in use, and if so, can you verify the
> CFLAGS? Note that merely using -O3 combined with -mfloat-abi=softfp
> may result in GCC emitting NEON instructions when it detects loops it
> can vectorize.
I can't :) I assume you mean Lanchon... please address your messages a
bit better!
--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-10-08 9:55 ` Russell King - ARM Linux
@ 2014-10-08 10:32 ` Ard Biesheuvel
0 siblings, 0 replies; 20+ messages in thread
From: Ard Biesheuvel @ 2014-10-08 10:32 UTC (permalink / raw)
To: linux-arm-kernel
On 8 October 2014 11:55, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> Ard,
>
> Note that you sent this message To: me, therefore you are addressing
> me in your message when you use "you".
>
My apologies. The Gmail interface obfuscates the To and Cc fields a bit.
I was indeed addressing Lanchon.
--
Ard.
> On Wed, Oct 08, 2014 at 11:22:32AM +0200, Ard Biesheuvel wrote:
>> On 8 October 2014 10:53, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>> > On 8 October 2014 10:35, Russell King - ARM Linux
>> > <linux@arm.linux.org.uk> wrote:
>> >> On Wed, Oct 08, 2014 at 05:19:19AM -0300, Lanchon wrote:
>> >>> for instance, you say that if an ISR uses the FPU it would corrupt user
>> >>> FP state. fine, but it is not that simple. what if the FPU was disabled
>> >>> at the time of interrupt? (ie: lazy restore did not yet happen in this
>> >>> time-slice.)
>> >>
>> >> At that point, it depends on which kernel version you are using. Yes,
>> >> older kernels will just restore the state. Newer kernels will trap this
>> >> and complain.
>> >>
>> >
>> > Indeed. As part of the kernel mode NEON support (which landed in 3.12
>> > I think?), the VFP trap handling now checks whether it occurred in
>> > kernel mode or user mode.
>> > Check arch/arm/vfp/vfphw.S:84 in your kernel tree for
>> >
>> > """
>> > ldr r3, [sp, #S_PSR] @ Neither lazy restore nor FP exceptions
>> > and r3, r3, #MODE_MASK @ are supported in kernel mode
>> > teq r3, #USR_MODE
>> > bne vfp_kmode_exception @ Returns through lr
>> > """
>> >
>> > Without these lines, the lazy restore machinery may kick in during the
>> > execution of an ISR that uses NEON registers inadvertently, and
>> > overwrite your VFP state with that of the process that happens to be
>> > active when the interrupt is taken.
>> >
>>
>> Ehm ... maybe this is not entirely true.
>
> It is true if VFP access was disabled, which is the scenario which
> Lanchon was asking about. The answer to that is "it depends on the
> kernel version", and whether it has your patch as part of the kernel
> mode neon in place which traps these.
>
> Sure, if VFP access was enabled, then it is as I have already explained
> several times - kernel mode VFP usage will change the VFP state, and
> lead to userspace VFP state corruption.
>
> Given the reported scenario, it depends when the VFP access is happening.
> If it's happening before any scheduling, then VFP access will still be
> enabled and corruption will be silent. If it happens after a scheduling
> event, then VFP access will have been disabled, and we should get a trap
> into the VFP support code.
>
> It should be noted that if it happens in a separate kernel thread, it
> will be independent of userland as it will have its own private and
> independent VFP state (but that doesn't mean we permit it.)
>
>> So the question is, where does the VFP register write come from? Are
>> there any out of tree modules in use, and if so, can you verify the
>> CFLAGS? Note that merely using -O3 combined with -mfloat-abi=softfp
>> may result in GCC emitting NEON instructions when it detects loops it
>> can vectorize.
>
> I can't :) I assume you mean Lanchon... please address your messages a
> bit better!
>
> --
> FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
> according to speedtest.net.
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-10-08 8:53 ` Ard Biesheuvel
2014-10-08 9:22 ` Ard Biesheuvel
@ 2014-10-09 22:20 ` Lanchon
2014-10-09 22:32 ` Russell King - ARM Linux
1 sibling, 1 reply; 20+ messages in thread
From: Lanchon @ 2014-10-09 22:20 UTC (permalink / raw)
To: linux-arm-kernel
On 10/08/2014 05:53 AM, Ard Biesheuvel wrote:
> On 8 October 2014 10:35, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
>> On Wed, Oct 08, 2014 at 05:19:19AM -0300, Lanchon wrote:
>>> for instance, you say that if an ISR uses the FPU it would corrupt user
>>> FP state. fine, but it is not that simple. what if the FPU was disabled
>>> at the time of interrupt? (ie: lazy restore did not yet happen in this
>>> time-slice.)
>> At that point, it depends on which kernel version you are using. Yes,
>> older kernels will just restore the state. Newer kernels will trap this
>> and complain.
>>
> Indeed. As part of the kernel mode NEON support (which landed in 3.12
> I think?), the VFP trap handling now checks whether it occurred in
> kernel mode or user mode.
> Check arch/arm/vfp/vfphw.S:84 in your kernel tree for
>
> """
> ldr r3, [sp, #S_PSR] @ Neither lazy restore nor FP exceptions
> and r3, r3, #MODE_MASK @ are supported in kernel mode
> teq r3, #USR_MODE
> bne vfp_kmode_exception @ Returns through lr
> """
>
> Without these lines, the lazy restore machinery may kick in during the
> execution of an ISR that uses NEON registers inadvertently, and
> overwrite your VFP state with that of the process that happens to be
> active when the interrupt is taken.
thank you for this! just one question. i suppose the 'kernel mode' test
used here will be positive if the trap happens while executing a kernel
thread. it should also be positive if the trap happens while executing
an ISR that interrupted a kernel thread. but what if the trap happens
while executing an ISR that interrupted userland? would this 'kernel
mode' test also be positive?
the 'official' kernels (like the one i linked in my first message) do
not have this feature. but i found this commit in Dorimanx, which is a
fairly used alternative kernel:
https://github.com/dorimanx/Dorimanx-SG2-I9100-Kernel/commit/d4f9e67b9395d5f0d7ce2a836f7c9b6edbae0fa0
i will have people retest with this kernel. but AFAIK, people do not
report panics or reboots with Dorimanx. and it is unreasonable to
believe a priori of cause that every single time an ISR or kernel thread
is about to corrupt FPs, the FPU just happened to be enabled. so this
fail-fast mechanism not triggering points to the code being ok, and this
being more of a hardware issue.
>
> You should also be aware that q4 is an alias of d8-d9, so grep'ing
> your objdump for d8 is not sufficient.
>
thanks! i have objdumped the kernel and *.ko files again and found no
'qNN' registers mentioned either.
---
there is a new piece of information:
the FP corruption seems to only happen in these android devices if the
display is off. the charger may be connected or not, but if the display
is on, the corruption won't happen.
i wonder if the kernel could be turning off the FPU and then back on
without saving the FPU state. i would think corruption would be seen
more often then.
maybe it is restoring state before voltage to the FPU has stabilized.
this could be easily checked by instrumenting the state restore with a
check. but sounds unreasonable: the delay implied by the lazy restore
mechanism should hide the effects of this 'race condition' of sorts.
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-10-09 22:20 ` Lanchon
@ 2014-10-09 22:32 ` Russell King - ARM Linux
2014-10-10 9:45 ` Arnd Bergmann
0 siblings, 1 reply; 20+ messages in thread
From: Russell King - ARM Linux @ 2014-10-09 22:32 UTC (permalink / raw)
To: linux-arm-kernel
On Thu, Oct 09, 2014 at 07:20:14PM -0300, Lanchon wrote:
>
> On 10/08/2014 05:53 AM, Ard Biesheuvel wrote:
>> On 8 October 2014 10:35, Russell King - ARM Linux
>> <linux@arm.linux.org.uk> wrote:
>>> On Wed, Oct 08, 2014 at 05:19:19AM -0300, Lanchon wrote:
>>>> for instance, you say that if an ISR uses the FPU it would corrupt user
>>>> FP state. fine, but it is not that simple. what if the FPU was disabled
>>>> at the time of interrupt? (ie: lazy restore did not yet happen in this
>>>> time-slice.)
>>> At that point, it depends on which kernel version you are using. Yes,
>>> older kernels will just restore the state. Newer kernels will trap this
>>> and complain.
>>>
>> Indeed. As part of the kernel mode NEON support (which landed in 3.12
>> I think?), the VFP trap handling now checks whether it occurred in
>> kernel mode or user mode.
>> Check arch/arm/vfp/vfphw.S:84 in your kernel tree for
>>
>> """
>> ldr r3, [sp, #S_PSR] @ Neither lazy restore nor FP exceptions
>> and r3, r3, #MODE_MASK @ are supported in kernel mode
>> teq r3, #USR_MODE
>> bne vfp_kmode_exception @ Returns through lr
>> """
>>
>> Without these lines, the lazy restore machinery may kick in during the
>> execution of an ISR that uses NEON registers inadvertently, and
>> overwrite your VFP state with that of the process that happens to be
>> active when the interrupt is taken.
>
> thank you for this! just one question. i suppose the 'kernel mode' test
> used here will be positive if the trap happens while executing a kernel
> thread. it should also be positive if the trap happens while executing
> an ISR that interrupted a kernel thread. but what if the trap happens
> while executing an ISR that interrupted userland? would this 'kernel
> mode' test also be positive?
Yes.
> there is a new piece of information:
> the FP corruption seems to only happen in these android devices if the
> display is off. the charger may be connected or not, but if the display
> is on, the corruption won't happen.
>
> i wonder if the kernel could be turning off the FPU and then back on
> without saving the FPU state. i would think corruption would be seen
> more often then.
No. We don't "turn off" the VFP. We disable and enable access to VFP
via the coprocessor access register. If the VFP access is disabled and
then re-enabled, all state is preserved.
The only time which state would be lost is if (eg) we hot-unplug the
entire CPU, but that first requires a context switch which implies that
the state will already be saved.
--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-10-08 9:22 ` Ard Biesheuvel
2014-10-08 9:55 ` Russell King - ARM Linux
@ 2014-10-09 22:36 ` Lanchon
1 sibling, 0 replies; 20+ messages in thread
From: Lanchon @ 2014-10-09 22:36 UTC (permalink / raw)
To: linux-arm-kernel
On 10/08/2014 06:22 AM, Ard Biesheuvel wrote:
> On 8 October 2014 10:53, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>>
>> Indeed. As part of the kernel mode NEON support (which landed in 3.12
>> I think?), the VFP trap handling now checks whether it occurred in
>> kernel mode or user mode.
>> Check arch/arm/vfp/vfphw.S:84 in your kernel tree for
>>
>> """
>> ldr r3, [sp, #S_PSR] @ Neither lazy restore nor FP exceptions
>> and r3, r3, #MODE_MASK @ are supported in kernel mode
>> teq r3, #USR_MODE
>> bne vfp_kmode_exception @ Returns through lr
>> """
>>
>> Without these lines, the lazy restore machinery may kick in during the
>> execution of an ISR that uses NEON registers inadvertently, and
>> overwrite your VFP state with that of the process that happens to be
>> active when the interrupt is taken.
>>
> Ehm ... maybe this is not entirely true. In order for the userland VFP
> state of some process to be clobbered, an ISR being executed while
> another process is active (which itself may not use the VFP at all)
> would not be sufficient, as it would be /that/ process's VFP state
> getting clobbered. So it is more likely (if you suspect the kernel)
> that the register is getting clobbered while the storage process has
> already 'unlocked' the VFP by accessing it from userland, which seems
> to be in agreement with your scenario of a syscall being performed,
> i.e., if no task switch occurs, the VFP would be unlocked during the
> execution of that syscall.
absolutely. this is the kind of fine detail i was asking in my first
message. yes, 99% of the pread64s in question would happen with the FPU
enabled. this is at the heart of question 1) i originally made, which
for which i still didn't get a straight answer:
1) What is the FPU enable state while executing a kernel thread in ARM
arch? Which of these answers is correct?
1a) the FPU is always disabled in kernel threads.
1b) the FPU might be enabled or disabled in a kernel thread, depending
on the FPU enable state of the userland context that executed before
and/or some other factors.
(maybe i should have used 'kernel mode' instead.)
you are clearly assuming 1b) in your text. (maybe because you know 1a)
to be false or maybe because you don't have the information.)
>
> So the question is, where does the VFP register write come from? Are
> there any out of tree modules in use, and if so, can you verify the
> CFLAGS? Note that merely using -O3 combined with -mfloat-abi=softfp
> may result in GCC emitting NEON instructions when it detects loops it
> can vectorize.
>
the flags are ok and the kernel works fine on other SoCs. there are
several KOs but i can't find FPU instructions in them.
-mfloat-abi=softfp lets GCC use the FPU at its leisure. -mfloat-abi=soft
is used everywhere.
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-10-09 22:32 ` Russell King - ARM Linux
@ 2014-10-10 9:45 ` Arnd Bergmann
2014-10-10 10:01 ` Russell King - ARM Linux
0 siblings, 1 reply; 20+ messages in thread
From: Arnd Bergmann @ 2014-10-10 9:45 UTC (permalink / raw)
To: linux-arm-kernel
On Thursday 09 October 2014 23:32:44 Russell King - ARM Linux wrote:
> > there is a new piece of information:
> > the FP corruption seems to only happen in these android devices if the
> > display is off. the charger may be connected or not, but if the display
> > is on, the corruption won't happen.
> >
> > i wonder if the kernel could be turning off the FPU and then back on
> > without saving the FPU state. i would think corruption would be seen
> > more often then.
>
> No. We don't "turn off" the VFP. We disable and enable access to VFP
> via the coprocessor access register. If the VFP access is disabled and
> then re-enabled, all state is preserved.
>
> The only time which state would be lost is if (eg) we hot-unplug the
> entire CPU, but that first requires a context switch which implies that
> the state will already be saved.
Could the problem be caused by a bug in the exynos CPU suspend/resume
path then? E.g. if we go to sleep with VFP access disabled but it
comes back with VFP access enabled (or vice versa) that could lead
to the wrong register state being seen by the user space application.
Arnd
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-10-10 9:45 ` Arnd Bergmann
@ 2014-10-10 10:01 ` Russell King - ARM Linux
2014-12-22 22:46 ` Lanchon
0 siblings, 1 reply; 20+ messages in thread
From: Russell King - ARM Linux @ 2014-10-10 10:01 UTC (permalink / raw)
To: linux-arm-kernel
On Fri, Oct 10, 2014 at 11:45:34AM +0200, Arnd Bergmann wrote:
> On Thursday 09 October 2014 23:32:44 Russell King - ARM Linux wrote:
> > > there is a new piece of information:
> > > the FP corruption seems to only happen in these android devices if the
> > > display is off. the charger may be connected or not, but if the display
> > > is on, the corruption won't happen.
> > >
> > > i wonder if the kernel could be turning off the FPU and then back on
> > > without saving the FPU state. i would think corruption would be seen
> > > more often then.
> >
> > No. We don't "turn off" the VFP. We disable and enable access to VFP
> > via the coprocessor access register. If the VFP access is disabled and
> > then re-enabled, all state is preserved.
> >
> > The only time which state would be lost is if (eg) we hot-unplug the
> > entire CPU, but that first requires a context switch which implies that
> > the state will already be saved.
>
> Could the problem be caused by a bug in the exynos CPU suspend/resume
> path then? E.g. if we go to sleep with VFP access disabled but it
> comes back with VFP access enabled (or vice versa) that could lead
> to the wrong register state being seen by the user space application.
Well, an interesting test would be to save out the entire VFP state
both before and after the pread64 call, and then inspect that to
determine whether it is a single register or multiple registers
which are being corrupted.
However, looking at the mainline code, we do the right thing with the
CPU PM infrastructure, and that is called appropriately by the exynos
CPU idle driver.
So, another possible test for Lanchon would be to see whether disabling
CPU idle support fixes the problem.
--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-10-10 10:01 ` Russell King - ARM Linux
@ 2014-12-22 22:46 ` Lanchon
2014-12-22 23:29 ` Russell King - ARM Linux
2014-12-23 8:45 ` Ard Biesheuvel
0 siblings, 2 replies; 20+ messages in thread
From: Lanchon @ 2014-12-22 22:46 UTC (permalink / raw)
To: linux-arm-kernel
On 10/10/2014 07:01 AM, Russell King - ARM Linux wrote:
> On Fri, Oct 10, 2014 at 11:45:34AM +0200, Arnd Bergmann wrote:
>> On Thursday 09 October 2014 23:32:44 Russell King - ARM Linux wrote:
>>>> there is a new piece of information:
>>>> the FP corruption seems to only happen in these android devices if the
>>>> display is off. the charger may be connected or not, but if the display
>>>> is on, the corruption won't happen.
>>>>
>>>> i wonder if the kernel could be turning off the FPU and then back on
>>>> without saving the FPU state. i would think corruption would be seen
>>>> more often then.
>>> No. We don't "turn off" the VFP. We disable and enable access to VFP
>>> via the coprocessor access register. If the VFP access is disabled and
>>> then re-enabled, all state is preserved.
>>>
>>> The only time which state would be lost is if (eg) we hot-unplug the
>>> entire CPU, but that first requires a context switch which implies that
>>> the state will already be saved.
>> Could the problem be caused by a bug in the exynos CPU suspend/resume
>> path then? E.g. if we go to sleep with VFP access disabled but it
>> comes back with VFP access enabled (or vice versa) that could lead
>> to the wrong register state being seen by the user space application.
> Well, an interesting test would be to save out the entire VFP state
> both before and after the pread64 call, and then inspect that to
> determine whether it is a single register or multiple registers
> which are being corrupted.
>
> However, looking at the mainline code, we do the right thing with the
> CPU PM infrastructure, and that is called appropriately by the exynos
> CPU idle driver.
>
> So, another possible test for Lanchon would be to see whether disabling
> CPU idle support fixes the problem.
>
hi again! thank you all for your help. i sort of disappeared, i'm very
sorry about that.
i never mentioned it here, but the fact was that i didn't have a device
to test on. so all i could do was post test code and ask users for their
help. at some point no one was helping; i waited for test results but
they never happened, so i got frustrated and abandoned the project.
but recently interest built up again and we were able to progress and
finally fix this, so i'm writing to let you know how it turned out.
so remember there was random userland VFP register corruption. the VFP
state was not being corrupted in the registers nor in the saved state in
ram. what happened was: the kernel tracks the leftover state in the VFP
once the eager state save is done. in the lazy restore trap, the kernel
optimizes away the state load and instead only enables the VFP if it can
prove that the leftover state in the VFP hardware matches the process
state saved in ram.
however under some circumstances the kernel did the wrong thing: it
didn't reload the registers even though it was needed, probably because
the hardware had been powered down and had lost state without the
tracking code getting word of it. just disabling the optimization made
the kernel solid.
a couple of days later the root cause seems to have been identified and
fixed. i describe the whole thing here:
http://forum.xda-developers.com/galaxy-s2/development-derivatives/kernel-fpbug-stable-4-x-kernel-galaxy-t2978088
once again, thank you for all your help.
kind regards
Lanchon
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-12-22 22:46 ` Lanchon
@ 2014-12-22 23:29 ` Russell King - ARM Linux
2014-12-22 23:42 ` Lanchon
2014-12-23 8:45 ` Ard Biesheuvel
1 sibling, 1 reply; 20+ messages in thread
From: Russell King - ARM Linux @ 2014-12-22 23:29 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Dec 22, 2014 at 07:46:27PM -0300, Lanchon wrote:
> however under some circumstances the kernel did the wrong thing: it didn't
> reload the registers even though it was needed, probably because the
> hardware had been powered down and had lost state without the tracking code
> getting word of it. just disabling the optimization made the kernel solid.
Right, so mainline kernel's don't exhibit the behaviour...
> a couple of days later the root cause seems to have been identified and
> fixed. i describe the whole thing here:
> http://forum.xda-developers.com/galaxy-s2/development-derivatives/kernel-fpbug-stable-4-x-kernel-galaxy-t2978088
... because it's a local issue with cpuidle not calling the appropriate
CPU PM functions, and that means there's no patches that we need to deal
with for mainline kernels, right?
--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-12-22 23:29 ` Russell King - ARM Linux
@ 2014-12-22 23:42 ` Lanchon
2014-12-22 23:50 ` Russell King - ARM Linux
0 siblings, 1 reply; 20+ messages in thread
From: Lanchon @ 2014-12-22 23:42 UTC (permalink / raw)
To: linux-arm-kernel
On 12/22/2014 08:29 PM, Russell King - ARM Linux wrote:
> On Mon, Dec 22, 2014 at 07:46:27PM -0300, Lanchon wrote:
>> however under some circumstances the kernel did the wrong thing: it didn't
>> reload the registers even though it was needed, probably because the
>> hardware had been powered down and had lost state without the tracking code
>> getting word of it. just disabling the optimization made the kernel solid.
> Right, so mainline kernel's don't exhibit the behaviour...
>
>> a couple of days later the root cause seems to have been identified and
>> fixed. i describe the whole thing here:
>> http://forum.xda-developers.com/galaxy-s2/development-derivatives/kernel-fpbug-stable-4-x-kernel-galaxy-t2978088
> ... because it's a local issue with cpuidle not calling the appropriate
> CPU PM functions, and that means there's no patches that we need to deal
> with for mainline kernels, right?
>
that's what i think, yes. i only got back here on the list to thank you
and let you know what was wrong since you helped me a couple of months ago.
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-12-22 23:42 ` Lanchon
@ 2014-12-22 23:50 ` Russell King - ARM Linux
0 siblings, 0 replies; 20+ messages in thread
From: Russell King - ARM Linux @ 2014-12-22 23:50 UTC (permalink / raw)
To: linux-arm-kernel
On Mon, Dec 22, 2014 at 08:42:02PM -0300, Lanchon wrote:
>
> On 12/22/2014 08:29 PM, Russell King - ARM Linux wrote:
> >On Mon, Dec 22, 2014 at 07:46:27PM -0300, Lanchon wrote:
> >>however under some circumstances the kernel did the wrong thing: it didn't
> >>reload the registers even though it was needed, probably because the
> >>hardware had been powered down and had lost state without the tracking code
> >>getting word of it. just disabling the optimization made the kernel solid.
> >Right, so mainline kernel's don't exhibit the behaviour...
> >
> >>a couple of days later the root cause seems to have been identified and
> >>fixed. i describe the whole thing here:
> >>http://forum.xda-developers.com/galaxy-s2/development-derivatives/kernel-fpbug-stable-4-x-kernel-galaxy-t2978088
> >... because it's a local issue with cpuidle not calling the appropriate
> >CPU PM functions, and that means there's no patches that we need to deal
> >with for mainline kernels, right?
> >
>
> that's what i think, yes. i only got back here on the list to thank you and
> let you know what was wrong since you helped me a couple of months ago.
Please let us know if anything changes, thanks.
And have a Merry Christmas.
--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9)
2014-12-22 22:46 ` Lanchon
2014-12-22 23:29 ` Russell King - ARM Linux
@ 2014-12-23 8:45 ` Ard Biesheuvel
1 sibling, 0 replies; 20+ messages in thread
From: Ard Biesheuvel @ 2014-12-23 8:45 UTC (permalink / raw)
To: linux-arm-kernel
On 22 December 2014 at 22:46, Lanchon <lanchon@gmail.com> wrote:
>
> On 10/10/2014 07:01 AM, Russell King - ARM Linux wrote:
>>
>> On Fri, Oct 10, 2014 at 11:45:34AM +0200, Arnd Bergmann wrote:
>>>
>>> On Thursday 09 October 2014 23:32:44 Russell King - ARM Linux wrote:
>>>>>
>>>>> there is a new piece of information:
>>>>> the FP corruption seems to only happen in these android devices if the
>>>>> display is off. the charger may be connected or not, but if the display
>>>>> is on, the corruption won't happen.
>>>>>
>>>>> i wonder if the kernel could be turning off the FPU and then back on
>>>>> without saving the FPU state. i would think corruption would be seen
>>>>> more often then.
>>>>
>>>> No. We don't "turn off" the VFP. We disable and enable access to VFP
>>>> via the coprocessor access register. If the VFP access is disabled and
>>>> then re-enabled, all state is preserved.
>>>>
>>>> The only time which state would be lost is if (eg) we hot-unplug the
>>>> entire CPU, but that first requires a context switch which implies that
>>>> the state will already be saved.
>>>
>>> Could the problem be caused by a bug in the exynos CPU suspend/resume
>>> path then? E.g. if we go to sleep with VFP access disabled but it
>>> comes back with VFP access enabled (or vice versa) that could lead
>>> to the wrong register state being seen by the user space application.
>>
>> Well, an interesting test would be to save out the entire VFP state
>> both before and after the pread64 call, and then inspect that to
>> determine whether it is a single register or multiple registers
>> which are being corrupted.
>>
>> However, looking at the mainline code, we do the right thing with the
>> CPU PM infrastructure, and that is called appropriately by the exynos
>> CPU idle driver.
>>
>> So, another possible test for Lanchon would be to see whether disabling
>> CPU idle support fixes the problem.
>>
>
> hi again! thank you all for your help. i sort of disappeared, i'm very sorry
> about that.
>
> i never mentioned it here, but the fact was that i didn't have a device to
> test on. so all i could do was post test code and ask users for their help.
> at some point no one was helping; i waited for test results but they never
> happened, so i got frustrated and abandoned the project.
>
> but recently interest built up again and we were able to progress and
> finally fix this, so i'm writing to let you know how it turned out.
>
> so remember there was random userland VFP register corruption. the VFP state
> was not being corrupted in the registers nor in the saved state in ram. what
> happened was: the kernel tracks the leftover state in the VFP once the eager
> state save is done. in the lazy restore trap, the kernel optimizes away the
> state load and instead only enables the VFP if it can prove that the
> leftover state in the VFP hardware matches the process state saved in ram.
>
> however under some circumstances the kernel did the wrong thing: it didn't
> reload the registers even though it was needed, probably because the
> hardware had been powered down and had lost state without the tracking code
> getting word of it. just disabling the optimization made the kernel solid.
>
> a couple of days later the root cause seems to have been identified and
> fixed. i describe the whole thing here:
> http://forum.xda-developers.com/galaxy-s2/development-derivatives/kernel-fpbug-stable-4-x-kernel-galaxy-t2978088
>
> once again, thank you for all your help.
>
Nice work! Seems like quite an adventure you guys had there.
--
Ard.
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2014-12-23 8:45 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-07 21:48 FP register corruption in Exynos 4210 (Cortex-A9) Lanchon
2014-10-07 22:15 ` Russell King - ARM Linux
2014-10-08 7:58 ` Lanchon
2014-10-08 8:19 ` Lanchon
2014-10-08 8:27 ` Russell King - ARM Linux
2014-10-08 8:35 ` Russell King - ARM Linux
2014-10-08 8:53 ` Ard Biesheuvel
2014-10-08 9:22 ` Ard Biesheuvel
2014-10-08 9:55 ` Russell King - ARM Linux
2014-10-08 10:32 ` Ard Biesheuvel
2014-10-09 22:36 ` Lanchon
2014-10-09 22:20 ` Lanchon
2014-10-09 22:32 ` Russell King - ARM Linux
2014-10-10 9:45 ` Arnd Bergmann
2014-10-10 10:01 ` Russell King - ARM Linux
2014-12-22 22:46 ` Lanchon
2014-12-22 23:29 ` Russell King - ARM Linux
2014-12-22 23:42 ` Lanchon
2014-12-22 23:50 ` Russell King - ARM Linux
2014-12-23 8:45 ` Ard Biesheuvel
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).