* FP register corruption in Exynos 4210 (Cortex-A9) @ 2014-10-07 21:48 Lanchon 2014-10-07 22:15 ` Russell King - ARM Linux 0 siblings, 1 reply; 20+ messages in thread From: Lanchon @ 2014-10-07 21:48 UTC (permalink / raw) To: linux-arm-kernel Hi, There is a longstanding bug in all the after-market kernels (and maybe manufacturer's kernels too) for all the Exynos 4210 (Cortex-A9)-based devices. These include: Samsung Galaxy S II Samsung Galaxy Note Samsung Galaxy Tab 7.0 Plus ...and others. Under rare conditions which are not easy to reproduce, floating point registers of userland processes get clobbered. There is a vital FUSE process in Android 4.4 (called 'sdcard.c') that mediates access to internal phone storage as an emulated sdcard, and to external sdcards too. This process, normally compiled using -mfloat-abi=softfp, calls pread64() after saving the value of a 64-bit integer variable (called 'unique') in an FPU register (d8). On very rare occasions, upon return from pread64() the value of the FP register is corrupted; as a result the process stops responding and the devices loose access to storage. There are other instabilities in the platform suspected of having the same cause. This bug has plagued the platform for years, but only recently FP clobbering was identified as the culprit. More context: This only happens on 4210-based devices. The same kernel tree compiled for 4212- and 4412-based devices does not exhibit the behavior. (The 4x12 SoCs are a newer iteration of the 4210, with the 'x' corresponding to the number of cores. See: http://en.wikipedia.org/wiki/Exynos#List_of_Exynos_SoCs ) This points to a hardware issue, maybe a missing errata in the kernel, or to a driver issue. Simply busy-spinning in userland waiting for FP corruption does not seem to trigger the issue. Concurrently accessing storage in another process while spinning also does not work; power management (sleep, etc) may be involved. Compiling 'sdcard.c' using -mfloat-abi=soft solves the issue (for this vital process) since the 'unique' variable is saved in regular instead of FP registers then. Objdumping the complete kernel does not show any instructions that access 'd' registers, except in context switching code, and in the code that implements traps that old VFP units need to handle some corner cases. Also, objdumps of *.ko files do not reveal any instructions that access 'd' registers. We do not have 'kernel_neon_begin' nor 'kernel_vfp_begin' support in these kernels; the code is just not there. Some links: One of the affected kernel trees: https://github.com/CyanogenMod/android_kernel_samsung_smdk4412/tree/cm-11.0 First direct observation of corruption: http://forum.xda-developers.com/showthread.php?p=51237856&highlight=unique The 'sdcard.c' process: http://forum.xda-developers.com/showthread.php?p=55787440 Post showing that 'unique' is saved in 'd8': http://forum.xda-developers.com/showthread.php?p=55783884 A busy-spin FP corruption test (that fails to reproduce the bug): http://forum.xda-developers.com/showthread.php?p=55861206 Objdumps: http://forum.xda-developers.com/showthread.php?p=55839635 And finally some questions: Kernel threads (such as the worker thread of a threaded interrupt) should guard FPU access in between 'kernel_XXX_begin'/'kernel_XXX_end' calls (which our kernels do not implement). But what if it did not? 1) What is the FPU enable state while executing a kernel thread in ARM arch? Which of these answers is correct? 1a) the FPU is always disabled in kernel threads. 1b) the FPU might be enabled or disabled in a kernel thread, depending on the FPU enable state of the userland context that executed before and/or some other factors. 2) What would happen if a kernel thread executed an FPU instruction without the kernel_XXX_begin'/'kernel_XXX_end' guards in ARM arch and the FPU was disabled at the time? 2a) In the FPU trap the kernel would always detect the issue and panic or oops or something. 2b) In the FPU trap the kernel might enable the FPU, load the FPU context of some userland process and resume the kernel thread. Of course an ISR should not touch the FPU at all. But what if it did? 3) What would happen if an ISR executed an FPU instruction in ARM arch and the FPU was disabled in the context that was interrupted: 3a) In the FPU trap the kernel would always detect the issue and panic or oops or something. 3b) In the FPU trap the kernel would react as if the interrupted context executed the FPU instruction: If the interrupted context was user mode, it would restore the userland process' FP context into the FPU. If the interrupted context was kernel mode, it would react as per the answer to question 2) above. 4) What would happen if an ISR executed an FPU instruction in ARM arch and the FPU was enabled in the context that was interrupted: 4a) The processor would disable the FPU on ISR entry automatically and thus the system would behave as described in the answer to question 3) above. 4b) If the driver uses the standard kernel interrupt dispatch architecture, the kernel would disable the FPU before dispatching the interrupt to the driver ISR, and so the system would also behave as described in 3). 4c) The FPU instruction would execute. There is no fail-fast or detection of this kind of violation by the kernel. Of course every pointer, idea, or suspicion that might seem relevant to the case is welcome. Thank you very much for reading and for your help. Regards, Lanchon ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-10-07 21:48 FP register corruption in Exynos 4210 (Cortex-A9) Lanchon @ 2014-10-07 22:15 ` Russell King - ARM Linux 2014-10-08 7:58 ` Lanchon 2014-10-08 8:19 ` Lanchon 0 siblings, 2 replies; 20+ messages in thread From: Russell King - ARM Linux @ 2014-10-07 22:15 UTC (permalink / raw) To: linux-arm-kernel On Tue, Oct 07, 2014 at 06:48:23PM -0300, Lanchon wrote: > Simply busy-spinning in userland waiting for FP corruption does not seem > to trigger the issue. Concurrently accessing storage in another process > while spinning also does not work; power management (sleep, etc) may be > involved. You need two processes accessing VFP to cause VFP state to be saved and restored. > We do not have 'kernel_neon_begin' nor 'kernel_vfp_begin' support in > these kernels; the code is just not there. Which means that the kernel itself must /never/ make use of floating point itself - if it does, it /will/ corrupt the user state in the way you are seeing. That's a pretty hard requirement, and something that we have enforced with mainline kernels by building the kernel in soft FP mode, thereby preventing the compiler emitting FP instructions. Hence, the only way to get VFP instructions in the kernel is via explicit assembly sequences. The exception to this rule is the VFP support code itself, which maintains the VFP state on behalf of the hardware and userspace (and even then, that code is only concerned with reading and writing the VFP registers, not using FP itself.) In SMP environments, VFP state is saved each time we context switch away from a thread. If we resume the thread on the _same_ CPU and no one else has used the VFP since, we just re-enable access to VFP. Otherwise, we re-load the VFP state from the previously saved state. In UP environments, we do something similar, but we don't save until we need to. However, neon shares the VFP registers, and we have some code (crypto stuff) which uses neon, and this has appropriate guards to ensure that userspace does not see any changes. This is only available when CONFIG_KERNEL_MODE_NEON is enabled (but as you say you don't have kernel_neon_begin anywhere, you should /never/ execute any neon instructions in the kernel.) I hope this helps; I didn't answer your specific questions because it seemed I would just end up repeating what I've said above. -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-10-07 22:15 ` Russell King - ARM Linux @ 2014-10-08 7:58 ` Lanchon 2014-10-08 8:19 ` Lanchon 1 sibling, 0 replies; 20+ messages in thread From: Lanchon @ 2014-10-08 7:58 UTC (permalink / raw) To: linux-arm-kernel thank you for your answer, please see comments below. On 10/07/2014 07:15 PM, Russell King - ARM Linux wrote: > On Tue, Oct 07, 2014 at 06:48:23PM -0300, Lanchon wrote: >> Simply busy-spinning in userland waiting for FP corruption does not seem >> to trigger the issue. Concurrently accessing storage in another process >> while spinning also does not work; power management (sleep, etc) may be >> involved. > You need two processes accessing VFP to cause VFP state to be saved and > restored. yes. these are dual core systems so i used 4 simultaneous processes running the busy-spin. >> We do not have 'kernel_neon_begin' nor 'kernel_vfp_begin' support in >> these kernels; the code is just not there. > Which means that the kernel itself must /never/ make use of floating > point itself - if it does, it /will/ corrupt the user state in the way > you are seeing. That's a pretty hard requirement, and something that > we have enforced with mainline kernels by building the kernel in > soft FP mode, thereby preventing the compiler emitting FP instructions. > Hence, the only way to get VFP instructions in the kernel is via > explicit assembly sequences. > > The exception to this rule is the VFP support code itself, which > maintains the VFP state on behalf of the hardware and userspace (and > even then, that code is only concerned with reading and writing the > VFP registers, not using FP itself.) and also the VFP support trap for corner cases needed in old VFP implementations (VFP 2?). as i said before, this is consistent with what i found with objdump: only context switch and old VFP support trap code. > > In SMP environments, VFP state is saved each time we context switch > away from a thread. If we resume the thread on the _same_ CPU and > no one else has used the VFP since, we just re-enable access to VFP. > Otherwise, we re-load the VFP state from the previously saved state. > > In UP environments, we do something similar, but we don't save until > we need to. this is SMP, and i verified that the resulting kernel uses eager FP state save (as required for SMP) and lazy restore. > > However, neon shares the VFP registers, and we have some code (crypto > stuff) which uses neon, and this has appropriate guards to ensure that > userspace does not see any changes. This is only available when > CONFIG_KERNEL_MODE_NEON is enabled (but as you say you don't have > kernel_neon_begin anywhere, you should /never/ execute any neon > instructions in the kernel.) no other neon/vfp instructions found in objdumps. the crypto acceleration (if the crypto code is in our trees at all) must be disabled then, for lack of CONFIG_KERNEL_MODE_NEON or some other config. i am grepping the output of the full kernel and *.ko objdumps (see previous link) for 'dN' and 'dNN'; i am supposing that any useful VFP/NEON code that clobbers d8 should refer to some 'd' register by name. > > I hope this helps; I didn't answer your specific questions because it > seemed I would just end up repeating what I've said above. > actually no, answers to my very specific questions would help me understand this: if we had a close-source driver (ISR or kernel thread) that touched the FPU, how would the kernel react? would the kernel fast-fail in every possible instance? if not, where would the code need to be and under what circumstances would it not cause fast-fail? knowing this would help me find the offending code (it such code exists; it may well be hardware error). thanks again. ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-10-07 22:15 ` Russell King - ARM Linux 2014-10-08 7:58 ` Lanchon @ 2014-10-08 8:19 ` Lanchon 2014-10-08 8:27 ` Russell King - ARM Linux 2014-10-08 8:35 ` Russell King - ARM Linux 1 sibling, 2 replies; 20+ messages in thread From: Lanchon @ 2014-10-08 8:19 UTC (permalink / raw) To: linux-arm-kernel On 10/07/2014 07:44 PM, Russell King - ARM Linux wrote: > On Tue, Oct 07, 2014 at 07:35:14PM -0300, Lanchon wrote: >>> I hope this helps; I didn't answer your specific questions because it >>> seemed I would just end up repeating what I've said above. >>> >> actually no, answers to my very specific questions would help me >> understand this: if we had a close-source driver (ISR or kernel thread) >> that touched the FPU, how would the kernel react? > I already covered this. It would corrupt the VFP state, thereby > corrupting the VFP state which userspace sees. > > Hence why I said: > > Which means that the kernel itself must /never/ make use of floating > point itself - if it does, it /will/ corrupt the user state in the way > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > you are seeing. > ^^^^^^^^^^^^^^^ > > How can I make that more clear? no, actually you did not answer my questions. you stated that the end result would be corruption of user FP state, which i already know. i am inquiring as to *how* the process of corruption comes about exactly, not the end result. knowing exactly how corruption can happen and how it cannot would help me decide where to look for the offending code. for instance, you say that if an ISR uses the FPU it would corrupt user FP state. fine, but it is not that simple. what if the FPU was disabled at the time of interrupt? (ie: lazy restore did not yet happen in this time-slice.) then the ISR FPU instruction would trap, not corrupt immediately. would the kernel recognize the trap was generated in ISR code and panic, or just blindly restore the FP context of the interrupted thread? if the former is true, then i can discount ISRs as sources of corruptions because i am not seeing panics, so there is no point in instrumenting ISRs. if the latter is true, ok fine... but what if the interrupted thread was a kernel thread? where would the restored FP context come from? answering these questions require both knowledge of the architecture of the linux kernel and of cortex-A, and i know neither of them, which is why i am asking in this list. a plausible answer (which i am making up out of the blue) would be: "each cpu is always working in the context of a 'current' or 'executing' userland process (which may be the idle process), with the MMU configured to its virtual address space and all, even when the cpu is executing a kernel thread. the FPU state and handling is not affected by user/kernel mode switches, only by userland context switches. this means that if a kernel thread executes FP instructions, the kernel will trap if the FPU is disabled and happily restore the context of the current userland process of the CPU for the kernel thread to corrupt next, never noticing that the trap originated in kernel mode. also the arm architecture will not disable the FPU on interrupt processing, and the kernel will not disable the FPU prior to dispatching the interrupt to the registered drivers. so the same thing would happen in an ISR, even if the ISR is interrupting a kernel thread." another plausible answer would be: "the kernel always disables the FPU on scheduling a kernel thread. the cost of fiddling with this is low compared to the safety it provides. if triggered, the FPU trap will notice that the CPU is in kernel mode and panic, failing fast. there are no special rules applying to interrupts: if an ISR issues FP instructions they will be handled as if they had been issued in the interrupted thread (kernel: panic; user: lazy restore and/or execution)." yet another: "the FPU is effectively disabled on interrupt processing by the arm architecture. while running in interrupt mode, and independent of the FPU enable status, all FP instructions will trap to a different FPU vector which will cause the kernel to panic." any and all of these hypothetical details would help me determine where *not* to look for the cause of the problem, where and what type of instrumentation is worth trying, etc. a simple "state would be corrupted" sentence does not give me any useful information that helps me find the source of the problem, but understanding the process of corruption might. (disregarding the fact that this is probably a hardware bug, maybe a cache coherence problem or something of the sort, and there might be no error in the code at all.) this is why i will close this email with a copy my questions for context. maybe someone can provide the answer for some. thanks again, and in advance to anyone who can help. regards, lanchon -------------------------- Kernel threads (such as the worker thread of a threaded interrupt) should guard FPU access in between 'kernel_XXX_begin'/'kernel_XXX_end' calls (which our kernels do not implement). But what if it did not? 1) What is the FPU enable state while executing a kernel thread in ARM arch? Which of these answers is correct? 1a) the FPU is always disabled in kernel threads. 1b) the FPU might be enabled or disabled in a kernel thread, depending on the FPU enable state of the userland context that executed before and/or some other factors. 2) What would happen if a kernel thread executed an FPU instruction without the kernel_XXX_begin'/'kernel_XXX_end' guards in ARM arch and the FPU was disabled at the time? 2a) In the FPU trap the kernel would always detect the issue and panic or oops or something. 2b) In the FPU trap the kernel might enable the FPU, load the FPU context of some userland process and resume the kernel thread. Of course an ISR should not touch the FPU at all. But what if it did? 3) What would happen if an ISR executed an FPU instruction in ARM arch and the FPU was disabled in the context that was interrupted: 3a) In the FPU trap the kernel would always detect the issue and panic or oops or something. 3b) In the FPU trap the kernel would react as if the interrupted context executed the FPU instruction: If the interrupted context was user mode, it would restore the userland process' FP context into the FPU. If the interrupted context was kernel mode, it would react as per the answer to question 2) above. 4) What would happen if an ISR executed an FPU instruction in ARM arch and the FPU was enabled in the context that was interrupted: 4a) The processor would disable the FPU on ISR entry automatically and thus the system would behave as described in the answer to question 3) above. 4b) If the driver uses the standard kernel interrupt dispatch architecture, the kernel would disable the FPU before dispatching the interrupt to the driver ISR, and so the system would also behave as described in 3). 4c) The FPU instruction would execute. There is no fail-fast or detection of this kind of violation by the kernel. Of course every pointer, idea, or suspicion that might seem relevant to the case is welcome. ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-10-08 8:19 ` Lanchon @ 2014-10-08 8:27 ` Russell King - ARM Linux 2014-10-08 8:35 ` Russell King - ARM Linux 1 sibling, 0 replies; 20+ messages in thread From: Russell King - ARM Linux @ 2014-10-08 8:27 UTC (permalink / raw) To: linux-arm-kernel On Wed, Oct 08, 2014 at 05:19:19AM -0300, Lanchon wrote: > > On 10/07/2014 07:44 PM, Russell King - ARM Linux wrote: >> On Tue, Oct 07, 2014 at 07:35:14PM -0300, Lanchon wrote: >>>> I hope this helps; I didn't answer your specific questions because it >>>> seemed I would just end up repeating what I've said above. >>>> >>> actually no, answers to my very specific questions would help me >>> understand this: if we had a close-source driver (ISR or kernel thread) >>> that touched the FPU, how would the kernel react? >> I already covered this. It would corrupt the VFP state, thereby >> corrupting the VFP state which userspace sees. >> >> Hence why I said: >> >> Which means that the kernel itself must /never/ make use of floating >> point itself - if it does, it /will/ corrupt the user state in the way >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> you are seeing. >> ^^^^^^^^^^^^^^^ >> >> How can I make that more clear? > > no, actually you did not answer my questions. you stated that the end > result would be corruption of user FP state, which i already know. i am > inquiring as to *how* the process of corruption comes about exactly, not > the end result. It is really /very/ simple. 1. ISR changes VFP registers. 2. Userspace sees changed VFP registers. 3. Userspace state is corrupted. For some reason, you think that there's more going on here than that. There isn't. The kernel sees the very same set of registers as userspace sees. Any changes which the kernel makes to those registers will be visible to userspace. Hence, using VFP instructions in the kernel will result in VFP registers changing. Userspace will then see the changed VFP registers. The userspace state will then be corrupted. Simple. Really. -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-10-08 8:19 ` Lanchon 2014-10-08 8:27 ` Russell King - ARM Linux @ 2014-10-08 8:35 ` Russell King - ARM Linux 2014-10-08 8:53 ` Ard Biesheuvel 1 sibling, 1 reply; 20+ messages in thread From: Russell King - ARM Linux @ 2014-10-08 8:35 UTC (permalink / raw) To: linux-arm-kernel On Wed, Oct 08, 2014 at 05:19:19AM -0300, Lanchon wrote: > for instance, you say that if an ISR uses the FPU it would corrupt user > FP state. fine, but it is not that simple. what if the FPU was disabled > at the time of interrupt? (ie: lazy restore did not yet happen in this > time-slice.) At that point, it depends on which kernel version you are using. Yes, older kernels will just restore the state. Newer kernels will trap this and complain. > a plausible answer (which i am making up out of the blue) would be: If you want to continue asking questions and getting answers, change your attitude; I am not a child. You should also consider *not* writing essays, but instead ask clear, direct and to the point questions - in other words, short emails. Not everyone has the time or the patience to read huge long emails, or huge rambling threads of 50+ pages on web forums. -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-10-08 8:35 ` Russell King - ARM Linux @ 2014-10-08 8:53 ` Ard Biesheuvel 2014-10-08 9:22 ` Ard Biesheuvel 2014-10-09 22:20 ` Lanchon 0 siblings, 2 replies; 20+ messages in thread From: Ard Biesheuvel @ 2014-10-08 8:53 UTC (permalink / raw) To: linux-arm-kernel On 8 October 2014 10:35, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Wed, Oct 08, 2014 at 05:19:19AM -0300, Lanchon wrote: >> for instance, you say that if an ISR uses the FPU it would corrupt user >> FP state. fine, but it is not that simple. what if the FPU was disabled >> at the time of interrupt? (ie: lazy restore did not yet happen in this >> time-slice.) > > At that point, it depends on which kernel version you are using. Yes, > older kernels will just restore the state. Newer kernels will trap this > and complain. > Indeed. As part of the kernel mode NEON support (which landed in 3.12 I think?), the VFP trap handling now checks whether it occurred in kernel mode or user mode. Check arch/arm/vfp/vfphw.S:84 in your kernel tree for """ ldr r3, [sp, #S_PSR] @ Neither lazy restore nor FP exceptions and r3, r3, #MODE_MASK @ are supported in kernel mode teq r3, #USR_MODE bne vfp_kmode_exception @ Returns through lr """ Without these lines, the lazy restore machinery may kick in during the execution of an ISR that uses NEON registers inadvertently, and overwrite your VFP state with that of the process that happens to be active when the interrupt is taken. You should also be aware that q4 is an alias of d8-d9, so grep'ing your objdump for d8 is not sufficient. -- Ard. >> a plausible answer (which i am making up out of the blue) would be: > > If you want to continue asking questions and getting answers, change > your attitude; I am not a child. > > You should also consider *not* writing essays, but instead ask clear, > direct and to the point questions - in other words, short emails. Not > everyone has the time or the patience to read huge long emails, or > huge rambling threads of 50+ pages on web forums. > > -- > FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up > according to speedtest.net. > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel at lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-10-08 8:53 ` Ard Biesheuvel @ 2014-10-08 9:22 ` Ard Biesheuvel 2014-10-08 9:55 ` Russell King - ARM Linux 2014-10-09 22:36 ` Lanchon 2014-10-09 22:20 ` Lanchon 1 sibling, 2 replies; 20+ messages in thread From: Ard Biesheuvel @ 2014-10-08 9:22 UTC (permalink / raw) To: linux-arm-kernel On 8 October 2014 10:53, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > On 8 October 2014 10:35, Russell King - ARM Linux > <linux@arm.linux.org.uk> wrote: >> On Wed, Oct 08, 2014 at 05:19:19AM -0300, Lanchon wrote: >>> for instance, you say that if an ISR uses the FPU it would corrupt user >>> FP state. fine, but it is not that simple. what if the FPU was disabled >>> at the time of interrupt? (ie: lazy restore did not yet happen in this >>> time-slice.) >> >> At that point, it depends on which kernel version you are using. Yes, >> older kernels will just restore the state. Newer kernels will trap this >> and complain. >> > > Indeed. As part of the kernel mode NEON support (which landed in 3.12 > I think?), the VFP trap handling now checks whether it occurred in > kernel mode or user mode. > Check arch/arm/vfp/vfphw.S:84 in your kernel tree for > > """ > ldr r3, [sp, #S_PSR] @ Neither lazy restore nor FP exceptions > and r3, r3, #MODE_MASK @ are supported in kernel mode > teq r3, #USR_MODE > bne vfp_kmode_exception @ Returns through lr > """ > > Without these lines, the lazy restore machinery may kick in during the > execution of an ISR that uses NEON registers inadvertently, and > overwrite your VFP state with that of the process that happens to be > active when the interrupt is taken. > Ehm ... maybe this is not entirely true. In order for the userland VFP state of some process to be clobbered, an ISR being executed while another process is active (which itself may not use the VFP at all) would not be sufficient, as it would be /that/ process's VFP state getting clobbered. So it is more likely (if you suspect the kernel) that the register is getting clobbered while the storage process has already 'unlocked' the VFP by accessing it from userland, which seems to be in agreement with your scenario of a syscall being performed, i.e., if no task switch occurs, the VFP would be unlocked during the execution of that syscall. So the question is, where does the VFP register write come from? Are there any out of tree modules in use, and if so, can you verify the CFLAGS? Note that merely using -O3 combined with -mfloat-abi=softfp may result in GCC emitting NEON instructions when it detects loops it can vectorize. -- Ard. > You should also be aware that q4 is an alias of d8-d9, so grep'ing > your objdump for d8 is not sufficient. > > -- > Ard. > > >>> a plausible answer (which i am making up out of the blue) would be: >> >> If you want to continue asking questions and getting answers, change >> your attitude; I am not a child. >> >> You should also consider *not* writing essays, but instead ask clear, >> direct and to the point questions - in other words, short emails. Not >> everyone has the time or the patience to read huge long emails, or >> huge rambling threads of 50+ pages on web forums. >> >> -- >> FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up >> according to speedtest.net. >> >> _______________________________________________ >> linux-arm-kernel mailing list >> linux-arm-kernel at lists.infradead.org >> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-10-08 9:22 ` Ard Biesheuvel @ 2014-10-08 9:55 ` Russell King - ARM Linux 2014-10-08 10:32 ` Ard Biesheuvel 2014-10-09 22:36 ` Lanchon 1 sibling, 1 reply; 20+ messages in thread From: Russell King - ARM Linux @ 2014-10-08 9:55 UTC (permalink / raw) To: linux-arm-kernel Ard, Note that you sent this message To: me, therefore you are addressing me in your message when you use "you". On Wed, Oct 08, 2014 at 11:22:32AM +0200, Ard Biesheuvel wrote: > On 8 October 2014 10:53, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > On 8 October 2014 10:35, Russell King - ARM Linux > > <linux@arm.linux.org.uk> wrote: > >> On Wed, Oct 08, 2014 at 05:19:19AM -0300, Lanchon wrote: > >>> for instance, you say that if an ISR uses the FPU it would corrupt user > >>> FP state. fine, but it is not that simple. what if the FPU was disabled > >>> at the time of interrupt? (ie: lazy restore did not yet happen in this > >>> time-slice.) > >> > >> At that point, it depends on which kernel version you are using. Yes, > >> older kernels will just restore the state. Newer kernels will trap this > >> and complain. > >> > > > > Indeed. As part of the kernel mode NEON support (which landed in 3.12 > > I think?), the VFP trap handling now checks whether it occurred in > > kernel mode or user mode. > > Check arch/arm/vfp/vfphw.S:84 in your kernel tree for > > > > """ > > ldr r3, [sp, #S_PSR] @ Neither lazy restore nor FP exceptions > > and r3, r3, #MODE_MASK @ are supported in kernel mode > > teq r3, #USR_MODE > > bne vfp_kmode_exception @ Returns through lr > > """ > > > > Without these lines, the lazy restore machinery may kick in during the > > execution of an ISR that uses NEON registers inadvertently, and > > overwrite your VFP state with that of the process that happens to be > > active when the interrupt is taken. > > > > Ehm ... maybe this is not entirely true. It is true if VFP access was disabled, which is the scenario which Lanchon was asking about. The answer to that is "it depends on the kernel version", and whether it has your patch as part of the kernel mode neon in place which traps these. Sure, if VFP access was enabled, then it is as I have already explained several times - kernel mode VFP usage will change the VFP state, and lead to userspace VFP state corruption. Given the reported scenario, it depends when the VFP access is happening. If it's happening before any scheduling, then VFP access will still be enabled and corruption will be silent. If it happens after a scheduling event, then VFP access will have been disabled, and we should get a trap into the VFP support code. It should be noted that if it happens in a separate kernel thread, it will be independent of userland as it will have its own private and independent VFP state (but that doesn't mean we permit it.) > So the question is, where does the VFP register write come from? Are > there any out of tree modules in use, and if so, can you verify the > CFLAGS? Note that merely using -O3 combined with -mfloat-abi=softfp > may result in GCC emitting NEON instructions when it detects loops it > can vectorize. I can't :) I assume you mean Lanchon... please address your messages a bit better! -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-10-08 9:55 ` Russell King - ARM Linux @ 2014-10-08 10:32 ` Ard Biesheuvel 0 siblings, 0 replies; 20+ messages in thread From: Ard Biesheuvel @ 2014-10-08 10:32 UTC (permalink / raw) To: linux-arm-kernel On 8 October 2014 11:55, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > Ard, > > Note that you sent this message To: me, therefore you are addressing > me in your message when you use "you". > My apologies. The Gmail interface obfuscates the To and Cc fields a bit. I was indeed addressing Lanchon. -- Ard. > On Wed, Oct 08, 2014 at 11:22:32AM +0200, Ard Biesheuvel wrote: >> On 8 October 2014 10:53, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >> > On 8 October 2014 10:35, Russell King - ARM Linux >> > <linux@arm.linux.org.uk> wrote: >> >> On Wed, Oct 08, 2014 at 05:19:19AM -0300, Lanchon wrote: >> >>> for instance, you say that if an ISR uses the FPU it would corrupt user >> >>> FP state. fine, but it is not that simple. what if the FPU was disabled >> >>> at the time of interrupt? (ie: lazy restore did not yet happen in this >> >>> time-slice.) >> >> >> >> At that point, it depends on which kernel version you are using. Yes, >> >> older kernels will just restore the state. Newer kernels will trap this >> >> and complain. >> >> >> > >> > Indeed. As part of the kernel mode NEON support (which landed in 3.12 >> > I think?), the VFP trap handling now checks whether it occurred in >> > kernel mode or user mode. >> > Check arch/arm/vfp/vfphw.S:84 in your kernel tree for >> > >> > """ >> > ldr r3, [sp, #S_PSR] @ Neither lazy restore nor FP exceptions >> > and r3, r3, #MODE_MASK @ are supported in kernel mode >> > teq r3, #USR_MODE >> > bne vfp_kmode_exception @ Returns through lr >> > """ >> > >> > Without these lines, the lazy restore machinery may kick in during the >> > execution of an ISR that uses NEON registers inadvertently, and >> > overwrite your VFP state with that of the process that happens to be >> > active when the interrupt is taken. >> > >> >> Ehm ... maybe this is not entirely true. > > It is true if VFP access was disabled, which is the scenario which > Lanchon was asking about. The answer to that is "it depends on the > kernel version", and whether it has your patch as part of the kernel > mode neon in place which traps these. > > Sure, if VFP access was enabled, then it is as I have already explained > several times - kernel mode VFP usage will change the VFP state, and > lead to userspace VFP state corruption. > > Given the reported scenario, it depends when the VFP access is happening. > If it's happening before any scheduling, then VFP access will still be > enabled and corruption will be silent. If it happens after a scheduling > event, then VFP access will have been disabled, and we should get a trap > into the VFP support code. > > It should be noted that if it happens in a separate kernel thread, it > will be independent of userland as it will have its own private and > independent VFP state (but that doesn't mean we permit it.) > >> So the question is, where does the VFP register write come from? Are >> there any out of tree modules in use, and if so, can you verify the >> CFLAGS? Note that merely using -O3 combined with -mfloat-abi=softfp >> may result in GCC emitting NEON instructions when it detects loops it >> can vectorize. > > I can't :) I assume you mean Lanchon... please address your messages a > bit better! > > -- > FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up > according to speedtest.net. ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-10-08 9:22 ` Ard Biesheuvel 2014-10-08 9:55 ` Russell King - ARM Linux @ 2014-10-09 22:36 ` Lanchon 1 sibling, 0 replies; 20+ messages in thread From: Lanchon @ 2014-10-09 22:36 UTC (permalink / raw) To: linux-arm-kernel On 10/08/2014 06:22 AM, Ard Biesheuvel wrote: > On 8 October 2014 10:53, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: >> >> Indeed. As part of the kernel mode NEON support (which landed in 3.12 >> I think?), the VFP trap handling now checks whether it occurred in >> kernel mode or user mode. >> Check arch/arm/vfp/vfphw.S:84 in your kernel tree for >> >> """ >> ldr r3, [sp, #S_PSR] @ Neither lazy restore nor FP exceptions >> and r3, r3, #MODE_MASK @ are supported in kernel mode >> teq r3, #USR_MODE >> bne vfp_kmode_exception @ Returns through lr >> """ >> >> Without these lines, the lazy restore machinery may kick in during the >> execution of an ISR that uses NEON registers inadvertently, and >> overwrite your VFP state with that of the process that happens to be >> active when the interrupt is taken. >> > Ehm ... maybe this is not entirely true. In order for the userland VFP > state of some process to be clobbered, an ISR being executed while > another process is active (which itself may not use the VFP at all) > would not be sufficient, as it would be /that/ process's VFP state > getting clobbered. So it is more likely (if you suspect the kernel) > that the register is getting clobbered while the storage process has > already 'unlocked' the VFP by accessing it from userland, which seems > to be in agreement with your scenario of a syscall being performed, > i.e., if no task switch occurs, the VFP would be unlocked during the > execution of that syscall. absolutely. this is the kind of fine detail i was asking in my first message. yes, 99% of the pread64s in question would happen with the FPU enabled. this is at the heart of question 1) i originally made, which for which i still didn't get a straight answer: 1) What is the FPU enable state while executing a kernel thread in ARM arch? Which of these answers is correct? 1a) the FPU is always disabled in kernel threads. 1b) the FPU might be enabled or disabled in a kernel thread, depending on the FPU enable state of the userland context that executed before and/or some other factors. (maybe i should have used 'kernel mode' instead.) you are clearly assuming 1b) in your text. (maybe because you know 1a) to be false or maybe because you don't have the information.) > > So the question is, where does the VFP register write come from? Are > there any out of tree modules in use, and if so, can you verify the > CFLAGS? Note that merely using -O3 combined with -mfloat-abi=softfp > may result in GCC emitting NEON instructions when it detects loops it > can vectorize. > the flags are ok and the kernel works fine on other SoCs. there are several KOs but i can't find FPU instructions in them. -mfloat-abi=softfp lets GCC use the FPU at its leisure. -mfloat-abi=soft is used everywhere. ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-10-08 8:53 ` Ard Biesheuvel 2014-10-08 9:22 ` Ard Biesheuvel @ 2014-10-09 22:20 ` Lanchon 2014-10-09 22:32 ` Russell King - ARM Linux 1 sibling, 1 reply; 20+ messages in thread From: Lanchon @ 2014-10-09 22:20 UTC (permalink / raw) To: linux-arm-kernel On 10/08/2014 05:53 AM, Ard Biesheuvel wrote: > On 8 October 2014 10:35, Russell King - ARM Linux > <linux@arm.linux.org.uk> wrote: >> On Wed, Oct 08, 2014 at 05:19:19AM -0300, Lanchon wrote: >>> for instance, you say that if an ISR uses the FPU it would corrupt user >>> FP state. fine, but it is not that simple. what if the FPU was disabled >>> at the time of interrupt? (ie: lazy restore did not yet happen in this >>> time-slice.) >> At that point, it depends on which kernel version you are using. Yes, >> older kernels will just restore the state. Newer kernels will trap this >> and complain. >> > Indeed. As part of the kernel mode NEON support (which landed in 3.12 > I think?), the VFP trap handling now checks whether it occurred in > kernel mode or user mode. > Check arch/arm/vfp/vfphw.S:84 in your kernel tree for > > """ > ldr r3, [sp, #S_PSR] @ Neither lazy restore nor FP exceptions > and r3, r3, #MODE_MASK @ are supported in kernel mode > teq r3, #USR_MODE > bne vfp_kmode_exception @ Returns through lr > """ > > Without these lines, the lazy restore machinery may kick in during the > execution of an ISR that uses NEON registers inadvertently, and > overwrite your VFP state with that of the process that happens to be > active when the interrupt is taken. thank you for this! just one question. i suppose the 'kernel mode' test used here will be positive if the trap happens while executing a kernel thread. it should also be positive if the trap happens while executing an ISR that interrupted a kernel thread. but what if the trap happens while executing an ISR that interrupted userland? would this 'kernel mode' test also be positive? the 'official' kernels (like the one i linked in my first message) do not have this feature. but i found this commit in Dorimanx, which is a fairly used alternative kernel: https://github.com/dorimanx/Dorimanx-SG2-I9100-Kernel/commit/d4f9e67b9395d5f0d7ce2a836f7c9b6edbae0fa0 i will have people retest with this kernel. but AFAIK, people do not report panics or reboots with Dorimanx. and it is unreasonable to believe a priori of cause that every single time an ISR or kernel thread is about to corrupt FPs, the FPU just happened to be enabled. so this fail-fast mechanism not triggering points to the code being ok, and this being more of a hardware issue. > > You should also be aware that q4 is an alias of d8-d9, so grep'ing > your objdump for d8 is not sufficient. > thanks! i have objdumped the kernel and *.ko files again and found no 'qNN' registers mentioned either. --- there is a new piece of information: the FP corruption seems to only happen in these android devices if the display is off. the charger may be connected or not, but if the display is on, the corruption won't happen. i wonder if the kernel could be turning off the FPU and then back on without saving the FPU state. i would think corruption would be seen more often then. maybe it is restoring state before voltage to the FPU has stabilized. this could be easily checked by instrumenting the state restore with a check. but sounds unreasonable: the delay implied by the lazy restore mechanism should hide the effects of this 'race condition' of sorts. ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-10-09 22:20 ` Lanchon @ 2014-10-09 22:32 ` Russell King - ARM Linux 2014-10-10 9:45 ` Arnd Bergmann 0 siblings, 1 reply; 20+ messages in thread From: Russell King - ARM Linux @ 2014-10-09 22:32 UTC (permalink / raw) To: linux-arm-kernel On Thu, Oct 09, 2014 at 07:20:14PM -0300, Lanchon wrote: > > On 10/08/2014 05:53 AM, Ard Biesheuvel wrote: >> On 8 October 2014 10:35, Russell King - ARM Linux >> <linux@arm.linux.org.uk> wrote: >>> On Wed, Oct 08, 2014 at 05:19:19AM -0300, Lanchon wrote: >>>> for instance, you say that if an ISR uses the FPU it would corrupt user >>>> FP state. fine, but it is not that simple. what if the FPU was disabled >>>> at the time of interrupt? (ie: lazy restore did not yet happen in this >>>> time-slice.) >>> At that point, it depends on which kernel version you are using. Yes, >>> older kernels will just restore the state. Newer kernels will trap this >>> and complain. >>> >> Indeed. As part of the kernel mode NEON support (which landed in 3.12 >> I think?), the VFP trap handling now checks whether it occurred in >> kernel mode or user mode. >> Check arch/arm/vfp/vfphw.S:84 in your kernel tree for >> >> """ >> ldr r3, [sp, #S_PSR] @ Neither lazy restore nor FP exceptions >> and r3, r3, #MODE_MASK @ are supported in kernel mode >> teq r3, #USR_MODE >> bne vfp_kmode_exception @ Returns through lr >> """ >> >> Without these lines, the lazy restore machinery may kick in during the >> execution of an ISR that uses NEON registers inadvertently, and >> overwrite your VFP state with that of the process that happens to be >> active when the interrupt is taken. > > thank you for this! just one question. i suppose the 'kernel mode' test > used here will be positive if the trap happens while executing a kernel > thread. it should also be positive if the trap happens while executing > an ISR that interrupted a kernel thread. but what if the trap happens > while executing an ISR that interrupted userland? would this 'kernel > mode' test also be positive? Yes. > there is a new piece of information: > the FP corruption seems to only happen in these android devices if the > display is off. the charger may be connected or not, but if the display > is on, the corruption won't happen. > > i wonder if the kernel could be turning off the FPU and then back on > without saving the FPU state. i would think corruption would be seen > more often then. No. We don't "turn off" the VFP. We disable and enable access to VFP via the coprocessor access register. If the VFP access is disabled and then re-enabled, all state is preserved. The only time which state would be lost is if (eg) we hot-unplug the entire CPU, but that first requires a context switch which implies that the state will already be saved. -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-10-09 22:32 ` Russell King - ARM Linux @ 2014-10-10 9:45 ` Arnd Bergmann 2014-10-10 10:01 ` Russell King - ARM Linux 0 siblings, 1 reply; 20+ messages in thread From: Arnd Bergmann @ 2014-10-10 9:45 UTC (permalink / raw) To: linux-arm-kernel On Thursday 09 October 2014 23:32:44 Russell King - ARM Linux wrote: > > there is a new piece of information: > > the FP corruption seems to only happen in these android devices if the > > display is off. the charger may be connected or not, but if the display > > is on, the corruption won't happen. > > > > i wonder if the kernel could be turning off the FPU and then back on > > without saving the FPU state. i would think corruption would be seen > > more often then. > > No. We don't "turn off" the VFP. We disable and enable access to VFP > via the coprocessor access register. If the VFP access is disabled and > then re-enabled, all state is preserved. > > The only time which state would be lost is if (eg) we hot-unplug the > entire CPU, but that first requires a context switch which implies that > the state will already be saved. Could the problem be caused by a bug in the exynos CPU suspend/resume path then? E.g. if we go to sleep with VFP access disabled but it comes back with VFP access enabled (or vice versa) that could lead to the wrong register state being seen by the user space application. Arnd ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-10-10 9:45 ` Arnd Bergmann @ 2014-10-10 10:01 ` Russell King - ARM Linux 2014-12-22 22:46 ` Lanchon 0 siblings, 1 reply; 20+ messages in thread From: Russell King - ARM Linux @ 2014-10-10 10:01 UTC (permalink / raw) To: linux-arm-kernel On Fri, Oct 10, 2014 at 11:45:34AM +0200, Arnd Bergmann wrote: > On Thursday 09 October 2014 23:32:44 Russell King - ARM Linux wrote: > > > there is a new piece of information: > > > the FP corruption seems to only happen in these android devices if the > > > display is off. the charger may be connected or not, but if the display > > > is on, the corruption won't happen. > > > > > > i wonder if the kernel could be turning off the FPU and then back on > > > without saving the FPU state. i would think corruption would be seen > > > more often then. > > > > No. We don't "turn off" the VFP. We disable and enable access to VFP > > via the coprocessor access register. If the VFP access is disabled and > > then re-enabled, all state is preserved. > > > > The only time which state would be lost is if (eg) we hot-unplug the > > entire CPU, but that first requires a context switch which implies that > > the state will already be saved. > > Could the problem be caused by a bug in the exynos CPU suspend/resume > path then? E.g. if we go to sleep with VFP access disabled but it > comes back with VFP access enabled (or vice versa) that could lead > to the wrong register state being seen by the user space application. Well, an interesting test would be to save out the entire VFP state both before and after the pread64 call, and then inspect that to determine whether it is a single register or multiple registers which are being corrupted. However, looking at the mainline code, we do the right thing with the CPU PM infrastructure, and that is called appropriately by the exynos CPU idle driver. So, another possible test for Lanchon would be to see whether disabling CPU idle support fixes the problem. -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-10-10 10:01 ` Russell King - ARM Linux @ 2014-12-22 22:46 ` Lanchon 2014-12-22 23:29 ` Russell King - ARM Linux 2014-12-23 8:45 ` Ard Biesheuvel 0 siblings, 2 replies; 20+ messages in thread From: Lanchon @ 2014-12-22 22:46 UTC (permalink / raw) To: linux-arm-kernel On 10/10/2014 07:01 AM, Russell King - ARM Linux wrote: > On Fri, Oct 10, 2014 at 11:45:34AM +0200, Arnd Bergmann wrote: >> On Thursday 09 October 2014 23:32:44 Russell King - ARM Linux wrote: >>>> there is a new piece of information: >>>> the FP corruption seems to only happen in these android devices if the >>>> display is off. the charger may be connected or not, but if the display >>>> is on, the corruption won't happen. >>>> >>>> i wonder if the kernel could be turning off the FPU and then back on >>>> without saving the FPU state. i would think corruption would be seen >>>> more often then. >>> No. We don't "turn off" the VFP. We disable and enable access to VFP >>> via the coprocessor access register. If the VFP access is disabled and >>> then re-enabled, all state is preserved. >>> >>> The only time which state would be lost is if (eg) we hot-unplug the >>> entire CPU, but that first requires a context switch which implies that >>> the state will already be saved. >> Could the problem be caused by a bug in the exynos CPU suspend/resume >> path then? E.g. if we go to sleep with VFP access disabled but it >> comes back with VFP access enabled (or vice versa) that could lead >> to the wrong register state being seen by the user space application. > Well, an interesting test would be to save out the entire VFP state > both before and after the pread64 call, and then inspect that to > determine whether it is a single register or multiple registers > which are being corrupted. > > However, looking at the mainline code, we do the right thing with the > CPU PM infrastructure, and that is called appropriately by the exynos > CPU idle driver. > > So, another possible test for Lanchon would be to see whether disabling > CPU idle support fixes the problem. > hi again! thank you all for your help. i sort of disappeared, i'm very sorry about that. i never mentioned it here, but the fact was that i didn't have a device to test on. so all i could do was post test code and ask users for their help. at some point no one was helping; i waited for test results but they never happened, so i got frustrated and abandoned the project. but recently interest built up again and we were able to progress and finally fix this, so i'm writing to let you know how it turned out. so remember there was random userland VFP register corruption. the VFP state was not being corrupted in the registers nor in the saved state in ram. what happened was: the kernel tracks the leftover state in the VFP once the eager state save is done. in the lazy restore trap, the kernel optimizes away the state load and instead only enables the VFP if it can prove that the leftover state in the VFP hardware matches the process state saved in ram. however under some circumstances the kernel did the wrong thing: it didn't reload the registers even though it was needed, probably because the hardware had been powered down and had lost state without the tracking code getting word of it. just disabling the optimization made the kernel solid. a couple of days later the root cause seems to have been identified and fixed. i describe the whole thing here: http://forum.xda-developers.com/galaxy-s2/development-derivatives/kernel-fpbug-stable-4-x-kernel-galaxy-t2978088 once again, thank you for all your help. kind regards Lanchon ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-12-22 22:46 ` Lanchon @ 2014-12-22 23:29 ` Russell King - ARM Linux 2014-12-22 23:42 ` Lanchon 2014-12-23 8:45 ` Ard Biesheuvel 1 sibling, 1 reply; 20+ messages in thread From: Russell King - ARM Linux @ 2014-12-22 23:29 UTC (permalink / raw) To: linux-arm-kernel On Mon, Dec 22, 2014 at 07:46:27PM -0300, Lanchon wrote: > however under some circumstances the kernel did the wrong thing: it didn't > reload the registers even though it was needed, probably because the > hardware had been powered down and had lost state without the tracking code > getting word of it. just disabling the optimization made the kernel solid. Right, so mainline kernel's don't exhibit the behaviour... > a couple of days later the root cause seems to have been identified and > fixed. i describe the whole thing here: > http://forum.xda-developers.com/galaxy-s2/development-derivatives/kernel-fpbug-stable-4-x-kernel-galaxy-t2978088 ... because it's a local issue with cpuidle not calling the appropriate CPU PM functions, and that means there's no patches that we need to deal with for mainline kernels, right? -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-12-22 23:29 ` Russell King - ARM Linux @ 2014-12-22 23:42 ` Lanchon 2014-12-22 23:50 ` Russell King - ARM Linux 0 siblings, 1 reply; 20+ messages in thread From: Lanchon @ 2014-12-22 23:42 UTC (permalink / raw) To: linux-arm-kernel On 12/22/2014 08:29 PM, Russell King - ARM Linux wrote: > On Mon, Dec 22, 2014 at 07:46:27PM -0300, Lanchon wrote: >> however under some circumstances the kernel did the wrong thing: it didn't >> reload the registers even though it was needed, probably because the >> hardware had been powered down and had lost state without the tracking code >> getting word of it. just disabling the optimization made the kernel solid. > Right, so mainline kernel's don't exhibit the behaviour... > >> a couple of days later the root cause seems to have been identified and >> fixed. i describe the whole thing here: >> http://forum.xda-developers.com/galaxy-s2/development-derivatives/kernel-fpbug-stable-4-x-kernel-galaxy-t2978088 > ... because it's a local issue with cpuidle not calling the appropriate > CPU PM functions, and that means there's no patches that we need to deal > with for mainline kernels, right? > that's what i think, yes. i only got back here on the list to thank you and let you know what was wrong since you helped me a couple of months ago. ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-12-22 23:42 ` Lanchon @ 2014-12-22 23:50 ` Russell King - ARM Linux 0 siblings, 0 replies; 20+ messages in thread From: Russell King - ARM Linux @ 2014-12-22 23:50 UTC (permalink / raw) To: linux-arm-kernel On Mon, Dec 22, 2014 at 08:42:02PM -0300, Lanchon wrote: > > On 12/22/2014 08:29 PM, Russell King - ARM Linux wrote: > >On Mon, Dec 22, 2014 at 07:46:27PM -0300, Lanchon wrote: > >>however under some circumstances the kernel did the wrong thing: it didn't > >>reload the registers even though it was needed, probably because the > >>hardware had been powered down and had lost state without the tracking code > >>getting word of it. just disabling the optimization made the kernel solid. > >Right, so mainline kernel's don't exhibit the behaviour... > > > >>a couple of days later the root cause seems to have been identified and > >>fixed. i describe the whole thing here: > >>http://forum.xda-developers.com/galaxy-s2/development-derivatives/kernel-fpbug-stable-4-x-kernel-galaxy-t2978088 > >... because it's a local issue with cpuidle not calling the appropriate > >CPU PM functions, and that means there's no patches that we need to deal > >with for mainline kernels, right? > > > > that's what i think, yes. i only got back here on the list to thank you and > let you know what was wrong since you helped me a couple of months ago. Please let us know if anything changes, thanks. And have a Merry Christmas. -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 20+ messages in thread
* FP register corruption in Exynos 4210 (Cortex-A9) 2014-12-22 22:46 ` Lanchon 2014-12-22 23:29 ` Russell King - ARM Linux @ 2014-12-23 8:45 ` Ard Biesheuvel 1 sibling, 0 replies; 20+ messages in thread From: Ard Biesheuvel @ 2014-12-23 8:45 UTC (permalink / raw) To: linux-arm-kernel On 22 December 2014 at 22:46, Lanchon <lanchon@gmail.com> wrote: > > On 10/10/2014 07:01 AM, Russell King - ARM Linux wrote: >> >> On Fri, Oct 10, 2014 at 11:45:34AM +0200, Arnd Bergmann wrote: >>> >>> On Thursday 09 October 2014 23:32:44 Russell King - ARM Linux wrote: >>>>> >>>>> there is a new piece of information: >>>>> the FP corruption seems to only happen in these android devices if the >>>>> display is off. the charger may be connected or not, but if the display >>>>> is on, the corruption won't happen. >>>>> >>>>> i wonder if the kernel could be turning off the FPU and then back on >>>>> without saving the FPU state. i would think corruption would be seen >>>>> more often then. >>>> >>>> No. We don't "turn off" the VFP. We disable and enable access to VFP >>>> via the coprocessor access register. If the VFP access is disabled and >>>> then re-enabled, all state is preserved. >>>> >>>> The only time which state would be lost is if (eg) we hot-unplug the >>>> entire CPU, but that first requires a context switch which implies that >>>> the state will already be saved. >>> >>> Could the problem be caused by a bug in the exynos CPU suspend/resume >>> path then? E.g. if we go to sleep with VFP access disabled but it >>> comes back with VFP access enabled (or vice versa) that could lead >>> to the wrong register state being seen by the user space application. >> >> Well, an interesting test would be to save out the entire VFP state >> both before and after the pread64 call, and then inspect that to >> determine whether it is a single register or multiple registers >> which are being corrupted. >> >> However, looking at the mainline code, we do the right thing with the >> CPU PM infrastructure, and that is called appropriately by the exynos >> CPU idle driver. >> >> So, another possible test for Lanchon would be to see whether disabling >> CPU idle support fixes the problem. >> > > hi again! thank you all for your help. i sort of disappeared, i'm very sorry > about that. > > i never mentioned it here, but the fact was that i didn't have a device to > test on. so all i could do was post test code and ask users for their help. > at some point no one was helping; i waited for test results but they never > happened, so i got frustrated and abandoned the project. > > but recently interest built up again and we were able to progress and > finally fix this, so i'm writing to let you know how it turned out. > > so remember there was random userland VFP register corruption. the VFP state > was not being corrupted in the registers nor in the saved state in ram. what > happened was: the kernel tracks the leftover state in the VFP once the eager > state save is done. in the lazy restore trap, the kernel optimizes away the > state load and instead only enables the VFP if it can prove that the > leftover state in the VFP hardware matches the process state saved in ram. > > however under some circumstances the kernel did the wrong thing: it didn't > reload the registers even though it was needed, probably because the > hardware had been powered down and had lost state without the tracking code > getting word of it. just disabling the optimization made the kernel solid. > > a couple of days later the root cause seems to have been identified and > fixed. i describe the whole thing here: > http://forum.xda-developers.com/galaxy-s2/development-derivatives/kernel-fpbug-stable-4-x-kernel-galaxy-t2978088 > > once again, thank you for all your help. > Nice work! Seems like quite an adventure you guys had there. -- Ard. ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2014-12-23 8:45 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-10-07 21:48 FP register corruption in Exynos 4210 (Cortex-A9) Lanchon 2014-10-07 22:15 ` Russell King - ARM Linux 2014-10-08 7:58 ` Lanchon 2014-10-08 8:19 ` Lanchon 2014-10-08 8:27 ` Russell King - ARM Linux 2014-10-08 8:35 ` Russell King - ARM Linux 2014-10-08 8:53 ` Ard Biesheuvel 2014-10-08 9:22 ` Ard Biesheuvel 2014-10-08 9:55 ` Russell King - ARM Linux 2014-10-08 10:32 ` Ard Biesheuvel 2014-10-09 22:36 ` Lanchon 2014-10-09 22:20 ` Lanchon 2014-10-09 22:32 ` Russell King - ARM Linux 2014-10-10 9:45 ` Arnd Bergmann 2014-10-10 10:01 ` Russell King - ARM Linux 2014-12-22 22:46 ` Lanchon 2014-12-22 23:29 ` Russell King - ARM Linux 2014-12-22 23:42 ` Lanchon 2014-12-22 23:50 ` Russell King - ARM Linux 2014-12-23 8:45 ` Ard Biesheuvel
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).