* crash after receiving SIGCHLD during system call @ 2017-05-17 16:04 David Mosberger 2017-05-17 17:09 ` Russell King - ARM Linux 0 siblings, 1 reply; 10+ messages in thread From: David Mosberger @ 2017-05-17 16:04 UTC (permalink / raw) To: linux-arm-kernel I have been seeing occasional random crashes of lighttpd for a few months now. I was hoping it was just a temporary glitch, but the problem has persisted across many different kernel versions (currently at 4.9.28) and BUILDROOT versions (currently at 2017.02.2). The issue sounds a lot like this bug report: http://lists.infradead.org/pipermail/linux-arm-kernel/2014-March/241421.html We use BUILDROOT with glibc v2.24 on a Cortex A5 processor (with hard-float). At this point, I'm mostly wondering whether we're the only ones seeing this problem. strace indicates that lighttpd always crashes after receiving SIGCHLD (i.e., no further system calls until the SIGSEGV or SIGILL hits). I added some instrumentation code to the SIGCHLD handler of lighttpd v1.4.45 and I have seen crashes after the SIGCHLD handler interrupted __libc_fork() and close(). The one constant so far is that the pc register in the signal handler machine-context (mcontext_t) points to the instruction after the "svc 0" instruction. Anyone have seen anything similar? --david ^ permalink raw reply [flat|nested] 10+ messages in thread
* crash after receiving SIGCHLD during system call 2017-05-17 16:04 crash after receiving SIGCHLD during system call David Mosberger @ 2017-05-17 17:09 ` Russell King - ARM Linux 2017-05-17 22:28 ` David Mosberger 0 siblings, 1 reply; 10+ messages in thread From: Russell King - ARM Linux @ 2017-05-17 17:09 UTC (permalink / raw) To: linux-arm-kernel On Wed, May 17, 2017 at 10:04:32AM -0600, David Mosberger wrote: > I added some instrumentation code to the SIGCHLD handler of lighttpd > v1.4.45 and I have seen crashes after the SIGCHLD handler > interrupted __libc_fork() and close(). The one constant so far is > that the pc register in the signal handler machine-context > (mcontext_t) points to the instruction after the "svc 0" instruction. I don't think there's much to be read into that as signal delivery to a process always occurs on the thread's exit path from kernel mode to user mode. This happens after: - completion of an interrupt handler - completion of a kernel syscall - completion of a page fault What would help is to enable CONFIG_DEBUG_USER and pass 'user_debug=31' to the kernel. It should be more verbose about the cause of the fault from the kernel point of view. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 10+ messages in thread
* crash after receiving SIGCHLD during system call 2017-05-17 17:09 ` Russell King - ARM Linux @ 2017-05-17 22:28 ` David Mosberger 2017-05-17 22:46 ` David Mosberger 2017-05-17 23:02 ` Russell King - ARM Linux 0 siblings, 2 replies; 10+ messages in thread From: David Mosberger @ 2017-05-17 22:28 UTC (permalink / raw) To: linux-arm-kernel On Wed, May 17, 2017 at 11:09 AM, Russell King - ARM Linux <linux@armlinux.org.uk> wrote: > > On Wed, May 17, 2017 at 10:04:32AM -0600, David Mosberger wrote: > > I added some instrumentation code to the SIGCHLD handler of lighttpd > > v1.4.45 and I have seen crashes after the SIGCHLD handler > > interrupted __libc_fork() and close(). The one constant so far is > > that the pc register in the signal handler machine-context > > (mcontext_t) points to the instruction after the "svc 0" instruction. > > I don't think there's much to be read into that as signal delivery to > a process always occurs on the thread's exit path from kernel mode to > user mode. > > This happens after: > > - completion of an interrupt handler > - completion of a kernel syscall > - completion of a page fault You may well be right. I added some instrumentation and it looks like about 4 out of 5 signals are delivered during return from system call. I have certainly watched many more than 5 of these failures and all where delivered during syscall return, but of course that doesn't prove that the problem doesn't happen on non-syscall kernel return. > What would help is to enable CONFIG_DEBUG_USER and pass 'user_debug=31' > to the kernel. It should be more verbose about the cause of the fault > from the kernel point of view. OK, since I see various faults, including SIGILL, I used user_debug=63. Here is one example: 2017-05-17 22:12:20: (log.c.217) server started [ 129.810000] pgd = cf0b4000 [ 129.810000] [00000073] *pgd=2f903831, *pte=00000000, *ppte=00000000 [ 129.820000] CPU: 0 PID: 701 Comm: lighttpd Not tainted 4.9.28+ #58 [ 129.820000] Hardware name: Atmel SAMA5 [ 129.830000] task: cecfcd80 task.stack: cf102000 [ 129.830000] PC is at 0x18af4 <-- points to "movle r6, r3" instruction [ 129.830000] LR is@0xb6c04510 [ 129.840000] pc : [<00018af4>] lr : [<b6c04510>] psr: 00070030 [ 129.840000] sp : bee098ec ip : ffffffff fp : 01ee4740 [ 129.850000] r10: 00000008 r9 : 00000000 r8 : b6d12c40 [ 129.850000] r7 : 00034684 r6 : 00000062 r5 : ffffffff r4 : 00000000 [ 129.860000] r3 : ff000000 r2 : bee09978 r1 : bee098f8 r0 : 00000073 [ 129.870000] Flags: nzcv IRQs on FIQs on Mode USER_32 ISA Thumb Segment user [ 129.870000] Control: 10c53c7d Table: 2f0b4059 DAC: 00000055 [ 129.880000] CPU: 0 PID: 701 Comm: lighttpd Not tainted 4.9.28+ #58 [ 129.890000] Hardware name: Atmel SAMA5 [ 129.890000] Backtrace: [ 129.890000] [<c010a640>] (dump_backtrace) from [<c010a890>] (show_stack+0x18/0x1c) [ 129.900000] r7:00000817 r6:00000073 r5:0000000b r4:cecfcd80 [ 129.910000] [<c010a878>] (show_stack) from [<c0244e6c>] (dump_stack+0x20/0x28) [ 129.910000] [<c0244e4c>] (dump_stack) from [<c01079d0>] (show_regs+0x14/0x18) [ 129.920000] [<c01079bc>] (show_regs) from [<c011023c>] (__do_user_fault+0x7c/0xc4) [ 129.930000] [<c01101c0>] (__do_user_fault) from [<c01104ec>] (do_page_fault+0x268/0x310) [ 129.940000] r7:00000055 r6:00030001 r5:00000073 r4:cf103fb0 [ 129.940000] [<c0110284>] (do_page_fault) from [<c010123c>] (do_DataAbort+0x3c/0xbc) [ 129.950000] r10:00000008 r9:00000000 r8:10c53c7d r7:cf103fb0 r6:c060679c r5:00000073 [ 129.960000] r4:00000817 [ 129.960000] [<c0101200>] (do_DataAbort) from [<c010b620>] (__dabt_usr+0x40/0x60) [ 129.970000] Exception stack(0xcf103fb0 to 0xcf103ff8) [ 129.970000] 3fa0: 00000073 bee098f8 bee09978 ff000000 [ 129.980000] 3fc0: 00000000 ffffffff 00000062 00034684 b6d12c40 00000000 00000008 01ee4740 [ 129.990000] 3fe0: ffffffff bee098ec b6c04510 00018af4 00070030 ffffffff [ 130.000000] r7:10c53c7d r6:ffffffff r5:00070030 r4:00018af4 Program received signal SIGSEGV, Segmentation fault. I'm not very good at reading ARM tombstones but if I read this right, the kernel got a page fault due to a data access but a "movle r6, r3" instruction doesn't access data memory. Are we dealing with a instruction cache issue? And it says we're in "Thumb" mode? That shouldn't be the case. For the record, during the last SIGCHLD signal before the above fault, my instrumentation code captures this machine-context: (gdb) p/x sigtrace.mctx[sigtrace.idx] $4 = { trap_no = 0x6, error_code = 0x0, oldmask = 0x0, arm_r0 = 0x0, arm_r1 = 0x6, arm_r2 = 0x3, arm_r3 = 0xbee05aa4, arm_r4 = 0x4, arm_r5 = 0x0, arm_r6 = 0x1ea4718, arm_r7 = 0x126, arm_r8 = 0x40000, arm_r9 = 0x0, arm_r10 = 0x0, arm_fp = 0x0, arm_ip = 0x4aafc, arm_sp = 0xbee05a90, arm_lr = 0x1e344, <--- network_write_chunkqueue() function in lighttpd arm_pc = 0xb6ca9b70, <---- first instruction after setsockopt syscall arm_cpsr = 0x200b0010, fault_address = 0x0 } Also for the record, using an alternate signal stack doesn't help. --david ^ permalink raw reply [flat|nested] 10+ messages in thread
* crash after receiving SIGCHLD during system call 2017-05-17 22:28 ` David Mosberger @ 2017-05-17 22:46 ` David Mosberger 2017-05-17 23:02 ` Russell King - ARM Linux 1 sibling, 0 replies; 10+ messages in thread From: David Mosberger @ 2017-05-17 22:46 UTC (permalink / raw) To: linux-arm-kernel Hmmh, sure enough: the "movle r6, r3" instruction has hex code 0xd1a06003. If we decode the lower-half of that (0x6003) as Thumb, it'd be a "store with immediate offset" instruction, with r3 as destination, r0 as base, and 0 offset. The tombstone shows r3=ff000000. I'm not sure where to look for the fault-triggering address in the tombstone, but it seems a store to 0xff00000 was attempted and that caused the segfault. Correct? ^ permalink raw reply [flat|nested] 10+ messages in thread
* crash after receiving SIGCHLD during system call 2017-05-17 22:28 ` David Mosberger 2017-05-17 22:46 ` David Mosberger @ 2017-05-17 23:02 ` Russell King - ARM Linux 2017-05-17 23:27 ` David Mosberger 2017-05-19 19:42 ` David Mosberger 1 sibling, 2 replies; 10+ messages in thread From: Russell King - ARM Linux @ 2017-05-17 23:02 UTC (permalink / raw) To: linux-arm-kernel On Wed, May 17, 2017 at 04:28:44PM -0600, David Mosberger wrote: > OK, since I see various faults, including SIGILL, I used > user_debug=63. Here is one example: > > 2017-05-17 22:12:20: (log.c.217) server started > [ 129.810000] pgd = cf0b4000 > [ 129.810000] [00000073] *pgd=2f903831, *pte=00000000, *ppte=00000000 > [ 129.820000] CPU: 0 PID: 701 Comm: lighttpd Not tainted 4.9.28+ #58 > [ 129.820000] Hardware name: Atmel SAMA5 > [ 129.830000] task: cecfcd80 task.stack: cf102000 > [ 129.830000] PC is at 0x18af4 <-- points to "movle r6, r3" instruction > [ 129.830000] LR is at 0xb6c04510 > [ 129.840000] pc : [<00018af4>] lr : [<b6c04510>] psr: 00070030 > [ 129.840000] sp : bee098ec ip : ffffffff fp : 01ee4740 > [ 129.850000] r10: 00000008 r9 : 00000000 r8 : b6d12c40 > [ 129.850000] r7 : 00034684 r6 : 00000062 r5 : ffffffff r4 : 00000000 > [ 129.860000] r3 : ff000000 r2 : bee09978 r1 : bee098f8 r0 : 00000073 > [ 129.870000] Flags: nzcv IRQs on FIQs on Mode USER_32 ISA Thumb > Segment user ... > Program received signal SIGSEGV, Segmentation fault. > > I'm not very good at reading ARM tombstones but if I read this right, > the kernel got a page fault due to a data access but a "movle r6, r3" > instruction doesn't access data memory. Are we dealing with a > instruction cache issue? > > And it says we're in "Thumb" mode? That shouldn't be the case. That does appear to be the case - the PSR value confirms it. The segfault is at address 0x73, and an ARM "movle r6, r3" instruction assembles to 0xd1a06003, which would correspond with Thumb: 0: d1a0 bne.n ffffff44 <.text+0xffffff44> 2: 6003 str r3, [r0, #0] Since r0 is 0x00000073, this ties up. So, the problem seems to be the T bit in the PSR is somehow getting set. The kernel signal handling merely saves the PSR value it got on entry (in the pt_regs structure) onto the userspace stack as part of the mcontext (see setup_sigframe in arch/arm/kernel/signal.c). I think you've confirmed that the saved information looks correct. The question then becomes what happens after the signal handler returns. If there is no sigreturn or rt_sigreturn syscall, then the return is being done entirely by userspace, which means userspace is responsible for unstacking the mcontext, including switching to the correct ISA. If there is a sigreturn syscall, the kernel will unstack the mcontext, (see sys_*sigreturn in arch/arm/kernel/signal.c) replacing the syscall's pt_regs with the saved mcontext registers. The resulting state is validated (to prevent userspace gaining privileged modes) before returning. So the T bit should be restored, unless something in userspace decided to set it. The validation will fix up the CPSR state if it looks bad (as a belt and braces) before returning zero to indicate illegal state, which will result in a forced SIGSEGV being delivered to the program. -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply [flat|nested] 10+ messages in thread
* crash after receiving SIGCHLD during system call 2017-05-17 23:02 ` Russell King - ARM Linux @ 2017-05-17 23:27 ` David Mosberger 2017-05-18 5:34 ` David Mosberger 2017-05-19 19:42 ` David Mosberger 1 sibling, 1 reply; 10+ messages in thread From: David Mosberger @ 2017-05-17 23:27 UTC (permalink / raw) To: linux-arm-kernel Ah, yes, that makes sense now. Thanks for decoding this properly! It does look like rt_sigreturn is being used: --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=4909, si_uid=33, si_status=0, si_utime=1, si_stime=2} --- rt_sigreturn() = 0 Rather disconcerting this problem. I'll keep digging... --david ^ permalink raw reply [flat|nested] 10+ messages in thread
* crash after receiving SIGCHLD during system call 2017-05-17 23:27 ` David Mosberger @ 2017-05-18 5:34 ` David Mosberger 2017-05-18 15:28 ` David Mosberger 0 siblings, 1 reply; 10+ messages in thread From: David Mosberger @ 2017-05-18 5:34 UTC (permalink / raw) To: linux-arm-kernel So even with turning on all Errata workarounds, disabling the D-cache and I-cache, and disabling branch prediction the problem still occurs. I also confirmed that at the end of sys_rt_sigreturn(), the T bit in regs->ARM_cpsr is NEVER set. --david ^ permalink raw reply [flat|nested] 10+ messages in thread
* crash after receiving SIGCHLD during system call 2017-05-18 5:34 ` David Mosberger @ 2017-05-18 15:28 ` David Mosberger 2017-05-18 16:11 ` Russell King - ARM Linux 0 siblings, 1 reply; 10+ messages in thread From: David Mosberger @ 2017-05-18 15:28 UTC (permalink / raw) To: linux-arm-kernel On Wed, May 17, 2017 at 11:34 PM, David Mosberger <davidm@egauge.net> wrote: > I also confirmed that at the end of sys_rt_sigreturn(), the T bit in > regs->ARM_cpsr is NEVER set. Similarly, regs->ARM_lr does not have bit 0 set (expect when it's obviously used as a scratch registers, with values of 0x1 or 0x10d). To me, it looks like bit 0 of a return address gets corrupted (possibly a register state corruption or an on-stack corruption). That would in turn cause the "bx rl" instruction to turn on Thumb mode and things go south from there. It's interesting that only one bit get corrupted since the address overall looks plausible. --david ^ permalink raw reply [flat|nested] 10+ messages in thread
* crash after receiving SIGCHLD during system call 2017-05-18 15:28 ` David Mosberger @ 2017-05-18 16:11 ` Russell King - ARM Linux 0 siblings, 0 replies; 10+ messages in thread From: Russell King - ARM Linux @ 2017-05-18 16:11 UTC (permalink / raw) To: linux-arm-kernel On Thu, May 18, 2017 at 09:28:00AM -0600, David Mosberger wrote: > On Wed, May 17, 2017 at 11:34 PM, David Mosberger <davidm@egauge.net> wrote: > > > I also confirmed that at the end of sys_rt_sigreturn(), the T bit in > > regs->ARM_cpsr is NEVER set. > > Similarly, regs->ARM_lr does not have bit 0 set (expect when it's > obviously used as a scratch registers, with values of 0x1 or 0x10d). Yea, that'll possibly matter for when the function returns, rather than the execution of the function itself. > To me, it looks like bit 0 of a return address gets corrupted > (possibly a register state corruption or an on-stack corruption). > That would in turn cause the "bx rl" instruction to turn on Thumb mode > and things go south from there. It's interesting that only one bit > get corrupted since the address overall looks plausible. The bit 0 value of the mcontext PC shouldn't matter, as we're restoring the full system state including the PSR value, rather than performing a normal function return. The return sequence is: mov r2, sp @ save current svc stack pointer ldr r1, [r2, #\offset + S_PSR] @ get calling cpsr ldr lr, [r2, #\offset + S_PC]! @ get pc msr spsr_cxsf, r1 @ save in spsr_svc ldmdb r2, {r0 - lr}^ @ get calling r0 - lr add sp, sp, #\offset + PT_REGS_SIZE @ balance svc stack movs pc, lr @ return & move spsr_svc into cpsr The only possible issue there is that ARMv7 deprecates "movs pc, lr" preferring "subs pc, lr, #0" instead. May be worth seeing whether that makes any difference: arch/arm/kernel/entry-header.S | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/arm/kernel/entry-header.S b/arch/arm/kernel/entry-header.S index 6391728c8f03..9ee5825b4507 100644 --- a/arch/arm/kernel/entry-header.S +++ b/arch/arm/kernel/entry-header.S @@ -312,7 +312,7 @@ mov r0, r0 @ ARMv5T and earlier require a nop @ after ldm {}^ add sp, sp, #\offset + PT_REGS_SIZE - movs pc, lr @ return & move spsr_svc into cpsr + subs pc, lr, #0 @ return & move spsr_svc into cpsr #elif defined(CONFIG_CPU_V7M) @ V7M restore. @ Note that we don't need to do clrex here as clearing the local @@ -339,7 +339,7 @@ ldmdb sp, {r0 - r12} @ get calling r0 - r12 .endif add sp, sp, #PT_REGS_SIZE - S_SP - movs pc, lr @ return & move spsr_svc into cpsr + subs pc, lr, #0 @ return & move spsr_svc into cpsr #endif /* !CONFIG_THUMB2_KERNEL */ .endm -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up according to speedtest.net. ^ permalink raw reply related [flat|nested] 10+ messages in thread
* crash after receiving SIGCHLD during system call 2017-05-17 23:02 ` Russell King - ARM Linux 2017-05-17 23:27 ` David Mosberger @ 2017-05-19 19:42 ` David Mosberger 1 sibling, 0 replies; 10+ messages in thread From: David Mosberger @ 2017-05-19 19:42 UTC (permalink / raw) To: linux-arm-kernel Just to close on this thread: the reason for the crashes came from the kernel not having Thumb enabled. The user-level code wasn't supposed to use Thumb, but it turns out that glibc-2.24 will use Thumb instructions in the strlen(), strcmp(), and memcmp() functions even when you don't ask for it (i.e., there is no CONFIG_THUMB or some such enabled anywhere). So, what happened was that user-level used Thumb and occasionally an interrupt would happen in the middle of the Thumb code. Now, if the kernel ended up needing to send a signal (SIGCHLD in my case), it would set up the signal frame but NOT clear the Thumb-mode bit (since Thumb-support wasn't enabled). Thus, when resuming user-level execution, the CPU would try to execute the signal handler in Thumb mode, which would then lead to the crash. Mystery solved. The code-size increase due to enabling Thumb support in the kernel and since there is no way to stop user-level from using Thumb, it's really necessary to always enable Thumb support in the kernel. I do think it'd be a good idea to have a BUG_ON() for this case (i.e., attempting to deliver signal with Thumb-mode bit on, but Thumb-support disabled), but that's up to the arm kernel maintainers, of course. --david ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2017-05-19 19:42 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-05-17 16:04 crash after receiving SIGCHLD during system call David Mosberger 2017-05-17 17:09 ` Russell King - ARM Linux 2017-05-17 22:28 ` David Mosberger 2017-05-17 22:46 ` David Mosberger 2017-05-17 23:02 ` Russell King - ARM Linux 2017-05-17 23:27 ` David Mosberger 2017-05-18 5:34 ` David Mosberger 2017-05-18 15:28 ` David Mosberger 2017-05-18 16:11 ` Russell King - ARM Linux 2017-05-19 19:42 ` David Mosberger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).