From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from sog-mx-4.v43.ch3.sourceforge.com ([172.29.43.194] helo=mx.sourceforge.net) by sfs-ml-4.v29.ch3.sourceforge.com with esmtp (Exim 4.76) (envelope-from ) id 1TfXXJ-0005Lu-0Z for ltp-list@lists.sourceforge.net; Mon, 03 Dec 2012 15:02:21 +0000 Received: from [222.73.24.84] (helo=song.cn.fujitsu.com) by sog-mx-4.v43.ch3.sourceforge.com with esmtp (Exim 4.76) id 1TfXXC-0004O4-BT for ltp-list@lists.sourceforge.net; Mon, 03 Dec 2012 15:02:20 +0000 Message-ID: <50BCBE9D.7060506@cn.fujitsu.com> Date: Mon, 03 Dec 2012 23:00:45 +0800 From: Wanlong Gao MIME-Version: 1.0 References: <1835266843.9449106.1354527440428.JavaMail.root@redhat.com> <50BCA312.7070508@st.com> In-Reply-To: <50BCA312.7070508@st.com> Subject: Re: [LTP] clone03/06 randomly crashing Reply-To: gaowanlong@cn.fujitsu.com List-Id: Linux Test Project General Discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: ltp-list-bounces@lists.sourceforge.net To: Carmelo AMOROSO Cc: ltp-list@lists.sourceforge.net On 12/03/2012 09:03 PM, Carmelo AMOROSO wrote: > On 03/12/2012 10.37, Jan Stancek wrote: >> >> >> ----- Original Message ----- >>> From: "Jan Stancek" >>> To: ltp-list@lists.sourceforge.net >>> Cc: "Jeffrey Burke" >>> Sent: Friday, 30 November, 2012 3:37:03 PM >>> Subject: [LTP] clone03/06 randomly crashing >>> >>> Hi, >>> >>> I'm occasionally getting core files from clone03/clone06 testcases. >>> The testcase itself gives PASS, it is the child which is randomly >>> crashing. >>> It seems to occur more on single cpu systems. >>> >>> For example: >>> Core was generated by `clone03'. >>> Program terminated with signal 11, Segmentation fault. >>> #0 0x0000000000402bfd in tst_print (tcid=0x403d0e "clone03", tnum=1, >>> ttype=2, >>> tmesg=0x14c6070 "unexpected signal 15 received (pid = 17427).") >>> at tst_res.c:412 >>> 412 { >>> (gdb) bt >>> #0 0x0000000000402bfd in tst_print (tcid=0x403d0e "clone03", tnum=1, >>> ttype=2, >>> tmesg=0x14c6070 "unexpected signal 15 received (pid = 17427).") >>> at tst_res.c:412 >>> #1 0x00000000004031be in tst_res (ttype=2, fname=>> out>, arg_fmt=) at tst_res.c:316 >>> #2 0x0000000000403761 in tst_brk (ttype=2, fname=0x0, func=0x4013d0 >>> , arg_fmt=) at tst_res.c:640 >>> #3 0x0000000000403960 in tst_brkm (ttype=2, func=0x4013d0 , >>> arg_fmt=) at tst_res.c:698 >>> #4 0x0000000000403b45 in def_handler (sig=15) at tst_sig.c:248 >>> #5 >>> #6 0x00000037940db650 in __write_nocancel () at >>> ../sysdeps/unix/syscall-template.S:82 >>> #7 0x000000000040169e in child_fn () at clone03.c:208 >>> #8 0x00000037940e890d in clone () at >>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 >>> >>> Dump of assembler code for function tst_print: >>> 0x0000000000402bd0 <+0>: mov %rbx,-0x30(%rsp) >>> 0x0000000000402bd5 <+5>: mov %rbp,-0x28(%rsp) >>> 0x0000000000402bda <+10>: mov %edx,%ebx >>> 0x0000000000402bdc <+12>: mov %r12,-0x20(%rsp) >>> 0x0000000000402be1 <+17>: mov %r13,-0x18(%rsp) >>> 0x0000000000402be6 <+22>: mov %rdi,%r12 >>> 0x0000000000402be9 <+25>: mov %r14,-0x10(%rsp) >>> 0x0000000000402bee <+30>: mov %r15,-0x8(%rsp) >>> 0x0000000000402bf3 <+35>: sub $0x2858,%rsp >>> 0x0000000000402bfa <+42>: mov %esi,%r14d >>> => 0x0000000000402bfd <+45>: mov %rcx,0x18(%rsp) >>> >>> (gdb) p $rsp >>> $1 = (void *) 0x14c3800 >>> (gdb) x/1x $rsp >>> 0x14c3800: Cannot access memory at address 0x14c3800 >>> >>> It looks like it receives SIGTERM and while handling SIGTERM it hits >>> SIGSEGV. >>> I don't know what is source of that SIGTERM. I was looking into the >>> second part >>> and looks like the stack for child is not large enough. >>> >>> I modified clone03.c (see attached clone03_poison.patch) to get some >>> extra >>> empty buffer before the child's stack, which was set to pattern 0xDE. >>> >>> Before: >>> |-------------------------------| >>> child_stack >>> child_stack+CHILD_STACK_SIZE >>> After: >>> |---------------------|-------------------------------| >>> poision_start child_stack >>> child_stack+CHILD_STACK_SIZE >>> >>> Now if I start clone03 and kill it I can randomly reproduce the >>> SIGSEGV (attached clone03_kill.sh). >>> The backtrace usually looks like: >>> ... (random place) >>> #5 0x000000000040324e in tst_res (ttype=2, fname=>> out>, arg_fmt=) at tst_res.c:316 >>> #6 0x00000000004037f1 in tst_brk (ttype=2, fname=0x0, func=0x401420 >>> , arg_fmt=) at tst_res.c:640 >>> #7 0x00000000004039f0 in tst_brkm (ttype=2, func=0x401420 , >>> arg_fmt=) at tst_res.c:698 >>> #8 0x0000000000403bd5 in def_handler (sig=13) at tst_sig.c:248 >>> #9 >>> #10 0x0000003327cdb650 in __write_nocancel () at >>> ../sysdeps/unix/syscall-template.S:82 >>> #11 0x000000000040172e in child_fn () at clone03.c:212 >>> #12 0x0000003327ce890d in clone () at >>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115 >>> >>> (gdb) p poison_start >>> $1 = (void *) 0xa02010 >>> (gdb) p child_stack >>> $2 = (void *) 0xa03010 >>> >>> (gdb) x/16x poison_start >>> 0xa02010: 0xdededede 0xdededede 0xdededede 0xdededede >>> 0xa02020: 0xdededede 0xdededede 0xdededede 0xdededede >>> 0xa02030: 0xdededede 0xdededede 0xdededede 0xdededede >>> 0xa02040: 0xdededede 0xdededede 0xdededede 0xdededede >>> ... >>> (gdb) >>> 0xa02490: 0xdededede 0xdededede 0xdededede 0xdededede >>> 0xa024a0: 0x00000018 0x00000030 0x00a02800 0x00000000 >>> 0xa024b0: 0x00a02740 0x00000000 0xdededede 0xdededede >>> 0xa024c0: 0xdededede 0xdededede 0x27409296 0x00000033 >>> >>> The above shows that 0xDE pattern has been overwritten. >>> >>> Extending child stack helps with the second part: SIGSEGV >>> #define CHILD_STACK_SIZE 16384*4 >>> but I have no idea, where is that first SIGTERM coming from. Any >>> ideas? >> >> It appears to be ltp-pan, which sees the child as orphan. >> When I added "-d 511", I've got some additional output: >> >> <<>> >> initiation_status="ok" >> duration=0 termination_type=exited termination_id=0 corefile=no >> cutime=0 cstime=0 >> <<>> >> pids still running: >> orphans still running: -26125 >> clone03 1 TBROK : unexpected signal 15 received (pid = 26126). >> clone03 2 TBROK : Remaining cases broken >> >> pan was signaled with sig 2... >> propagating sig 2 to orphaned pgrp -26125 >> orphans still running: >> >> I'll send a patch, that adds wait() to parent. >> >> Regards, >> Jan > > Hi Jan, > I think you're right. We have hit similar problems with setrlimit01, and > few other tests. > > Unfortunately we did not upstream these patches as we are still working > with an older LTP. > > I'll try to rebase it and share some other pending patches we are using > in our project. Sounds great, thank you very much. Regards, Wanlong Gao > > Regards, > Carmelo > >> >> ------------------------------------------------------------------------------ >> Keep yourself connected to Go Parallel: >> BUILD Helping you discover the best ways to construct your parallel projects. >> http://goparallel.sourceforge.net >> _______________________________________________ >> Ltp-list mailing list >> Ltp-list@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/ltp-list >> >> > > > ------------------------------------------------------------------------------ > Keep yourself connected to Go Parallel: > BUILD Helping you discover the best ways to construct your parallel projects. > http://goparallel.sourceforge.net > _______________________________________________ > Ltp-list mailing list > Ltp-list@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ltp-list > ------------------------------------------------------------------------------ Keep yourself connected to Go Parallel: BUILD Helping you discover the best ways to construct your parallel projects. http://goparallel.sourceforge.net _______________________________________________ Ltp-list mailing list Ltp-list@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ltp-list