public inbox for ltp@lists.linux.it
 help / color / mirror / Atom feed
* [LTP] clone03/06 randomly crashing
@ 2012-11-30 14:37 Jan Stancek
  2012-12-03  9:37 ` Jan Stancek
  0 siblings, 1 reply; 6+ messages in thread
From: Jan Stancek @ 2012-11-30 14:37 UTC (permalink / raw)
  To: ltp-list; +Cc: Jeffrey Burke

[-- Attachment #1: Type: text/plain, Size: 4316 bytes --]

Hi,

I'm occasionally getting core files from clone03/clone06 testcases.
The testcase itself gives PASS, it is the child which is randomly crashing.
It seems to occur more on single cpu systems.

For example:
Core was generated by `clone03'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000402bfd in tst_print (tcid=0x403d0e "clone03", tnum=1, ttype=2, 
    tmesg=0x14c6070 "unexpected signal 15 received (pid = 17427).") at tst_res.c:412
412	{
(gdb) bt
#0  0x0000000000402bfd in tst_print (tcid=0x403d0e "clone03", tnum=1, ttype=2, 
    tmesg=0x14c6070 "unexpected signal 15 received (pid = 17427).") at tst_res.c:412
#1  0x00000000004031be in tst_res (ttype=2, fname=<value optimized out>, arg_fmt=<value optimized out>) at tst_res.c:316
#2  0x0000000000403761 in tst_brk (ttype=2, fname=0x0, func=0x4013d0 <cleanup>, arg_fmt=<value optimized out>) at tst_res.c:640
#3  0x0000000000403960 in tst_brkm (ttype=2, func=0x4013d0 <cleanup>, arg_fmt=<value optimized out>) at tst_res.c:698
#4  0x0000000000403b45 in def_handler (sig=15) at tst_sig.c:248
#5  <signal handler called>
#6  0x00000037940db650 in __write_nocancel () at ../sysdeps/unix/syscall-template.S:82
#7  0x000000000040169e in child_fn () at clone03.c:208
#8  0x00000037940e890d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

Dump of assembler code for function tst_print:
   0x0000000000402bd0 <+0>:	mov    %rbx,-0x30(%rsp)
   0x0000000000402bd5 <+5>:	mov    %rbp,-0x28(%rsp)
   0x0000000000402bda <+10>:	mov    %edx,%ebx
   0x0000000000402bdc <+12>:	mov    %r12,-0x20(%rsp)
   0x0000000000402be1 <+17>:	mov    %r13,-0x18(%rsp)
   0x0000000000402be6 <+22>:	mov    %rdi,%r12
   0x0000000000402be9 <+25>:	mov    %r14,-0x10(%rsp)
   0x0000000000402bee <+30>:	mov    %r15,-0x8(%rsp)
   0x0000000000402bf3 <+35>:	sub    $0x2858,%rsp
   0x0000000000402bfa <+42>:	mov    %esi,%r14d
=> 0x0000000000402bfd <+45>:	mov    %rcx,0x18(%rsp)

(gdb) p $rsp
$1 = (void *) 0x14c3800
(gdb) x/1x $rsp
0x14c3800:	Cannot access memory at address 0x14c3800

It looks like it receives SIGTERM and while handling SIGTERM it hits SIGSEGV.
I don't know what is source of that SIGTERM. I was looking into the second part
and looks like the stack for child is not large enough.

I modified clone03.c (see attached clone03_poison.patch) to get some extra
empty buffer before the child's stack, which was set to pattern 0xDE.

Before:
                          |-------------------------------|
                     child_stack             child_stack+CHILD_STACK_SIZE
After:
    |---------------------|-------------------------------|
poision_start        child_stack             child_stack+CHILD_STACK_SIZE

Now if I start clone03 and kill it I can randomly reproduce the SIGSEGV (attached clone03_kill.sh).
The backtrace usually looks like:
... (random place)
#5  0x000000000040324e in tst_res (ttype=2, fname=<value optimized out>, arg_fmt=<value optimized out>) at tst_res.c:316
#6  0x00000000004037f1 in tst_brk (ttype=2, fname=0x0, func=0x401420 <cleanup>, arg_fmt=<value optimized out>) at tst_res.c:640
#7  0x00000000004039f0 in tst_brkm (ttype=2, func=0x401420 <cleanup>, arg_fmt=<value optimized out>) at tst_res.c:698
#8  0x0000000000403bd5 in def_handler (sig=13) at tst_sig.c:248
#9  <signal handler called>
#10 0x0000003327cdb650 in __write_nocancel () at ../sysdeps/unix/syscall-template.S:82
#11 0x000000000040172e in child_fn () at clone03.c:212
#12 0x0000003327ce890d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

(gdb) p poison_start 
$1 = (void *) 0xa02010
(gdb) p child_stack 
$2 = (void *) 0xa03010

(gdb) x/16x poison_start
0xa02010:	0xdededede	0xdededede	0xdededede	0xdededede
0xa02020:	0xdededede	0xdededede	0xdededede	0xdededede
0xa02030:	0xdededede	0xdededede	0xdededede	0xdededede
0xa02040:	0xdededede	0xdededede	0xdededede	0xdededede
...
(gdb) 
0xa02490:	0xdededede	0xdededede	0xdededede	0xdededede
0xa024a0:	0x00000018	0x00000030	0x00a02800	0x00000000
0xa024b0:	0x00a02740	0x00000000	0xdededede	0xdededede
0xa024c0:	0xdededede	0xdededede	0x27409296	0x00000033

The above shows that 0xDE pattern has been overwritten. 

Extending child stack helps with the second part: SIGSEGV
#define CHILD_STACK_SIZE 16384*4
but I have no idea, where is that first SIGTERM coming from. Any ideas?

Regards,
Jan

[-- Attachment #2: clone03_kill.sh --]
[-- Type: application/x-sh, Size: 174 bytes --]

[-- Attachment #3: clone03_poison.patch --]
[-- Type: text/x-patch, Size: 1229 bytes --]

diff --git a/testcases/kernel/syscalls/clone/clone03.c b/testcases/kernel/syscalls/clone/clone03.c
index 24ee8e6..dada00c 100644
--- a/testcases/kernel/syscalls/clone/clone03.c
+++ b/testcases/kernel/syscalls/clone/clone03.c
@@ -87,13 +87,15 @@ static int pfd[2];
 
 char *TCID = "clone03";		/* Test program identifier.    */
 int TST_TOTAL = 1;		/* Total number of test cases. */
+void *poison_start;		/* stack for child */
+void *child_stack;	/* stack for child */
+#define POISON_SIZE getpagesize()
 
 int main(int ac, char **av)
 {
 
 	int lc;
 	char *msg;
-	void *child_stack;	/* stack for child */
 	char buff[10];
 	int child_pid;
 
@@ -104,10 +106,13 @@ int main(int ac, char **av)
 	setup();
 
 	/* Allocate stack for child */
-	if ((child_stack = (void *)malloc(CHILD_STACK_SIZE)) == NULL) {
+	if ((poison_start = (void *)malloc(POISON_SIZE+CHILD_STACK_SIZE)) == NULL) {
 		tst_brkm(TBROK, cleanup, "Cannot allocate stack for child");
 	}
 
+	memset(poison_start, 0xDE, POISON_SIZE);
+	child_stack = poison_start + POISON_SIZE;
+
 	for (lc = 0; TEST_LOOPING(lc); lc++) {
 
 		Tst_count = 0;
@@ -154,7 +159,7 @@ int main(int ac, char **av)
 
 	}
 
-	free(child_stack);
+	free(poison_start);
 
 	cleanup();
 	tst_exit();

[-- Attachment #4: Type: text/plain, Size: 214 bytes --]

------------------------------------------------------------------------------
Keep yourself connected to Go Parallel: 
TUNE You got it built. Now make it sing. Tune shows you how.
http://goparallel.sourceforge.net

[-- Attachment #5: Type: text/plain, Size: 155 bytes --]

_______________________________________________
Ltp-list mailing list
Ltp-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ltp-list

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [LTP] clone03/06 randomly crashing
  2012-11-30 14:37 [LTP] clone03/06 randomly crashing Jan Stancek
@ 2012-12-03  9:37 ` Jan Stancek
  2012-12-03 13:03   ` Carmelo AMOROSO
  0 siblings, 1 reply; 6+ messages in thread
From: Jan Stancek @ 2012-12-03  9:37 UTC (permalink / raw)
  To: ltp-list; +Cc: Jeffrey Burke



----- Original Message -----
> From: "Jan Stancek" <jstancek@redhat.com>
> To: ltp-list@lists.sourceforge.net
> Cc: "Jeffrey Burke" <jburke@redhat.com>
> Sent: Friday, 30 November, 2012 3:37:03 PM
> Subject: [LTP] clone03/06 randomly crashing
> 
> Hi,
> 
> I'm occasionally getting core files from clone03/clone06 testcases.
> The testcase itself gives PASS, it is the child which is randomly
> crashing.
> It seems to occur more on single cpu systems.
> 
> For example:
> Core was generated by `clone03'.
> Program terminated with signal 11, Segmentation fault.
> #0  0x0000000000402bfd in tst_print (tcid=0x403d0e "clone03", tnum=1,
> ttype=2,
>     tmesg=0x14c6070 "unexpected signal 15 received (pid = 17427).")
>     at tst_res.c:412
> 412	{
> (gdb) bt
> #0  0x0000000000402bfd in tst_print (tcid=0x403d0e "clone03", tnum=1,
> ttype=2,
>     tmesg=0x14c6070 "unexpected signal 15 received (pid = 17427).")
>     at tst_res.c:412
> #1  0x00000000004031be in tst_res (ttype=2, fname=<value optimized
> out>, arg_fmt=<value optimized out>) at tst_res.c:316
> #2  0x0000000000403761 in tst_brk (ttype=2, fname=0x0, func=0x4013d0
> <cleanup>, arg_fmt=<value optimized out>) at tst_res.c:640
> #3  0x0000000000403960 in tst_brkm (ttype=2, func=0x4013d0 <cleanup>,
> arg_fmt=<value optimized out>) at tst_res.c:698
> #4  0x0000000000403b45 in def_handler (sig=15) at tst_sig.c:248
> #5  <signal handler called>
> #6  0x00000037940db650 in __write_nocancel () at
> ../sysdeps/unix/syscall-template.S:82
> #7  0x000000000040169e in child_fn () at clone03.c:208
> #8  0x00000037940e890d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
> 
> Dump of assembler code for function tst_print:
>    0x0000000000402bd0 <+0>:	mov    %rbx,-0x30(%rsp)
>    0x0000000000402bd5 <+5>:	mov    %rbp,-0x28(%rsp)
>    0x0000000000402bda <+10>:	mov    %edx,%ebx
>    0x0000000000402bdc <+12>:	mov    %r12,-0x20(%rsp)
>    0x0000000000402be1 <+17>:	mov    %r13,-0x18(%rsp)
>    0x0000000000402be6 <+22>:	mov    %rdi,%r12
>    0x0000000000402be9 <+25>:	mov    %r14,-0x10(%rsp)
>    0x0000000000402bee <+30>:	mov    %r15,-0x8(%rsp)
>    0x0000000000402bf3 <+35>:	sub    $0x2858,%rsp
>    0x0000000000402bfa <+42>:	mov    %esi,%r14d
> => 0x0000000000402bfd <+45>:	mov    %rcx,0x18(%rsp)
> 
> (gdb) p $rsp
> $1 = (void *) 0x14c3800
> (gdb) x/1x $rsp
> 0x14c3800:	Cannot access memory at address 0x14c3800
> 
> It looks like it receives SIGTERM and while handling SIGTERM it hits
> SIGSEGV.
> I don't know what is source of that SIGTERM. I was looking into the
> second part
> and looks like the stack for child is not large enough.
> 
> I modified clone03.c (see attached clone03_poison.patch) to get some
> extra
> empty buffer before the child's stack, which was set to pattern 0xDE.
> 
> Before:
>                           |-------------------------------|
>                      child_stack
>                                  child_stack+CHILD_STACK_SIZE
> After:
>     |---------------------|-------------------------------|
> poision_start        child_stack
>             child_stack+CHILD_STACK_SIZE
> 
> Now if I start clone03 and kill it I can randomly reproduce the
> SIGSEGV (attached clone03_kill.sh).
> The backtrace usually looks like:
> ... (random place)
> #5  0x000000000040324e in tst_res (ttype=2, fname=<value optimized
> out>, arg_fmt=<value optimized out>) at tst_res.c:316
> #6  0x00000000004037f1 in tst_brk (ttype=2, fname=0x0, func=0x401420
> <cleanup>, arg_fmt=<value optimized out>) at tst_res.c:640
> #7  0x00000000004039f0 in tst_brkm (ttype=2, func=0x401420 <cleanup>,
> arg_fmt=<value optimized out>) at tst_res.c:698
> #8  0x0000000000403bd5 in def_handler (sig=13) at tst_sig.c:248
> #9  <signal handler called>
> #10 0x0000003327cdb650 in __write_nocancel () at
> ../sysdeps/unix/syscall-template.S:82
> #11 0x000000000040172e in child_fn () at clone03.c:212
> #12 0x0000003327ce890d in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
> 
> (gdb) p poison_start
> $1 = (void *) 0xa02010
> (gdb) p child_stack
> $2 = (void *) 0xa03010
> 
> (gdb) x/16x poison_start
> 0xa02010:	0xdededede	0xdededede	0xdededede	0xdededede
> 0xa02020:	0xdededede	0xdededede	0xdededede	0xdededede
> 0xa02030:	0xdededede	0xdededede	0xdededede	0xdededede
> 0xa02040:	0xdededede	0xdededede	0xdededede	0xdededede
> ...
> (gdb)
> 0xa02490:	0xdededede	0xdededede	0xdededede	0xdededede
> 0xa024a0:	0x00000018	0x00000030	0x00a02800	0x00000000
> 0xa024b0:	0x00a02740	0x00000000	0xdededede	0xdededede
> 0xa024c0:	0xdededede	0xdededede	0x27409296	0x00000033
> 
> The above shows that 0xDE pattern has been overwritten.
> 
> Extending child stack helps with the second part: SIGSEGV
> #define CHILD_STACK_SIZE 16384*4
> but I have no idea, where is that first SIGTERM coming from. Any
> ideas?

It appears to be ltp-pan, which sees the child as orphan.
When I added "-d 511", I've got some additional output:

<<<execution_status>>>
initiation_status="ok"
duration=0 termination_type=exited termination_id=0 corefile=no
cutime=0 cstime=0
<<<test_end>>>
pids still running: 
orphans still running: -26125 
clone03     1  TBROK  :  unexpected signal 15 received (pid = 26126).
clone03     2  TBROK  :  Remaining cases broken

pan was signaled with sig 2...
  propagating sig 2 to orphaned pgrp -26125
orphans still running: 

I'll send a patch, that adds wait() to parent.

Regards,
Jan

------------------------------------------------------------------------------
Keep yourself connected to Go Parallel: 
BUILD Helping you discover the best ways to construct your parallel projects.
http://goparallel.sourceforge.net
_______________________________________________
Ltp-list mailing list
Ltp-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ltp-list

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LTP] clone03/06 randomly crashing
  2012-12-03  9:37 ` Jan Stancek
@ 2012-12-03 13:03   ` Carmelo AMOROSO
  2012-12-03 15:00     ` Wanlong Gao
  2012-12-06  9:43     ` chrubis
  0 siblings, 2 replies; 6+ messages in thread
From: Carmelo AMOROSO @ 2012-12-03 13:03 UTC (permalink / raw)
  To: ltp-list

On 03/12/2012 10.37, Jan Stancek wrote:
> 
> 
> ----- Original Message -----
>> From: "Jan Stancek" <jstancek@redhat.com>
>> To: ltp-list@lists.sourceforge.net
>> Cc: "Jeffrey Burke" <jburke@redhat.com>
>> Sent: Friday, 30 November, 2012 3:37:03 PM
>> Subject: [LTP] clone03/06 randomly crashing
>>
>> Hi,
>>
>> I'm occasionally getting core files from clone03/clone06 testcases.
>> The testcase itself gives PASS, it is the child which is randomly
>> crashing.
>> It seems to occur more on single cpu systems.
>>
>> For example:
>> Core was generated by `clone03'.
>> Program terminated with signal 11, Segmentation fault.
>> #0  0x0000000000402bfd in tst_print (tcid=0x403d0e "clone03", tnum=1,
>> ttype=2,
>>     tmesg=0x14c6070 "unexpected signal 15 received (pid = 17427).")
>>     at tst_res.c:412
>> 412	{
>> (gdb) bt
>> #0  0x0000000000402bfd in tst_print (tcid=0x403d0e "clone03", tnum=1,
>> ttype=2,
>>     tmesg=0x14c6070 "unexpected signal 15 received (pid = 17427).")
>>     at tst_res.c:412
>> #1  0x00000000004031be in tst_res (ttype=2, fname=<value optimized
>> out>, arg_fmt=<value optimized out>) at tst_res.c:316
>> #2  0x0000000000403761 in tst_brk (ttype=2, fname=0x0, func=0x4013d0
>> <cleanup>, arg_fmt=<value optimized out>) at tst_res.c:640
>> #3  0x0000000000403960 in tst_brkm (ttype=2, func=0x4013d0 <cleanup>,
>> arg_fmt=<value optimized out>) at tst_res.c:698
>> #4  0x0000000000403b45 in def_handler (sig=15) at tst_sig.c:248
>> #5  <signal handler called>
>> #6  0x00000037940db650 in __write_nocancel () at
>> ../sysdeps/unix/syscall-template.S:82
>> #7  0x000000000040169e in child_fn () at clone03.c:208
>> #8  0x00000037940e890d in clone () at
>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
>>
>> Dump of assembler code for function tst_print:
>>    0x0000000000402bd0 <+0>:	mov    %rbx,-0x30(%rsp)
>>    0x0000000000402bd5 <+5>:	mov    %rbp,-0x28(%rsp)
>>    0x0000000000402bda <+10>:	mov    %edx,%ebx
>>    0x0000000000402bdc <+12>:	mov    %r12,-0x20(%rsp)
>>    0x0000000000402be1 <+17>:	mov    %r13,-0x18(%rsp)
>>    0x0000000000402be6 <+22>:	mov    %rdi,%r12
>>    0x0000000000402be9 <+25>:	mov    %r14,-0x10(%rsp)
>>    0x0000000000402bee <+30>:	mov    %r15,-0x8(%rsp)
>>    0x0000000000402bf3 <+35>:	sub    $0x2858,%rsp
>>    0x0000000000402bfa <+42>:	mov    %esi,%r14d
>> => 0x0000000000402bfd <+45>:	mov    %rcx,0x18(%rsp)
>>
>> (gdb) p $rsp
>> $1 = (void *) 0x14c3800
>> (gdb) x/1x $rsp
>> 0x14c3800:	Cannot access memory at address 0x14c3800
>>
>> It looks like it receives SIGTERM and while handling SIGTERM it hits
>> SIGSEGV.
>> I don't know what is source of that SIGTERM. I was looking into the
>> second part
>> and looks like the stack for child is not large enough.
>>
>> I modified clone03.c (see attached clone03_poison.patch) to get some
>> extra
>> empty buffer before the child's stack, which was set to pattern 0xDE.
>>
>> Before:
>>                           |-------------------------------|
>>                      child_stack
>>                                  child_stack+CHILD_STACK_SIZE
>> After:
>>     |---------------------|-------------------------------|
>> poision_start        child_stack
>>             child_stack+CHILD_STACK_SIZE
>>
>> Now if I start clone03 and kill it I can randomly reproduce the
>> SIGSEGV (attached clone03_kill.sh).
>> The backtrace usually looks like:
>> ... (random place)
>> #5  0x000000000040324e in tst_res (ttype=2, fname=<value optimized
>> out>, arg_fmt=<value optimized out>) at tst_res.c:316
>> #6  0x00000000004037f1 in tst_brk (ttype=2, fname=0x0, func=0x401420
>> <cleanup>, arg_fmt=<value optimized out>) at tst_res.c:640
>> #7  0x00000000004039f0 in tst_brkm (ttype=2, func=0x401420 <cleanup>,
>> arg_fmt=<value optimized out>) at tst_res.c:698
>> #8  0x0000000000403bd5 in def_handler (sig=13) at tst_sig.c:248
>> #9  <signal handler called>
>> #10 0x0000003327cdb650 in __write_nocancel () at
>> ../sysdeps/unix/syscall-template.S:82
>> #11 0x000000000040172e in child_fn () at clone03.c:212
>> #12 0x0000003327ce890d in clone () at
>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
>>
>> (gdb) p poison_start
>> $1 = (void *) 0xa02010
>> (gdb) p child_stack
>> $2 = (void *) 0xa03010
>>
>> (gdb) x/16x poison_start
>> 0xa02010:	0xdededede	0xdededede	0xdededede	0xdededede
>> 0xa02020:	0xdededede	0xdededede	0xdededede	0xdededede
>> 0xa02030:	0xdededede	0xdededede	0xdededede	0xdededede
>> 0xa02040:	0xdededede	0xdededede	0xdededede	0xdededede
>> ...
>> (gdb)
>> 0xa02490:	0xdededede	0xdededede	0xdededede	0xdededede
>> 0xa024a0:	0x00000018	0x00000030	0x00a02800	0x00000000
>> 0xa024b0:	0x00a02740	0x00000000	0xdededede	0xdededede
>> 0xa024c0:	0xdededede	0xdededede	0x27409296	0x00000033
>>
>> The above shows that 0xDE pattern has been overwritten.
>>
>> Extending child stack helps with the second part: SIGSEGV
>> #define CHILD_STACK_SIZE 16384*4
>> but I have no idea, where is that first SIGTERM coming from. Any
>> ideas?
> 
> It appears to be ltp-pan, which sees the child as orphan.
> When I added "-d 511", I've got some additional output:
> 
> <<<execution_status>>>
> initiation_status="ok"
> duration=0 termination_type=exited termination_id=0 corefile=no
> cutime=0 cstime=0
> <<<test_end>>>
> pids still running: 
> orphans still running: -26125 
> clone03     1  TBROK  :  unexpected signal 15 received (pid = 26126).
> clone03     2  TBROK  :  Remaining cases broken
> 
> pan was signaled with sig 2...
>   propagating sig 2 to orphaned pgrp -26125
> orphans still running: 
> 
> I'll send a patch, that adds wait() to parent.
>
> Regards,
> Jan

Hi Jan,
I think you're right. We have hit similar problems with setrlimit01, and
few other tests.

Unfortunately we did not upstream these patches as we are still working
with an older LTP.

I'll try to rebase it and share some other pending patches we are using
in our project.

Regards,
Carmelo

> 
> ------------------------------------------------------------------------------
> Keep yourself connected to Go Parallel: 
> BUILD Helping you discover the best ways to construct your parallel projects.
> http://goparallel.sourceforge.net
> _______________________________________________
> Ltp-list mailing list
> Ltp-list@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ltp-list
> 
> 


------------------------------------------------------------------------------
Keep yourself connected to Go Parallel: 
BUILD Helping you discover the best ways to construct your parallel projects.
http://goparallel.sourceforge.net
_______________________________________________
Ltp-list mailing list
Ltp-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ltp-list

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LTP] clone03/06 randomly crashing
  2012-12-03 13:03   ` Carmelo AMOROSO
@ 2012-12-03 15:00     ` Wanlong Gao
  2012-12-06  9:43     ` chrubis
  1 sibling, 0 replies; 6+ messages in thread
From: Wanlong Gao @ 2012-12-03 15:00 UTC (permalink / raw)
  To: Carmelo AMOROSO; +Cc: ltp-list

On 12/03/2012 09:03 PM, Carmelo AMOROSO wrote:
> On 03/12/2012 10.37, Jan Stancek wrote:
>>
>>
>> ----- Original Message -----
>>> From: "Jan Stancek" <jstancek@redhat.com>
>>> To: ltp-list@lists.sourceforge.net
>>> Cc: "Jeffrey Burke" <jburke@redhat.com>
>>> Sent: Friday, 30 November, 2012 3:37:03 PM
>>> Subject: [LTP] clone03/06 randomly crashing
>>>
>>> Hi,
>>>
>>> I'm occasionally getting core files from clone03/clone06 testcases.
>>> The testcase itself gives PASS, it is the child which is randomly
>>> crashing.
>>> It seems to occur more on single cpu systems.
>>>
>>> For example:
>>> Core was generated by `clone03'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0  0x0000000000402bfd in tst_print (tcid=0x403d0e "clone03", tnum=1,
>>> ttype=2,
>>>     tmesg=0x14c6070 "unexpected signal 15 received (pid = 17427).")
>>>     at tst_res.c:412
>>> 412	{
>>> (gdb) bt
>>> #0  0x0000000000402bfd in tst_print (tcid=0x403d0e "clone03", tnum=1,
>>> ttype=2,
>>>     tmesg=0x14c6070 "unexpected signal 15 received (pid = 17427).")
>>>     at tst_res.c:412
>>> #1  0x00000000004031be in tst_res (ttype=2, fname=<value optimized
>>> out>, arg_fmt=<value optimized out>) at tst_res.c:316
>>> #2  0x0000000000403761 in tst_brk (ttype=2, fname=0x0, func=0x4013d0
>>> <cleanup>, arg_fmt=<value optimized out>) at tst_res.c:640
>>> #3  0x0000000000403960 in tst_brkm (ttype=2, func=0x4013d0 <cleanup>,
>>> arg_fmt=<value optimized out>) at tst_res.c:698
>>> #4  0x0000000000403b45 in def_handler (sig=15) at tst_sig.c:248
>>> #5  <signal handler called>
>>> #6  0x00000037940db650 in __write_nocancel () at
>>> ../sysdeps/unix/syscall-template.S:82
>>> #7  0x000000000040169e in child_fn () at clone03.c:208
>>> #8  0x00000037940e890d in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
>>>
>>> Dump of assembler code for function tst_print:
>>>    0x0000000000402bd0 <+0>:	mov    %rbx,-0x30(%rsp)
>>>    0x0000000000402bd5 <+5>:	mov    %rbp,-0x28(%rsp)
>>>    0x0000000000402bda <+10>:	mov    %edx,%ebx
>>>    0x0000000000402bdc <+12>:	mov    %r12,-0x20(%rsp)
>>>    0x0000000000402be1 <+17>:	mov    %r13,-0x18(%rsp)
>>>    0x0000000000402be6 <+22>:	mov    %rdi,%r12
>>>    0x0000000000402be9 <+25>:	mov    %r14,-0x10(%rsp)
>>>    0x0000000000402bee <+30>:	mov    %r15,-0x8(%rsp)
>>>    0x0000000000402bf3 <+35>:	sub    $0x2858,%rsp
>>>    0x0000000000402bfa <+42>:	mov    %esi,%r14d
>>> => 0x0000000000402bfd <+45>:	mov    %rcx,0x18(%rsp)
>>>
>>> (gdb) p $rsp
>>> $1 = (void *) 0x14c3800
>>> (gdb) x/1x $rsp
>>> 0x14c3800:	Cannot access memory at address 0x14c3800
>>>
>>> It looks like it receives SIGTERM and while handling SIGTERM it hits
>>> SIGSEGV.
>>> I don't know what is source of that SIGTERM. I was looking into the
>>> second part
>>> and looks like the stack for child is not large enough.
>>>
>>> I modified clone03.c (see attached clone03_poison.patch) to get some
>>> extra
>>> empty buffer before the child's stack, which was set to pattern 0xDE.
>>>
>>> Before:
>>>                           |-------------------------------|
>>>                      child_stack
>>>                                  child_stack+CHILD_STACK_SIZE
>>> After:
>>>     |---------------------|-------------------------------|
>>> poision_start        child_stack
>>>             child_stack+CHILD_STACK_SIZE
>>>
>>> Now if I start clone03 and kill it I can randomly reproduce the
>>> SIGSEGV (attached clone03_kill.sh).
>>> The backtrace usually looks like:
>>> ... (random place)
>>> #5  0x000000000040324e in tst_res (ttype=2, fname=<value optimized
>>> out>, arg_fmt=<value optimized out>) at tst_res.c:316
>>> #6  0x00000000004037f1 in tst_brk (ttype=2, fname=0x0, func=0x401420
>>> <cleanup>, arg_fmt=<value optimized out>) at tst_res.c:640
>>> #7  0x00000000004039f0 in tst_brkm (ttype=2, func=0x401420 <cleanup>,
>>> arg_fmt=<value optimized out>) at tst_res.c:698
>>> #8  0x0000000000403bd5 in def_handler (sig=13) at tst_sig.c:248
>>> #9  <signal handler called>
>>> #10 0x0000003327cdb650 in __write_nocancel () at
>>> ../sysdeps/unix/syscall-template.S:82
>>> #11 0x000000000040172e in child_fn () at clone03.c:212
>>> #12 0x0000003327ce890d in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:115
>>>
>>> (gdb) p poison_start
>>> $1 = (void *) 0xa02010
>>> (gdb) p child_stack
>>> $2 = (void *) 0xa03010
>>>
>>> (gdb) x/16x poison_start
>>> 0xa02010:	0xdededede	0xdededede	0xdededede	0xdededede
>>> 0xa02020:	0xdededede	0xdededede	0xdededede	0xdededede
>>> 0xa02030:	0xdededede	0xdededede	0xdededede	0xdededede
>>> 0xa02040:	0xdededede	0xdededede	0xdededede	0xdededede
>>> ...
>>> (gdb)
>>> 0xa02490:	0xdededede	0xdededede	0xdededede	0xdededede
>>> 0xa024a0:	0x00000018	0x00000030	0x00a02800	0x00000000
>>> 0xa024b0:	0x00a02740	0x00000000	0xdededede	0xdededede
>>> 0xa024c0:	0xdededede	0xdededede	0x27409296	0x00000033
>>>
>>> The above shows that 0xDE pattern has been overwritten.
>>>
>>> Extending child stack helps with the second part: SIGSEGV
>>> #define CHILD_STACK_SIZE 16384*4
>>> but I have no idea, where is that first SIGTERM coming from. Any
>>> ideas?
>>
>> It appears to be ltp-pan, which sees the child as orphan.
>> When I added "-d 511", I've got some additional output:
>>
>> <<<execution_status>>>
>> initiation_status="ok"
>> duration=0 termination_type=exited termination_id=0 corefile=no
>> cutime=0 cstime=0
>> <<<test_end>>>
>> pids still running: 
>> orphans still running: -26125 
>> clone03     1  TBROK  :  unexpected signal 15 received (pid = 26126).
>> clone03     2  TBROK  :  Remaining cases broken
>>
>> pan was signaled with sig 2...
>>   propagating sig 2 to orphaned pgrp -26125
>> orphans still running: 
>>
>> I'll send a patch, that adds wait() to parent.
>>
>> Regards,
>> Jan
> 
> Hi Jan,
> I think you're right. We have hit similar problems with setrlimit01, and
> few other tests.
> 
> Unfortunately we did not upstream these patches as we are still working
> with an older LTP.
> 
> I'll try to rebase it and share some other pending patches we are using
> in our project.

Sounds great, thank you very much.

Regards,
Wanlong Gao

> 
> Regards,
> Carmelo
> 
>>
>> ------------------------------------------------------------------------------
>> Keep yourself connected to Go Parallel: 
>> BUILD Helping you discover the best ways to construct your parallel projects.
>> http://goparallel.sourceforge.net
>> _______________________________________________
>> Ltp-list mailing list
>> Ltp-list@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/ltp-list
>>
>>
> 
> 
> ------------------------------------------------------------------------------
> Keep yourself connected to Go Parallel: 
> BUILD Helping you discover the best ways to construct your parallel projects.
> http://goparallel.sourceforge.net
> _______________________________________________
> Ltp-list mailing list
> Ltp-list@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ltp-list
> 


------------------------------------------------------------------------------
Keep yourself connected to Go Parallel: 
BUILD Helping you discover the best ways to construct your parallel projects.
http://goparallel.sourceforge.net
_______________________________________________
Ltp-list mailing list
Ltp-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ltp-list

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LTP] clone03/06 randomly crashing
  2012-12-03 13:03   ` Carmelo AMOROSO
  2012-12-03 15:00     ` Wanlong Gao
@ 2012-12-06  9:43     ` chrubis
  1 sibling, 0 replies; 6+ messages in thread
From: chrubis @ 2012-12-06  9:43 UTC (permalink / raw)
  To: Carmelo AMOROSO; +Cc: ltp-list

Hi!
> > It appears to be ltp-pan, which sees the child as orphan.
> > When I added "-d 511", I've got some additional output:
> > 
> > <<<execution_status>>>
> > initiation_status="ok"
> > duration=0 termination_type=exited termination_id=0 corefile=no
> > cutime=0 cstime=0
> > <<<test_end>>>
> > pids still running: 
> > orphans still running: -26125 
> > clone03     1  TBROK  :  unexpected signal 15 received (pid = 26126).
> > clone03     2  TBROK  :  Remaining cases broken
> > 
> > pan was signaled with sig 2...
> >   propagating sig 2 to orphaned pgrp -26125
> > orphans still running: 
> > 
> > I'll send a patch, that adds wait() to parent.
> >
> > Regards,
> > Jan
> 
> Hi Jan,
> I think you're right. We have hit similar problems with setrlimit01, and
> few other tests.
> 
> Unfortunately we did not upstream these patches as we are still working
> with an older LTP.

Is there any reason for that? I try not to break anything and stay as
backward compatible as possible. I know that for maintenance people
usually use the version that was used for release testing and stick to
this through the product life cycle.

Anyway if there are some changes or bugs that prevents you from using up
to date LTP please let us know.

> I'll try to rebase it and share some other pending patches we are using
> in our project.

I'm looking forward to see the patches.

-- 
Cyril Hrubis
chrubis@suse.cz

------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
_______________________________________________
Ltp-list mailing list
Ltp-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ltp-list

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [LTP] clone03/06 randomly crashing
       [not found] <50C06D55.2020903@mips.com>
@ 2012-12-06 10:47 ` Jan Stancek
  0 siblings, 0 replies; 6+ messages in thread
From: Jan Stancek @ 2012-12-06 10:47 UTC (permalink / raw)
  To: Pete/Piet Delaney; +Cc: ltp-list, Chris Dearman



----- Original Message -----
> From: "Pete/Piet Delaney" <piet@mips.com>
> To: "Carmelo AMOROSO" <carmelo.amoroso@st.com>
> Cc: ltp-list@lists.sourceforge.net, "Chris Dearman" <chris@mips.com>
> Sent: Thursday, 6 December, 2012 11:03:01 AM
> Subject: Re: [LTP] clone03/06 randomly crashing
> 
> On 12/06/2012 01:43 AM, chrubis@suse.cz wrote:
> > Hi!
> >>> It appears to be ltp-pan, which sees the child as orphan.
> >>> When I added "-d 511", I've got some additional output:
> >>>
> >>> <<<execution_status>>>
> >>> initiation_status="ok"
> >>> duration=0 termination_type=exited termination_id=0 corefile=no
> >>> cutime=0 cstime=0
> >>> <<<test_end>>>
> >>> pids still running:
> >>> orphans still running: -26125
> >>> clone03     1  TBROK  :  unexpected signal 15 received (pid =
> >>> 26126).
> >>> clone03     2  TBROK  :  Remaining cases broken
> >>>
> >>> pan was signaled with sig 2...
> >>>   propagating sig 2 to orphaned pgrp -26125
> >>> orphans still running:
> >>>
> >>> I'll send a patch, that adds wait() to parent.
> Could ya send me a copy of the patch to try.
> I've been experiencing a problem with clone06
> on Android LTP testing this evening. Would likely
> save me some time to see and try the patch.

I've met 2 issues, details are linked in description:
https://github.com/linux-test-project/ltp/commit/db68aaa3bbc7c42f15535cc77e81691940f7b1bc
https://github.com/linux-test-project/ltp/commit/443d16ede4dd9074faa00a98b8f9cdf22e5c168f

Regards,
Jan


------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
_______________________________________________
Ltp-list mailing list
Ltp-list@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ltp-list

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-12-06 10:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-30 14:37 [LTP] clone03/06 randomly crashing Jan Stancek
2012-12-03  9:37 ` Jan Stancek
2012-12-03 13:03   ` Carmelo AMOROSO
2012-12-03 15:00     ` Wanlong Gao
2012-12-06  9:43     ` chrubis
     [not found] <50C06D55.2020903@mips.com>
2012-12-06 10:47 ` Jan Stancek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox