* [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed
@ 2018-06-27 15:22 Cyril Hrubis
2018-06-27 15:22 ` [LTP] [PATCH 2/2] [WORK-IN-PROGRESS] lib/tst_test: Dump stack for test processes stuck in kernel Cyril Hrubis
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Cyril Hrubis @ 2018-06-27 15:22 UTC (permalink / raw)
To: ltp
If there are any leftover children the main test process will likely be
killed while sleeping in wait(). That is because all child processes are
either waited explicitely by the test code or implicitly by the test
library.
We also send SIGKILL to the whole process group, so if one of the
children continues to live for long enough that very likely means that
it ended up stuck in the kernel.
So if there are any processes left with in the process group we created
once the process group leader i.e. main test process has been waited
for we loop for a short while to give the init daemon chance to reap the
process after it has been reparented and if that does not happen for a
few seconds we declare the process to be stuck in the kernel.
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
CC: Eric Biggers <ebiggers3@gmail.com>
---
lib/tst_test.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/lib/tst_test.c b/lib/tst_test.c
index 80808854e..329168a24 100644
--- a/lib/tst_test.c
+++ b/lib/tst_test.c
@@ -1,5 +1,5 @@
/*
- * Copyright (c) 2015-2016 Cyril Hrubis <chrubis@suse.cz>
+ * Copyright (c) 2015-2018 Cyril Hrubis <chrubis@suse.cz>
*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
@@ -1047,6 +1047,21 @@ static int fork_testrun(void)
alarm(0);
SAFE_SIGNAL(SIGINT, SIG_DFL);
+ unsigned int sleep = 100;
+ unsigned int retries = 0;
+
+ while (kill(-test_pid, 0) == 0) {
+
+ usleep(sleep);
+ sleep*=2;
+
+ if (retries++ <= 14)
+ continue;
+
+ tst_res(TFAIL, "Test process child stuck in the kernel!");
+ tst_brk(TFAIL, "Congratulation, likely test hit a kernel bug.");
+ }
+
if (WIFEXITED(status) && WEXITSTATUS(status))
return WEXITSTATUS(status);
--
2.13.6
^ permalink raw reply related [flat|nested] 8+ messages in thread* [LTP] [PATCH 2/2] [WORK-IN-PROGRESS] lib/tst_test: Dump stack for test processes stuck in kernel 2018-06-27 15:22 [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed Cyril Hrubis @ 2018-06-27 15:22 ` Cyril Hrubis 2018-06-28 13:05 ` Jan Stancek 2018-06-28 7:10 ` [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed Petr Vorel 2018-06-28 9:41 ` Li Wang 2 siblings, 1 reply; 8+ messages in thread From: Cyril Hrubis @ 2018-06-27 15:22 UTC (permalink / raw) To: ltp This commit adds a small helper library to find a process(es) given a process group ID and dump their stacks. Example output: $ ./shmctl05 tst_test.c:1015: INFO: Timeout per run is 0h 00m 10s Test timeouted, sending SIGKILL! tst_test.c:1059: TFAIL: Test process child stuck in the kernel! tst_find_pid.c:90: INFO: Pid 1272 stuck in kernel! Kernel stacktrace follows: [<ffffffffa3c12564>] __switch_to_asm+0x34/0x70 [<ffffffffa3c12570>] __switch_to_asm+0x40/0x70 [<ffffffffa3625761>] __switch_to+0x2c1/0x6e0 [<ffffffffa393e194>] call_rwsem_down_read_failed+0x14/0x30 [<ffffffffa3704802>] acct_collect+0x42/0x1a0 [<ffffffffa367d36a>] do_exit+0x74a/0xaf0 [<ffffffffa3c13d27>] rewind_stack_do_exit+0x17/0x20 [<ffffffffffffffff>] 0xffffffffffffffff tst_test.c:1061: FAIL: Congratulation, likely test hit a kernel bug. TODO: The main test process uses signal handler and alarm to call _exit if the child process that executes the actuall test timeouts. We need to redesign this if we want to dump the stack in that case as well. Signed-off-by: Cyril Hrubis <chrubis@suse.cz> CC: Jan Stancek <jstancek@redhat.com> --- include/tst_dump_stacks.h | 25 +++++++++++ lib/tst_dump_stacks.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++ lib/tst_test.c | 3 +- 3 files changed, 135 insertions(+), 1 deletion(-) create mode 100644 include/tst_dump_stacks.h create mode 100644 lib/tst_dump_stacks.c diff --git a/include/tst_dump_stacks.h b/include/tst_dump_stacks.h new file mode 100644 index 000000000..643cc58a8 --- /dev/null +++ b/include/tst_dump_stacks.h @@ -0,0 +1,25 @@ +/* + * Copyright (c) 2018 Cyril Hrubis <chrubis@suse.cz> + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see <http://www.gnu.org/licenses/>. + */ + +#ifndef TST_DUMP_STACKS__ +#define TST_DUMP_STACKS__ + +void tst_dump_stacks_by_pgid(pid_t pgid); + +void tst_dump_stack_by_pid(pid_t pid); + +#endif /* TST_DUMP_STACKS__ */ diff --git a/lib/tst_dump_stacks.c b/lib/tst_dump_stacks.c new file mode 100644 index 000000000..aa97c6820 --- /dev/null +++ b/lib/tst_dump_stacks.c @@ -0,0 +1,108 @@ +/* + * Copyright (c) 2018 Cyril Hrubis <chrubis@suse.cz> + * + * This program is free software: you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation, either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see <http://www.gnu.org/licenses/>. + */ + +#include <ctype.h> +#include <stdio.h> + +#define TST_NO_DEFAULT_MAIN 1 +#include "tst_test.h" + +static void *process_search_init(void) +{ + DIR *dir = SAFE_OPENDIR("/proc/"); + + return dir; +} + +static int is_number(const char *str) +{ + do { + if (!isdigit(*str)) + return 0; + } while (*(++str)); + + return 1; +} + +static int process_search_pgid_next(void *pid_search, pid_t pgid) +{ + struct dirent *ent; + DIR *dir = pid_search; + char path[1024]; + int ppgid, pid; + FILE *f; + + while ((ent = readdir(dir))) { + if (ent->d_type != DT_DIR) + continue; + if (!is_number(ent->d_name)) + continue; + + snprintf(path, sizeof(path), "/proc/%s/stat", ent->d_name); + + f = fopen(path, "r"); + if (!f) + continue; + + if (fscanf(f, "%i %*s %*c %*i %i", &pid, &ppgid) != 2) { + tst_res(TWARN, "fscanf(%s) failed!", ent->d_name); + fclose(f); + continue; + } + + fclose(f); + + if (ppgid == pgid) + break; + } + + if (ent) + return pid; + + closedir(dir); + return -1; +} + +void tst_dump_stack_by_pid(pid_t pid) +{ + int fd, len; + char buf[512]; + char path[1024]; + + tst_res(TINFO, "Pid %i stuck in kernel!", pid); + + fprintf(stderr, "Kernel stacktrace follows:\n"); + fflush(stderr); + + snprintf(path, sizeof(path), "/proc/%i/stack", pid); + + fd = SAFE_OPEN(path, O_RDONLY); + + while ((len = SAFE_READ(0, fd, buf, sizeof(buf))) > 0) + SAFE_WRITE(1, 2, buf, len); + + SAFE_CLOSE(fd); +} + +void tst_dump_stacks_by_pgid(pid_t pgid) +{ + void *ps = process_search_init(); + int pid; + + while ((pid = process_search_pgid_next(ps, pgid)) != -1) + tst_dump_stack_by_pid(pid); +} diff --git a/lib/tst_test.c b/lib/tst_test.c index 329168a24..d9476c02c 100644 --- a/lib/tst_test.c +++ b/lib/tst_test.c @@ -1058,7 +1058,8 @@ static int fork_testrun(void) if (retries++ <= 14) continue; - tst_res(TFAIL, "Test process child stuck in the kernel!"); + tst_res(TFAIL, "Test process child(ren) stuck in the kernel!"); + tst_dump_stacks_by_pgid(test_pid); tst_brk(TFAIL, "Congratulation, likely test hit a kernel bug."); } -- 2.13.6 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [LTP] [PATCH 2/2] [WORK-IN-PROGRESS] lib/tst_test: Dump stack for test processes stuck in kernel 2018-06-27 15:22 ` [LTP] [PATCH 2/2] [WORK-IN-PROGRESS] lib/tst_test: Dump stack for test processes stuck in kernel Cyril Hrubis @ 2018-06-28 13:05 ` Jan Stancek 0 siblings, 0 replies; 8+ messages in thread From: Jan Stancek @ 2018-06-28 13:05 UTC (permalink / raw) To: ltp ----- Original Message ----- > This commit adds a small helper library to find a process(es) given a > process group ID and dump their stacks. > > Example output: > > $ ./shmctl05 > tst_test.c:1015: INFO: Timeout per run is 0h 00m 10s > Test timeouted, sending SIGKILL! > tst_test.c:1059: TFAIL: Test process child stuck in the kernel! > tst_find_pid.c:90: INFO: Pid 1272 stuck in kernel! > Kernel stacktrace follows: > [<ffffffffa3c12564>] __switch_to_asm+0x34/0x70 > [<ffffffffa3c12570>] __switch_to_asm+0x40/0x70 > [<ffffffffa3625761>] __switch_to+0x2c1/0x6e0 > [<ffffffffa393e194>] call_rwsem_down_read_failed+0x14/0x30 > [<ffffffffa3704802>] acct_collect+0x42/0x1a0 > [<ffffffffa367d36a>] do_exit+0x74a/0xaf0 > [<ffffffffa3c13d27>] rewind_stack_do_exit+0x17/0x20 > [<ffffffffffffffff>] 0xffffffffffffffff > tst_test.c:1061: FAIL: Congratulation, likely test hit a kernel bug. > > TODO: The main test process uses signal handler and alarm to call _exit if > the > child process that executes the actuall test timeouts. We need to > redesign > this if we want to dump the stack in that case as well. Hi, What if we dropped _exit() from signal handler, and left all killing to code added in 1/2 of this series? Signal handler will only note that we hit timeout: static void alarm_handler(int sig LTP_ATTRIBUTE_UNUSED) { WRITE_MSG("Test timed out!\n"); ++timeout_hit; } and fork_testrun() will be periodically checking for it: do { usleep(10000); ret = SAFE_WAITPID(test_pid, &status, WNOHANG); } while (ret == 0 || timeout_hit == 0); // try to kill process group here > > Signed-off-by: Cyril Hrubis <chrubis@suse.cz> > CC: Jan Stancek <jstancek@redhat.com> > --- > include/tst_dump_stacks.h | 25 +++++++++++ > lib/tst_dump_stacks.c | 108 > ++++++++++++++++++++++++++++++++++++++++++++++ > lib/tst_test.c | 3 +- > 3 files changed, 135 insertions(+), 1 deletion(-) > create mode 100644 include/tst_dump_stacks.h > create mode 100644 lib/tst_dump_stacks.c > > diff --git a/include/tst_dump_stacks.h b/include/tst_dump_stacks.h > new file mode 100644 > index 000000000..643cc58a8 > --- /dev/null > +++ b/include/tst_dump_stacks.h > @@ -0,0 +1,25 @@ > +/* > + * Copyright (c) 2018 Cyril Hrubis <chrubis@suse.cz> > + * > + * This program is free software: you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation, either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program. If not, see <http://www.gnu.org/licenses/>. > + */ > + > +#ifndef TST_DUMP_STACKS__ > +#define TST_DUMP_STACKS__ > + > +void tst_dump_stacks_by_pgid(pid_t pgid); > + > +void tst_dump_stack_by_pid(pid_t pid); > + > +#endif /* TST_DUMP_STACKS__ */ > diff --git a/lib/tst_dump_stacks.c b/lib/tst_dump_stacks.c > new file mode 100644 > index 000000000..aa97c6820 > --- /dev/null > +++ b/lib/tst_dump_stacks.c > @@ -0,0 +1,108 @@ > +/* > + * Copyright (c) 2018 Cyril Hrubis <chrubis@suse.cz> > + * > + * This program is free software: you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation, either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program. If not, see <http://www.gnu.org/licenses/>. > + */ > + > +#include <ctype.h> > +#include <stdio.h> > + > +#define TST_NO_DEFAULT_MAIN 1 > +#include "tst_test.h" > + > +static void *process_search_init(void) > +{ > + DIR *dir = SAFE_OPENDIR("/proc/"); > + > + return dir; > +} > + > +static int is_number(const char *str) > +{ > + do { > + if (!isdigit(*str)) > + return 0; > + } while (*(++str)); > + > + return 1; > +} > + > +static int process_search_pgid_next(void *pid_search, pid_t pgid) > +{ > + struct dirent *ent; > + DIR *dir = pid_search; > + char path[1024]; > + int ppgid, pid; > + FILE *f; > + > + while ((ent = readdir(dir))) { > + if (ent->d_type != DT_DIR) > + continue; > + if (!is_number(ent->d_name)) > + continue; > + > + snprintf(path, sizeof(path), "/proc/%s/stat", ent->d_name); > + > + f = fopen(path, "r"); > + if (!f) > + continue; > + > + if (fscanf(f, "%i %*s %*c %*i %i", &pid, &ppgid) != 2) { > + tst_res(TWARN, "fscanf(%s) failed!", ent->d_name); > + fclose(f); > + continue; > + } > + > + fclose(f); > + > + if (ppgid == pgid) > + break; > + } > + > + if (ent) > + return pid; > + > + closedir(dir); > + return -1; > +} > + > +void tst_dump_stack_by_pid(pid_t pid) > +{ > + int fd, len; > + char buf[512]; > + char path[1024]; > + > + tst_res(TINFO, "Pid %i stuck in kernel!", pid); > + > + fprintf(stderr, "Kernel stacktrace follows:\n"); > + fflush(stderr); > + > + snprintf(path, sizeof(path), "/proc/%i/stack", pid); > + > + fd = SAFE_OPEN(path, O_RDONLY); > + > + while ((len = SAFE_READ(0, fd, buf, sizeof(buf))) > 0) > + SAFE_WRITE(1, 2, buf, len); > + > + SAFE_CLOSE(fd); > +} > + > +void tst_dump_stacks_by_pgid(pid_t pgid) > +{ > + void *ps = process_search_init(); > + int pid; > + > + while ((pid = process_search_pgid_next(ps, pgid)) != -1) > + tst_dump_stack_by_pid(pid); > +} > diff --git a/lib/tst_test.c b/lib/tst_test.c > index 329168a24..d9476c02c 100644 > --- a/lib/tst_test.c > +++ b/lib/tst_test.c > @@ -1058,7 +1058,8 @@ static int fork_testrun(void) > if (retries++ <= 14) > continue; > > - tst_res(TFAIL, "Test process child stuck in the kernel!"); > + tst_res(TFAIL, "Test process child(ren) stuck in the kernel!"); > + tst_dump_stacks_by_pgid(test_pid); > tst_brk(TFAIL, "Congratulation, likely test hit a kernel bug."); > } Looks good to me. Regards, Jan ^ permalink raw reply [flat|nested] 8+ messages in thread
* [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed 2018-06-27 15:22 [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed Cyril Hrubis 2018-06-27 15:22 ` [LTP] [PATCH 2/2] [WORK-IN-PROGRESS] lib/tst_test: Dump stack for test processes stuck in kernel Cyril Hrubis @ 2018-06-28 7:10 ` Petr Vorel 2018-06-28 9:41 ` Li Wang 2 siblings, 0 replies; 8+ messages in thread From: Petr Vorel @ 2018-06-28 7:10 UTC (permalink / raw) To: ltp Hi Cyril, ... > + while (kill(-test_pid, 0) == 0) { > + > + usleep(sleep); > + sleep*=2; > + > + if (retries++ <= 14) ^ Maybe some constant for it? Kind regards, Petr ^ permalink raw reply [flat|nested] 8+ messages in thread
* [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed 2018-06-27 15:22 [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed Cyril Hrubis 2018-06-27 15:22 ` [LTP] [PATCH 2/2] [WORK-IN-PROGRESS] lib/tst_test: Dump stack for test processes stuck in kernel Cyril Hrubis 2018-06-28 7:10 ` [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed Petr Vorel @ 2018-06-28 9:41 ` Li Wang 2018-06-28 10:08 ` Li Wang 2018-06-28 10:20 ` Cyril Hrubis 2 siblings, 2 replies; 8+ messages in thread From: Li Wang @ 2018-06-28 9:41 UTC (permalink / raw) To: ltp Hi Cyril, On Wed, Jun 27, 2018 at 11:22 PM, Cyril Hrubis <chrubis@suse.cz> wrote: > If there are any leftover children the main test process will likely be > killed while sleeping in wait(). That is because all child processes are > either waited explicitely by the test code or implicitly by the test > library. > > We also send SIGKILL to the whole process group, so if one of the > children continues to live for long enough that very likely means that > it ended up stuck in the kernel. > > So if there are any processes left with in the process group we created > once the process group leader i.e. main test process has been waited > for we loop for a short while to give the init daemon chance to reap the > process after it has been reparented and if that does not happen for a > few seconds we declare the process to be stuck in the kernel. > > Signed-off-by: Cyril Hrubis <chrubis@suse.cz> > CC: Eric Biggers <ebiggers3@gmail.com> > --- > lib/tst_test.c | 17 ++++++++++++++++- > 1 file changed, 16 insertions(+), 1 deletion(-) > > diff --git a/lib/tst_test.c b/lib/tst_test.c > index 80808854e..329168a24 100644 > --- a/lib/tst_test.c > +++ b/lib/tst_test.c > @@ -1,5 +1,5 @@ > /* > - * Copyright (c) 2015-2016 Cyril Hrubis <chrubis@suse.cz> > + * Copyright (c) 2015-2018 Cyril Hrubis <chrubis@suse.cz> > * > * This program is free software: you can redistribute it and/or modify > * it under the terms of the GNU General Public License as published by > @@ -1047,6 +1047,21 @@ static int fork_testrun(void) > alarm(0); > SAFE_SIGNAL(SIGINT, SIG_DFL); > > + unsigned int sleep = 100; > + unsigned int retries = 0; > + > + while (kill(-test_pid, 0) == 0) { I'm a little worried about here, image that, if a process_A(test_pid) exist to make function kill(-test_pid, 0) return 0 at first time, then we go into this while loop, but during the sleeping time process_A exit and system reuse the test_pid to another process_B, we will still keep looping and very probably make mistake to report TFAIL(with stack of process_B dump to ltp user in PATCH 2/2). > + > + usleep(sleep); > + sleep*=2; > + > + if (retries++ <= 14) > + continue; > + > + tst_res(TFAIL, "Test process child stuck in the kernel!"); > + tst_brk(TFAIL, "Congratulation, likely test hit a kernel bug."); > + } > + > if (WIFEXITED(status) && WEXITSTATUS(status)) > return WEXITSTATUS(status); > > -- > 2.13.6 > > > -- > Mailing list info: https://lists.linux.it/listinfo/ltp -- Regards, Li Wang ^ permalink raw reply [flat|nested] 8+ messages in thread
* [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed 2018-06-28 9:41 ` Li Wang @ 2018-06-28 10:08 ` Li Wang 2018-06-28 10:23 ` Cyril Hrubis 2018-06-28 10:20 ` Cyril Hrubis 1 sibling, 1 reply; 8+ messages in thread From: Li Wang @ 2018-06-28 10:08 UTC (permalink / raw) To: ltp On Thu, Jun 28, 2018 at 5:41 PM, Li Wang <liwang@redhat.com> wrote: > Hi Cyril, > > On Wed, Jun 27, 2018 at 11:22 PM, Cyril Hrubis <chrubis@suse.cz> wrote: >> If there are any leftover children the main test process will likely be >> killed while sleeping in wait(). That is because all child processes are >> either waited explicitely by the test code or implicitly by the test >> library. >> >> We also send SIGKILL to the whole process group, so if one of the >> children continues to live for long enough that very likely means that >> it ended up stuck in the kernel. >> >> So if there are any processes left with in the process group we created >> once the process group leader i.e. main test process has been waited >> for we loop for a short while to give the init daemon chance to reap the >> process after it has been reparented and if that does not happen for a >> few seconds we declare the process to be stuck in the kernel. >> >> Signed-off-by: Cyril Hrubis <chrubis@suse.cz> >> CC: Eric Biggers <ebiggers3@gmail.com> >> --- >> lib/tst_test.c | 17 ++++++++++++++++- >> 1 file changed, 16 insertions(+), 1 deletion(-) >> >> diff --git a/lib/tst_test.c b/lib/tst_test.c >> index 80808854e..329168a24 100644 >> --- a/lib/tst_test.c >> +++ b/lib/tst_test.c >> @@ -1,5 +1,5 @@ >> /* >> - * Copyright (c) 2015-2016 Cyril Hrubis <chrubis@suse.cz> >> + * Copyright (c) 2015-2018 Cyril Hrubis <chrubis@suse.cz> >> * >> * This program is free software: you can redistribute it and/or modify >> * it under the terms of the GNU General Public License as published by >> @@ -1047,6 +1047,21 @@ static int fork_testrun(void) >> alarm(0); >> SAFE_SIGNAL(SIGINT, SIG_DFL); >> >> + unsigned int sleep = 100; >> + unsigned int retries = 0; >> + >> + while (kill(-test_pid, 0) == 0) { > > I'm a little worried about here, image that, if a process_A(test_pid) > exist to make function kill(-test_pid, 0) return 0 at first time, then > we go into this while loop, but during the sleeping time process_A > exit and system reuse the test_pid to another process_B, we will still > keep looping and very probably make mistake to report TFAIL(with stack > of process_B dump to ltp user in PATCH 2/2). Maybe we could verify the content of '/proc/test_pid/cmdline' in this loop to make sure test_pid is still using by the process we wanted? > >> + >> + usleep(sleep); >> + sleep*=2; >> + >> + if (retries++ <= 14) >> + continue; >> + >> + tst_res(TFAIL, "Test process child stuck in the kernel!"); >> + tst_brk(TFAIL, "Congratulation, likely test hit a kernel bug."); >> + } >> + >> if (WIFEXITED(status) && WEXITSTATUS(status)) >> return WEXITSTATUS(status); >> >> -- >> 2.13.6 >> >> >> -- >> Mailing list info: https://lists.linux.it/listinfo/ltp > > > > -- > Regards, > Li Wang -- Regards, Li Wang ^ permalink raw reply [flat|nested] 8+ messages in thread
* [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed 2018-06-28 10:08 ` Li Wang @ 2018-06-28 10:23 ` Cyril Hrubis 0 siblings, 0 replies; 8+ messages in thread From: Cyril Hrubis @ 2018-06-28 10:23 UTC (permalink / raw) To: ltp Hi! > > I'm a little worried about here, image that, if a process_A(test_pid) > > exist to make function kill(-test_pid, 0) return 0 at first time, then > > we go into this while loop, but during the sleeping time process_A > > exit and system reuse the test_pid to another process_B, we will still > > keep looping and very probably make mistake to report TFAIL(with stack > > of process_B dump to ltp user in PATCH 2/2). > > Maybe we could verify the content of '/proc/test_pid/cmdline' in this loop > to make sure test_pid is still using by the process we wanted? That unfortunatelly does not work, half of the /proc/$pid/* files block on reading when this happens as there is a deadlock in the kernel and the processes that try to read these files end up deadlocked too. -- Cyril Hrubis chrubis@suse.cz ^ permalink raw reply [flat|nested] 8+ messages in thread
* [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed 2018-06-28 9:41 ` Li Wang 2018-06-28 10:08 ` Li Wang @ 2018-06-28 10:20 ` Cyril Hrubis 1 sibling, 0 replies; 8+ messages in thread From: Cyril Hrubis @ 2018-06-28 10:20 UTC (permalink / raw) To: ltp Hi! > > + unsigned int sleep = 100; > > + unsigned int retries = 0; > > + > > + while (kill(-test_pid, 0) == 0) { > > I'm a little worried about here, image that, if a process_A(test_pid) > exist to make function kill(-test_pid, 0) return 0 at first time, then > we go into this while loop, but during the sleeping time process_A > exit and system reuse the test_pid to another process_B, we will still > keep looping and very probably make mistake to report TFAIL(with stack > of process_B dump to ltp user in PATCH 2/2). That is known limitation of UNIX. In practice it's very unlikely that the pid would be reused in very short timeframe unless there is a fork bomb running on the system or the system is out of pids, which both means greater trouble. Just try to run 'watch /proc/self/stat' and look how fast is the first number increasing. On an idle system it's increased by a single digit number every two seconds and even if you run a parallel compilation in background it takes a long time until we start to reuse recenlty used pids. I guess that we can remove the part that doubles the sleep and increase the number of retries accordingly, that way we would be much more likely to hit even very short interval when the pid was not allocated. We can also include various sanity checks, we may examine the process whoose process group matches the test_pid to some degree. We can for instance check if the process has been reparented to init i.e. parent pid == 1 which happens when the parent is killed. However I would like to avoid anything too complicated since at a point we get to this situation the kernel has been likely corrupted so all bets are off, the system is in inconsistent state and the best action to take is to try to inform the tester that something went wrong. -- Cyril Hrubis chrubis@suse.cz ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2018-06-28 13:05 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-06-27 15:22 [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed Cyril Hrubis 2018-06-27 15:22 ` [LTP] [PATCH 2/2] [WORK-IN-PROGRESS] lib/tst_test: Dump stack for test processes stuck in kernel Cyril Hrubis 2018-06-28 13:05 ` Jan Stancek 2018-06-28 7:10 ` [LTP] [PATCH 1/2 v2] tst_test: Fail the test subprocess cannot be killed Petr Vorel 2018-06-28 9:41 ` Li Wang 2018-06-28 10:08 ` Li Wang 2018-06-28 10:23 ` Cyril Hrubis 2018-06-28 10:20 ` Cyril Hrubis
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox