From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Eric W. Biederman" Date: Tue, 21 Jun 2022 17:47:30 +0000 Subject: Re: [PATCH v4 12/12] sched,signal,ptrace: Rework TASK_TRACED, TASK_STOPPED state Message-Id: <87bkulgb7x.fsf@email.froward.int.ebiederm.org> List-Id: References: <87a6bv6dl6.fsf_-_@email.froward.int.ebiederm.org> <20220505182645.497868-12-ebiederm@xmission.com> <877d5ajesi.fsf@email.froward.int.ebiederm.org> In-Reply-To: (Alexander Gordeev's message of "Tue, 21 Jun 2022 17:15:47 +0200") MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: Alexander Gordeev Cc: linux-kernel@vger.kernel.org, rjw@rjwysocki.net, Oleg Nesterov , mingo@kernel.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, mgorman@suse.de, bigeasy@linutronix.de, Will Deacon , tj@kernel.org, linux-pm@vger.kernel.org, Peter Zijlstra , Richard Weinberger , Anton Ivanov , Johannes Berg , linux-um@lists.infradead.org, Chris Zankel , Max Filippov , linux-xtensa@linux-xtensa.org, Kees Cook , Jann Horn , linux-ia64@vger.kernel.org Alexander Gordeev writes: > On Tue, Jun 21, 2022 at 09:02:05AM -0500, Eric W. Biederman wrote: >> Alexander Gordeev writes: >> >> > On Thu, May 05, 2022 at 01:26:45PM -0500, Eric W. Biederman wrote: >> >> From: Peter Zijlstra >> >> >> >> Currently ptrace_stop() / do_signal_stop() rely on the special states >> >> TASK_TRACED and TASK_STOPPED resp. to keep unique state. That is, this >> >> state exists only in task->__state and nowhere else. >> >> >> >> There's two spots of bother with this: >> >> >> >> - PREEMPT_RT has task->saved_state which complicates matters, >> >> meaning task_is_{traced,stopped}() needs to check an additional >> >> variable. >> >> >> >> - An alternative freezer implementation that itself relies on a >> >> special TASK state would loose TASK_TRACED/TASK_STOPPED and will >> >> result in misbehaviour. >> >> >> >> As such, add additional state to task->jobctl to track this state >> >> outside of task->__state. >> >> >> >> NOTE: this doesn't actually fix anything yet, just adds extra state. >> >> >> >> --EWB >> >> * didn't add a unnecessary newline in signal.h >> >> * Update t->jobctl in signal_wake_up and ptrace_signal_wake_up >> >> instead of in signal_wake_up_state. This prevents the clearing >> >> of TASK_STOPPED and TASK_TRACED from getting lost. >> >> * Added warnings if JOBCTL_STOPPED or JOBCTL_TRACED are not cleared >> > >> > Hi Eric, Peter, >> > >> > On s390 this patch triggers warning at kernel/ptrace.c:272 when >> > kill_child testcase from strace tool is repeatedly used (the source >> > is attached for reference): >> > >> > while :; do >> > strace -f -qq -e signal=none -e trace=sched_yield,/kill ./kill_child >> > done >> > >> > It normally takes few minutes to cause the warning in -rc3, but FWIW >> > it hits almost immediately for ptrace_stop-cleanup-for-v5.19 tag of >> > git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace. >> > >> > Commit 7b0fe1367ef2 ("ptrace: Document that wait_task_inactive can't >> > fail") suggests this WARN_ON_ONCE() is not really expected, yet we >> > observe a child in __TASK_TRACED state. Could you please comment here? >> > >> >> For clarity the warning is that the child is not in __TASK_TRACED state. >> >> The code is waiting for the code to stop in the scheduler in the >> __TASK_TRACED state so that it can safely read and change the >> processes state. Some of that state is not even saved until the >> process is scheduled out so we have to wait until the process >> is stopped in the scheduler. > > So I assume (checked actually) the return 0 below from kernel/sched/core.c: > wait_task_inactive() is where it bails out: > > 3303 while (task_running(rq, p)) { > 3304 if (match_state && unlikely(READ_ONCE(p->__state) != match_state)) > 3305 return 0; > 3306 cpu_relax(); > 3307 } > > Yet, the child task is always found in __TASK_TRACED state (as seen > in crash dumps): > >> 101447 11342 13 ce3a8100 RU 0.0 10040 4412 strace > 101450 101447 0 bb04b200 TR 0.0 2272 1136 kill_child > 108261 101447 2 d0b10100 TR 0.0 2272 532 kill_child > crash> task bb04b200 __state > PID: 101450 TASK: bb04b200 CPU: 0 COMMAND: "kill_child" > __state = 8, > > crash> task d0b10100 __state > PID: 108261 TASK: d0b10100 CPU: 2 COMMAND: "kill_child" > __state = 8, That is weird. >> At least on s390 it looks like there is a race between SIGKILL and >> ptrace_check_attach. That isn't good. >> >> Reading the code below there is something missing because I don't see >> anything making ptrace calls, and ptrace_check_attach (which contains >> the warning) only happens in the ptrace syscall. > > That is what I believe strace does when calling that code: > > strace -f -qq -e signal=none -e trace=sched_yield,/kill ./kill_child Thank you. That was my braino. I will have to see if it reproduces for me on x86 (I don't have an s390). Perhaps if I can reproduce it I can guess what is going wrong. So far it appears WARN_ON_ONCE has nothing to warn about yet it is warning. Eric