From: ebiederm@xmission.com (Eric W. Biederman)
To: Guenter Roeck <linux@roeck-us.net>
Cc: Vovo Yang <vovoy@google.com>, Ingo Molnar <mingo@kernel.org>,
linux-kernel@vger.kernel.org
Subject: Re: Threads stuck in zap_pid_ns_processes()
Date: Fri, 12 May 2017 12:33:01 -0500 [thread overview]
Message-ID: <874lwqyo8i.fsf@xmission.com> (raw)
In-Reply-To: <20170512165214.GA12960@roeck-us.net> (Guenter Roeck's message of "Fri, 12 May 2017 09:52:14 -0700")
Guenter Roeck <linux@roeck-us.net> writes:
> Hi Eric,
>
> On Fri, May 12, 2017 at 08:26:27AM -0500, Eric W. Biederman wrote:
>> Vovo Yang <vovoy@google.com> writes:
>>
>> > On Fri, May 12, 2017 at 7:19 AM, Eric W. Biederman
>> > <ebiederm@xmission.com> wrote:
>> >> Guenter Roeck <linux@roeck-us.net> writes:
>> >>
>> >>> What I know so far is
>> >>> - We see this condition on a regular basis in the field. Regular is
>> >>> relative, of course - let's say maybe 1 in a Milion Chromebooks
>> >>> per day reports a crash because of it. That is not that many,
>> >>> but it adds up.
>> >>> - We are able to reproduce the problem with a performance benchmark
>> >>> which opens 100 chrome tabs. While that is a lot, it should not
>> >>> result in a kernel hang/crash.
>> >>> - Vovo proviced the test code last night. I don't know if this is
>> >>> exactly what is observed in the benchmark, or how it relates to the
>> >>> benchmark in the first place, but it is the first time we are actually
>> >>> able to reliably create a condition where the problem is seen.
>> >>
>> >> Thank you. I will be interesting to hear what is happening in the
>> >> chrome perfomance benchmark that triggers this.
>> >>
>> > What's happening in the benchmark:
>> > 1. A chrome renderer process was created with CLONE_NEWPID
>> > 2. The process crashed
>> > 3. Chrome breakpad service calls ptrace(PTRACE_ATTACH, ..) to attach to every
>> > threads of the crashed process to dump info
>> > 4. When breakpad detach the crashed process, the crashed process stuck in
>> > zap_pid_ns_processes()
>>
>> Very interesting thank you.
>>
>> So the question is specifically which interaction is causing this.
>>
>> In the test case provided it was a sibling task in the pid namespace
>> dying and not being reaped. Which may be what is happening with
>> breakpad. So far I have yet to see kernel bug but I won't rule one out.
>>
>
> I am trying to understand what you are looking for. I would have thought
> that both the test application as well as the Chrome functionality
> described above show that there are situations where zap_pid_ns_processes()
> can get stuck and cause hung task timeouts in conjunction with the use of
> ptrace().
>
> Your last sentence seems to suggest that you believe that the kernel might
> do what it is expected to do. Assuming this is the case, what else would
> you like to see ? A test application which matches exactly the Chrome use
> case ? We can try to provide that, but I don't entirely understand how
> that would change the situation. After all, we already know that it is
> possible to get a thread into this condition, and we already have one
> means to reproduce it.
>
> Replacing TASK_UNINTERRUPTIBLE with TASK_INTERRUPTABLE works for both the
> test application and the Chrome benchmark. The thread is still stuck in
> zap_pid_ns_processes(), but it is now in S (sleep) state instead of D,
> and no longer results in a hung task timeout. It remains in that state
> until the parent process terminates. I am not entirely happy with it
> since the processes are still stuck and may pile up over time, but at
> least it solves the immediate problem for us.
>
> Question now is what to do with that solution. We can of course apply
> it locally to Chrome OS, but I would rather have it upstream - especially
> since we have to assume that any users of Chrome on Linux, or more
> generically anyone using ptrace in conjunction with CLONE_NEWPID, may
> experience the same problem. Right now I have no idea how to get there,
> though. Can you provide some guidance ?
Apologies for not being clear. I intend to send a pull request with the
the TASK_UINTERRUPTIBLE to TASK_INTERRUPTIBLE change to Linus in the
next week or so with a Cc stable and an appropriate Fixes tag. So the
fix can be backported.
I have a more comprehensive change queued I will probably merge for 4.13
already but it just changes what kind of zombies you see. It won't
remove the ``stuck'' zombies.
So what I am looking for now is:
Why are things getting stuck in your benchmark?
- Is it a userspace bug?
In which case we can figure out what userspace (aka breakpad) needs
to do to avoid the problem.
- Is it a kernel bug with ptrace?
There have been a lot of little subtle bugs with ptrace over the
years so one more would not surprise
So I am just looking to make certain we fix the root issue not just
the hung task timeout warning.
Eric
next prev parent reply other threads:[~2017-05-12 17:39 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-05-11 17:11 Threads stuck in zap_pid_ns_processes() Guenter Roeck
2017-05-11 17:31 ` Eric W. Biederman
2017-05-11 18:35 ` Guenter Roeck
2017-05-11 20:23 ` Eric W. Biederman
2017-05-11 20:48 ` Guenter Roeck
2017-05-11 21:39 ` Eric W. Biederman
2017-05-11 20:21 ` Guenter Roeck
2017-05-11 21:25 ` Eric W. Biederman
2017-05-11 22:47 ` Guenter Roeck
2017-05-11 23:19 ` Eric W. Biederman
2017-05-12 9:30 ` Vovo Yang
2017-05-12 13:26 ` Eric W. Biederman
2017-05-12 16:52 ` Guenter Roeck
2017-05-12 17:33 ` Eric W. Biederman [this message]
2017-05-12 17:55 ` [REVIEW][PATCH] pid_ns: Sleep in TASK_INTERRUPTIBLE in zap_pid_ns_processes Eric W. Biederman
2017-05-12 19:33 ` Guenter Roeck
2017-05-12 19:43 ` Threads stuck in zap_pid_ns_processes() Guenter Roeck
2017-05-12 20:03 ` Eric W. Biederman
2017-05-13 14:34 ` Guenter Roeck
2017-05-13 18:21 ` Eric W. Biederman
2017-06-01 17:08 ` Eric W. Biederman
2017-06-01 18:45 ` Guenter Roeck
2017-06-01 19:36 ` Eric W. Biederman
2017-06-01 21:43 ` Guenter Roeck
2017-06-02 1:06 ` Eric W. Biederman
2017-05-12 3:42 ` Eric W. Biederman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=874lwqyo8i.fsf@xmission.com \
--to=ebiederm@xmission.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux@roeck-us.net \
--cc=mingo@kernel.org \
--cc=vovoy@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox