Re: PID_NS unshare VS synchronize_rcu_tasks() (was: Re: [Syzkaller & bisect] There is task hung in "synchronize_rcu" in v6.1-rc5 kernel)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Pengfei Xu <pengfei.xu@intel.com>
To: Frederic Weisbecker <frederic@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>,
	"Paul E. McKenney" <paulmck@kernel.org>,
	Neeraj Upadhyay <quic_neeraju@quicinc.com>,
	"Christian Brauner" <brauner@kernel.org>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	<linux-kernel@vger.kernel.org>, <heng.su@intel.com>,
	<rcu@vger.kernel.org>
Subject: Re: PID_NS unshare VS synchronize_rcu_tasks() (was: Re: [Syzkaller & bisect] There is task hung in "synchronize_rcu" in v6.1-rc5 kernel)
Date: Wed, 23 Nov 2022 23:45:50 +0800	[thread overview]
Message-ID: <Y35ALpl8borkSHjy@xpf.sh.intel.com> (raw)
In-Reply-To: <20221123143758.GA1387380@lothringen>

Hi Frederic Weisbecker,

On 2022-11-23 at 15:37:58 +0100, Frederic Weisbecker wrote:
> On Mon, Nov 21, 2022 at 01:37:06PM +0800, Pengfei Xu wrote:
> > Hi Frederic Weisbecker and kernel developers,
> > 
> > Greeting!
> > There is task hung in "synchronize_rcu" in v6.1-rc5 kernel.
> > 
> > Bisected the issue on Raptor and server(No atom small core, big core only),
> > both platforms bisected results show that:
> > first bad commit is c597bfddc9e9e8a63817252b67c3ca0e544ace26:
> > "sched: Provide Kconfig support for default dynamic preempt mode"
> > 
> > [  300.097166] INFO: task rcu_tasks_kthre:11 blocked for more than 147 seconds.
> > [  300.097455]       Not tainted 6.1.0-rc5-094226ad94f4 #1
> > [  300.097641] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [  300.097922] task:rcu_tasks_kthre state:D stack:0     pid:11    ppid:2      flags:0x00004000
> > [  300.098230] Call Trace:
> > [  300.098325]  <TASK>
> > [  300.098410]  __schedule+0x2de/0x8f0
> > [  300.098562]  schedule+0x5b/0xe0
> > [  300.098693]  schedule_timeout+0x3f1/0x4b0
> > [  300.098849]  ? __sanitizer_cov_trace_pc+0x25/0x60
> > [  300.099032]  ? queue_delayed_work_on+0x82/0xc0
> > [  300.099206]  wait_for_completion+0x81/0x140
> > [  300.099373]  __synchronize_srcu.part.23+0x83/0xb0
> > [  300.099558]  ? __bpf_trace_rcu_stall_warning+0x20/0x20
> > [  300.099757]  synchronize_srcu+0xd6/0x100
> > [  300.099913]  rcu_tasks_postscan+0x19/0x20
> > [  300.100070]  rcu_tasks_wait_gp+0x108/0x290
> > [  300.100230]  ? _raw_spin_unlock+0x1d/0x40
> > [  300.100389]  rcu_tasks_one_gp+0x27f/0x370
> > [  300.100546]  ? rcu_tasks_postscan+0x20/0x20
> > [  300.100709]  rcu_tasks_kthread+0x37/0x50
> > [  300.100863]  kthread+0x14d/0x190
> > [  300.100998]  ? kthread_complete_and_exit+0x40/0x40
> > [  300.101199]  ret_from_fork+0x1f/0x30
> > [  300.101347]  </TASK>
> 
> Thanks for reporting this. Fortunately I managed to reproduce and debug.
> It took me a few days to understand the complicated circular dependency
> involved.
> 
> So here is a summary:
> 
> 1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace
>    that every subsequent child of TASK A will belong to. But TASK A doesn't
>    itself belong to that new PID namespace.
> 
> 2) TASK A forks() and creates TASK B (it is a new threadgroup so it is a
>    thread group leader). TASK A stays attached to its PID namespace (let's say PID_NS1)
>    and TASK B is the first task belonging to the new PID namespace created by
>    unshare()  (let's call it PID_NS2).
> 
> 3) Since TASK B is the first task attached to PID_NS2, it becomes the PID_NS2
>    child reaper.
> 
> 4) TASK A forks() again and creates TASK C which get attached to PID_NS2.
>    Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has
>    TASK B (belonging to PID_NS2) as a pid_namespace child_reaper.
> 
> 3) TASK B exits and since it is the child reaper for PID_NS2, it has to
>    kill all other tasks attached to PID_NS2, and wait for all of them to die
>    before reaping itself (zap_pid_ns_process()). Note it seems to make a
>    misleading assumption here, trusting that all tasks in PID_NS2 either
>    get reaped by a parent belonging to the same namespace or by TASK B.
>    And it is confident that since it deactivated SIGCHLD handler, all
>    the remaining tasks ultimately autoreap. And it waits for that to happen.
>    However TASK C escapes that rule because it will get reaped by its parent
>    TASK A belonging to PID_NS1.
> 
> 4) TASK A calls synchronize_rcu_tasks() which leads to
>    synchronize_srcu(&tasks_rcu_exit_srcu).
> 
> 5) TASK B is waiting for TASK C to get reaped (wrongly assuming it autoreaps)
>    But TASK B is under a tasks_rcu_exit_srcu SRCU critical section
>    (exit_notify() is between exit_tasks_rcu_start() and
>    exit_tasks_rcu_finish()), blocking TASK A
> 
> 6) TASK C exits and since TASK A is its parent, it waits for it to reap TASK C,
>    but it can't because TASK A waits for TASK B that waits for TASK C.
> 
> So there is a circular dependency:
> 
> _ TASK A waits for TASK B to get out of tasks_rcu_exit_srcu SRCU critical
> section
> _ TASK B waits for TASK C to get reaped
> _ TASK C waits for TASK A to reap it.
> 
> I have no idea how to solve the situation without violating the pid_namespace
> rules and unshare() semantics (although I wish unshare(CLONE_NEWPID) had a less
> error prone behaviour with allowing creating more than one task belonging to the
> same namespace).
> 
> So probably having an SRCU read side critical section within exit_notify() is
> not a good idea, is there a solution to work around that for rcu tasks?
> 
  Thanks for the analysis!
  Add one more information: I tried to revert this commit only on top of
  v6.1-rc5 mainline by script, but it caused kernel make to fail, it could not
  confirm the bisect information is 100% accurate if I could not pass the
  revert step verification. I just provide all the information I could.

  And this issue is too difficult to me.
  If I find more clue, I will update the eamil.

  Thanks!
  BR.

> Thanks.

next prev parent reply	other threads:[~2022-11-23 15:45 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-21  5:37 [Syzkaller & bisect] There is task hung in "synchronize_rcu" in v6.1-rc5 kernel Pengfei Xu
2022-11-23 14:37 ` PID_NS unshare VS synchronize_rcu_tasks() (was: Re: [Syzkaller & bisect] There is task hung in "synchronize_rcu" in v6.1-rc5 kernel) Frederic Weisbecker
2022-11-23 15:45   ` Pengfei Xu [this message]
2022-11-23 22:06     ` Frederic Weisbecker
2022-11-23 22:17       ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y35ALpl8borkSHjy@xpf.sh.intel.com \
    --to=pengfei.xu@intel.com \
    --cc=brauner@kernel.org \
    --cc=ebiederm@xmission.com \
    --cc=frederic@kernel.org \
    --cc=heng.su@intel.com \
    --cc=jiangshanlai@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=quic_neeraju@quicinc.com \
    --cc=rcu@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox