From: Andrea Parri <andrea.parri@amarulasolutions.com>
To: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: akpm@linux-foundation.org, peterz@infradead.org, oleg@redhat.com,
viro@zeniv.linux.org.uk, mingo@kernel.org,
paulmck@linux.vnet.ibm.com, keescook@chromium.org,
riel@redhat.com, mhocko@suse.com, tglx@linutronix.de,
kirill.shutemov@linux.intel.com, marcos.souza.org@gmail.com,
hoeun.ryu@gmail.com, pasha.tatashin@oracle.com,
gs051095@gmail.com, ebiederm@xmission.com, dhowells@redhat.com,
rppt@linux.vnet.ibm.com, linux-kernel@vger.kernel.org,
Alan Stern <stern@rowland.harvard.edu>,
Will Deacon <will.deacon@arm.com>,
Boqun Feng <boqun.feng@gmail.com>
Subject: Re: [PATCH 4/4] exit: Lockless iteration over task list in mm_update_next_owner()
Date: Thu, 26 Apr 2018 14:35:42 +0200 [thread overview]
Message-ID: <20180426123542.GA819@andrea> (raw)
In-Reply-To: <152474046779.29458.5294808258041953930.stgit@localhost.localdomain>
Hi Kirill,
On Thu, Apr 26, 2018 at 02:01:07PM +0300, Kirill Tkhai wrote:
> The patch finalizes the series and makes mm_update_next_owner()
> to iterate over task list using RCU instead of tasklist_lock.
> This is possible because of rules of inheritance of mm: it may be
> propagated to a child only, while only kernel thread can obtain
> someone else's mm via use_mm().
>
> Also, all new tasks are added to tail of tasks list or threads list.
> The only exception is transfer_pid() in de_thread(), when group
> leader is replaced by another thread. But transfer_pid() is called
> in case of successful exec only, where new mm is allocated, so it
> can't be interesting for mm_update_next_owner().
>
> This patch uses alloc_pid() as a memory barrier, and it's possible
> since it contains two or more spin_lock()/spin_unlock() pairs.
> Single pair does not imply a barrier, while two pairs do imply that.
>
> There are three barriers:
>
> 1)for_each_process(g) copy_process()
> p->mm = mm
> smp_rmb(); smp_wmb() implied by alloc_pid()
> if (g->flags & PF_KTHREAD) list_add_tail_rcu(&p->tasks, &init_task.tasks)
>
> 2)for_each_thread(g, c) copy_process()
> p->mm = mm
> smp_rmb(); smp_wmb() implied by alloc_pid()
> tmp = READ_ONCE(c->mm) list_add_tail_rcu(&p->thread_node, ...)
>
> 3)for_each_thread(g, c) copy_process()
> list_add_tail_rcu(&p->thread_node, ...)
> p->mm != NULL check do_exit()
> smp_rmb() smp_mb();
> get next thread in loop p->mm = NULL
>
>
> This patch may be useful for machines with many processes executing.
> I regulary observe mm_update_next_owner() executing on one of the cpus
> in crash dumps (not related to this function) on big machines. Even
> if iteration over task list looks as unlikely situation, it regularity
> grows with the growth of containers/processes numbers.
>
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> ---
> kernel/exit.c | 39 +++++++++++++++++++++++++++++++++++----
> kernel/fork.c | 1 +
> kernel/pid.c | 5 ++++-
> 3 files changed, 40 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 40f734ed1193..7ce4cdf96a64 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -406,6 +406,8 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
> void mm_update_next_owner(struct mm_struct *mm)
> {
> struct task_struct *c, *g, *p = current;
> + struct mm_struct *tmp;
> + struct list_head *n;
>
> retry:
> /*
> @@ -440,21 +442,49 @@ void mm_update_next_owner(struct mm_struct *mm)
> if (c->mm == mm)
> goto new_owner;
> }
> + read_unlock(&tasklist_lock);
>
> /*
> * Search through everything else, we should not get here often.
> */
> + rcu_read_lock();
> for_each_process(g) {
> + /*
> + * g->signal, g->mm and g->flags initialization of a just
> + * created task must not reorder with linking the task to
> + * tasks list. Pairs with smp_mb() implied by alloc_pid().
> + */
> + smp_rmb();
> if (g->flags & PF_KTHREAD)
> continue;
> for_each_thread(g, c) {
> - if (c->mm == mm)
> - goto new_owner;
> - if (c->mm)
> + /*
> + * Make visible mm of iterated thread.
> + * Pairs with smp_mb() implied by alloc_pid().
> + */
> + if (c != g)
> + smp_rmb();
> + tmp = READ_ONCE(c->mm);
> + if (tmp == mm)
> + goto new_owner_nolock;
> + if (likely(tmp))
> break;
> + n = READ_ONCE(c->thread_node.next);
> + /*
> + * All mm are NULL, so iterated threads already exited.
> + * Make sure we see their children.
> + * Pairs with smp_mb() in do_exit().
> + */
> + if (n == &g->signal->thread_head)
> + smp_rmb();
> }
> + /*
> + * Children of exited thread group are visible due to the above
> + * smp_rmb(). Threads with mm != NULL can't create a child with
> + * the mm we're looking for. So, no additional smp_rmb() needed.
> + */
> }
> - read_unlock(&tasklist_lock);
> + rcu_read_unlock();
> /*
> * We found no owner yet mm_users > 1: this implies that we are
> * most likely racing with swapoff (try_to_unuse()) or /proc or
> @@ -466,6 +496,7 @@ void mm_update_next_owner(struct mm_struct *mm)
> new_owner:
> rcu_read_lock();
> read_unlock(&tasklist_lock);
> +new_owner_nolock:
> BUG_ON(c == p);
>
> /*
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a5d21c42acfc..2032d4657546 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1805,6 +1805,7 @@ static __latent_entropy struct task_struct *copy_process(
> goto bad_fork_cleanup_io;
>
> if (pid != &init_struct_pid) {
> + /* Successfuly returned, this function imply smp_mb() */
> pid = alloc_pid(p->nsproxy->pid_ns_for_children);
> if (IS_ERR(pid)) {
> retval = PTR_ERR(pid);
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 157fe4b19971..cb96473aa058 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -155,7 +155,10 @@ void free_pid(struct pid *pid)
>
> call_rcu(&pid->rcu, delayed_put_pid);
> }
> -
> +/*
> + * This function contains at least two sequential spin_lock()/spin_unlock(),
> + * and together they imply full memory barrier.
Mmh, it's possible that I am misunderstanding this statement but it does
not seem quite correct to me; a counter-example would be provided by the
test at "tools/memory-model/litmus-tests/SB+mbonceonces.litmus" (replace
either of the smp_mb() with the sequence:
spin_lock(s); spin_unlock(s); spin_lock(s); spin_unlock(s); ).
BTW, your commit message suggests that your case would work with "imply
an smp_wmb()". This implication should hold "w.r.t. current implementa-
tions". We (LKMM people) discussed changes to the LKMM to make it hold
in LKMM but such changes are still in our TODO list as of today...
Andrea
> + */
> struct pid *alloc_pid(struct pid_namespace *ns)
> {
> struct pid *pid;
>
next prev parent reply other threads:[~2018-04-26 12:35 UTC|newest]
Thread overview: 96+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-04-26 11:00 [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable Kirill Tkhai
2018-04-26 11:00 ` [PATCH 1/4] exit: Move read_unlock() up in mm_update_next_owner() Kirill Tkhai
2018-04-26 15:01 ` Oleg Nesterov
2018-04-26 11:00 ` [PATCH 2/4] exit: Use rcu instead of get_task_struct() " Kirill Tkhai
2018-04-26 11:00 ` [PATCH 3/4] exit: Rename assign_new_owner label " Kirill Tkhai
2018-04-26 11:01 ` [PATCH 4/4] exit: Lockless iteration over task list " Kirill Tkhai
2018-04-26 12:35 ` Andrea Parri [this message]
2018-04-26 13:52 ` Kirill Tkhai
2018-04-26 15:20 ` Peter Zijlstra
2018-04-26 15:56 ` Kirill Tkhai
2018-04-26 15:20 ` Peter Zijlstra
2018-04-26 16:04 ` Kirill Tkhai
2018-04-26 15:29 ` Andrea Parri
2018-04-26 16:11 ` Kirill Tkhai
2018-04-26 13:07 ` [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable Michal Hocko
2018-04-26 13:52 ` Oleg Nesterov
2018-04-26 14:07 ` Kirill Tkhai
2018-04-26 15:10 ` Oleg Nesterov
2018-04-26 16:19 ` Eric W. Biederman
2018-04-26 19:28 ` Michal Hocko
2018-04-27 7:08 ` Michal Hocko
2018-04-27 18:05 ` Eric W. Biederman
2018-05-01 17:22 ` Eric W. Biederman
2018-05-01 17:35 ` [RFC][PATCH] memcg: Replace mm->owner with mm->memcg Eric W. Biederman
2018-05-02 8:47 ` Michal Hocko
2018-05-02 13:20 ` Johannes Weiner
2018-05-02 14:05 ` Eric W. Biederman
2018-05-02 19:21 ` [PATCH] " Eric W. Biederman
2018-05-02 21:04 ` Andrew Morton
2018-05-02 21:35 ` Eric W. Biederman
2018-05-03 13:33 ` Oleg Nesterov
2018-05-03 14:39 ` Eric W. Biederman
2018-05-04 14:20 ` Oleg Nesterov
2018-05-04 14:36 ` Eric W. Biederman
2018-05-04 14:54 ` Oleg Nesterov
2018-05-04 15:49 ` Eric W. Biederman
2018-05-04 16:22 ` Oleg Nesterov
2018-05-04 16:40 ` Eric W. Biederman
2018-05-04 17:26 ` [PATCH 0/2] mm->owner to mm->memcg fixes Eric W. Biederman
2018-05-04 17:26 ` [PATCH 1/2] memcg: Update the mm->memcg maintenance to work when !CONFIG_MMU Eric W. Biederman
2018-05-04 17:27 ` [PATCH 2/2] memcg: Close the race between migration and installing bprm->mm as mm Eric W. Biederman
2018-05-09 14:51 ` Oleg Nesterov
2018-05-10 3:00 ` Eric W. Biederman
2018-05-10 12:14 ` [PATCH 0/2] mm->owner to mm->memcg fixes Michal Hocko
2018-05-10 12:18 ` Michal Hocko
2018-05-22 12:57 ` Michal Hocko
2018-05-23 19:46 ` Eric W. Biederman
2018-05-24 11:10 ` Michal Hocko
2018-05-24 21:16 ` Andrew Morton
2018-05-24 23:37 ` Andrea Parri
2018-05-30 12:17 ` Michal Hocko
2018-05-31 18:41 ` Eric W. Biederman
2018-06-01 1:57 ` [PATCH] memcg: Replace mm->owner with mm->memcg Eric W. Biederman
2018-06-01 14:52 ` [RFC][PATCH 0/2] memcg: Require every task that uses an mm to migrate together Eric W. Biederman
2018-06-01 14:53 ` [RFC][PATCH 1/2] memcg: Ensure every task that uses an mm is in the same memory cgroup Eric W. Biederman
2018-06-01 16:50 ` Tejun Heo
2018-06-01 18:11 ` Eric W. Biederman
2018-06-01 19:16 ` Tejun Heo
2018-06-04 13:01 ` Michal Hocko
2018-06-04 18:47 ` Tejun Heo
2018-06-04 19:11 ` Eric W. Biederman
2018-06-06 11:13 ` Michal Hocko
2018-06-07 11:42 ` Eric W. Biederman
2018-06-07 12:19 ` Michal Hocko
2018-06-01 14:53 ` [RFC][PATCH 2/2] memcgl: Remove dead code now that all tasks of an mm share a memcg Eric W. Biederman
2018-06-01 14:07 ` [PATCH 0/2] mm->owner to mm->memcg fixes Michal Hocko
2018-05-24 21:17 ` Andrew Morton
2018-05-30 11:52 ` Michal Hocko
2018-05-31 17:43 ` Eric W. Biederman
2018-05-07 14:33 ` [PATCH] memcg: Replace mm->owner with mm->memcg Oleg Nesterov
2018-05-08 3:15 ` Eric W. Biederman
2018-05-09 14:40 ` Oleg Nesterov
2018-05-10 3:09 ` Eric W. Biederman
2018-05-10 4:03 ` [RFC][PATCH] cgroup: Don't mess with tasks in exec Eric W. Biederman
2018-05-10 12:15 ` Oleg Nesterov
2018-05-10 12:35 ` Tejun Heo
2018-05-10 12:38 ` [PATCH] memcg: Replace mm->owner with mm->memcg Oleg Nesterov
2018-05-04 11:07 ` Michal Hocko
2018-05-05 16:54 ` kbuild test robot
2018-05-07 23:18 ` Andrew Morton
2018-05-08 2:17 ` Eric W. Biederman
2018-05-09 21:00 ` Michal Hocko
2018-05-02 23:59 ` [RFC][PATCH] " Balbir Singh
2018-05-03 15:11 ` Eric W. Biederman
2018-05-04 4:59 ` Balbir Singh
2018-05-03 10:52 ` [PATCH 0/4] exit: Make unlikely case in mm_update_next_owner() more scalable Kirill Tkhai
2018-06-01 1:07 ` Eric W. Biederman
2018-06-01 13:57 ` Michal Hocko
2018-06-01 14:32 ` Eric W. Biederman
2018-06-01 15:02 ` Michal Hocko
2018-06-01 15:25 ` Eric W. Biederman
2018-06-04 6:54 ` Michal Hocko
2018-06-04 14:31 ` Eric W. Biederman
2018-06-05 8:15 ` Michal Hocko
2018-06-05 8:48 ` Kirill Tkhai
2018-06-05 15:36 ` Eric W. Biederman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180426123542.GA819@andrea \
--to=andrea.parri@amarulasolutions.com \
--cc=akpm@linux-foundation.org \
--cc=boqun.feng@gmail.com \
--cc=dhowells@redhat.com \
--cc=ebiederm@xmission.com \
--cc=gs051095@gmail.com \
--cc=hoeun.ryu@gmail.com \
--cc=keescook@chromium.org \
--cc=kirill.shutemov@linux.intel.com \
--cc=ktkhai@virtuozzo.com \
--cc=linux-kernel@vger.kernel.org \
--cc=marcos.souza.org@gmail.com \
--cc=mhocko@suse.com \
--cc=mingo@kernel.org \
--cc=oleg@redhat.com \
--cc=pasha.tatashin@oracle.com \
--cc=paulmck@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=rppt@linux.vnet.ibm.com \
--cc=stern@rowland.harvard.edu \
--cc=tglx@linutronix.de \
--cc=viro@zeniv.linux.org.uk \
--cc=will.deacon@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).