From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753332AbbIRI52 (ORCPT <rfc822;w@1wt.eu>);
	Fri, 18 Sep 2015 04:57:28 -0400
Received: from bombadil.infradead.org ([198.137.202.9]:48616 "EHLO
	bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752396AbbIRI5X (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Fri, 18 Sep 2015 04:57:23 -0400
Date: Fri, 18 Sep 2015 10:51:56 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Oleg Nesterov <oleg@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>, ebiederm@xmission.com,
        Al Viro <viro@zeniv.linux.org.uk>,
        Andrew Morton <akpm@linux-foundation.org>,
        Ingo Molnar <mingo@kernel.org>,
        Paul McKenney <paulmck@linux.vnet.ibm.com>, mhocko@suse.cz,
        LKML <linux-kernel@vger.kernel.org>, ktsan@googlegroups.com,
        Kostya Serebryany <kcc@google.com>,
        Andrey Konovalov <andreyknvl@google.com>,
        Alexander Potapenko <glider@google.com>,
        Hans Boehm <hboehm@google.com>
Subject: Re: [PATCH] kernel: fix data race in put_pid
Message-ID: <20150918085156.GS3816@twins.programming.kicks-ass.net>
References: <1442496268-47803-1-git-send-email-dvyukov@google.com>
 <20150917160837.GA26050@redhat.com>
 <CACT4Y+YKmhz-+sugKt86CF2715YtoVz4ee+sRfDp2Wu==A4zPQ@mail.gmail.com>
 <20150917174456.GA30178@redhat.com>
 <CACT4Y+bnqMGjMV6FHHJWsx0YaxWmRqAM+YY7b5wz-7JmFBnwjA@mail.gmail.com>
 <20150917180919.GA32116@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150917180919.GA32116@redhat.com>
User-Agent: Mutt/1.5.21 (2012-12-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, Sep 17, 2015 at 08:09:19PM +0200, Oleg Nesterov wrote:
> On 09/17, Dmitry Vyukov wrote:
> >
> > I can update the patch description, but let me explain it here first.
> 
> Yes thanks.
> 
> > Here is the essence of what happens:
> 
> Aha, so you really meant that 2 put_pid's can race with each other,
> 
> > // thread 1
> > 1: pid->foo = 1; // foo is the first word of pid object
> > // then it does put_pid
> > 2: atomic_dec_and_test(&pid->count) // decrements count to 1 and
> > returns false so the function returns
> >
> > // thread 2
> > // executes put_pid
> > 3: atomic_load(&pid->count); // returns 1, so proceed to kmem_cache_free
> > // then kmem_cache_free does:
> > 4: *(void**)pid = head->freelist;
> > 5: head->freelist = (void*)pid;
> >
> > This can be executed as:
> >
> > 4: *(void**)pid = head->freelist;
> > 1: pid->foo = 1; // foo is the first word of pid object
> > 2: atomic_dec_and_test(&pid->count) // decrements count to 1 and
> > returns false so the function returns
> > 3: atomic_load(&pid->count); // returns 1, so proceed to kmem_cache_free
> > 5: head->freelist = (void*)pid;
> 
> Unless I am totally confused, everything is simpler. We can forget
> about the hoisting, freelist, etc.
> 
> Thread 2 can see the result of atomic_dec_and_test(), but not the
> result of "pid->foo = 1". In this case in can free the object which
> can be re-allocated _before_ STORE(pid->foo) completes. Of course,
> this would be really bad.
> 
> I need to recheck, but afaics this is not possible. This optimization
> is fine, but probably needs a comment.

For sure, this code doesn't make any sense to me.

> We rely on delayed_put_pid()
> called by RCU. And note that nobody can write to this pid after it
> is removed from the rcu-protected list.
> 
> So I think this is false alarm, but I'll try to recheck tomorrow, it
> is too late for me today.

As an alternative patch, could we not do:

  void put_pid(struct pid *pid)
  {
	struct pid_namespace *ns;

	if (!pid)
		return;

	ns = pid->numbers[pid->level].ns;
	if ((atomic_read(&pid->count) == 1) ||
	     atomic_dec_and_test(&pid->count)) {

+		smp_read_barrier_depends(); /* ctrl-dep */

		kmem_cache_free(ns->pid_cachep, pid);
		put_pid_ns(ns);
	}
  }

That would upgrade the atomic_read() path to a full READ_ONCE_CTRL(),
and thereby avoid any of the kmem_cache_free() stores from leaking out.
And its free, except on Alpha. Whereas the atomic_read_acquire() will
generate a full memory barrier on whole bunch of archs.