From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753997AbcGTRP6 (ORCPT <rfc822;w@1wt.eu>);
	Wed, 20 Jul 2016 13:15:58 -0400
Received: from mx1.redhat.com ([209.132.183.28]:51358 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752583AbcGTRPv (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Wed, 20 Jul 2016 13:15:51 -0400
Date: Wed, 20 Jul 2016 19:16:03 +0200
From: Oleg Nesterov <oleg@redhat.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>, mingo@kernel.org,
        linux-kernel@vger.kernel.org, tj@kernel.org, john.stultz@linaro.org,
        dimitrysh@google.com, romlem@google.com, ccross@google.com,
        tkjos@google.com
Subject: Re: [PATCH] rcu_sync: simplify the state machine, introduce
	__rcu_sync_enter()
Message-ID: <20160720171602.GA5059@redhat.com>
References: <20160714184351.GA18388@redhat.com> <20160714192018.GM30154@twins.programming.kicks-ass.net> <20160715132709.GA27644@redhat.com> <20160715133928.GH7094@linux.vnet.ibm.com> <20160715134523.GB28589@redhat.com> <20160715153840.GI7094@linux.vnet.ibm.com> <20160715164938.GA4294@redhat.com> <20160715180140.GM7094@linux.vnet.ibm.com> <20160716171007.GA30000@redhat.com> <20160719205018.GB7094@linux.vnet.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160719205018.GB7094@linux.vnet.ibm.com>
User-Agent: Mutt/1.5.18 (2008-05-17)
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Wed, 20 Jul 2016 17:15:50 +0000 (UTC)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Paul, I had to switch to internal bugzillas after the first email, and now
I feel I can't read... I'll try to answer one question right now, tomorrow
I'll reread your email, probably I need to answer something else...

On 07/19, Paul E. McKenney wrote:
>
> On Sat, Jul 16, 2016 at 07:10:07PM +0200, Oleg Nesterov wrote:
>
> > And, there is another possible transition, GP_ENTER -> GP_IDLE, because
> > not it is possible to call __rcu_sync_enter() and rcu_sync_exit() in any
> > state (except obviously they should be balanced), and they do not block.
...
> If you feel strongly about allowing rcu_sync_exit() in GP_ENTER state,
> could you please tell me your use case?  Or am I confused?

See below,

> And I think __rcu_sync_enter() can have more users. Let's look at
> freeze_super(). It calls percpu_down_write() 3 times, and waits for 3 GP's
> sequentally.
>
> > Now we can add 3 __rcu_sync_enter's at the start and 3 rcu_sync_exit's at
> > the end (actually we can do better, just to simplify). And again, note
> > that rcu_sync_exit() will work correctly even if we (say) return -EBUSY,
> > so rcu_sync_wait and/or percpu_down_write() was not called in between,
> > and in this case we won't block waiting for GP.
> >
>
> I am not going to claim to understand freeze_super(), but it does seem
> to have a fair amount of waiting.
>
> But yes, you could put rcu_sync_enter()
                         ^^^^^^^^^^^^^^
__rcu_sync_enter, see below,

> and rcu_sync_exit() before and
> after a series of write-side enter/exit pairs in order to force things
> to stay in writer mode, if that is what you are suggesting.

No, no, this is not what I am trying to suggest.

The problem is that freeze_super() takes 3 semaphores for writing in row,
this means that it needs to wait for 3 GP's sequentally, and it does this
with sb->s_umount held. This is just ugly.

OK, lets suppose it simply does

	freeze_super(sb)
	{
		down_write(&sb->s_umount);
		if (NEED_TO_FREEZE) {
			percpu_down_write(SEM1);
			percpu_down_write(SEM2);
			percpu_down_write(SEM3);
		}
		up_write(&sb->s_umount);
	}

and every percpu_down_write() waits for GP.

Now, suppose we add the additional enter/exit's:

	freeze_super(sb)
	{
		// this doesn't block
		__rcu_sync_enter(SEM3);
		__rcu_sync_enter(SEM2);
		__rcu_sync_enter(SEM1);

		down_write(&sb->s_umount);
		if (NEED_TO_FREEZE) {
			percpu_down_write(SEM1);
			percpu_down_write(SEM2);
			percpu_down_write(SEM3);
		}
		up_write(&sb->s_umount);

		rcu_sync_exit(SEM1);
		rcu_sync_exit(SEM2);
		rcu_sync_exit(SEM3);
		
	}

Again, actually we can do better, just to simplify.

Now. the fisrt percpu_down_write(SEM1) can block waiting for GP or not,
this depends on how many time it spends in down_write().

But the 2nd and the 3rd percpu_down_write() most likely won't block, so
in the likely case freeze_super() will need a single GP pass.

And note that NEED_TO_FREEZE can be false, in this case rcu_sync_exit()
will be called in GP_ENTER state.


To some degree, this is like get_state_synchronize_rcu/cond_synchronize_rcu.
But obviously percpu_down_write() can not use these helpers, and in this
particular case __rcu_sync_enter() is better because it forces the start
of GP pass.

-----------------------------------------------------------------------
As for cgroups, we want to switch cgroup_threadgroup_rwsem into the
slow mode, at least for now.

We could add the additional hooks/hacks into rcu/sync.c but why? We can
do this without any changes outside of cgroup.c right now, just add
rcu_sync_enter() into cgroup_init().

But we do not want to add a pointless synchronize_sched() at boot time,
__rcu_sync_enter() looks much better.

------------------------------------------------------------------------
And even __cgroup_procs_write() could use __rcu_sync_enter(). But lets
ignore this for now.

Oleg.