From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1756952AbbE2Tqb (ORCPT <rfc822;w@1wt.eu>);
	Fri, 29 May 2015 15:46:31 -0400
Received: from mx1.redhat.com ([209.132.183.28]:48814 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1756544AbbE2TqW (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Fri, 29 May 2015 15:46:22 -0400
Date: Fri, 29 May 2015 21:45:34 +0200
From: Oleg Nesterov <oleg@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: paulmck@linux.vnet.ibm.com, tj@kernel.org, mingo@redhat.com,
        linux-kernel@vger.kernel.org, der.herr@hofr.at, dave@stgolabs.net,
        torvalds@linux-foundation.org
Subject: Re: [RFC][PATCH 5/5] percpu-rwsem: Optimize readers and reduce
	global impact
Message-ID: <20150529194534.GA31860@redhat.com>
References: <20150526114356.609107918@infradead.org> <20150526120215.042527659@infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150526120215.042527659@infradead.org>
User-Agent: Mutt/1.5.18 (2008-05-17)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Sorry for delay, finally I found the time to read this series...
The code matches our previous discussion and I believe it is correct.

Reviewed-by: Oleg Nesterov <oleg@redhat.com>


Just one nit below,

On 05/26, Peter Zijlstra wrote:
>
>  struct percpu_rw_semaphore {
> -	unsigned int __percpu	*fast_read_ctr;
> -	atomic_t		write_ctr;
> +	unsigned int __percpu	*refcount;
> +	int			state;

....

> +enum { readers_slow, readers_block };

Now that we rely on rss_sync() and thus we do not have "readers_fast",
I think that "bool reader_should_block" will look better.

> +void percpu_down_write(struct percpu_rw_semaphore *sem)
>  {
...

so it does

	rcu_sync_enter(&sem->rss);

	state = BLOCK;

	mb();

	wait_event(sem->writer, readers_active_check(sem));

and this looks correct.

The only nontrivial thing we need to ensure is that
per_cpu_sum(*sem->refcount) == 0 can't be false positive. False
negative is fine.

And this means that if we see the result of this_cpu_dec() we must
not miss the result of the previous this_cpu_inc() on another CPU.
same or _another_ CPU.

And this is true because if the reader does dec() on another CPU
it does up_read() and this is only possible if down_read() didn't
see state == BLOCK.

But if it didn't see state == BLOCK then the writer must see the
result of the previous down_read()->inc().

IOW, we just rely on STORE-MB-LOAD, just the writer does LOAD
multiple times in per_cpu_sum():

DOWN_WRITE:			DOWN_READ on CPU X:

state = BLOCK;			refcount[X]++;

mb();				mb();

sum = 0;			if (state != BLOCK)
sum += refcount[0];			return;	 /* success* /
sum += refcount[1];		
...				refcount[X]--;
sum += refcount[NR_CPUS];


If the reader wins and takes the lock, then its addition to
refcount[X] must be accounted by the writer.

The writer can obviously miss dec() from the reader, but we rely
on wake_up/wait_event in this case.

Oleg.