From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756952AbbE2Tqb (ORCPT ); Fri, 29 May 2015 15:46:31 -0400 Received: from mx1.redhat.com ([209.132.183.28]:48814 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756544AbbE2TqW (ORCPT ); Fri, 29 May 2015 15:46:22 -0400 Date: Fri, 29 May 2015 21:45:34 +0200 From: Oleg Nesterov To: Peter Zijlstra Cc: paulmck@linux.vnet.ibm.com, tj@kernel.org, mingo@redhat.com, linux-kernel@vger.kernel.org, der.herr@hofr.at, dave@stgolabs.net, torvalds@linux-foundation.org Subject: Re: [RFC][PATCH 5/5] percpu-rwsem: Optimize readers and reduce global impact Message-ID: <20150529194534.GA31860@redhat.com> References: <20150526114356.609107918@infradead.org> <20150526120215.042527659@infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150526120215.042527659@infradead.org> User-Agent: Mutt/1.5.18 (2008-05-17) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Sorry for delay, finally I found the time to read this series... The code matches our previous discussion and I believe it is correct. Reviewed-by: Oleg Nesterov Just one nit below, On 05/26, Peter Zijlstra wrote: > > struct percpu_rw_semaphore { > - unsigned int __percpu *fast_read_ctr; > - atomic_t write_ctr; > + unsigned int __percpu *refcount; > + int state; .... > +enum { readers_slow, readers_block }; Now that we rely on rss_sync() and thus we do not have "readers_fast", I think that "bool reader_should_block" will look better. > +void percpu_down_write(struct percpu_rw_semaphore *sem) > { ... so it does rcu_sync_enter(&sem->rss); state = BLOCK; mb(); wait_event(sem->writer, readers_active_check(sem)); and this looks correct. The only nontrivial thing we need to ensure is that per_cpu_sum(*sem->refcount) == 0 can't be false positive. False negative is fine. And this means that if we see the result of this_cpu_dec() we must not miss the result of the previous this_cpu_inc() on another CPU. same or _another_ CPU. And this is true because if the reader does dec() on another CPU it does up_read() and this is only possible if down_read() didn't see state == BLOCK. But if it didn't see state == BLOCK then the writer must see the result of the previous down_read()->inc(). IOW, we just rely on STORE-MB-LOAD, just the writer does LOAD multiple times in per_cpu_sum(): DOWN_WRITE: DOWN_READ on CPU X: state = BLOCK; refcount[X]++; mb(); mb(); sum = 0; if (state != BLOCK) sum += refcount[0]; return; /* success* / sum += refcount[1]; ... refcount[X]--; sum += refcount[NR_CPUS]; If the reader wins and takes the lock, then its addition to refcount[X] must be accounted by the writer. The writer can obviously miss dec() from the reader, but we rely on wake_up/wait_event in this case. Oleg.