From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751694Ab1GUFJu (ORCPT ); Thu, 21 Jul 2011 01:09:50 -0400 Received: from e6.ny.us.ibm.com ([32.97.182.146]:33741 "EHLO e6.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751069Ab1GUFJt (ORCPT ); Thu, 21 Jul 2011 01:09:49 -0400 Date: Wed, 20 Jul 2011 22:09:27 -0700 From: "Paul E. McKenney" To: Linus Torvalds Cc: linux-kernel@vger.kernel.org, mingo@elte.hu, laijs@cn.fujitsu.com, dipankar@in.ibm.com, akpm@linux-foundation.org, mathieu.desnoyers@polymtl.ca, josh@joshtriplett.org, niv@us.ibm.com, tglx@linutronix.de, peterz@infradead.org, rostedt@goodmis.org, Valdis.Kletnieks@vt.edu, dhowells@redhat.com, eric.dumazet@gmail.com, darren@dvhart.com, patches@linaro.org, greearb@candelatech.com, edt@aei.ca Subject: Re: [PATCH tip/core/urgent 3/7] rcu: Streamline code produced by __rcu_read_unlock() Message-ID: <20110721050927.GV2313@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com References: <20110720182512.GA22946@linux.vnet.ibm.com> <1311186383-24819-3-git-send-email-paulmck@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 20, 2011 at 03:44:55PM -0700, Linus Torvalds wrote: > On Wed, Jul 20, 2011 at 11:26 AM, Paul E. McKenney > wrote: > > Given some common flag combinations, particularly -Os, gcc will inline > > rcu_read_unlock_special() despite its being in an unlikely() clause. > > Use noinline to prohibit this misoptimization. > > Btw, I suspect that we should at least look at what it would mean if > we make the rcu_read_lock_nesting and the preempt counters both be > per-cpu variables instead of making them per-thread/process counters. > > Then, when we switch threads, we'd just save/restore them from the > process register save area. > > There's a lot of critical code sequences (spin-lock/unlock, rcu > read-lock/unlock) that currently fetches the thread/process pointer > only to then offset it and increment the count. I get the strong > feeling that code generation could be improved and we could avoid one > level of indirection by just making it a per-thread counter. > > For example, instead of __rcu_read_lock: looking like this (and being > an external function, partly because of header file dependencies on > the data structures involved): > > push %rbp > mov %rsp,%rbp > mov %gs:0xb580,%rax > incl 0x100(%rax) > leaveq > retq > > it should inline to just something like > > incl %gs:0x100 > > instead. Same for the preempt counter. > > Of course, it would need to involve making sure that we pick a good > cacheline etc that is already always dirty. But other than that, is > there any real downside? We would need a form of per-CPU variable access that generated efficient code, but that didn't complain about being used when preemption was enabled. __this_cpu_add_4() might do the trick, but I haven't dug fully through it yet. Thanx, Paul