From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:58174 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726758AbeJBKmF (ORCPT ); Tue, 2 Oct 2018 06:42:05 -0400 Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w923xTSq066863 for ; Tue, 2 Oct 2018 00:00:56 -0400 Received: from e17.ny.us.ibm.com (e17.ny.us.ibm.com [129.33.205.207]) by mx0a-001b2d01.pphosted.com with ESMTP id 2mv0xsrhah-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Tue, 02 Oct 2018 00:00:56 -0400 Received: from localhost by e17.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 2 Oct 2018 00:00:54 -0400 Date: Mon, 1 Oct 2018 21:00:51 -0700 From: "Paul E. McKenney" Subject: Re: Question of usage of per_thread() Reply-To: paulmck@linux.ibm.com References: <71206929-11cd-73bc-560e-f982c7e86cf8@gmail.com> <20180916150849.GW652@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Message-Id: <20181002040051.GS4222@linux.ibm.com> Sender: perfbook-owner@vger.kernel.org List-ID: To: Akira Yokosawa Cc: perfbook@vger.kernel.org On Tue, Oct 02, 2018 at 07:48:15AM +0900, Akira Yokosawa wrote: > On 2018/09/17 0:08, Paul E. McKenney wrote: > > On Sun, Sep 16, 2018 at 05:37:23PM +0900, Akira Yokosawa wrote: > >> Hi Paul, > >> > >> Every time I review code under CodeSamples/, > >> I find myself confused where to use READ_ONCE/WRITE_ONCEs. > >> > >> I'm looking at Listing 5.3 of current master. > >> > >> There are two cases which lack READ_ONCE/WRITE_ONCE to access potentially > >> shared variables, namely on line 5 (__get_thread_var(counter)++;) and > >> on line 14 (sum += per_thread(counter, t);). > >> > >> Line 5 looks like a good candidate to be optimized out when inlined. > >> But the performance result indicates "gcc -O3" keeps it inside the loop. > >> > >> Is this because the definition of __get_thread_var() contains > >> a call to smp_thread_id() and complicated enough not to be optimized > >> out? > >> > >> As for line 14, as per_thread() was derived from per_cpu() of kernel > >> API, I looked for call sites of per_cpu() in the kernel source tree. > >> > >> There are very few cases where READ_ONCE/WRITE_ONCE is used along > >> with per_cpu(). There are two READ_ONCEs with per_cpu() in > >> kernel/rcu/srcutree.c, whose author is none other than you. > >> Are those READ_ONCEs necessary? > >> > >> I don't grasp the actual definition of per_cpu() macro. > >> Definition of per_thread() macro under CodeSamples/api-pthreads/ > >> does not look so complicated, but contains array indexing, > >> which might be good enough to prevent optimization in the loop. > >> > >> I'm not sure, but my gut feeling is that READ_ONCE/WRITE_ONCE > >> is necessary to access an unannotated variable. If we need > >> volatility for sure, we could modify the definition of annotating > >> macros/functions. > >> > >> Can you enlighten me? > > > > Yes, I do need to expand on this, both in perfbook and in order to inspect > > my usage in Linux-kernel RCU. Please see below for my current outline, > > and any and all feedback would be deeply appreciated! > > > > So, I applied your outline to accesses in Listing 5.3. > See inline comments below for my observation. > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > > > Plain accesses, that is, without either READ_ONCE() or WRITE_ONCE(): > > o Only accessed under lock. > > o Only accessed by owning CPU/thread. > > o Only modified by owning CPU/thread under lock, and otherwise > > accessed only either while holding lock or by owning CPU/thread. > > > > WRITE_ONCE() for modifications, READ_ONCE() for non-owning CPUs/threads > > not holding lock: > > o Only modified while holding lock by owning CPU/thread, > > could be read from anywhere by anyone. > > > > WRITE_ONCE() for modifications, READ_ONCE() for CPUs/threads not holding > > lock: > > o Only modified while holding lock, could be read anywhere by > > anyone. > > > > WRITE_ONCE() for modifications, READ_ONCE() for non-owning > > CPUs/threads: > > o Only modified by owning CPU/threads, could be read anywhere > > by anyone. > > Listing 5.3's per-thread variable "counter" looks covered here. > So the write part of the increment on line 5 needs WRITE_ONCE(). > Read access on line 14 also needs READ_ONCE(). > > Read part of the increment doesn't need READ_ONCE(), I guess. > > Line 5 would look like: > > WRITE_ONCE(__get_thread_var(counter), __get_thread_var(counter) + 1); > > . Considering the cost of __get_thread_var(), there should be a room > to help compilers to optimize by using a local pointer as follows: > > static __inline__ void inc_count(void) > { > long *p_counter = &__get_thread_var(counter); > > WRITE_ONCE(*p_counter, *p_counter + 1); > } > > . Does this look reasonable to you? Or is it better to also use > READ_ONCE()? Yes, this is fine. You don't need READ_ONCE() because threads only modify the counters that they own, and this case is a thread reading its own counter -- thus concurrent update cannot happen. > But if __get_thread_var() does imply volatility, we can keep the > current way of increment: > > __get_thread_var(counter)++; Unfortunately, it does not, at least not the way that it is currently implemented. > Thoughts? Your taking the pointer and using WRITE_ONCE() looks good! Thanx, Paul > Thanks, Akira > > > > > Otherwise, READ_ONCE() and WRITE_ONCE() everywhere. > > > > But note that READ_ONCE() and WRITE_ONCE() provide absolutely no > > ordering guarantees unless: > > > > o There is only one variable being accesses, and all of those > > accesses are of the same size and alignment. > > > > o There is only one CPU/thread doing the accesses to all > > of the variables during the timeframe of interest. > > > > o There are other operations involved, for example, smp_mb(). > > >