From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757263AbbKRWit (ORCPT ); Wed, 18 Nov 2015 17:38:49 -0500 Received: from e33.co.us.ibm.com ([32.97.110.151]:52469 "EHLO e33.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750832AbbKRWir (ORCPT ); Wed, 18 Nov 2015 17:38:47 -0500 X-IBM-Helo: d03dlp01.boulder.ibm.com X-IBM-MailFrom: paulmck@linux.vnet.ibm.com X-IBM-RcptTo: linux-kernel@vger.kernel.org Date: Wed, 18 Nov 2015 14:39:02 -0800 From: "Paul E. McKenney" To: linux-kernel@vger.kernel.org Cc: fweisbec@gmail.com, luto@amacapital.net, peterz@infradead.org, riel@redhat.com, torvalds@linux-foundation.org Subject: Belated notes from LKS context-tracking session Message-ID: <20151118223902.GA27326@linux.vnet.ibm.com> Reply-To: paulmck@linux.vnet.ibm.com MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 15111822-0009-0000-0000-00000FDE946C Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The issue for this session was that NO_HZ_FULL slows down kernel/user transitions too much, and this needs to be fixed. In addition, the current context-tracking code is considered to be more opaque than necessary, so it needs to be adjusted appropriately, whether that be commented, rewritten, or whatever. One approach suggested earlier was to shift code around in order to minimize the number of interrupt enable/disable pair. This did result in some improvement, but only a few percent worth. Earlier investigation by Rik van Riel (not present) indicated that timestamp computations were contributing the bulk of the overhead. Rik has been working on a remote-sampling approach to eliminate this overhead on worker CPUs. Peter Zijlstra expressed concern that this would be problematic on systems with large numbers of CPUs and small numbers of housekeeping CPUs. This sparked considerable discussion about the overhead of the x86 rdtsc instruction. And it was suggested that the jiffies counter would work just fine for many use cases. This counter would remain in cache for one jiffy, so that workloads with extreme system-call rates would enjoy very low overhead timestamps. After some post-meeting discussion, there seemed to be significant support for this approach, at least within the small post-meeting group. Discussion then turned to the possibility of merging RCU's nohz_full-CPU tracking with the context-tracking code. I noted that RCU needs not just to detect whether or not a given CPU is in nohz_full mode, but rather whether or not that CPU was ever in that mode over a given time interval. This is currently accomplished using a counter that is incremented on entry to and exit from nohz_full mode. If the counter has an even value, then the CPU is in nohz_full mode. If the counter changes over some period of time, then the CPU had to have been in nohz_full mode at some point during that period of time. [ Just for the record, I incorrectly accused Linus of having written the counter comparison. It is instead the counter comparisons for rcu_barrier() and synchronize_rcu_expedited() that I can blame on Linus. ] This was followed by a call for lower-overhead maintenance of the above counter, by weakening the associated memory barriers to something like the lwsync instruction on PowerPC, which of course maps to the instruction-free barrier() macro on x86 and other total-store-order (TSO) systems. This was the subject of some spirited post-meeting discussions. I was able to demonstrate [*] that at least one of the four associated memory barriers needs to remain a full smp_mb(), but was unable to come up with a similar on-the-spot demonstration for the other three. Some detailed documentation of RCU's memory-ordering requirements therefore appears to be needed sooner rather than later, given the severe penalties for weakening RCU-related barriers too much. Thanx, Paul ------------------------------------------------------------------------ [*] Memory-barrier demonstration There are four of these memory barriers: 1. Before counter increment when entering nohz_full mode (or idle). 2. After counter increment when entering nohz_full mode (or idle). 3. Before counter increment when exiting nohz_full mode (or idle). 4. After counter increment when exiting nohz_full mode (or idle). Currently, #1 and #2 and combined into one atomic increment, and #3 and #4 are combined into another atomic increment. It is easy to show that #4 must remain a full barrier, because if it was not, the following sequence of events could occur: a. CPU 0 loads the counter, preparing to exit nohz_full mode. b. CPU 0 loads a pointer to an RCU-protected data item. (Even x86 allows this to be reordered with (e) below.) c. CPU 1 removes that data item. d. CPU 1 does a grace period, but sees that CPU 0's counter still indicates that it is in nohz_full mode. e. CPU 0 increments and stores the counter. (This was reordered with (b) above, which is allowed x86.) f. CPU 1 completes its grace period, and frees the data item. g. CPU 0 continues accessing the RCU-protected data item. Boom!!! I will document RCU's memory-ordering constraints to see if #1-#3 can be weakened.