From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1757263AbbKRWit (ORCPT <rfc822;w@1wt.eu>);
	Wed, 18 Nov 2015 17:38:49 -0500
Received: from e33.co.us.ibm.com ([32.97.110.151]:52469 "EHLO
	e33.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750832AbbKRWir (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Wed, 18 Nov 2015 17:38:47 -0500
X-IBM-Helo: d03dlp01.boulder.ibm.com
X-IBM-MailFrom: paulmck@linux.vnet.ibm.com
X-IBM-RcptTo: linux-kernel@vger.kernel.org
Date: Wed, 18 Nov 2015 14:39:02 -0800
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: linux-kernel@vger.kernel.org
Cc: fweisbec@gmail.com, luto@amacapital.net, peterz@infradead.org,
        riel@redhat.com, torvalds@linux-foundation.org
Subject: Belated notes from LKS context-tracking session
Message-ID: <20151118223902.GA27326@linux.vnet.ibm.com>
Reply-To: paulmck@linux.vnet.ibm.com
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-MML: disable
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 15111822-0009-0000-0000-00000FDE946C
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

The issue for this session was that NO_HZ_FULL slows down kernel/user
transitions too much, and this needs to be fixed.  In addition, the
current context-tracking code is considered to be more opaque than
necessary, so it needs to be adjusted appropriately, whether that be
commented, rewritten, or whatever.

One approach suggested earlier was to shift code around in order to
minimize the number of interrupt enable/disable pair.  This did result
in some improvement, but only a few percent worth.

Earlier investigation by Rik van Riel (not present) indicated that
timestamp computations were contributing the bulk of the overhead.
Rik has been working on a remote-sampling approach to eliminate this
overhead on worker CPUs.  Peter Zijlstra expressed concern that this
would be problematic on systems with large numbers of CPUs and small
numbers of housekeeping CPUs.

This sparked considerable discussion about the overhead of the x86 rdtsc
instruction.  And it was suggested that the jiffies counter would work
just fine for many use cases.  This counter would remain in cache for
one jiffy, so that workloads with extreme system-call rates would enjoy
very low overhead timestamps.  After some post-meeting discussion, there
seemed to be significant support for this approach, at least within the
small post-meeting group.

Discussion then turned to the possibility of merging RCU's nohz_full-CPU
tracking with the context-tracking code.  I noted that RCU needs not just
to detect whether or not a given CPU is in nohz_full mode, but rather
whether or not that CPU was ever in that mode over a given time interval.
This is currently accomplished using a counter that is incremented
on entry to and exit from nohz_full mode.  If the counter has an even
value, then the CPU is in nohz_full mode.  If the counter changes over
some period of time, then the CPU had to have been in nohz_full mode at
some point during that period of time.

[ Just for the record, I incorrectly accused Linus of having written
the counter comparison.  It is instead the counter comparisons for
rcu_barrier() and synchronize_rcu_expedited() that I can blame on Linus. ]

This was followed by a call for lower-overhead maintenance of the above
counter, by weakening the associated memory barriers to something
like the lwsync instruction on PowerPC, which of course maps to the
instruction-free barrier() macro on x86 and other total-store-order
(TSO) systems.  This was the subject of some spirited post-meeting
discussions.  I was able to demonstrate [*] that at least one of
the four associated memory barriers needs to remain a full smp_mb(),
but was unable to come up with a similar on-the-spot demonstration for
the other three.  Some detailed documentation of RCU's memory-ordering
requirements therefore appears to be needed sooner rather than later,
given the severe penalties for weakening RCU-related barriers too much.

							Thanx, Paul

------------------------------------------------------------------------

[*] Memory-barrier demonstration

There are four of these memory barriers:

1.	Before counter increment when entering nohz_full mode (or idle).
2.	After counter increment when entering nohz_full mode (or idle).
3.	Before counter increment when exiting nohz_full mode (or idle).
4.	After counter increment when exiting nohz_full mode (or idle).

Currently, #1 and #2 and combined into one atomic increment, and #3 and
#4 are combined into another atomic increment.  It is easy to show that
#4 must remain a full barrier, because if it was not, the following
sequence of events could occur:

a.	CPU 0 loads the counter, preparing to exit nohz_full mode.

b.	CPU 0 loads a pointer to an RCU-protected data item.  (Even x86
	allows this to be reordered with (e) below.)

c.	CPU 1 removes that data item.

d.	CPU 1 does a grace period, but sees that CPU 0's counter still
	indicates that it is in nohz_full mode.

e.	CPU 0 increments and stores the counter.  (This was reordered
	with (b) above, which is allowed x86.)

f.	CPU 1 completes its grace period, and frees the data item.

g.	CPU 0 continues accessing the RCU-protected data item.  Boom!!!

I will document RCU's memory-ordering constraints to see if #1-#3 can
be weakened.