All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>,
	tj@kernel.org, mingo@redhat.com, linux-kernel@vger.kernel.org,
	der.herr@hofr.at, dave@stgolabs.net, riel@redhat.com,
	viro@ZenIV.linux.org.uk, torvalds@linux-foundation.org
Subject: Re: [RFC][PATCH 12/13] stop_machine: Remove lglock
Date: Fri, 26 Jun 2015 09:14:28 -0700	[thread overview]
Message-ID: <20150626161415.GY3717@linux.vnet.ibm.com> (raw)
In-Reply-To: <20150626123207.GZ19282@twins.programming.kicks-ass.net>

On Fri, Jun 26, 2015 at 02:32:07PM +0200, Peter Zijlstra wrote:
> On Thu, Jun 25, 2015 at 07:51:46AM -0700, Paul E. McKenney wrote:
> > > So please humour me and explain how all this is far more complicated ;-)
> > 
> > Yeah, I do need to get RCU design/implementation documentation put together.
> > 
> > In the meantime, RCU's normal grace-period machinery is designed to be
> > quite loosely coupled.  The idea is that almost all actions occur locally,
> > reducing contention and cache thrashing.  But an expedited grace period
> > needs tight coupling in order to be able to complete quickly.  Making
> > something that switches between loose and tight coupling in short order
> > is not at all simple.
> 
> But expedited just means faster, we never promised that
> sync_rcu_expedited is the absolute fastest primitive ever.

Which is good, because given that it is doing something to each and
every CPU, it most assuredly won't in any way resemble the absolute
fastest primitive ever.  ;-)

> So I really should go read the RCU code I suppose, but I don't get
> what's wrong with starting a forced quiescent state, then doing the
> stop_work spray, where each work will run the regular RCU tick thing to
> push it forwards.
> 
> >From my feeble memories, what I remember is that the last cpu to
> complete a GP on a leaf node will push the completion up to the next
> level, until at last we've reached the root of your tree and we can
> complete the GP globally.

That is true, the task that notices the last required quiescent state
will push up the tree and notice that the grace period has ended.
If that task is not the grace-period kthread, it will then awaken
the grace-period kthread.

> To me it just makes more sense to have a single RCU state machine. With
> expedited we'll push it as fast as we can, but no faster.

Suppose that someone invokes synchronize_sched_expedited(), but there
is no normal grace period in flight.  Then each CPU will note its own
quiescent state, but when it later might have tried to push it up the
tree, it will see that there is no grace period in effect, and will
therefore not bother.

OK, we could have synchronize_sched_expedited() tell the grace-period
kthread to start a grace period if one was not already in progress.
But that still isn't good enough, because the grace-period kthread will
take some time to initialize the new grace period, and if we hammer all
the CPUs before the initialization is complete, the resulting quiescent
states cannot be counted against the new grace period.  (The reason for
this is that there is some delay between the actual quiescent state
and the time that it is reported, so we have to be very careful not
to incorrectly report a quiescent state from an earlier grace period
against the current grace period.)

OK, the grace-period kthread could tell synchronize_sched_expedited()
when it has finished initializing the grace period, though this is
starting to get a bit on the Rube Goldberg side.  But this -still- is
not good enough, because even though the grace-period kthread has fully
initialized the new grace period, the individual CPUs are unaware of it.
And they will therefore continue to ignore any quiescent state that they
encounter, because they cannot prove that it actually happened after
the start of the current grace period.

OK, we could have some sort of indication when all CPUs become aware
of the new grace period by having them atomically manipulate a global
counter.  Presumably we have some flag indicating when this is and is
not needed so that we avoid the killer memory contention in the common
case where it is not needed.  But this -still- isn't good enough, because
idle CPUs never will become aware of the new grace period -- by design,
as they are supposed to be able to sleep through an arbitrary number of
grace periods.

OK, so we could have some sort of indication when all non-idle CPUs
become aware of the new grace period.  But there could be races where
an idle CPU suddenly becomes non-idle just after it was reported that
the all non-idle CPUs were aware of the grace period.  This would result
in a hang, because this this newly non-idle CPU might not have noticed
the new grace period at the time that synchronize_sched_expedited()
hammers it, which would mean that this newly non-idle CPU would refuse
to report the resulting quiescent state.

OK, so the grace-period kthread could track and report the set of CPUs
that had ever been idle since synchronize_sched_expedited() contacted it.
But holy overhead Batman!!!

And that is just one of the possible interactions with the grace-period
kthread.  It might be in the middle of setting up a new grace period.
It might be in the middle of cleaning up after the last grace period.
It might be waiting for a grace period to complete, and the last quiescent
state was just reported, but hasn't propagated all the way up yet.  All
of these would need to be handled correctly, and a number of them would
be as messy as the above scenario.  Some might be even more messy.

I feel like there is a much easier way, but cannot yet articulate it.
I came across a couple of complications and a blind alley with it thus
far, but it still looks promising.  I expect to be able to generate
actual code for it within a few days, but right now it is just weird
abstract shapes in my head.  (Sorry, if I knew how to describe them,
I could just write the code!  When I do write the code, it will probably
seem obvious and trivial, that being the usual outcome...)

							Thanx, Paul


  reply	other threads:[~2015-06-26 16:14 UTC|newest]

Thread overview: 106+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-22 12:16 [RFC][PATCH 00/13] percpu rwsem -v2 Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 01/13] rcu: Create rcu_sync infrastructure Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 02/13] rcusync: Introduce struct rcu_sync_ops Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 03/13] rcusync: Add the CONFIG_PROVE_RCU checks Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 04/13] rcusync: Introduce rcu_sync_dtor() Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 05/13] percpu-rwsem: Optimize readers and reduce global impact Peter Zijlstra
2015-06-22 23:02   ` Oleg Nesterov
2015-06-23  7:28   ` Nicholas Mc Guire
2015-06-25 19:08     ` Peter Zijlstra
2015-06-25 19:17       ` Tejun Heo
2015-06-29  9:32         ` Peter Zijlstra
2015-06-29 15:12           ` Tejun Heo
2015-06-29 15:14             ` Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 06/13] percpu-rwsem: Provide percpu_down_read_trylock() Peter Zijlstra
2015-06-22 23:08   ` Oleg Nesterov
2015-06-22 12:16 ` [RFC][PATCH 07/13] sched: Reorder task_struct Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 08/13] percpu-rwsem: DEFINE_STATIC_PERCPU_RWSEM Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 09/13] hotplug: Replace hotplug lock with percpu-rwsem Peter Zijlstra
2015-06-22 22:57   ` Oleg Nesterov
2015-06-23  7:16     ` Peter Zijlstra
2015-06-23 17:01       ` Oleg Nesterov
2015-06-23 17:53         ` Peter Zijlstra
2015-06-24 13:50           ` Oleg Nesterov
2015-06-24 14:13             ` Peter Zijlstra
2015-06-24 15:12               ` Oleg Nesterov
2015-06-24 16:15                 ` Peter Zijlstra
2015-06-28 23:56             ` [PATCH 0/3] percpu-rwsem: introduce percpu_rw_semaphore->recursive mode Oleg Nesterov
2015-06-28 23:56               ` [PATCH 1/3] rcusync: introduce rcu_sync_struct->exclusive mode Oleg Nesterov
2015-06-28 23:56               ` [PATCH 2/3] percpu-rwsem: don't use percpu_rw_semaphore->rw_sem to exclude writers Oleg Nesterov
2015-06-28 23:56               ` [PATCH 3/3] percpu-rwsem: introduce percpu_rw_semaphore->recursive mode Oleg Nesterov
2015-06-22 12:16 ` [RFC][PATCH 10/13] fs/locks: Replace lg_global with a percpu-rwsem Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 11/13] fs/locks: Replace lg_local with a per-cpu spinlock Peter Zijlstra
2015-06-23  0:19   ` Oleg Nesterov
2015-06-22 12:16 ` [RFC][PATCH 12/13] stop_machine: Remove lglock Peter Zijlstra
2015-06-22 22:21   ` Oleg Nesterov
2015-06-23 10:09     ` Peter Zijlstra
2015-06-23 10:55       ` Peter Zijlstra
2015-06-23 11:20         ` Peter Zijlstra
2015-06-23 13:08           ` Peter Zijlstra
2015-06-23 16:36             ` Oleg Nesterov
2015-06-23 17:30             ` Paul E. McKenney
2015-06-23 18:04               ` Peter Zijlstra
2015-06-23 18:26                 ` Paul E. McKenney
2015-06-23 19:05                   ` Paul E. McKenney
2015-06-24  2:23                     ` Paul E. McKenney
2015-06-24  8:32                       ` Peter Zijlstra
2015-06-24  9:31                         ` Peter Zijlstra
2015-06-24 13:48                           ` Paul E. McKenney
2015-06-24 15:01                         ` Paul E. McKenney
2015-06-24 15:34                           ` Peter Zijlstra
2015-06-24  7:35                   ` Peter Zijlstra
2015-06-24  8:42                     ` Ingo Molnar
2015-06-24 13:39                       ` Paul E. McKenney
2015-06-24 13:43                         ` Ingo Molnar
2015-06-24 14:03                           ` Paul E. McKenney
2015-06-24 14:50                     ` Paul E. McKenney
2015-06-24 15:01                       ` Peter Zijlstra
2015-06-24 15:27                         ` Paul E. McKenney
2015-06-24 15:40                           ` Peter Zijlstra
2015-06-24 16:09                             ` Paul E. McKenney
2015-06-24 16:42                               ` Peter Zijlstra
2015-06-24 17:10                                 ` Paul E. McKenney
2015-06-24 17:20                                   ` Paul E. McKenney
2015-06-24 17:29                                     ` Peter Zijlstra
2015-06-24 17:28                                   ` Peter Zijlstra
2015-06-24 17:32                                     ` Peter Zijlstra
2015-06-24 18:14                                     ` Peter Zijlstra
2015-06-24 17:58                                   ` Peter Zijlstra
2015-06-25  3:23                                     ` Paul E. McKenney
2015-06-25 11:07                                       ` Peter Zijlstra
2015-06-25 13:47                                         ` Paul E. McKenney
2015-06-25 14:20                                           ` Peter Zijlstra
2015-06-25 14:51                                             ` Paul E. McKenney
2015-06-26 12:32                                               ` Peter Zijlstra
2015-06-26 16:14                                                 ` Paul E. McKenney [this message]
2015-06-29  7:56                                                   ` Peter Zijlstra
2015-06-30 21:32                                                     ` Paul E. McKenney
2015-07-01 11:56                                                       ` Peter Zijlstra
2015-07-01 15:56                                                         ` Paul E. McKenney
2015-07-01 16:16                                                           ` Peter Zijlstra
2015-07-01 18:45                                                             ` Paul E. McKenney
2015-06-23 14:39         ` Paul E. McKenney
2015-06-23 16:20       ` Oleg Nesterov
2015-06-23 17:24         ` Oleg Nesterov
2015-06-25 19:18           ` Peter Zijlstra
2015-06-22 12:16 ` [RFC][PATCH 13/13] locking: " Peter Zijlstra
2015-06-22 12:36 ` [RFC][PATCH 00/13] percpu rwsem -v2 Peter Zijlstra
2015-06-22 18:11 ` Daniel Wagner
2015-06-22 19:05   ` Peter Zijlstra
2015-06-23  9:35     ` Daniel Wagner
2015-06-23 10:00       ` Ingo Molnar
2015-06-23 14:34       ` Peter Zijlstra
2015-06-23 14:56         ` Daniel Wagner
2015-06-23 17:50           ` Peter Zijlstra
2015-06-23 19:36             ` Peter Zijlstra
2015-06-24  8:46               ` Ingo Molnar
2015-06-24  9:01                 ` Peter Zijlstra
2015-06-24  9:18                 ` Daniel Wagner
2015-07-01  5:57                   ` Daniel Wagner
2015-07-01 21:54                     ` Linus Torvalds
2015-07-02  9:41                       ` Peter Zijlstra
2015-07-20  5:53                         ` Daniel Wagner
2015-07-20 18:44                           ` Linus Torvalds
2015-06-22 20:06 ` Linus Torvalds
2015-06-23 16:10 ` Davidlohr Bueso
2015-06-23 16:21   ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150626161415.GY3717@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=dave@stgolabs.net \
    --cc=der.herr@hofr.at \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=riel@redhat.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@ZenIV.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.