All of lore.kernel.org
 help / color / mirror / Atom feed
From: Krister Johansen <kjlx@templeofstupid.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: linux-kernel@vger.kernel.org, mingo@kernel.org,
	jiangshanlai@gmail.com, dipankar@in.ibm.com,
	akpm@linux-foundation.org, mathieu.desnoyers@efficios.com,
	josh@joshtriplett.org, tglx@linutronix.de, peterz@infradead.org,
	rostedt@goodmis.org, dhowells@redhat.com, edumazet@google.com,
	fweisbec@gmail.com, oleg@redhat.com, bobby.prani@gmail.com,
	stable@vger.kernel.org, gregkh@linuxfoundation.org
Subject: Re: [PATCH tip/core/rcu 45/88] rcu: Add memory barriers for NOCB leader wakeup
Date: Thu, 8 Jun 2017 13:11:48 -0700	[thread overview]
Message-ID: <20170608201148.GA2553@templeofstupid.com> (raw)
In-Reply-To: <1495749601-21574-45-git-send-email-paulmck@linux.vnet.ibm.com>

Hi Paul,

On Thu, May 25, 2017 at 02:59:18PM -0700, Paul E. McKenney wrote:
> Wait/wakeup operations do not guarantee ordering on their own.  Instead,
> either locking or memory barriers are required.  This commit therefore
> adds memory barriers to wake_nocb_leader() and nocb_leader_wait().
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
>  kernel/rcu/tree_plugin.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 0b1042545116..573fbe9640a0 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -1810,6 +1810,7 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
>  	if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
>  		/* Prior smp_mb__after_atomic() orders against prior enqueue. */
>  		WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
> +		smp_mb(); /* ->nocb_leader_sleep before swake_up(). */
>  		swake_up(&rdp_leader->nocb_wq);
>  	}
>  }
> @@ -2064,6 +2065,7 @@ static void nocb_leader_wait(struct rcu_data *my_rdp)
>  	 * nocb_gp_head, where they await a grace period.
>  	 */
>  	gotcbs = false;
> +	smp_mb(); /* wakeup before ->nocb_head reads. */
>  	for (rdp = my_rdp; rdp; rdp = rdp->nocb_next_follower) {
>  		rdp->nocb_gp_head = READ_ONCE(rdp->nocb_head);
>  		if (!rdp->nocb_gp_head)

May I impose upon you to CC this patch to stable, and tag it as fixing
abedf8e241?  I ran into this on a production 4.9 branch.  When I
debugged it, I discovered that it went all the way back to 4.6.  The
tl;dr is that at least for some environments, the missed wakeup
manifests itself as a series of hung-task warnings to console and if I'm
unlucky it can also generate a hang that can block interactive logins
via ssh.

In case it's useful, I'm attching the RCA that I sent out before I
circulated a similar patch for internal review.  You beat me to the
punch in terms of getting a patch out a public mailing list.

--- >8 cut here 8< ---

Production boxes go through fits where they generate spurts of hung task
warnings.  Sometimes, the slowdown is enough to ensnare a process
holding the rtnl_lock, or prevent systemd from letting ssh logins on to
the box.

Below is a splat from a relatively benign ocurrence:

   INFO: task uc-spawn:34443 blocked for more than 120 seconds.
         Not tainted 4.9.4-1.el7.x86_64 #1
   "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
   uc-spawn        D    0 34443      1 0x00000082
    ffff883fec88e4c0 0000000000000000 ffff883ff1b4ac00 ffff883ffddd9540
    ffff883fee999600 ffffc9001c907c30 ffffffff8175e3f6 0000000000000202
    ffff883fee999600 7fffffffffffffff ffff883fee999600 7fffffffffffffff
   Call Trace:
    [<ffffffff8175e3f6>] ? __schedule+0x1a6/0x650
    [<ffffffff8175e8d6>] schedule+0x36/0x80
    [<ffffffff81761aec>] schedule_timeout+0x21c/0x3a0
    [<ffffffff8175f3b2>] wait_for_completion+0xf2/0x130
    [<ffffffff810b6710>] ? wake_up_q+0x80/0x80
    [<ffffffff810eec27>] __wait_rcu_gp+0xd7/0x100
    [<ffffffff810f264e>] synchronize_sched+0x5e/0x80
    [<ffffffff810f1da0>] ? __call_rcu+0x320/0x320
    [<ffffffff810eeb30>] ? trace_raw_output_rcu_utilization+0x60/0x60
    [<ffffffff8125673b>] kern_unmount+0x2b/0x40
    [<ffffffff812d6f85>] mq_put_mnt+0x15/0x20
    [<ffffffff812d76de>] put_ipc_ns+0x3e/0xa0
    [<ffffffff810ab768>] free_nsproxy+0x38/0x90
    [<ffffffff810ab8c9>] switch_task_namespaces+0x59/0x60
    [<ffffffff810ab8e0>] exit_task_namespaces+0x10/0x20
    [<ffffffff8108d787>] do_exit+0x2b7/0xac0
    [<ffffffff81003510>] ? syscall_trace_enter+0x1d0/0x2b0
    [<ffffffff8108e00f>] do_group_exit+0x3f/0xb0
    [<ffffffff8108e094>] SyS_exit_group+0x14/0x20
    [<ffffffff81003a47>] do_syscall_64+0x67/0x180
    [<ffffffff8176306b>] entry_SYSCALL64_slow_path+0x25/0x25

In order to debug this further, a kdump was obtained.

crash> ps 34443
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
  34443      1  33  ffff883fee999600  UN   0.0       0      0  uc-spawn

crash> bt 34443
PID: 34443  TASK: ffff883fee999600  CPU: 33  COMMAND: "uc-spawn"
 #0 [ffffc9001c907bb8] __schedule at ffffffff8175e3f6
 #1 [ffffc9001c907c38] schedule at ffffffff8175e8d6
 #2 [ffffc9001c907c50] schedule_timeout at ffffffff81761aec
 #3 [ffffc9001c907cf0] wait_for_completion at ffffffff8175f3b2
 #4 [ffffc9001c907d50] __wait_rcu_gp at ffffffff810eec27
 #5 [ffffc9001c907da0] synchronize_sched at ffffffff810f264e
 #6 [ffffc9001c907df0] kern_unmount at ffffffff8125673b
 #7 [ffffc9001c907e08] mq_put_mnt at ffffffff812d6f85
 #8 [ffffc9001c907e18] put_ipc_ns at ffffffff812d76de
 #9 [ffffc9001c907e30] free_nsproxy at ffffffff810ab768
#10 [ffffc9001c907e48] switch_task_namespaces at ffffffff810ab8c9
#11 [ffffc9001c907e70] exit_task_namespaces at ffffffff810ab8e0
#12 [ffffc9001c907e80] do_exit at ffffffff8108d787
#13 [ffffc9001c907ee8] do_group_exit at ffffffff8108e00f
#14 [ffffc9001c907f18] sys_exit_group at ffffffff8108e094
#15 [ffffc9001c907f28] do_syscall_64 at ffffffff81003a47
    RIP: 00007f6b6e1f4209  RSP: 00007ffc488028b8  RFLAGS: 00000246
    RAX: ffffffffffffffda  RBX: 0000000000000001  RCX: 00007f6b6e1f4209
    RDX: 0000000000000001  RSI: 0000000000000000  RDI: 0000000000000001
    RBP: 00007f6b6e4ea840   R8: 000000000000003c   R9: 00000000000000e7
    R10: fffffffffffffe90  R11: 0000000000000246  R12: 00007f6b6e4ea840
    R13: 00007f6b6e4efe80  R14: 00007ffc48802918  R15: 00007ffc48802998
    ORIG_RAX: 00000000000000e7  CS: 0033  SS: 002b

This particular process is blocked in its exit path, waiting for the
synchronize_sched() call it invoked to complete.  This function queues a
call_rcu_sched() invocation with a callback that triggers a wakeup of
the invoking process.  In order for this process to become unblocked,
a grace period must elapse, and a the callback must be called.

Typically, bugs like this imply that some processor failed to yield, or
some thread is holding rcu_read_lock_sched(), and hence preventing a
grace period from occurring.  However, after dilligent inspection I can
find no evidence of such a thing occurring.  After dumping all stacks,
no threads are in call frames that hold this type of read-side critical
section.  All cpus appear to be mostly idle and responding to interrupt
events.  The rcu threads themselves are not blocked waiting for
completion of a grace period either.

A quick look at rcu_sched_state shows nothing obviously asmiss:

crash> p rcu_sched_state
rcu_sched_state = $1 = {
  node = {{
      lock = {
        raw_lock = {
          val = {
            counter = 0
          }
        }
      },
      gpnum = 22469944,
      completed = 22469944,
      qsmask = 0,
      qsmaskinit = 3,
      qsmaskinitnext = 0,
      expmask = 0,
      expmaskinit = 3,
      expmaskinitnext = 0,
      grpmask = 0,
      grplo = 0,
      grphi = 71,
      grpnum = 0 '\000',
      level = 0 '\000',
      wait_blkd_tasks = false,
      parent = 0x0,
      blkd_tasks = {
        next = 0xffffffff81c70428 <rcu_sched_state+104>,
        prev = 0xffffffff81c70428 <rcu_sched_state+104>
      },
      gp_tasks = 0x0,
      exp_tasks = 0x0,
      boost_tasks = 0x0,
      <...>

Note that according to this, the most recent gp completed was numbered
22469944.

Now here's the rcu_data for CPU 33, where the process got stuck.  It's
at least 15 grace periods behind.

crash> struct rcu_data 0xffff883ffddda380
struct rcu_data {
  completed = 22469929,
  gpnum = 22469930,
  rcu_qs_ctr_snap = 18203094,
  cpu_no_qs = {
    b = {
      norm = 0 '\000',
      exp = 0 '\000'
    },
    s = 0
  },
  core_needs_qs = true,
  beenonline = true,
  gpwrap = false,
  mynode = 0xffffffff81c70600 <rcu_sched_state+576>,
  grpmask = 8589934592,
  ticks_this_gp = 0,
  nxtlist = 0x0,
  nxttail = {0x0, 0x0, 0x0, 0x0},
  nxtcompleted = {0, 0, 0, 0},
  qlen_lazy = 0,
  qlen = 0,
  qlen_last_fqs_check = 0,
  n_cbs_invoked = 0,
  n_nocbs_invoked = 10626703,
  n_cbs_orphaned = 0,
  n_cbs_adopted = 0,
  n_force_qs_snap = 0,
  blimit = 10,
  dynticks = 0xffff883ffddcffd0,
  dynticks_snap = 28482762,
  dynticks_fqs = 19371,
  offline_fqs = 0,
  cond_resched_completed = 22421984,
  n_rcu_pending = 2265627,
  n_rp_core_needs_qs = 31314,
  n_rp_report_qs = 595411,
  n_rp_cb_ready = 0,
  n_rp_cpu_needs_gp = 757,
  n_rp_gp_completed = 367886,
  n_rp_gp_started = 38620,
  n_rp_nocb_defer_wakeup = 126,
  n_rp_need_nothing = 1262827,
  barrier_head = {
    next = 0x0,
    func = 0xffffffff810f0090 <rcu_barrier_callback>
  },
  exp_workdone0 = {
    counter = 21444
  },
  exp_workdone1 = {
    counter = 0
  },
  exp_workdone2 = {
    counter = 0
  },
  exp_workdone3 = {
    counter = 0
  },
  nocb_head = 0xffff881f8444a138,
  nocb_tail = 0xffff881f9f8ee8d0,
  nocb_q_count = {
    counter = 500
  },
  nocb_q_count_lazy = {
    counter = 18
  },
  nocb_follower_head = 0x0,
  nocb_follower_tail = 0xffff883ffddda508,
  nocb_wq = {
    lock = {
      raw_lock = {
        val = {
          counter = 0
        }
      }
    },
    task_list = {
      next = 0xffffc900193e3e80,
      prev = 0xffffc900193e3e80
    }
  },
  nocb_kthread = 0xffff883ff0861600,
  nocb_defer_wakeup = 0,
  nocb_gp_head = 0xffff883fe18c71a0,
  nocb_gp_tail = 0xffffc9001c7a3da0,
  nocb_leader_sleep = false,
  nocb_next_follower = 0xffff883ffde1a380,
  nocb_leader = 0xffff883ffdd9a380,
  softirq_snap = 1002642,
  cpu = 33,
  rsp = 0xffffffff81c703c0 <rcu_sched_state>
}

Notice also that the nocb_q_count is 500.  With a bit of looking we can
see that the blocked process has its callback queued in both the
nocb_gp_head list and the nocb_head list.  I'll spare readers all of the
output, but the correct linkage was located by walking the list entries
and then grepping for the correct pointer.

crash> struct completion 0xffffc9001c907dc0
struct completion {
  done = 0,
  wait = {
    lock = {
      {
        rlock = {
          raw_lock = {
            val = {
              counter = 0
            }
          }
        }
      }
    },
    task_list = {
      next = 0xffffc9001c907d10,
      prev = 0xffffc9001c907d10
    }
  }
}

crash> struct rcu_synchronize 0xffffc9001c907db0
struct rcu_synchronize {
  head = {
    next = 0xffff883fe4d18398,
    func = 0xffffffff810eeb30 <wakeme_after_rcu>
  },
  completion = {
    done = 0,
    wait = {
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      },
      task_list = {
        next = 0xffffc9001c907d10,
        prev = 0xffffc9001c907d10
      }
    }
  }
}

This is the stack for CPU 33's nocb kthread.  It's sleeping.

crash> bt 0xffff883ff0861600
PID: 279    TASK: ffff883ff0861600  CPU: 39  COMMAND: "rcuos/33"
 #0 [ffffc900193e3db0] __schedule at ffffffff8175e3f6
 #1 [ffffc900193e3e30] schedule at ffffffff8175e8d6
 #2 [ffffc900193e3e48] rcu_nocb_kthread at ffffffff810f2a67
 #3 [ffffc900193e3ec8] kthread at ffffffff810a9e19
 #4 [ffffc900193e3f50] ret_from_fork at ffffffff81763215

Since CPU 33 is a follower in this configuration, it needs to be awoken
by a leader.  In this case CPU 32 is the leader assigned to wake CPU
33's nocb_kthread.  It's also sleeping.

crash> bt 0xffff883ff0805800
PID: 271    TASK: ffff883ff0805800  CPU: 38  COMMAND: "rcuos/32"
 #0 [ffffc900193a3db0] __schedule at ffffffff8175e3f6
 #1 [ffffc900193a3e30] schedule at ffffffff8175e8d6
 #2 [ffffc900193a3e48] rcu_nocb_kthread at ffffffff810f2ad5
 #3 [ffffc900193a3ec8] kthread at ffffffff810a9e19
 #4 [ffffc900193a3f50] ret_from_fork at ffffffff81763215

Here's a red flag: The leader is asleep despite nocb_leader_sleep being
false. The code in nocb_leader_wait() has an swait_event_interruptible()
that's supposed to only let this thread wait when nocb_leader_sleep is
true.

crash> struct rcu_data.nocb_leader_sleep 0xffff883ffdd9a380
  nocb_leader_sleep = false

The leader's wq contains active entries, so something is sleeping here:

crash> struct rcu_data.nocb_wq 0xffff883ffdd9a380
  nocb_wq = {
    lock = {
      raw_lock = {
        val = {
          counter = 0
        }
      }
    },
    task_list = {
      next = 0xffffc900193a3e80,
      prev = 0xffffc900193a3e80
    }
  }

crash> struct -o rcu_data.nocb_wq 0xffff883ffdd9a380
struct rcu_data {
  [ffff883ffdd9a518] struct swait_queue_head nocb_wq;
}

crash> struct list_head 0xffffc900193a3e80
struct list_head {
  next = 0xffff883ffdd9a520,
  prev = 0xffff883ffdd9a520
}

This is the waitq of the callback parked by the process that was
stranded on cpu 33.  It's also correctly configured.

crash> struct wait_queue_t 0xffffc9001c907cf8
struct wait_queue_t {
  flags = 1,
  private = 0xffff883fee999600,
  func = 0xffffffff810b6710 <default_wake_function>,
  task_list = {
    next = 0xffffc9001c907dd0,
    prev = 0xffffc9001c907dd0
  }
}
crash> ps -m 34443
[ 0 00:11:07.318] [UN]  PID: 34443  TASK: ffff883fee999600  CPU: 33  COMMAND: "uc-spawn"
crash> ps -m 0xffff883ff0861600
[ 0 00:11:44.271] [IN]  PID: 279    TASK: ffff883ff0861600  CPU: 39  COMMAND: "rcuos/33"
crash> ps -m ffff883ff0805800
[ 0 00:11:44.208] [IN]  PID: 271    TASK: ffff883ff0805800  CPU: 38  COMMAND: "rcuos/32"

Looking at the PS data, we can see that our processes have been stranded
for more than 11 minutes.  The rcuos threads for 37 seconds longer than
pid 34443.

The reason this appears to be a transient hang is due to the code in
__call_rcu_nocb_enqueue().  We seem to be missing the initial wakeup,
and once that happens, we'll only send another wakeup if nocb_q_count is
greater than qlen_last_fqs_check + qhimark.  In our case, we need to
accumulate 10,000 callbacks before we get another bite at this.  Since
these cpus are relatively idle, we've only accrued 500 on the follower
and ~2000 on the leader.  Leader's full state if below:

crash> struct rcu_data 0xffff883ffdd9a380
struct rcu_data {
  completed = 22469935,
  gpnum = 22469935,
  rcu_qs_ctr_snap = 17067139,
  cpu_no_qs = {
    b = {
      norm = 0 '\000',
      exp = 0 '\000'
    },
    s = 0
  },
  core_needs_qs = false,
  beenonline = true,
  gpwrap = false,
  mynode = 0xffffffff81c70600 <rcu_sched_state+576>,
  grpmask = 4294967296,
  ticks_this_gp = 1,
  nxtlist = 0x0,
  nxttail = {0x0, 0x0, 0x0, 0x0},
  nxtcompleted = {18446744073709551526, 18446744073709551526, 18446744073709551526, 0},
  qlen_lazy = 0,
  qlen = 0,
  qlen_last_fqs_check = 0,
  n_cbs_invoked = 0,
  n_nocbs_invoked = 10730040,
  n_cbs_orphaned = 0,
  n_cbs_adopted = 0,
  n_force_qs_snap = 0,
  blimit = 10,
  dynticks = 0xffff883ffdd8ffd0,
  dynticks_snap = 29638122,
  dynticks_fqs = 19616,
  offline_fqs = 0,
  cond_resched_completed = 22467956,
  n_rcu_pending = 2314079,
  n_rp_core_needs_qs = 30976,
  n_rp_report_qs = 654184,
  n_rp_cb_ready = 0,
  n_rp_cpu_needs_gp = 813,
  n_rp_gp_completed = 369917,
  n_rp_gp_started = 37049,
  n_rp_nocb_defer_wakeup = 135,
  n_rp_need_nothing = 1251981,
  barrier_head = {
    next = 0x0,
    func = 0xffffffff810f0090 <rcu_barrier_callback>
  },
  exp_workdone0 = {
    counter = 20577
  },
  exp_workdone1 = {
    counter = 0
  },
  exp_workdone2 = {
    counter = 0
  },
  exp_workdone3 = {
    counter = 0
  },
  nocb_head = 0xffff883ff0f6f700,
  nocb_tail = 0xffff881f9f8ed2d0,
  nocb_q_count = {
    counter = 2029
  },
  nocb_q_count_lazy = {
    counter = 71
  },
  nocb_follower_head = 0x0,
  nocb_follower_tail = 0xffff883ffdd9a508,
  nocb_wq = {
    lock = {
      raw_lock = {
        val = {
          counter = 0
        }
      }
    },
    task_list = {
      next = 0xffffc900193a3e80,
      prev = 0xffffc900193a3e80
    }
  },
  nocb_kthread = 0xffff883ff0805800,
  nocb_defer_wakeup = 0,
  nocb_gp_head = 0x0,
  nocb_gp_tail = 0xffff883feb6c4d00,
  nocb_leader_sleep = false,
  nocb_next_follower = 0xffff883ffddda380,
  nocb_leader = 0xffff883ffdd9a380,
  softirq_snap = 1061978,
  cpu = 32,
  rsp = 0xffffffff81c703c0 <rcu_sched_state>
}

The nocb path can miss a wakeup sent to the master. If the system is
fairly idle it can take a substantial amount of time in order for
qhimark to be reached so that we kick the leader again.  It might be
possible to tune qhimark, but this is intellectually unsatisfying,
especially since that tunable has use outside of the nocb path.

Instead, it looks like in the conversion to simple wait queues the
sematics of swake_up differ slightly from wake_up and the rcu_nocb code
failed to make a necessary adjustment.

Here's the wakeup path:

static void wake_nocb_leader(struct rcu_data *rdp, bool force)
{
        struct rcu_data *rdp_leader = rdp->nocb_leader;

        if (!READ_ONCE(rdp_leader->nocb_kthread))
                return;
        if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
                /* Prior smp_mb__after_atomic() orders against prior enqueue. */
                WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
                swake_up(&rdp_leader->nocb_wq);
        }
}

And here's swake_up iteslf:

void swake_up(struct swait_queue_head *q)
{
        unsigned long flags;

        if (!swait_active(q))
                return;

        raw_spin_lock_irqsave(&q->lock, flags);
        swake_up_locked(q);
        raw_spin_unlock_irqrestore(&q->lock, flags);
}

It's important to note that swait_active() isn't called with any locks
held so we're in a situation where the write to nocb_leader_sleep and,
potentially, a store to q->task_list are in flight on different CPUs at
the same time.  A comment in wake_nocb_leader notes that the
smp_mb__after_atomic() gives us ordering against the prior enqueue, but
that's a red herring.  In order to wake the leader, the waker's store to
nocb_leader_sleep must be visibile on the CPU where the awaking process
can observe it.

The documentation in memory-barriers.txt cautions us in this way:

   [!] Note that the memory barriers implied by the sleeper and the
   waker do _not_ order multiple stores before the wake-up with respect
   to loads of those stored values after the sleeper has called
   set_current_state().  For instance, if the sleeper does:

        set_current_state(TASK_INTERRUPTIBLE);
        if (event_indicated)
                break;
        __set_current_state(TASK_RUNNING);
        do_something(my_data);

   and the waker does:

        my_data = value;
        event_indicated = 1;
        wake_up(&event_wait_queue);

   there's no guarantee that the change to event_indicated will be
   perceived by the sleeper as coming after the change to my_data.  In
   such a circumstance, the code on both sides must interpolate its own
   memory barriers between the separate data accesses.  Thus the above
   sleeper ought to do:

        set_current_state(TASK_INTERRUPTIBLE);
        if (event_indicated) {
                smp_rmb();
                do_something(my_data);
        }

   and the waker should do:

        my_data = value;
        smp_wmb();
        event_indicated = 1;
        wake_up(&event_wait_queue);

  [https://www.kernel.org/doc/Documentation/memory-barriers.txt]

The RT-linux tree had a patch in its swait_head_has_waiters() function
to ensure that wakeups were not missed:

https://www.spinics.net/lists/linux-rt-users/msg10340.html

That barrier was dropped from the patch as part of getting it ready for
mainline:

http://www.spinics.net/lists/linux-rt-users/msg13691.html

The patch in the prior message seemed to have a comment, though it's
possible it was subsequently dropped.  Also note that waitqueue_active
in the current code base contains the following comment:

/**
 * waitqueue_active -- locklessly test for waiters on the queue
 * @q: the waitqueue to test for waiters
 *
 * returns true if the wait list is not empty
 *
 * NOTE: this function is lockless and requires care, incorrect usage _will_
 * lead to sporadic and non-obvious failure.
 *
 * Use either while holding wait_queue_head_t::lock or when used for wakeups
 * with an extra smp_mb() like:
 *
 *      CPU0 - waker                    CPU1 - waiter
 *
 *                                      for (;;) {
 *      @cond = true;                     prepare_to_wait(&wq, &wait, state);
 *      smp_mb();                         // smp_mb() from set_current_state()
 *      if (waitqueue_active(wq))         if (@cond)
 *        wake_up(wq);                      break;
 *                                        schedule();
 *                                      }
 *                                      finish_wait(&wq, &wait);
 *
 * Because without the explicit smp_mb() it's possible for the
 * waitqueue_active() load to get hoisted over the @cond store such that we'll
 * observe an empty wait list while the waiter might not observe @cond.
 *
 * Also note that this 'optimization' trades a spin_lock() for an smp_mb(),
 * which (when the lock is uncontended) are of roughly equal cost.
 */
[include/linux/wait.h]

That's exactly what we're seeing go wrong here.  The waiter didn't see
the condtion become true as it was going to sleep, and the waker didn't
see the waiter as being asleep.

Based upon all of this, it's pretty clear that we need a memory barrier
added to wake_nocb_leader().  The good news is that most other users of
swake_up(), are already using other atomic operations that cause updates
to their values to become visibile.

The fix should be something as simple as this:

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 56583e7..1322d0b 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1765,8 +1765,8 @@ static void wake_nocb_leader(struct rcu_data *rdp, bool force)
        if (!READ_ONCE(rdp_leader->nocb_kthread))
                return;
        if (READ_ONCE(rdp_leader->nocb_leader_sleep) || force) {
-               /* Prior smp_mb__after_atomic() orders against prior enqueue. */
-               WRITE_ONCE(rdp_leader->nocb_leader_sleep, false);
+               /* mb ensures nocb_leader_sleep is observed prior to wakeup */
+               smp_store_mb(rdp_leader->nocb_leader_sleep, false);
                swake_up(&rdp_leader->nocb_wq);
        }
 }

--- >8 cut here 8< ---

Thanks,

-K

  reply	other threads:[~2017-06-08 20:11 UTC|newest]

Thread overview: 104+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-05-25 21:59 [PATCH tip/core/rcu 0/88] Commits for 4.13 Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 01/88] rcutorture: Add lockdep to one of the SRCU scenarios Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 02/88] rcutorture: Add three-level tree test for Tree SRCU Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 03/88] rcutorture: Fix bug in reporting Kconfig mis-settings Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 04/88] rcutorture: Add a scenario for Tiny SRCU Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 05/88] rcutorture: Add a scenario for Classic SRCU Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 06/88] rcu: Prevent rcu_barrier() from starting needless grace periods Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 07/88] rcutorture: Correctly handle CONFIG_RCU_TORTURE_TEST_* options Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 08/88] rcutorture: Update test scenarios based on new Kconfig dependencies Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 09/88] srcu: Eliminate possibility of destructive counter overflow Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 10/88] rcu: Complain if blocking in preemptible RCU read-side critical section Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 11/88] rcuperf: Defer expedited/normal check to end of test Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 12/88] rcuperf: Remove conflicting Kconfig options Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 13/88] rcu: Remove obsolete reference to synchronize_kernel() Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 14/88] rcuperf: Add ability to performance-test call_rcu() and friends Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 15/88] rcuperf: Add a Kconfig-fragment file for Classic SRCU Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 16/88] rcu: Make sync_rcu_preempt_exp_done() return bool Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 17/88] checkpatch: Remove checks for expedited grace periods Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 18/88] rcuperf: Add test for dynamically initialized srcu_struct Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 19/88] doc/atomic_ops: Clarify smp_mb__{before,after}_atomic() Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 20/88] atomics: Add header comment so spin_unlock_wait() Paul E. McKenney
2017-06-10 15:02   ` Andrea Parri
2017-05-25 21:58 ` [PATCH tip/core/rcu 21/88] docs: Fix typo in Documentation/memory-barriers.txt Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 22/88] rcuperf: Add the ability to test tiny RCU flavors Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 23/88] srcu: Make Classic and Tree SRCU announce themselves at bootup Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 24/88] rcutorture: Reduce CPUs dedicated to testing Classic SRCU Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 25/88] srcu: Shrink Tiny SRCU a bit more Paul E. McKenney
2017-05-25 21:58 ` [PATCH tip/core/rcu 26/88] rcuperf: Set more user-friendly defaults Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 27/88] srcu-cbmc: Use /usr/bin/awk instead of /bin/awk Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 28/88] rcuperf: Add writer_holdoff boot parameter Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 29/88] rcutorture: Add "git diff" output to testid.txt file Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 30/88] srcu: Document auto-expediting requirement Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 31/88] doc: Take tail recursion into account in RCU requirements Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 32/88] rcu: Add preemptibility checks in rcu_sched_qs() and rcu_bh_qs() Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 33/88] rcu: Print out rcupdate.c non-default boot-time settings Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 34/88] rcu: Update rcu_bootup_announce_oddness() Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 35/88] srcu: Make exp_holdoff module parameter be static Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 36/88] srcu: Print non-default exp_holdoff values at boot time Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 37/88] rcu: Add lockdep_assert_held() teeth to tree.c Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 38/88] rcu: Add lockdep_assert_held() teeth to tree_plugin.h Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 39/88] srcu: Make SRCU be once again optional Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 40/88] srcu: Shrink Tiny SRCU a bit Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 41/88] srcu: Add DEBUG_OBJECTS_RCU_HEAD functionality Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 42/88] rcu: Make synchronize_rcu_mult() check for duplicates Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 43/88] sched: Rely on synchronize_rcu_mult() de-duplication Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 44/88] rcu: Use RCU_NOCB_WAKE rather than RCU_NOGP_WAKE Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 45/88] rcu: Add memory barriers for NOCB leader wakeup Paul E. McKenney
2017-06-08 20:11   ` Krister Johansen [this message]
2017-06-08 20:55     ` Paul E. McKenney
2017-06-08 21:28       ` Krister Johansen
2017-06-08 23:47         ` Paul E. McKenney
2017-06-09  3:24           ` Krister Johansen
2017-05-25 21:59 ` [PATCH tip/core/rcu 46/88] torture: Add --kconfig argument to kvm.sh Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 47/88] rcu: Flag need for rcu_node_tree.h and rcu_segcblist.h visibility Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 48/88] rcutorture: Don't wait for kernel when all builds fail Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 49/88] rcu: Move docbook comments out of rcupdate.h Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 50/88] rcu: Move rcu_expedited and rcu_normal externs from rcupdate.h Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 51/88] rcu: Move expediting-related access/control out of rcupdate.h Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 52/88] rcu: Move torture-related definitions from rcupdate.h to rcu.h Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 53/88] rcu: Remove UINT_CMP_GE() and UINT_CMP_LT() Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 54/88] rcu: Move rcupdate.h to new empty-function style Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 55/88] rcu: Eliminate the unused __rcu_is_watching() function Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 56/88] rcu: Move the RCU_SCHEDULER_ definitions from rcupdate.h Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 57/88] rcu: Remove linux/debugobjects.h " Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 58/88] rcu: Improve __call_rcu() debug-objects error message Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 59/88] rcu: Move rcu_is_nocb_cpu() from rcupdate.h to rcu.h Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 60/88] rcu: Move rcu_ftrace_dump() " Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 61/88] rcu: move rcupdate.h to the new true/false-function style Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 62/88] rcu: Move torture-related functions out of rcutiny.h and rcutree.h Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 63/88] rcu: Move rcu_request_urgent_qs_task() " Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 64/88] rcu: Move rcutiny.h to new empty-function style Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 65/88] rcu: move rcutiny.h to the new true/false-function style Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 66/88] srcu: Prevent sdp->srcu_gp_seq_needed counter wrap Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 67/88] srcu: Shrink srcu.h by moving docbook and private function Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 68/88] srcu: Apply trivial callback lists to shrink Tiny SRCU Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 69/88] lockdep: Use consistent printing primitives Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 70/88] rcu: Refactor #includes from include/linux/rcupdate.h Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 71/88] rcu: Convert rnp->lock wrappers to macros for SRCU use Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 72/88] rcu: Move rnp->lock wrappers " Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 73/88] srcu: Use rnp->lock wrappers to replace explicit memory barriers Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 74/88] rcu: Remove *_SLOW_* Kconfig options Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 75/88] rcu: Remove the RCU_KTHREAD_PRIO Kconfig option Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 76/88] rcu: Remove nohz_full full-system-idle state machine Paul E. McKenney
2017-05-25 23:03   ` Linus Torvalds
2017-05-25 23:42     ` Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 77/88] rcu: Remove #ifdef moving rcu_end_inkernel_boot from rcupdate.h Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 78/88] rcu: Remove typecheck() from RCU locking wrapper functions Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 79/88] rcu: Remove the now-obsolete PROVE_RCU_REPEATEDLY Kconfig option Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 80/88] rcu: Remove SPARSE_RCU_POINTER " Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 81/88] srcu: Fix rcutorture-statistics typo Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 82/88] srcu: Remove Classic SRCU Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 83/88] rcu: Remove debugfs tracing Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 84/88] rcu: Eliminate NOCBs CPU-state Kconfig options Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 85/88] rcu: Move RCU non-debug Kconfig options to kernel/rcu Paul E. McKenney
2017-05-25 21:59 ` [PATCH tip/core/rcu 86/88] rcu: Move RCU debug " Paul E. McKenney
2017-05-25 22:00 ` [PATCH tip/core/rcu 87/88] rcu: Remove event tracing from Tiny RCU Paul E. McKenney
2017-05-25 22:00 ` [PATCH tip/core/rcu 88/88] rcu: Remove RCU CPU stall warnings " Paul E. McKenney
2017-06-09 13:52 ` [PATCH tip/core/rcu 0/88] Commits for 4.13 Steven Rostedt
2017-06-09 16:24   ` Paul E. McKenney
2017-06-09 16:39     ` Steven Rostedt
2017-06-09 17:20       ` Paul E. McKenney
2017-06-09 17:33         ` Steven Rostedt
2017-06-09 17:57           ` Paul E. McKenney
2017-06-09 18:01             ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170608201148.GA2553@templeofstupid.com \
    --to=kjlx@templeofstupid.com \
    --cc=akpm@linux-foundation.org \
    --cc=bobby.prani@gmail.com \
    --cc=dhowells@redhat.com \
    --cc=dipankar@in.ibm.com \
    --cc=edumazet@google.com \
    --cc=fweisbec@gmail.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=jiangshanlai@gmail.com \
    --cc=josh@joshtriplett.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.