From: Oleg Nesterov <oleg@redhat.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>,
Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
Ingo Molnar <mingo@kernel.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Linux-MM <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>,
Thomas Gleixner <tglx@linutronix.de>,
Steven Rostedt <rostedt@goodmis.org>,
Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [RFC] introduce synchronize_sched_{enter,exit}()
Date: Mon, 30 Sep 2013 14:42:14 +0200 [thread overview]
Message-ID: <20130930124214.GA19560@redhat.com> (raw)
In-Reply-To: <20130929200114.GC19582@linux.vnet.ibm.com>
On 09/29, Paul E. McKenney wrote:
>
> On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
> >
> > struct xxx_struct {
> > atomic_t counter;
> > };
> >
> > static inline bool xxx_is_idle(struct xxx_struct *xxx)
> > {
> > return atomic_read(&xxx->counter) == 0;
> > }
> >
> > static inline void xxx_enter(struct xxx_struct *xxx)
> > {
> > atomic_inc(&xxx->counter);
> > synchronize_sched();
> > }
> >
> > static inline void xxx_enter(struct xxx_struct *xxx)
> > {
> > synchronize_sched();
> > atomic_dec(&xxx->counter);
> > }
>
> But there is nothing for synchronize_sched() to wait for in the above.
> Presumably the caller of xxx_is_idle() is required to disable preemption
> or be under rcu_read_lock_sched()?
Yes, yes, sure, xxx_is_idle() should be called under preempt_disable().
(or rcu_read_lock() if xxx_enter() uses synchronize_rcu()).
> So you are trying to make something that abstracts the RCU-protected
> state-change pattern? Or perhaps more accurately, the RCU-protected
> state-change-and-back pattern?
Yes, exactly.
> > struct xxx_struct {
> > int gp_state;
> >
> > int gp_count;
> > wait_queue_head_t gp_waitq;
> >
> > int cb_state;
> > struct rcu_head cb_head;
>
> spinlock_t xxx_lock; /* ? */
See
#define xxx_lock gp_waitq.lock
in .c below, but we can add another spinlock.
> This spinlock might not make the big-system guys happy, but it appears to
> be needed below.
Only the writers use this spinlock, and they should synchronize with each
other anyway. I don't think this can really penalize, say, percpu_down_write
or cpu_hotplug_begin.
> > // .c -----------------------------------------------------------------------
> >
> > enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
> >
> > enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
> >
> > #define xxx_lock gp_waitq.lock
> >
> > void xxx_enter(struct xxx_struct *xxx)
> > {
> > bool need_wait, need_sync;
> >
> > spin_lock_irq(&xxx->xxx_lock);
> > need_wait = xxx->gp_count++;
> > need_sync = xxx->gp_state == GP_IDLE;
>
> Suppose ->gp_state is GP_PASSED. It could transition to GP_IDLE at any
> time, right?
As you already pointed below - no.
Once we incremented ->nr_writers, nobody can set GP_IDLE. And if the
caller is the "first" writer (need_sync == T) nobody else can change
->gp_state, so xxx_enter() sets GP_PASSED lockless.
> > if (need_sync)
> > xxx->gp_state = GP_PENDING;
> > spin_unlock_irq(&xxx->xxx_lock);
> >
> > BUG_ON(need_wait && need_sync);
> >
> > } if (need_sync) {
> > synchronize_sched();
> > xxx->gp_state = GP_PASSED;
> > wake_up_all(&xxx->gp_waitq);
> > } else if (need_wait) {
> > wait_event(&xxx->gp_waitq, xxx->gp_state == GP_PASSED);
>
> Suppose the wakeup is delayed until after the state has been updated
> back to GP_IDLE? Ah, presumably the non-zero ->gp_count prevents this.
Yes, exactly.
> > static void cb_rcu_func(struct rcu_head *rcu)
> > {
> > struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> > long flags;
> >
> > BUG_ON(xxx->gp_state != GP_PASSED);
> > BUG_ON(xxx->cb_state == CB_IDLE);
> >
> > spin_lock_irqsave(&xxx->xxx_lock, flags);
> > if (xxx->gp_count) {
> > xxx->cb_state = CB_IDLE;
> > } else if (xxx->cb_state == CB_REPLAY) {
> > xxx->cb_state = CB_PENDING;
> > call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> > } else {
> > xxx->cb_state = CB_IDLE;
> > xxx->gp_state = GP_IDLE;
> > }
>
> It took me a bit to work out the above. It looks like the intent is
> to have the last xxx_exit() put the state back to GP_IDLE, which appears
> to be the state in which readers can use a fastpath.
Yes, and we we offload this work to rcu callback so xxx_exit() doesn't
block.
The only complication is the next writer which does xxx_enter() after
xxx_exit(). If there are no other writers, the next xxx_exit() should do
rcu_cancel(&xxx->cb_head);
call_rcu_sched(&xxx->cb_head, cb_rcu_func);
to "extend" the gp, but since we do not have rcu_cancel() it simply sets
CB_REPLAY to instruct cb_rcu_func() to reschedule itself.
> This works because if ->gp_count is non-zero and ->cb_state is CB_IDLE,
> there must be an xxx_exit() in our future.
Yes, but ->cb_state doesn't really matter if ->gp_count != 0 in xxx_exit()
or cb_rcu_func() (except it can't be CB_IDLE in cb_rcu_func).
> > void xxx_exit(struct xxx_struct *xxx)
> > {
> > spin_lock_irq(&xxx->xxx_lock);
> > if (!--xxx->gp_count) {
> > if (xxx->cb_state == CB_IDLE) {
> > xxx->cb_state = CB_PENDING;
> > call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> > } else if (xxx->cb_state == CB_PENDING) {
> > xxx->cb_state = CB_REPLAY;
> > }
> > }
> > spin_unlock_irq(&xxx->xxx_lock);
> > }
>
> Then we also have something like this?
>
> bool xxx_readers_fastpath_ok(struct xxx_struct *xxx)
> {
> BUG_ON(!rcu_read_lock_sched_held());
> return xxx->gp_state == GP_IDLE;
> }
Yes, this is what xxx_is_idle() does (ignoring BUG_ON). It actually
checks xxx->gp_state == 0, this is just to avoid the unnecessary export
of GP_* enum.
Oleg.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2013-09-30 12:49 UTC|newest]
Thread overview: 182+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-09-10 9:31 [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7 Mel Gorman
2013-09-10 9:31 ` [PATCH 01/50] sched: monolithic code dump of what is being pushed upstream Mel Gorman
2013-09-11 0:58 ` Joonsoo Kim
2013-09-11 3:11 ` Hillf Danton
2013-09-13 8:11 ` Mel Gorman
2013-09-10 9:31 ` [PATCH 02/50] mm: numa: Document automatic NUMA balancing sysctls Mel Gorman
2013-09-10 9:31 ` [PATCH 03/50] sched, numa: Comment fixlets Mel Gorman
2013-09-10 9:31 ` [PATCH 04/50] mm: numa: Do not account for a hinting fault if we raced Mel Gorman
2013-09-10 9:31 ` [PATCH 05/50] mm: Wait for THP migrations to complete during NUMA hinting faults Mel Gorman
2013-09-10 9:31 ` [PATCH 06/50] mm: Prevent parallel splits during THP migration Mel Gorman
2013-09-10 9:31 ` [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update Mel Gorman
2013-09-16 12:36 ` Peter Zijlstra
2013-09-16 13:39 ` Rik van Riel
2013-09-16 14:54 ` Peter Zijlstra
2013-09-16 16:11 ` Mel Gorman
2013-09-16 16:37 ` Peter Zijlstra
2013-09-10 9:31 ` [PATCH 08/50] mm: numa: Sanitize task_numa_fault() callsites Mel Gorman
2013-09-10 9:31 ` [PATCH 09/50] mm: numa: Do not migrate or account for hinting faults on the zero page Mel Gorman
2013-09-10 9:31 ` [PATCH 10/50] sched: numa: Mitigate chance that same task always updates PTEs Mel Gorman
2013-09-10 9:31 ` [PATCH 11/50] sched: numa: Continue PTE scanning even if migrate rate limited Mel Gorman
2013-09-10 9:31 ` [PATCH 12/50] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node" Mel Gorman
2013-09-10 9:31 ` [PATCH 13/50] sched: numa: Initialise numa_next_scan properly Mel Gorman
2013-09-10 9:31 ` [PATCH 14/50] sched: Set the scan rate proportional to the memory usage of the task being scanned Mel Gorman
2013-09-16 15:18 ` Peter Zijlstra
2013-09-16 15:40 ` Mel Gorman
2013-09-10 9:31 ` [PATCH 15/50] sched: numa: Correct adjustment of numa_scan_period Mel Gorman
2013-09-10 9:31 ` [PATCH 16/50] mm: Only flush TLBs if a transhuge PMD is modified for NUMA pte scanning Mel Gorman
2013-09-10 9:31 ` [PATCH 17/50] mm: Do not flush TLB during protection change if !pte_present && !migration_entry Mel Gorman
2013-09-16 16:35 ` Peter Zijlstra
2013-09-17 17:00 ` Mel Gorman
2013-09-10 9:31 ` [PATCH 18/50] sched: numa: Slow scan rate if no NUMA hinting faults are being recorded Mel Gorman
2013-09-10 9:31 ` [PATCH 19/50] sched: Track NUMA hinting faults on per-node basis Mel Gorman
2013-09-10 9:32 ` [PATCH 20/50] sched: Select a preferred node with the most numa hinting faults Mel Gorman
2013-09-10 9:32 ` [PATCH 21/50] sched: Update NUMA hinting faults once per scan Mel Gorman
2013-09-10 9:32 ` [PATCH 22/50] sched: Favour moving tasks towards the preferred node Mel Gorman
2013-09-10 9:32 ` [PATCH 23/50] sched: Resist moving tasks towards nodes with fewer hinting faults Mel Gorman
2013-09-10 9:32 ` [PATCH 24/50] sched: Reschedule task on preferred NUMA node once selected Mel Gorman
2013-09-10 9:32 ` [PATCH 25/50] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults Mel Gorman
2013-09-10 9:32 ` [PATCH 26/50] sched: Check current->mm before allocating NUMA faults Mel Gorman
2013-09-10 9:32 ` [PATCH 27/50] mm: numa: Scan pages with elevated page_mapcount Mel Gorman
2013-09-12 2:10 ` Hillf Danton
2013-09-13 8:11 ` Mel Gorman
2013-09-10 9:32 ` [PATCH 28/50] sched: Remove check that skips small VMAs Mel Gorman
2013-09-10 9:32 ` [PATCH 29/50] sched: Set preferred NUMA node based on number of private faults Mel Gorman
2013-09-10 9:32 ` [PATCH 30/50] sched: Do not migrate memory immediately after switching node Mel Gorman
2013-09-10 9:32 ` [PATCH 31/50] sched: Avoid overloading CPUs on a preferred NUMA node Mel Gorman
2013-09-10 9:32 ` [PATCH 32/50] sched: Retry migration of tasks to CPU on a preferred node Mel Gorman
2013-09-10 9:32 ` [PATCH 33/50] sched: numa: increment numa_migrate_seq when task runs in correct location Mel Gorman
2013-09-10 9:32 ` [PATCH 34/50] sched: numa: Do not trap hinting faults for shared libraries Mel Gorman
2013-09-17 2:02 ` 答复: " 张天飞
2013-09-17 8:05 ` ????: " Mel Gorman
2013-09-17 8:22 ` Figo.zhang
2013-09-10 9:32 ` [PATCH 35/50] mm: numa: Only trap pmd hinting faults if we would otherwise trap PTE faults Mel Gorman
2013-09-10 9:32 ` [PATCH 36/50] stop_machine: Introduce stop_two_cpus() Mel Gorman
2013-09-10 9:32 ` [PATCH 37/50] sched: Introduce migrate_swap() Mel Gorman
2013-09-17 14:30 ` [PATCH] hotplug: Optimize {get,put}_online_cpus() Peter Zijlstra
2013-09-17 16:20 ` Mel Gorman
2013-09-17 16:45 ` Peter Zijlstra
2013-09-18 15:49 ` Peter Zijlstra
2013-09-19 14:32 ` Peter Zijlstra
2013-09-21 16:34 ` Oleg Nesterov
2013-09-21 19:13 ` Oleg Nesterov
2013-09-23 9:29 ` Peter Zijlstra
2013-09-23 17:32 ` Oleg Nesterov
2013-09-24 20:24 ` Peter Zijlstra
2013-09-24 21:02 ` Peter Zijlstra
2013-09-25 15:55 ` Oleg Nesterov
2013-09-25 16:59 ` Paul E. McKenney
2013-09-25 17:43 ` Peter Zijlstra
2013-09-25 17:50 ` Oleg Nesterov
2013-09-25 18:40 ` Peter Zijlstra
2013-09-25 21:22 ` Paul E. McKenney
2013-09-26 11:10 ` Peter Zijlstra
[not found] ` <20130926155321.GA4342@redhat.com>
2013-09-26 16:13 ` Peter Zijlstra
2013-09-26 16:14 ` Oleg Nesterov
2013-09-26 16:40 ` Peter Zijlstra
2013-09-26 16:58 ` Oleg Nesterov
2013-09-26 17:50 ` Peter Zijlstra
2013-09-27 18:15 ` Oleg Nesterov
2013-09-27 20:41 ` Peter Zijlstra
2013-09-28 12:48 ` Oleg Nesterov
2013-09-28 14:47 ` Peter Zijlstra
2013-09-28 16:31 ` Oleg Nesterov
2013-09-30 20:11 ` Rafael J. Wysocki
2013-10-01 17:11 ` Srivatsa S. Bhat
2013-10-01 17:36 ` Peter Zijlstra
2013-10-01 17:45 ` Oleg Nesterov
2013-10-01 17:56 ` Peter Zijlstra
2013-10-01 18:07 ` Oleg Nesterov
2013-10-01 19:05 ` Paul E. McKenney
2013-10-02 12:16 ` Oleg Nesterov
2013-10-02 9:08 ` Peter Zijlstra
2013-10-02 12:13 ` Oleg Nesterov
2013-10-02 12:25 ` Peter Zijlstra
2013-10-02 13:31 ` Peter Zijlstra
2013-10-02 14:00 ` Oleg Nesterov
2013-10-02 15:17 ` Peter Zijlstra
2013-10-02 16:31 ` Oleg Nesterov
2013-10-02 17:52 ` Paul E. McKenney
2013-10-01 19:03 ` Srivatsa S. Bhat
2013-10-01 18:14 ` Srivatsa S. Bhat
2013-10-01 18:56 ` Srivatsa S. Bhat
2013-10-02 10:14 ` Srivatsa S. Bhat
2013-09-28 20:46 ` Paul E. McKenney
2013-10-01 3:56 ` Paul E. McKenney
2013-10-01 14:14 ` Oleg Nesterov
2013-10-01 14:45 ` Paul E. McKenney
2013-10-01 14:48 ` Peter Zijlstra
2013-10-01 15:24 ` Paul E. McKenney
2013-10-01 15:34 ` Oleg Nesterov
2013-10-01 15:00 ` Oleg Nesterov
2013-09-29 13:56 ` Oleg Nesterov
2013-10-01 15:38 ` Paul E. McKenney
2013-10-01 15:40 ` Oleg Nesterov
2013-10-01 20:40 ` Paul E. McKenney
2013-09-23 14:50 ` Steven Rostedt
2013-09-23 14:54 ` Peter Zijlstra
2013-09-23 15:13 ` Steven Rostedt
2013-09-23 15:22 ` Peter Zijlstra
2013-09-23 15:59 ` Steven Rostedt
2013-09-23 16:02 ` Peter Zijlstra
2013-09-23 15:50 ` Paul E. McKenney
2013-09-23 16:01 ` Peter Zijlstra
2013-09-23 17:04 ` Paul E. McKenney
2013-09-23 17:30 ` Peter Zijlstra
2013-09-23 17:50 ` Oleg Nesterov
2013-09-24 12:38 ` Peter Zijlstra
2013-09-24 14:42 ` Paul E. McKenney
2013-09-24 16:09 ` Peter Zijlstra
2013-09-24 16:31 ` Oleg Nesterov
2013-09-24 21:09 ` Paul E. McKenney
2013-09-24 16:03 ` Oleg Nesterov
2013-09-24 16:43 ` Steven Rostedt
2013-09-24 17:06 ` Oleg Nesterov
2013-09-24 17:47 ` Paul E. McKenney
2013-09-24 18:00 ` Oleg Nesterov
2013-09-24 20:35 ` Peter Zijlstra
2013-09-25 15:16 ` Oleg Nesterov
2013-09-25 15:35 ` Peter Zijlstra
2013-09-25 16:33 ` Oleg Nesterov
2013-09-24 16:49 ` Paul E. McKenney
2013-09-24 16:54 ` Peter Zijlstra
2013-09-24 17:02 ` Oleg Nesterov
2013-09-24 16:51 ` Peter Zijlstra
2013-09-24 16:39 ` Steven Rostedt
2013-09-29 18:36 ` [RFC] introduce synchronize_sched_{enter,exit}() Oleg Nesterov
2013-09-29 20:01 ` Paul E. McKenney
2013-09-30 12:42 ` Oleg Nesterov [this message]
2013-09-29 21:34 ` Steven Rostedt
2013-09-30 13:03 ` Oleg Nesterov
2013-09-30 12:59 ` Peter Zijlstra
2013-09-30 14:24 ` Peter Zijlstra
2013-09-30 15:06 ` Peter Zijlstra
2013-09-30 16:58 ` Oleg Nesterov
2013-09-30 16:38 ` Oleg Nesterov
2013-10-02 14:41 ` Peter Zijlstra
2013-10-03 7:04 ` Ingo Molnar
2013-10-03 7:43 ` Peter Zijlstra
2013-09-17 14:32 ` [PATCH 37/50] sched: Introduce migrate_swap() Peter Zijlstra
2013-09-10 9:32 ` [PATCH 38/50] sched: numa: Use a system-wide search to find swap/migration candidates Mel Gorman
2013-09-10 9:32 ` [PATCH 39/50] sched: numa: Favor placing a task on the preferred node Mel Gorman
2013-09-10 9:32 ` [PATCH 40/50] mm: numa: Change page last {nid,pid} into {cpu,pid} Mel Gorman
2013-09-10 9:32 ` [PATCH 41/50] sched: numa: Use {cpu, pid} to create task groups for shared faults Mel Gorman
2013-09-12 12:42 ` Hillf Danton
2013-09-12 14:40 ` Mel Gorman
2013-09-12 12:45 ` Hillf Danton
2013-09-10 9:32 ` [PATCH 42/50] sched: numa: Report a NUMA task group ID Mel Gorman
2013-09-10 9:32 ` [PATCH 43/50] mm: numa: Do not group on RO pages Mel Gorman
2013-09-10 9:32 ` [PATCH 44/50] sched: numa: stay on the same node if CLONE_VM Mel Gorman
2013-09-10 9:32 ` [PATCH 45/50] sched: numa: use group fault statistics in numa placement Mel Gorman
2013-09-10 9:32 ` [PATCH 46/50] sched: numa: Prevent parallel updates to group stats during placement Mel Gorman
2013-09-20 9:55 ` Peter Zijlstra
2013-09-20 12:31 ` Mel Gorman
2013-09-20 12:36 ` Peter Zijlstra
2013-09-20 13:31 ` Mel Gorman
2013-09-10 9:32 ` [PATCH 47/50] sched: numa: add debugging Mel Gorman
2013-09-10 9:32 ` [PATCH 48/50] sched: numa: Decide whether to favour task or group weights based on swap candidate relationships Mel Gorman
2013-09-10 9:32 ` [PATCH 49/50] sched: numa: fix task or group comparison Mel Gorman
2013-09-10 9:32 ` [PATCH 50/50] sched: numa: Avoid migrating tasks that are placed on their preferred node Mel Gorman
2013-09-11 2:03 ` [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7 Rik van Riel
2013-09-14 2:57 ` Bob Liu
2013-09-30 10:30 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130930124214.GA19560@redhat.com \
--to=oleg@redhat.com \
--cc=aarcange@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=mingo@kernel.org \
--cc=paulmck@linux.vnet.ibm.com \
--cc=peterz@infradead.org \
--cc=riel@redhat.com \
--cc=rostedt@goodmis.org \
--cc=srikar@linux.vnet.ibm.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).