From: Oleg Nesterov <oleg@redhat.com>
To: Boqun Feng <boqun.feng@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
mingo@kernel.org, linux-kernel@vger.kernel.org,
paulmck@linux.vnet.ibm.com, corbet@lwn.net, mhocko@kernel.org,
dhowells@redhat.com, torvalds@linux-foundation.org,
will.deacon@arm.com, Michael Ellerman <mpe@ellerman.id.au>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>,
Paul Mackerras <paulus@samba.org>
Subject: Re: [PATCH 4/4] locking: Introduce smp_cond_acquire()
Date: Thu, 12 Nov 2015 16:00:58 +0100 [thread overview]
Message-ID: <20151112150058.GA30321@redhat.com> (raw)
In-Reply-To: <20151112070915.GC6314@fixme-laptop.cn.ibm.com>
On 11/12, Boqun Feng wrote:
>
> On Wed, Nov 11, 2015 at 08:39:53PM +0100, Oleg Nesterov wrote:
> >
> > object_t *object;
> > spinlock_t lock;
> >
> > void update(void)
> > {
> > object_t *o;
> >
> > spin_lock(&lock);
> > o = READ_ONCE(object);
> > if (o) {
> > BUG_ON(o->dead);
> > do_something(o);
> > }
> > spin_unlock(&lock);
> > }
> >
> > void destroy(void) // can be called only once, can't race with itself
> > {
> > object_t *o;
> >
> > o = object;
> > object = NULL;
> >
> > /*
> > * pairs with lock/ACQUIRE. The next update() must see
> > * object == NULL after spin_lock();
> > */
> > smp_mb();
> >
> > spin_unlock_wait(&lock);
> >
> > /*
> > * pairs with unlock/RELEASE. The previous update() has
> > * already passed BUG_ON(o->dead).
> > *
> > * (Yes, yes, in this particular case it is not needed,
> > * we can rely on the control dependency).
> > */
> > smp_mb();
> >
> > o->dead = true;
> > }
> >
> > I believe the code above is correct and it needs the barriers on both sides.
> >
>
> Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
> only guarantees that the memory operations following spin_lock() can't
> be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
> i.e. the case below can happen(assuming the spin_lock() is implemented
> as ll/sc loop)
>
> spin_lock(&lock):
> r1 = *lock; // LL, r1 == 0
> o = READ_ONCE(object); // could be reordered here.
> *lock = 1; // SC
>
> This could happen because of the ACQUIRE semantics of spin_lock(), and
> the current implementation of spin_lock() on PPC allows this happen.
>
> (Cc PPC maintainers for their opinions on this one)
In this case the code above is obviously wrong. And I do not understand
how we can rely on spin_unlock_wait() then.
And afaics do_exit() is buggy too then, see below.
> I think it's OK for it as an ACQUIRE(with a proper barrier) or even just
> a control dependency to pair with spin_unlock(), for example, the
> following snippet in do_exit() is OK, except the smp_mb() is redundant,
> unless I'm missing something subtle:
>
> /*
> * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
> * when the following two conditions become true.
> * - There is race condition of mmap_sem (It is acquired by
> * exit_mm()), and
> * - SMI occurs before setting TASK_RUNINNG.
> * (or hypervisor of virtual machine switches to other guest)
> * As a result, we may become TASK_RUNNING after becoming TASK_DEAD
> *
> * To avoid it, we have to wait for releasing tsk->pi_lock which
> * is held by try_to_wake_up()
> */
> smp_mb();
> raw_spin_unlock_wait(&tsk->pi_lock);
Perhaps it is me who missed something. But I don't think we can remove
this mb(). And at the same time it can't help on PPC if I understand
your explanation above correctly.
To simplify, lets ignore exit_mm/down_read/etc. The exiting task does
current->state = TASK_UNINTERRUPTIBLE;
// without schedule() in between
current->state = TASK_RUNNING;
smp_mb();
spin_unlock_wait(pi_lock);
current->state = TASK_DEAD;
schedule();
and we need to ensure that if we race with try_to_wake_up(TASK_UNINTERRUPTIBLE)
it can't change TASK_DEAD back to RUNNING.
Without smp_mb() this can be reordered, spin_unlock_wait(pi_locked) can
read the old "unlocked" state of pi_lock before we set UNINTERRUPTIBLE,
so in fact we could have
current->state = TASK_UNINTERRUPTIBLE;
spin_unlock_wait(pi_lock);
current->state = TASK_RUNNING;
current->state = TASK_DEAD;
and this can obviously race with ttwu() which can take pi_lock and see
state == TASK_UNINTERRUPTIBLE after spin_unlock_wait().
And, if I understand you correctly, this smp_mb() can't help on PPC.
try_to_wake_up() can read task->state before it writes to *pi_lock.
To me this doesn't really differ from the code above,
CPU 1 (do_exit) CPU_2 (ttwu)
spin_lock(pi_lock):
r1 = *pi_lock; // r1 == 0;
p->state = TASK_UNINTERRUPTIBLE;
state = p->state;
p->state = TASK_RUNNING;
mb();
spin_unlock_wait();
*pi_lock = 1;
p->state = TASK_DEAD;
if (state & TASK_UNINTERRUPTIBLE) // true
p->state = RUNNING;
No?
And smp_mb__before_spinlock() looks wrong too then.
Oleg.
next prev parent reply other threads:[~2015-11-12 14:04 UTC|newest]
Thread overview: 78+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-11-02 13:29 [PATCH 0/4] scheduler ordering bits Peter Zijlstra
2015-11-02 13:29 ` [PATCH 1/4] sched: Better document the try_to_wake_up() barriers Peter Zijlstra
2015-12-04 0:09 ` Byungchul Park
2015-12-04 0:58 ` Byungchul Park
2015-11-02 13:29 ` [PATCH 2/4] sched: Document Program-Order guarantees Peter Zijlstra
2015-11-02 20:27 ` Paul Turner
2015-11-02 20:34 ` Peter Zijlstra
2015-11-02 22:09 ` Paul Turner
2015-11-02 22:12 ` Peter Zijlstra
2015-11-20 10:02 ` Peter Zijlstra
2015-11-20 14:08 ` Boqun Feng
2015-11-20 14:18 ` Peter Zijlstra
2015-11-20 14:21 ` Boqun Feng
2015-11-20 19:41 ` Peter Zijlstra
2015-11-02 13:29 ` [PATCH 3/4] sched: Fix a race in try_to_wake_up() vs schedule() Peter Zijlstra
2015-11-02 13:29 ` [PATCH 4/4] locking: Introduce smp_cond_acquire() Peter Zijlstra
2015-11-02 13:57 ` Peter Zijlstra
2015-11-02 17:43 ` Will Deacon
2015-11-03 1:14 ` Paul E. McKenney
2015-11-03 1:25 ` Linus Torvalds
2015-11-02 17:42 ` Will Deacon
2015-11-02 18:08 ` Linus Torvalds
2015-11-02 18:37 ` Will Deacon
2015-11-02 19:17 ` Linus Torvalds
2015-11-02 19:57 ` Will Deacon
2015-11-02 20:23 ` Peter Zijlstra
2015-11-02 21:56 ` Peter Zijlstra
2015-11-03 1:57 ` Paul E. McKenney
2015-11-03 19:40 ` Linus Torvalds
2015-11-04 3:57 ` Paul E. McKenney
2015-11-04 4:43 ` Linus Torvalds
2015-11-04 12:54 ` Paul E. McKenney
2015-11-02 20:36 ` David Howells
2015-11-02 20:40 ` Peter Zijlstra
2015-11-02 21:11 ` Linus Torvalds
2015-11-03 17:59 ` Oleg Nesterov
2015-11-03 18:23 ` Peter Zijlstra
2015-11-11 9:39 ` Boqun Feng
2015-11-11 10:34 ` Boqun Feng
2015-11-11 19:53 ` Oleg Nesterov
2015-11-12 13:50 ` Paul E. McKenney
2015-11-11 12:12 ` Peter Zijlstra
2015-11-11 19:39 ` Oleg Nesterov
2015-11-11 21:23 ` Linus Torvalds
2015-11-12 7:14 ` Boqun Feng
2015-11-12 10:28 ` Peter Zijlstra
2015-11-12 15:00 ` Oleg Nesterov [this message]
2015-11-12 14:40 ` Paul E. McKenney
2015-11-12 14:49 ` Boqun Feng
2015-11-12 15:02 ` Paul E. McKenney
2015-11-12 21:53 ` Will Deacon
2015-11-12 14:50 ` Peter Zijlstra
2015-11-12 15:01 ` Paul E. McKenney
2015-11-12 15:08 ` Peter Zijlstra
2015-11-12 15:20 ` Paul E. McKenney
2015-11-12 21:25 ` Will Deacon
2015-11-12 15:18 ` Boqun Feng
2015-11-12 18:38 ` Oleg Nesterov
2015-11-12 18:02 ` Peter Zijlstra
2015-11-12 19:33 ` Oleg Nesterov
2015-11-12 18:59 ` Paul E. McKenney
2015-11-12 21:33 ` Will Deacon
2015-11-12 23:43 ` Paul E. McKenney
2015-11-16 13:58 ` Will Deacon
2015-11-12 18:21 ` Linus Torvalds
2015-11-12 22:09 ` Will Deacon
2015-11-16 15:56 ` Peter Zijlstra
2015-11-16 16:04 ` Peter Zijlstra
2015-11-16 16:24 ` Will Deacon
2015-11-16 16:44 ` Paul E. McKenney
2015-11-16 16:46 ` Will Deacon
2015-11-16 17:15 ` Paul E. McKenney
2015-11-16 21:58 ` Linus Torvalds
2015-11-17 11:51 ` Will Deacon
2015-11-17 21:01 ` Paul E. McKenney
2015-11-18 11:25 ` Will Deacon
2015-11-19 18:01 ` Will Deacon
2015-11-20 10:09 ` Peter Zijlstra
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151112150058.GA30321@redhat.com \
--to=oleg@redhat.com \
--cc=benh@kernel.crashing.org \
--cc=boqun.feng@gmail.com \
--cc=corbet@lwn.net \
--cc=dhowells@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mhocko@kernel.org \
--cc=mingo@kernel.org \
--cc=mpe@ellerman.id.au \
--cc=paulmck@linux.vnet.ibm.com \
--cc=paulus@samba.org \
--cc=peterz@infradead.org \
--cc=torvalds@linux-foundation.org \
--cc=will.deacon@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.