All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
To: Jamie Lokier <jamie@shareable.org>
Cc: bert hubert <ahu@ds9a.nl>, Andrew Morton <akpm@osdl.org>,
	linux-kernel@vger.kernel.org, rusty@rustcorp.com.au,
	mingo@elte.hu
Subject: Re: Futex queue_me/get_user ordering
Date: Tue, 16 Nov 2004 17:30:24 +0900	[thread overview]
Message-ID: <4199BAA0.1070608@jp.fujitsu.com> (raw)
In-Reply-To: <20041115141247.GC25502@mail.shareable.org>

OMG... Wait, wait... Don't do anything.

I have to deeply apologize to all for my mistake.
If my understanding is correct, this bug is "2.4 futex"(RHEL3) *SPECIFIC*!!
I had swallow the story that 2.6 futex has the same problem...

So I realize that 2.6 futex never behave:
 >>      "returns 0 if the futex was not equal to the expected value, but
 >>       the process was woken by a FUTEX_WAKE call."

Update of manpage is now unnecessary, I think.

#

First of all, I would appreciate if you could read my old post:
"Kernel bug in futex_wait, cause application hang with NPTL"
http://www.ussg.iu.edu/hypermail/linux/kernel/0409.0/2044.html

#

Then, let's go on to the main subject.

Jamie Lokier wrote:
 > In fact, waiting does not get the lock for the futex.  It relies on
 > the ordering of (1) adding to the wait queue, (2) checking the current
 > value, and (3) removing from the wait queue if the value doesn't
 > match.  Among other things, this is necessary because checking the
 > current value cannot be done with a spinlock held.

If my understanding is correct, 2.6 futex does not get any spinlocks,
but a semaphore:

[kernel/futex.c](from 2.6, RHEL4b2)
  286 static int futex_wake(unsigned long uaddr, int nr_wake)
  287 {
   :
  294         down_read(&current->mm->mmap_sem);
   :
  306                         wake_futex(this);
   :
  314         up_read(&current->mm->mmap_sem);
  315         return ret;
  316 }
   :
  477 static int futex_wait(unsigned long uaddr, int val, unsigned long time)
  478 {
   :
  483         down_read(&current->mm->mmap_sem);
   :
  489         queue_me(&q, -1, NULL);
   :
  500         if (curval != val) {
  501                 ret = -EWOULDBLOCK;
  502                 goto out_unqueue;
  503         }
   :
  509         up_read(&current->mm->mmap_sem);
   :
  528                 time = schedule_timeout(time);
   :
  536         /* If we were woken (and unqueued), we succeeded, whatever. */
  537         if (!unqueue_me(&q))
  538                 return 0;
  539         if (time == 0)
  540                 return -ETIMEDOUT;
  541         /* A spurious wakeup should never happen. */
  542         WARN_ON(!signal_pending(current));
  543         return -EINTR;
  544
  545  out_unqueue:
  546         /* If we were woken (and unqueued), we succeeded, whatever. */
  547         if (!unqueue_me(&q))
  548                 ret = 0;
  549  out_release_sem:
  550         up_read(&current->mm->mmap_sem);
  551         return ret;
  552 }

This semaphore prevents a waiter which temporarily queued to check the val
from being target of wakeup.

So my "[simulation]" is wrong if it is on 2.6, since wake_Y never be able to
touch the queue while wait_A is in the queue to have the val to be checked.

(If it is not possible that there are threads which go around with same
futex/condvar but each have different mmap_sem,) 2.6 futex is quite good.

#

Next, let's see how about 2.4 futex:

[kernel/futex.c](from 2.4, RHEL3U2)
  154 static inline int futex_wake(unsigned long uaddr, int offset, int num)
  155 {
   :
  160         lock_futex_mm();
   :
  176                         wake_up_all(&this->waiters);
   :
  185         unlock_futex_mm();
   :
  188         return ret;
  189 }
   :
  310 static inline int futex_wait(unsigned long uaddr,
  311                       int offset,
  312                       int val,
  313                       unsigned long time)
  314 {
   :
  323         lock_futex_mm();
   :
  330         __queue_me(&q, page, uaddr, offset, -1, NULL);
   :
  342         if (curval != val) {
  343                 unlock_futex_mm();
  344                 ret = -EWOULDBLOCK;
  345                 goto out;
  346         }
   :
  357                 unlock_futex_mm();
  358                 time = schedule_timeout(time);
   :
  365         if (time == 0) {
  366                 ret = -ETIMEDOUT;
  367                 goto out;
  368         }
  369         if (signal_pending(current))
  370                 ret = -EINTR;
  371 out:
  372         /* Were we woken up anyway? */
  373         if (!unqueue_me(&q))
  374                 ret = 0;
  375         put_page(q.page);
  376
  377         return ret;
   :
  383 }

2.4 futex uses spinlocks.

   74 static inline void lock_futex_mm(void)
   75 {
   76         spin_lock(&current->mm->page_table_lock);
   77         spin_lock(&vcache_lock);
   78         spin_lock(&futex_lock);
   79 }
   80
   81 static inline void unlock_futex_mm(void)
   82 {
   83         spin_unlock(&futex_lock);
   84         spin_unlock(&vcache_lock);
   85         spin_unlock(&current->mm->page_table_lock);
   86 }

However, this spinlocks fail to prevent topical waiters from wakeups.
Because the spinlocks are released *before* unqueue_me(&q) (line 343 & 373).
So this failure allows wake_Y to touch the queue while wait_A is in it.

Of course as you know, this brings bug which I have mentioned.
(I don't know how many distributions have 2.4 futex in itself, but)
At least 2.4 futex in RHEL3U2 is buggy.

#

I regret that I could not notice this fact earlier.
I'm sorry... I hope you'll accept my apology.


Thanks,
H.Seto


  reply	other threads:[~2004-11-16  8:28 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20041113164048.2f31a8dd.akpm@osdl.org>
2004-11-14  9:00 ` Futex queue_me/get_user ordering (was: 2.6.10-rc1-mm5 [u]) Emergency Services Jamie Lokier
2004-11-14  9:09   ` Andrew Morton
2004-11-14  9:23     ` Jamie Lokier
2004-11-14  9:50       ` bert hubert
2004-11-15 14:12         ` Jamie Lokier
2004-11-16  8:30           ` Hidetoshi Seto [this message]
2004-11-16 14:58             ` Futex queue_me/get_user ordering Jamie Lokier
2004-11-18  1:29               ` Hidetoshi Seto
2004-11-15  0:58       ` Hidetoshi Seto
2004-11-15  2:01         ` Jamie Lokier
2004-11-15  3:06           ` Hidetoshi Seto
2004-11-15 13:22             ` Jamie Lokier
2004-11-17  8:47               ` Jakub Jelinek
2004-11-18  2:10                 ` Hidetoshi Seto
2004-11-18  7:20                 ` Jamie Lokier
2004-11-18 19:47                   ` Jakub Jelinek
2005-03-17 10:26                     ` Jakub Jelinek
2005-03-17 15:20                       ` Jamie Lokier
2005-03-17 15:55                         ` Jakub Jelinek
2005-03-18 17:00                           ` Ingo Molnar
2005-03-21  2:55                             ` Jamie Lokier
2005-03-18 16:53                         ` Jakub Jelinek
2004-11-26 17:06                 ` Jamie Lokier
2004-11-28 17:36                   ` Joe Seigh
2004-11-29 11:24                   ` Jakub Jelinek
2004-11-29 21:50                     ` Jamie Lokier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4199BAA0.1070608@jp.fujitsu.com \
    --to=seto.hidetoshi@jp.fujitsu.com \
    --cc=ahu@ds9a.nl \
    --cc=akpm@osdl.org \
    --cc=jamie@shareable.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=rusty@rustcorp.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.